# Basic R Test -- Normalize A Function

We can set a variable using the **<-** construct.

*Ex:  "a <- 10" would set a variable called "a" to a value of 10.*

Well, technically, the variable "a" is set to a vector of size one, whose sole element is the number 10.  This turns out to be a big advantage for performing set-based operations in R, but let's not get too distracted.
    
Note that you can also write "a = 10" and it will work, but this is not a great practice because the equality operator generally doesn't quite work the same way in R as it does in C-based languages or SQL, and you might end up with an edge case that backfires on you.  Stick with <- and you'll be fine.

In [None]:
y <- 500
y

We can also assign a multi-element vector using <-, and so we'll do that with the next example.  Note that the . in R is **not** a way of accessing methods or properties on an object like in C#, but is just another character like the underscore.  R variables often use dots instead of underscores, and that can get confusing for new developers.

In [None]:
#The . in R is like _ in C-based languages.
our.second.vector <- c(10.4, 11.9, 38.1, 2.2, 0.7664)
#Of course, the _ in R is also like the _ in C-based languages!
our_third_vector <- c(9.99, 19.99, 29.99)

print('Results of our.second.vector:')
our.second.vector
print('Results of our_third_vector:')
our_third_vector

Accessing elements in a vector is simple:  you can use array notation.  Note that R uses 1-based arrays rather than 0-based arrays!

In [None]:
our.second.vector[3]
our_third_vector[3]
#In this case, we don't have a value, so nothing appears:
our.second.vector[0]

We can create functions using the same <- construct that we use to create vectors.

In [None]:
normalize <- function(x) {
  #standardize values of x; subtract mean & rescale so sd = 1
  (x-mean(x))/sd(x)
}

The normalization function will rescale a vector such that the standard deviation is 1.

In [None]:
x <- rnorm(100)*5+2
#Call our function on the data set x and put the results into z.
z <- normalize(x)

In the code snippet above, we use the **rnorm** function.  Whenever you see a function which is unfamiliar to you, you can use **help(function)** to get help on that function.  For example, to get help on rnorm, we call:

In [None]:
help(rnorm)

rnorm(100) returns a vector of 100 values fit a normal distribution of mean 0 and standard deviation 1.  We want to show off our normalize function, so we'll shift the results by multiplying each value by 5 and adding 2 to each value.

Note that there's no foreach loop, no for loop, etc.  Like SQL, RBAR is a no-no in general here.  Better to use set-based operations, as they're MUCH faster!

Calling a function on a vector is easy; one of the nice things about R is that, most of the time, you dn't have to think about the difference between a variable and a vector, and can treat sets as you would single elements.

We can call built-in methods easily.  We want to make sure that after normalization, our mean is (approximately) 0 and our standard deviation is (approximately) 1

In [None]:
mean(z)

Running this shows that the mean is 0.  The value we get back won't say 0, but it's so close that we can treat it as such.

In [None]:
sd(z)

The standard deviation is 1.  By contrast, the original mean and standard deviation are:

In [None]:
mean(x)
sd(x)

Let's do something with the vectors x and z, and put them into a data frame.  Data frames are R's preferred method of working with sets of data, and from a SQL Server practitioner's point of view, you can think of them as result sets.  With data frames, each row is known as an *observation* and each column known as a *variable*.

We will use the **data.frame** method to build a data frame.  Then, we will use the **names** method to provide reasonable names for each variable in the data frame.  To update both at once, we'll use the concatenate function, **c**.

As a bonus, we'll calculate the mean and standard deviation of the vector x and use those values to build a calculated data frame to test our function.  If the results match, we'll know that our calculation is correct.

Finally, we will look at a chart of the top 10 elements in both x and z, so we can see the translations.

In [None]:
meanx <- mean(x)
sdx <- sd(x)

df <- data.frame(x, z, ((x-meanx)/sdx))
names(df) <- c("Original","Normalized","Calculated")

head(df, 10)

Looking at charts of numbers is cool and all, but visuals are easier to follow. This visual proves that our normalize function does NOT change the distribution

In [None]:
plot(x,z)

Plotting shows that we've simply scaled the values; we have not modified the distribution in any other way.