# Intro to R

In this week, we will be going over some of the basics of R and getting using the key data structures in R. If you would like to dig deeper or read more about these concepts, I recommend reading Chapters 1-2.5 in Introduction to Data Science (IDS). 

**Main Concepts**: objects, types, vectors, matrices, lists, data frames  
**Additional Concepts**: naming variables, indexing, editing objects  
**Resources**: IDS Chapters 1-2.5  

## Why R?

What are some of the benefits to using R? 
- R is open source.  
- R has most of the latest statistical methods available.
- R is flexible. 

Since R is built for statisticians, it is built with data in mind. That comes in handy when we want to streamline how we process and analyze data. It also means that many statisticians working on new methods are publishing user-created packages in R so we have access to most methods of interest. R is also an interpreted language, which means that we do not have to compile our code into machine language first. What this means for you is that this allows for simpler syntax and more flexibility when writing our code, which also makes it a great first language to learn.  

Python is another interpreted language often used for data analysis. Both languages feature simple and flexible syntax. While python is more broadly developed for usage outside data science and statistical analysis, R has more available statistical packages, the implementation in these packages can sometimes be simpler, especially with respect to machine learning models, and R has some fantastic data visualization capabilities. I have programmed frequently in both and find switching between them to be relatively straightforward, but I do prefer R for data applications. For those who are interested, I believe anyone completing this course would be able to pick up python as well. 

The content of this course will focus on using JupyterHub for weekly assignments, but as we get to the end of the class you will want to switch to RStudio. I will provide information for making that switch when we get there. 


### Now we are ready to get started using R!

As you go through this notebook, be sure to run each code block to see the output. You can do so by holding SHIFT-ENTER or by using the play button at the top of the notebook.

## Main Types and Classes in R

We can get started in R doing some basic calculations using all the operators we are familiar with:

Addition: +  
Subtraction: -  
Multiplication: *  
Division: /  
Exponentiation: ^  
Modulo: %%  

Run the code below and check that the results match what you expect. 

In [1]:
3.5+4.5
7/3.5
2^3
15%%2

Once I ran the lines above though, I can no longer access the result. Variables store an object so that I can later use the result. Below, I store the value 4 as a variable named `ex_var`. To do so, we use the **assignment** operator `<-`. This indicates take the object on the righthand side and assign it to the variable on the lefthand side. Then, later I can access `ex_var` by name. Note that `print` statements are helpful for giving output. 

In [2]:
ex_var <- 4
ex_var <- ex_var + 1
print(ex_var)

[1] 5


You may also use the equals sign `=` for assignment. In a few (very very rare) cases, the `<-` operator is preferred and so that is what I will use, but you can use what you find more natural. 

In [3]:
ex_var = 4
ex_var = ex_var + 1
print(ex_var)

[1] 5


The variable `ex_var` is a numeric variable, but we can also work with two other kinds of data: characters (e.g. "php", "stats") and booleans (TRUE, FALSE), also known as logicals. We can check the type by using the `class()` function, as shown below. 

Knowing the type changes the interpretation of the variables. To test this, update the code below to try running TRUE+FALSE or "Alice"+"Bob". What happens? The computer tries to look for what to do with boolean+boolean or character+character rather than numeric+numeric. Note that it does not have a way to interpret addition for two character variables.

In [4]:
ex_bool <- TRUE
ex_char <- "Alice"

class(ex_var)
class(ex_bool)
class(ex_char)

## Basic Types

Beyond the singular valued variables above, the other basic **data types** in R are **vectors**, **factors**, **matrices**, **lists**, and **data frames**. When we refer to an **object**, we mean an instance of a data type. For example, `ex_var` refers to an object of the numeric type. These main structures inform how we will store information in R and inform R how to interpret our commands. Later on in the course, we will see that these base types form the building blocks for programming in R and help build the more complicated data structures. 

# Vectors

Vectors can store multiple data values of the same type (e.g. character, boolean, or numeric). One way to create a vector is to use the combine function `c()`. In fact, vectors can store just a single value or multiple values. However, vectors cannot store objects of different types so we could *not* run `c("Monday",5)`. Below we create two vectors: one with the days of the week and one with the amount of rain on each day.

In [5]:
days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
rain <- c(5, 0.1, 0, 0, 0.4)

If we check the classes of these two variables, we will see that it tells us the type of the data (character and numeric). In fact, even those variables `ex_var`, `ex_bool`, and `ex_char` are all vectors! Just with a single value.

In [6]:
class(days)
class(rain)

What happens when we create an empty vector? What is the class?

In [7]:
ex_empty <- c()
class(ex_empty)

If we wanted to specify the type we could make an empty vector using the `vector()` function.

In [8]:
ex_empty <- vector(mode = "numeric")
class(ex_empty)

Another way to create a non-empty vector is with the `rep()` or `seq()` functions. The first function `rep(x, times)` takes in a vector `x` and a number of times `times` and outputs `x` repeated that many times. Let's try this with a single value below. The second function `seq(from, to, step)` takes in a starting value, end value, and step size (all numeric values) and returns a sequence from `from` in increments of `step` until a maximum value of `to` is reached. 

In [9]:
rep(0, 5)
rep("Monday", 4)
seq(1, 5, 1)
seq(0, 10, 2)

### Indexing a Vector

Now that we have a vector, we can access a subset of the values by using the indices, where the first value has index 1, the second value has index 2, etc. Below, we use these indices to return a single value. The first line returns the value at index 1 and the second line returns the value at index 4.

In [10]:
days[1]
days[4]

We can either access a single value or a subset of values using a vector of indices. See what happens when you use a vector of indices `c(1,4)` and then try what happens when you use `-c(1,4)`. In the first case, we get the values at index 1 and at index 4. In the second case, we get all values *except* at those indices.

In [11]:
days[c(1,4)]
days[-c(1,4)]

However, always indexing by the index value can sometimes be difficult. One extra feature of vectors is that we can associate a name with each value. Below, we update the names of the vector `rain` to be the days of the week and then find Friday's rain count by indexing with the name.

In [12]:
names(rain) <- days
print(rain)
rain["Friday"]

   Monday   Tuesday Wednesday  Thursday    Friday 
      5.0       0.1       0.0       0.0       0.4 


We can also index a vector using TRUE and FALSE values. If we have a vector of booleans that is the same length as our original vector, then this will return all the values that correspond to a TRUE value. Below, this will return the first and fourth values.

In [13]:
ind_bools <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
days[ind_bools]

This is useful because we can use logic to access certain values. We will see more about this later on. 

### Editing a Vector and Calculations

After we create a vector, we may need to update it or change it. For example, we may want to change a value. We can do so using indexing. Below, we update the rain value for Friday. 

In [14]:
rain["Friday"] <- 0.5
print(rain)

   Monday   Tuesday Wednesday  Thursday    Friday 
      5.0       0.1       0.0       0.0       0.5 


Further, we may need to add extra entries. We can do using the `c()` function again but this time passing in a vector as our first argument. This will create a single vector with all values. Below, we add two days to both vectors and then check the length of the updated vector `rain`. The `length()` function returns the length of a vector.

In [15]:
length(rain)
days <- c(days,"Saturday", "Sunday") # add the weekend with no rain
rain <- c(rain,0, 0)
length(rain)

We can also call some useful functions on vectors. For example, the `sum()`, `max()`, and `min()` functions will return the sum, maximum value, and minimum value of a vector, respectively. 

In [16]:
sum(rain)
max(rain)
min(rain)

### Question 1
Create a vector of the odd numbers from 1 to 11 using the `seq()` function. Then, find the third value in the vector using indexing.

In [17]:
# Solution:

# Factors

A special kind of vector is a **factor** vector. A **factor** vector behaves exactly like a regular vector but it assumes the vector represents values from a category. In particular, it keeps track of all possible values called the **levels**. Factors are helpful when we start getting into data analysis and have categorical variables. The `as.factor()` function will convert a vector to a factor. 

In [18]:
days <- c("Monday", "Tuesday", "Wednesday", "Monday", "Thursday", "Wednesday")
days <- as.factor(days)

class(days)
levels(days)

Above we did not specify the possible levels for our variable. Instead, R found all values in the vector `days`. Therefore, if we try to change one to Friday it will give us an error. Uncomment the line below to see the error it gives.

In [19]:
#days[2] <- "Friday"   

We can avoid this error by specifying the levels by using the `factor` function instead of `as.factor`. 

In [20]:
days <- c("Monday", "Tuesday", "Wednesday", "Monday", "Thursday", "Wednesday")
days <- factor(days, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

class(days)
levels(days)
days[2] <- "Friday"

# Matrices

Matrices are similar to vectors in that they store data of the same type, but matrices are two-dimensional and consist of rows and columns. Below, we create a matrix reporting the daily rainfall over multiple weeks. We can create a matrix using the `matrix(data, nrow, ncol, byrow)` function. This creates a `nrow` by `ncol` matrix from the vector `data` values filling in by row if `byrow` is TRUE and by column otherwise. Run the code below. Then, change the last argument to `byrow=FALSE` and see what happens to the values.

In [21]:
rainfall <- matrix(c(5,6,0.1,3,0,1,0,1,0.4,0.2,0.5,0.3,0,0), ncol=7, nrow=2, byrow=TRUE)
print(rainfall)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    5  6.0  0.1  3.0  0.0    1    0
[2,]    1  0.4  0.2  0.5  0.3    0    0


## Indexing a Matrix
Note that matrices are two-dimensional. That means that to access a subset of values, we will need to provide row and column indices. Below, we access a single value in the 1st row and the 4th column.

In [22]:
rainfall[1,4]

As before, we can also provide multiple incides to get multiple values. Below, we choose multiple columns but we can also choose multiple rows (or multiple rows and multiple columns!).

In [23]:
rainfall[1,c(4,5,7)]

As with vectors, we can also use booleans to index a matrix by providing boolean values for the row and/or columns. Note that below we give a vector for the row indices and no values for the columns. Since we did not specify any column indices, this will select all of them.

In [24]:
rainfall[c(FALSE, TRUE), ]

Let's do the opposite and select some columns and all rows.

In [25]:
rainfall[ ,c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE)]

0,1
5,6.0
1,0.4


As with vectors, we can specify row names and column names to access an entry instead of using indices.

In [26]:
colnames(rainfall) <- c("Monday", "Tuesday", "Wednesday", "Thursday", 
                        "Friday", "Saturday", "Sunday")
rownames(rainfall) <- c("Week1", "Week2")
rainfall["Week1",c("Friday","Saturday")]

## Editing a Matrix

If we want to change the values in a matrix, we need to index those values and give the new value(s). Below, we change a single entry to be 3 and then update several values to all be 0. Note that we do not to provide multiple 0's on the right-hand side. R will infer that all values be set to 0.

In [27]:
rainfall["Week1", "Friday"] <- 3

In [28]:
rainfall["Week1", c("Monday", "Tuesday")] <- 0
print(rainfall)

      Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Week1      0     0.0       0.1      3.0    3.0        1      0
Week2      1     0.4       0.2      0.5    0.3        0      0


Further, we can append values to our matrix by adding rows or columns through the **rbind** and **cbind** functions. The first function appends a row (or multiple rows) to a matrix and the second appends a column (or multiple columns). Note that below I provide a row and column name when passing in the additional data (if I didn't then those rows and columns would not be named). 

In [29]:
rainfall <- rbind(rainfall, "Week3" = c(0.4, 0.0, 0.0, 0.0, 1.2, 2.2, 0.0))
rainfall <- cbind(rainfall, "Total" = c(7.1, 2.4, 3.8))
print(rainfall)

      Monday Tuesday Wednesday Thursday Friday Saturday Sunday Total
Week1    0.0     0.0       0.1      3.0    3.0      1.0      0   7.1
Week2    1.0     0.4       0.2      0.5    0.3      0.0      0   2.4
Week3    0.4     0.0       0.0      0.0    1.2      2.2      0   3.8


Here is an example where we bind two matrices by column. 

In [30]:
cbind(matrix(c(1,2,3,4), nrow=2), matrix(c(5,6,7,8), nrow=2))

0,1,2,3
1,3,5,7
2,4,6,8


### Question 2
Create a 3x4 matrix using the `rep()` and `matrix()` functions. Then, sum the first two rows of the matrix using indexing and the `sum()` function.

In [31]:
# Solution:

# Data Frames
Matrices can store data such as the rainfall data when everything is of the same type. However, if we want to capture more complex data records we also want to allow for different measurement types. That is where data frames come in. A data frame is like a matrix in that it is two-dimensional except we allow for each column to be a different type. In this case, each row corresponds to a single data entry (or record) and each column is for a different variable. 
    
For example, suppose that for every day we want to record the temperature, rainfall, and day of the week. Temperature and rainfall can be numeric values, but day of the week will be character. We create a data frame using the `data.frame()` function. Note that I am providing column names for each vector (column).

The `head()` function prints the first portion of a data frame (to avoid printing very large datasets). In our case, it will show all the data. The column names are displayed as well as their type. 
   

In [32]:
weather_data <- data.frame(day_of_week = c("Monday","Tuesday","Wednesday","Monday"), 
                           temp = c(70,62,75,50), rain = c(5,0.1,0.0,0.5))
head(weather_data)

Unnamed: 0_level_0,day_of_week,temp,rain
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
1,Monday,70,5.0
2,Tuesday,62,0.1
3,Wednesday,75,0.0
4,Monday,50,0.5


One useful thing is to find the dimensions of the data frame or to see the number of rows and columns. The `dim()`, `nrow()`, and `ncol` functions return the dimensions, number of rows, and number of columns of a data frame, respectively. These functions also work for matrices.

In [33]:
dim(weather_data)
nrow(weather_data)
ncol(weather_data)

## Indexing a Data Frame

We can select elements of the data frame using the indices in the same way as matrices. Below, we access a single value and then a subset of our data frame. The subset returned is itself a data frame.

In [34]:
weather_data[1,2]
weather_data[1,c("day_of_week","temp")]

Unnamed: 0_level_0,day_of_week,temp
Unnamed: 0_level_1,<chr>,<dbl>
1,Monday,70


Another useful way to access the columns of a data frame is by using the $ accessor and the column name. 

In [35]:
weather_data$day_of_week
weather_data$temp

The column `day_of_week` is a categorical variable in that it can only take on a limited number of values. For this kind of variable, it is often useful to convert that column to a **factor** as we did below. It is especially useful to convert a column to a factor if you have a numerical value corresponding to a category (e.g. 0/1 encodings).

In [36]:
weather_data$day_of_week <- factor(weather_data$day_of_week)
levels(weather_data$day_of_week)

Now, let's suppose that I want to get the temperature of the days when it rained. I can do so using the code below. The first line tests whether each entry in the vector of rain values is greater than zero. This is called a logic test. What type of output is this? 

The next line uses that result to index the temperature vector (remember TRUE indicates select to select that entry and FALSE indicates not to). 

In [37]:
weather_data$rain > 0
weather_data$temp[weather_data$rain > 0]

## Editing a Data Frame

As with matrices, we can change values in a data frame by indexing those entries. 

In [38]:
weather_data[1, "rain"] <- 2.2
weather_data

day_of_week,temp,rain
<fct>,<dbl>,<dbl>
Monday,70,2.2
Tuesday,62,0.1
Wednesday,75,0.0
Monday,50,0.5


The **rbind** functions and **cbind** functions also work for data frames in the same way as for matrices. However, another way to add a column is to directly use the $ accessor. Below, we add a categorical column called heavy_traffic.

In [39]:
weather_data$heavy_traffic <- as.factor(c(1, 0, 0, 0))
weather_data

day_of_week,temp,rain,heavy_traffic
<fct>,<dbl>,<dbl>,<fct>
Monday,70,2.2,1
Tuesday,62,0.1,0
Wednesday,75,0.0,0
Monday,50,0.5,0


### Question 3
Add a column to `weather_data` called `late_to_work` using the `rep()` function so that all values are `NA` (the missing value in R). Then, index the second value of this column and set it to be 1.

In [40]:
# Solution:

# Lists

A data frame is a special kind of another data type called a list. A list is a collection of objects under the same name. These objects can be vectors, matrices, data frames, or even other lists! With a list there does not have to be any relation in size, type, or other attribute between different members of the list. Below, we create an example list using the `list` function which takes in a series of objects. What are the types of each element of the list? We can access each element using the index, but here we need to use double brackets. 

In [41]:
ex_list <- list("Alice", c("mint_chip", "caramel"), c(3.1, 2.5, 4.0))
print(ex_list)
ex_list[[2]]

[[1]]
[1] "Alice"

[[2]]
[1] "mint_chip" "caramel"  

[[3]]
[1] 3.1 2.5 4.0



More often, however, it is useful to name the elements of the list for easier access. Let's create this list again but this time give names to each object.

In [42]:
ex_list <- list(name="Alice", ice_cream_flavors = c("mint_chip", "caramel"), 
                run_lengths = c(3.1, 2.5, 4.0))
print(ex_list)
ex_list$ice_cream_flavors

$name
[1] "Alice"

$ice_cream_flavors
[1] "mint_chip" "caramel"  

$run_lengths
[1] 3.1 2.5 4.0



To edit a list, we can index and access different objects in the list. Additionally, we can add objects to the list using the $ accessor.

In [43]:
ex_list$toppings = c("Cherry", "Sprinkles")
ex_list

In [44]:
ex_list$toppings[2] = "Hot Fudge"
ex_list

## Video: Example Matrices and Data Frames


%%HTML
<iframe width="560" height="315"
 src="https://www.youtube.com/embed/fZQERMjaoUY"
</iframe>


# A Note on Naming Conventions and Comments

As we start coding, we want to make sure our code is easy to read and clear to both ourselves and others. Two easy ways to improve our code is to use good variable names and to comment our code. A few tips for effective names:
-  Stick to a single format. One popular format is to use lower case letters with underscores between words (e.g. my_var, names, prec_days)
-  Make your names useful. Try to avoid using names that are too long (e.g. which_day_of_the_week) or do not contain enough information (e.g., x1, x2, x3). 
-  Replace magic numbers or unexplained values with a variable. An example is if you are using a class size of 32 to calculate some measurements make a variable class_size rather than repeatedly using the number. 

Above, I had a lot of text surrounding each code chunk, but often you will want to write larger blocks of code and have it speak for itself. Or, you might be writing an R script which does not contain chunks of text. A good rule of thumb is to include enough comments so that if you came back to this code a month later you would be able to follow your logic. We will work on these coding practices throughout the course. 
