# Data Structures in R

In this chapter, we will demonstrate the key **data structures** in R. A data structure is how information is stored in R and it informs R how to interpret our code. Any **object** is a named instance of a data structure. For example, the object `ex_num` is a vector of numeric type. 

In [12]:
ex_num <- 4

The main data structures in R are **vectors**, **factors**, **matrices**, **arrays**, **lists**, and **data frames**. These structures are distinguished by the dimensions and type of data stored. For example, a vector is a 1-dimensional data structure of the same type whereas a data frame is a 2-dimensional data structure where each column has the same type. We will cover each structure except for arrays, which are an extension of matrices that allow for data that more than 2-dimensional.

## Data Types

Each individual value in R is a type: logical, integer, double, or character. We can think of these as the building blocks of all data structures. Below, we can see use the `typeof` function to find the type, which shows that the value of `ex_num` is a **double**. A double is a numeric value that stores the decimal value as well.

In [13]:
typeof(ex_num)

We now create an integer object `ex_int`. To indicate to R that we want to restrict our values to integer values we use an `L` after the number. For our examples, we will not need to use this extra type. 

In [14]:
ex_int <- 4L
typeof(ex_int)

Both `ex_var` and `ex_int` are numeric objects, but we can also work with two other kinds of types: characters (e.g. "php", "stats") and booleans (TRUE, FALSE), also known as logicals. 

In [15]:
ex_bool <- TRUE
ex_char <- "Alice"

typeof(ex_bool)
typeof(ex_char)

One characteristic of logical objects is that R will also interpret them as 0/1. 

In [17]:
TRUE+FALSE+TRUE

To create these objects, we have used the assignment operator `<-`. We could intead use the `=` operator, which can also be used for assignment but also has other meanings. It is generally preferable to stick to using `<-`. 

## Vectors

In the examples above, we created objects with a single value. However, R actually uses a vector of length 1 to store this information. Vectors are 1-dimensional data structures that can store multiple data values of the same type (e.g. character, boolean, or numeric). We can confirm this by using the `is.vector` function, which returns TRUE if the inputted argument is a vector.

In [19]:
is.vector(ex_bool)

One way to create a vector with multiple values is to use the combine function `c()`. Below we create two vectors: one with the days of the week and one with the amount of rain on each day.

In [18]:
days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
rain <- c(5, 0.1, 0, 0, 0.4)

Remember, vectors cannot store objects of different types. In the code below, R automatically converts the numeric value to be a character.

In [20]:
c("Monday", 5)

If we check the classes of these two objects, we will see that it tells us the type of the data (character and numeric). 

In [6]:
class(days)
class(rain)

What happens when we create an empty vector? What is the class?

In [7]:
ex_empty <- c()
class(ex_empty)

In this case, there is no specified type yet. If we wanted to specify the type we could make an empty vector using the `vector()` function.

In [8]:
ex_empty <- vector(mode = "numeric")
class(ex_empty)

Another way to create a non-empty vector is with the `rep()` or `seq()` functions. The first function `rep(x, times)` takes in a vector `x` and a number of times `times` and outputs `x` repeated that many times. Let's try this with a single value below. The second function `seq(from, to, step)` takes in a starting value, end value, and step size (all numeric values) and returns a sequence from `from` in increments of `step` until a maximum value of `to` is reached. 

In [5]:
rep(0, 5)
rep("Monday", 4)
seq(1, 5, 1)
seq(0, -10, -2)

The mathematical operators we saw in the last chapter ((e.g. `+`, `-`, `*`, `/`, `^`) all can apply to numeric vectors and will apply element-wise. That is, in the code below the two vectors are added together by index. This holds true for some of the built-in math functions as well:

* `exp` - expoential
* `log` - log
* `sqrt` - square root
* `abs` - absolute value 
* `round` - round to nearest integer value
* `ceiling` - round up to the nearest integer value 
* `floor` - round down to the nearest integer value

In [42]:
c(1,2,3) + c(1,1,1)
c(1,2,3) + 1 # equivalent to the code above
sqrt(c(1,4,16))

### Indexing a Vector

Now that we have a vector, we may want to access the values. To do so, we index the values starting from index 1: the first value has index 1, the second value has index 2, etc. This is what we mean when we say a vector is 1-dimensional. Below, we use these indices to find the value at index 1 and the value at index 4.

In [21]:
days[1]
days[4]

We can either access a single value or a subset of values using a vector of indices. See what happens when you use a vector of indices `c(1,4)` and then try what happens when you use `-c(1,4)`. In the first case, we get the values at index 1 and at index 4. In the second case, we get all values *except* at those indices. The `-` indicates that we want to remove rather than select these indices.

In [11]:
days[c(1,4)]
days[-c(1,4)]

However, always indexing by the index value can sometimes be difficult. One extra feature of vectors is that we can associate a name with each value. Below, we update the names of the vector `rain` to be the days of the week and then find Friday's rain count by indexing with the name.

In [12]:
names(rain) <- days
print(rain)
rain["Friday"]

   Monday   Tuesday Wednesday  Thursday    Friday 
      5.0       0.1       0.0       0.0       0.4 


We can also index a vector using TRUE and FALSE values. If we have a vector of booleans that is the same length as our original vector, then this will return all the values that correspond to a TRUE value. Below, this will return the first and fourth values.

In [13]:
ind_bools <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
days[ind_bools]

This is useful because we can use logic to access certain values. We will see more about this later on. 

### Editing a Vector and Calculations

After we create a vector, we may need to update it or change it. For example, we may want to change a value. We can do so using indexing. Below, we update the rain value for Friday using the assignment operator. 

In [14]:
rain["Friday"] <- 0.5
rain

   Monday   Tuesday Wednesday  Thursday    Friday 
      5.0       0.1       0.0       0.0       0.5 


Further, we may need to add extra entries. We can do using the `c()` function again but this time passing in a vector as our first argument. This will create a single vector with all values. Below, we add two days to both vectors and then check the length of the updated vector `rain`. The `length()` function returns the length of a vector.

In [15]:
length(rain)
days <- c(days,"Saturday", "Sunday") # add the weekend with no rain
rain <- c(rain,0, 0)
length(rain)

We can also call some useful functions on vectors. For example, the `sum()`, `max()`, and `min()` functions will return the sum, maximum value, and minimum value of a vector, respectively. 

### Practice Question

Create a vector of the odd numbers from 1 to 11 using the `seq()` function. Then, find the third value in the vector using indexing. 

In [22]:
# Insert your solution here:

### Common Vector Functions

Below we list some of the most common vector functions that are available in base R. All of these functions assume that the vector is numeric. If we pass the function a logical vector it will convert it to 0/1 first, and if we pass it a character vector it will return an error. 

* sum() -  summation
* median() - median value
* mean() - mean
* sd() - standard deviation 
* var() - variance
* max() - maximum value
* which.max() - index of the first element with the maximum value
* min() - minimum value
* which.min() - index of the first element with the minimum value

Try these out using the vector `rain`. 

In [40]:
mean(rain)  
min(rain) 
which.min(rain) 

We may also be interested in the order of the values. The `sort` function sorts the values of a vector whereas the `order` function returns the permutation of the elements to be in sorted order. The last line of code below sorts the days of the week from smallest to largest rain value. 

In [28]:
rain
order(rain)
days[order(rain)]

Both of these functions have an extra possible argument decreasing, which has a default value of FALSE. We can specify this to be TRUE to find the days of the week sorted from largest to smallest rainfall.

In [29]:
days[order(rain, decreasing=TRUE)]

## Factors

A special kind of vector is a **factor** vector. A **factor** vector behaves exactly like a regular vector but it assumes the vector represents values from a category. In particular, it keeps track of all possible values called the **levels**. Factors are helpful when we start getting into data analysis and have categorical variables. The `as.factor()` function will convert a vector to a factor. 

In [18]:
days <- c("Monday", "Tuesday", "Wednesday", "Monday", "Thursday", "Wednesday")
days_fct <- as.factor(days)

class(days_fct)
levels(days_fct)

Above we did not specify the possible levels for our variable. Instead, R found all values in the vector `days`. Therefore, if we try to change one to Friday it will give us an error. Uncomment the line below to see the error it gives.

In [19]:
#days_fct[2] <- "Friday"   

We can avoid this error by specifying the levels by using the `factor` function instead of `as.factor`. 

In [30]:
days <- c("Monday", "Tuesday", "Wednesday", "Monday", "Thursday", "Wednesday")
days_fct <- factor(days, 
               levels = c("Monday", "Tuesday", "Wednesday", "Thursday", 
                          "Friday", "Sunday", "Saturday"))

class(days_fct)
levels(days_fct)
days_fct[2] <- "Friday"

## Matrices

Matrices are similar to vectors in that they store data of the same type, but matrices are two-dimensional and consist of rows and columns. 

TODO: IMAGE

 Below, we create a matrix reporting the daily rainfall over multiple weeks. We can create a matrix using the `matrix(data, nrow, ncol, byrow)` function. This creates a `nrow` by `ncol` matrix from the vector `data` values filling in by row if `byrow` is TRUE and by column otherwise. Run the code below. Then, change the last argument to `byrow=FALSE` and see what happens to the values.

In [33]:
rainfall <- matrix(c(5,6,0.1,3,0,1,0,1,0.4,0.2,0.5,0.3,0,0), ncol=7, nrow=2, byrow=TRUE)
rainfall

0,1,2,3,4,5,6
5,6.0,0.1,3.0,0.0,1,0
1,0.4,0.2,0.5,0.3,0,0


We can find the dimensions of a matrix using the `nrow` and `ncol` functions, which return the number of rows, the the number of columns, respectively. Additionally, the `dim` function returns both.

In [34]:
nrow(rainfall)
ncol(rainfall)
dim(rainfall)

## Indexing a Matrix

Since matrices are two-dimensaional, a single value is indexed by its row number and its column number. This means that to access a subset of values, we will need to provide row and column indices. Below, we access a single value in the 1st row and the 4th column. The first value is always the row index and the second value is always the column index.

In [22]:
rainfall[1,4]

As before, we can also provide multiple incides to get multiple values. Below, we choose multiple columns but we can also choose multiple rows (or multiple rows and multiple columns!).

In [23]:
rainfall[1,c(4,5,7)]

As with vectors, we can also use booleans to index a matrix by providing boolean values for the row and/or columns. Note that below we give a vector for the row indices and no values for the columns. Since we did not specify any column indices, this will select all of them.

In [24]:
rainfall[c(FALSE, TRUE), ]

Let's do the opposite and select some columns and all rows.

In [25]:
rainfall[ ,c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE)]

0,1
5,6.0
1,0.4


As with vectors, we can specify row names and column names to access an entry instead of using indices.

In [26]:
colnames(rainfall) <- c("Monday", "Tuesday", "Wednesday", "Thursday", 
                        "Friday", "Saturday", "Sunday")
rownames(rainfall) <- c("Week1", "Week2")
rainfall["Week1",c("Friday","Saturday")]

## Editing a Matrix

If we want to change the values in a matrix, we need to index those values and give the new value(s). Below, we change a single entry to be 3 and then update several values to all be 0. Note that we do not to provide multiple 0's on the right-hand side. R will infer that all values be set to 0.

In [27]:
rainfall["Week1", "Friday"] <- 3

In [28]:
rainfall["Week1", c("Monday", "Tuesday")] <- 0
print(rainfall)

      Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Week1      0     0.0       0.1      3.0    3.0        1      0
Week2      1     0.4       0.2      0.5    0.3        0      0


Further, we can append values to our matrix by adding rows or columns through the **rbind** and **cbind** functions. The first function appends a row (or multiple rows) to a matrix and the second appends a column (or multiple columns). Note that below I provide a row and column name when passing in the additional data (if I didn't then those rows and columns would not be named). 

In [29]:
rainfall <- rbind(rainfall, "Week3" = c(0.4, 0.0, 0.0, 0.0, 1.2, 2.2, 0.0))
rainfall <- cbind(rainfall, "Total" = c(7.1, 2.4, 3.8))
print(rainfall)

      Monday Tuesday Wednesday Thursday Friday Saturday Sunday Total
Week1    0.0     0.0       0.1      3.0    3.0      1.0      0   7.1
Week2    1.0     0.4       0.2      0.5    0.3      0.0      0   2.4
Week3    0.4     0.0       0.0      0.0    1.2      2.2      0   3.8


Here is an example where we bind two matrices by column. 

In [35]:
A <- matrix(c(1,2,3,4), nrow=2)
B <- matrix(c(5,6,7,8), nrow=2)
C <- cbind(A, B)
C

0,1,2,3
1,3,5,7
2,4,6,8


As with vectors, most mathematical operators (`+`, `-`, `*`, `/` etc.) are applied element-wise in R.

In [36]:
A+B

0,1
6,10
8,12


In [43]:
exp(C)

0,1,2,3
2.718282,20.08554,148.4132,1096.633
7.389056,54.59815,403.4288,2980.958


### Practice Question
Create a 3x4 matrix of all 1's using the `rep()` and `matrix()` functions. Then select the first and third column using indexing.

In [39]:
# Insert your solution here:

## Data Frames
Matrices can store data such as the rainfall data when everything is of the same type. However, if we want to capture more complex data records we also want to allow for different measurement types. That is where data frames come in. A data frame is like a matrix in that it is two-dimensional except we allow for each column to be a different type. In this case, each row corresponds to a single data entry (or observation) and each column is for a different variable. 
    
For example, suppose that for every day we want to record the temperature, rainfall, and day of the week. Temperature and rainfall can be numeric values, but day of the week will be character. We create a data frame using the `data.frame()` function. Note that I am providing column names for each vector (column).

The `head()` function prints the first six rows of a data frame (to avoid printing very large datasets). In our case, it will show all the data. The column names are displayed as well as their type. By contrast, the `tail()` function prints the last six rows of a data frame.
   

In [45]:
weather_data <- data.frame(day_of_week = c("Monday","Tuesday","Wednesday","Monday"), 
                           temp = c(70,62,75,50), rain = c(5,0.1,0.0,0.5))
head(weather_data)

Unnamed: 0_level_0,day_of_week,temp,rain
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
1,Monday,70,5.0
2,Tuesday,62,0.1
3,Wednesday,75,0.0
4,Monday,50,0.5


One useful thing is to find the dimensions of the data frame or to see the number of rows and columns. The `dim`, `nrow`, and `ncol` functions return the dimensions, number of rows, and number of columns of a data frame, respectively. 

In [46]:
dim(weather_data)
nrow(weather_data)
ncol(weather_data)

The column names can be found (or assigned) using the `colnames` or `names` function. These were specified when I created the data. On the other hand, the row names are currently the indices. 

In [48]:
colnames(weather_data)
rownames(weather_data)
names(weather_data)

We could update the row names to be more informative as with a matrix.

In [51]:
rownames(weather_data) <- c("6/1", "6/2", "6/3", "6/8")
head(weather_data)

Unnamed: 0_level_0,day_of_week,temp,rain
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
6/1,Monday,70,5.0
6/2,Tuesday,62,0.1
6/3,Wednesday,75,0.0
6/8,Monday,50,0.5


## Indexing a Data Frame

We can select elements of the data frame using the indices in the same way as matrices. Below, we access a single value and then a subset of our data frame. The subset returned is itself a data frame. Note that both of these return a data frame. 

In [52]:
weather_data[1,2]
weather_data[1,c("day_of_week","temp")]

Unnamed: 0_level_0,day_of_week,temp
Unnamed: 0_level_1,<chr>,<dbl>
6/1,Monday,70


Another useful way to access the columns of a data frame is by using the $ accessor and the column name. 

In [35]:
weather_data$day_of_week
weather_data$temp

The column `day_of_week` is a categorical variable in that it can only take on a limited number of values. For this kind of variable, it is often useful to convert that column to a **factor** as we did below. It is especially useful to convert a column to a factor if you have a numerical value corresponding to a category (e.g. 0/1 encodings).

In [36]:
weather_data$day_of_week <- factor(weather_data$day_of_week)
levels(weather_data$day_of_week)

Now, let's suppose that I want to get the temperature of the days when it rained. I can do so using the code below. The first line tests whether each entry in the vector of rain values is greater than zero. This is called a logic test. What type of output is this? 

The next line uses that result to index the temperature vector (remember TRUE indicates select to select that entry and FALSE indicates not to). 

In [37]:
weather_data$rain > 0
weather_data$temp[weather_data$rain > 0]

## Editing a Data Frame

As with matrices, we can change values in a data frame by indexing those entries. 

In [38]:
weather_data[1, "rain"] <- 2.2
weather_data

day_of_week,temp,rain
<fct>,<dbl>,<dbl>
Monday,70,2.2
Tuesday,62,0.1
Wednesday,75,0.0
Monday,50,0.5


The **rbind** functions and **cbind** functions also work for data frames in the same way as for matrices. However, another way to add a column is to directly use the $ accessor. Below, we add a categorical column called heavy_traffic.

In [39]:
weather_data$heavy_traffic <- as.factor(c(1, 0, 0, 0))
weather_data

day_of_week,temp,rain,heavy_traffic
<fct>,<dbl>,<dbl>,<fct>
Monday,70,2.2,1
Tuesday,62,0.1,0
Wednesday,75,0.0,0
Monday,50,0.5,0


### Practice Question
Add a column to `weather_data` called `late_to_work` using the `rep()` function so that all values are `NA` (the missing value in R). Then, index the second value of this column and set it to be 1.

In [40]:
# Solution:

TODO: update the example above to add pollution instead of late to work

TODO: include double indexing??

# Lists

A data frame is a actually a special kind of another data structure called a list. A list is a collection of objects under the same name. These objects can be vectors, matrices, data frames, or even other lists! With a list there does not have to be any relation in size, type, or other attribute between different members of the list. Below, we create an example list using the `list` function which takes in a series of objects. What are the types of each element of the list? We can access each element using the index, but here we need to use double brackets. 

In [41]:
ex_list <- list("Alice", c("mint_chip", "caramel"), c(3.1, 2.5, 4.0))
print(ex_list)
ex_list[[2]]

[[1]]
[1] "Alice"

[[2]]
[1] "mint_chip" "caramel"  

[[3]]
[1] 3.1 2.5 4.0



More often, however, it is useful to name the elements of the list for easier access. Let's create this list again but this time give names to each object.

In [42]:
ex_list <- list(name="Alice", ice_cream_flavors = c("mint_chip", "caramel"), 
                run_lengths = c(3.1, 2.5, 4.0))
print(ex_list)
ex_list$ice_cream_flavors

$name
[1] "Alice"

$ice_cream_flavors
[1] "mint_chip" "caramel"  

$run_lengths
[1] 3.1 2.5 4.0



To edit a list, we can index and access different objects in the list. Additionally, we can add objects to the list using the $ accessor.

In [43]:
ex_list$toppings = c("Cherry", "Sprinkles")
ex_list

In [44]:
ex_list$toppings[2] = "Hot Fudge"
ex_list

## Video: Example Matrices and Data Frames


%%HTML
<iframe width="560" height="315"
 src="https://www.youtube.com/embed/fZQERMjaoUY"
</iframe>


## Exercises

## Tips and Resources

TODO: cheatsheet with all functions?

TODO: check variable

TODO: add color to practice questions

In [None]:
TODO: vector and matrix + /