<img src="images/econ140R_logo.png" width="200" /img>

<h1>ECON 140R - Coding Bootcamp, Part 1</h1>

## Position Indices and Pipes

This material is closely and gratefully adapted from the work of the UC Berkeley EEP/IAS 118 team, including Jeremy Magruder, Sofia Villas-Boas, James Sears, and many other people working on these materials for EEP118. This is your work. We are in your debt!

This bootcamp will help us get better at manipulating data and improve our coding efficiency with <b>pipes</b>.

For today, let's load a few packages and read in a dataset on sleep quality and time allocation for 706 individuals. This dataset is saved to the section folder as `sleep75.dta`. 

In [None]:
library(tidyverse)
library(haven)
sleepdata <- read_dta("sleep75.dta")
# The following seeds the random number generators (RNG) so that we'll get reproducible results
set.seed("12345") 

## Working with Indices

We've seen how to manipulate datasets by adding in variables or removing certain observations. But what if we want to obtain either one element or a subset of elements from a known location? We can do this by referring to matrix indices of the dataframe.

### Matrix and Vector Indices in R
**R** uses brackets `[]` to indicate we're referring to a position. If added to the end of a one-dimensional vector object, we only need to include the position of the object we want to pull. When it comes to matrices, we now have to account for both the row and column dimensions, in that order. Let's see some examples!


### Vectors
Let's start by working with a vector. Run the following code in the cell below to create and see a vector named `vec`:

`vec <- rnorm(10, mean = 4, sd = 2)`

`vec`

**What did we do?** We created a vector of length 10 of random draws from an Normal(4,4) distribution. 

If we wanted just the third element of this vector, we could call the following. Try it below:

`vec[3]`

The square brackets `[]` let __R__ know that you want to select on position, while the 3 is our instruction for which position to pull from. 

(note that since we're working with a one-dimensional vector and not a dataframe, we can't use `$` to call a certain column). 

If we were interested in elements 5 through 7, we can pull them with the use of `:` within the brackets to pull a sequence of consecutive integers:

`vec[5:7]`

Finally, if we wanted to pull the first, fourth, and ninth elements we can do that using `c()`:

`vec[c(1, 4, 7)]`


Remember that `c()` combines the listed elements into a vector itself, so if you ran `c(1, 4, 7)` on its own you'd get back a vector those three elements in order. We can see that by initiating a new vector of 30, 34, 38, and 42 and checking its type:

`newvec <- c(30,34,38,42)`

`newvec`

`is.vector(newvec)`

### Matrices

What happens when we are working multidimensional objects? Largely the same thing! 

Let's start by creating a matrix with `matrix()` with 4 rows, 10 columns containing the values 1:40 and then checking its type:

`mat40 <- matrix(1:40, nrow = 4, ncol = 10)`

`mat40`

`is.matrix(mat40)`

To pull an element from a  matrix using its index, we now have to account for both the row and column dimensions. Now we just need to refer to position by specifying `[row#, column#]`. We first specify the row we want to draw from, followed by the column.

To get the first element (first row and first column), we can use 

`mat40[1,1]`

And for the element in the 3rd row and 6th column:

`mat40[3, 6]`

Like with vectors, we can pull multiple adjacent elements with a `:`. If we wanted the fifth, sixth, and seventh elements from the 2nd row, we could use

`mat40[2, 5:7]`

If instead we wanted an entire row, we can just omit the column value entirely:

`mat40[2, ]`

And for the sixth column:

`mat40[,6]`


We can also use `c()` as before, and combine it with `:`: to pull many elements/rows/columns at once!

#### Quiz Time
What elements do we pull with the following commands?

* `mat40[c(2,4), ]`
* `mat40[3,c(1,4)]`
* `mat40[c(1,3), 5:7]`


We have a bunch of flexibility here to call one element or multiple elements at the same time, the only restriction being that we follow the `[row#, col#]` syntax.

### Data Frames

The process for data frames is pretty similar, albeit with one extension. Now that we have variables, we can combine a position call with the `$` for a specific variable.

In [None]:
sleepdf <- select(sleepdata, age, educ, exper, hrwage)
head(sleepdf)
nrow(sleepdf)
ncol(sleepdf)
dim(sleepdf)

is.data.frame(sleepdf)

Remember that we can access a column be referring to it by name after the `$`:

`sleepdf$age`

You can similarly refer to the matrix position to pull the age column:

`sleepdf[,1]`


You can also refer to the column index by name, with quotations:

`sleepdf[, "age"]`

*note: **tidyverse** functions expect just the name of the variable and don't depend on the specific position of the variable in the dataframe.*

If we wanted to get the fourth row element of column 4 (hrwage), we could refer to the matrix position:

`sleepdf[4,4]`

Alternatively, we can do the same thing by referring to the specific position within the variable

`sleepdf$hrwage[4]`

Note that when we use the `$` to call a specific variable, __R__ now treats that variable as a vector, so we can refer to its elements with `[]` in one dimension. In that case, our call `sleepdf$hrwage[4]` gives us just a number, whereas the previous call of `sleepdf[4,4]` gives us the same value but presented in a 1x1 table.

# Pipes

As we start performing multiple consecutive manipulations on a dataframe, generating more specific summary statistics that require multiple coding steps, or making custom figures in ggplot, it can get tedious (and memory-intensive) to constantly have to assign objects to memory in each intermediate step. Fortunately **tidyverse** has a great alternative that will save us time and effort: pipes!

For an example, if we were interested in altering our sleep variable to measure hours slept per night and also wanted to then obtain summary statistics by whether individuals are in good or excellent health (`gdhlth = 1`) or not (`gdhlth = 0`) for everyone over age 25, we could do it in the following way:

`sleepdata <- filter(sleepdata, age > 25)
sleepdata <- mutate(sleepdata, hrs_night = sleep/(7*60))
sleepdata_goodhealth <- filter(sleepdata, gdhlth == 1)
sleepdata_poorhealth <- filter(sleepdata, gdhlth == 0)
summarize(sleepdata_poorhealth, mean_hours = mean(hrs_night), min_hours = min(hrs_night), max = max(hrs_night), count_badhealth = n())
summarize(sleepdata_goodhealth, count_good = n(), mean_hours = mean(hrs_night), min_hours = min(hrs_night), max = max(hrs_night), count_goodhealth = n())`


To get summary statistics on hours slept per night for each of the good and poor health groups, we had to use `filter()` to subset the data on health quality, store those subsets in data, and then generate summary statistics for each subset individually. 

`tidyverse` has a fantastic alternative that helps us skip these intermediate steps: a pipe `%>%`. The way the pipe (`%>%`) works is by taking the output from one expression and plugging it into thefirst argument of the function that comes to the right/below the pipe. For instance, we could rewrite the above code using pipes in fewer lines and without having to store anything in memory:

`sleepdata %>%
    filter(age > 25) %>%
    mutate(hrs_night = sleep/(7*60)) %>%
    group_by(gdhlth) %>%
    summarize(mean_hours = mean(hrs_night), min_hours = min(hrs_night), max = max(hrs_night), count = n())`

Which gives us the same output without storing anything to memory and in fewer steps/less coding. 

What the pipe is doing here is 
1. Passing our dataframe *sleepdata* on as the first argument of the next function (filter)
2. Filtering on age > 25. Since we used the pipe to pass in *sleepdata*, **R** knows that the first argument is our *sleepdata* object, so we only specify the second argument (the condition for filtering). We then pass the filtered version of *sleepdata* on to the next line
3. Creating a new variable `hrs_night` that measures hours slept per night (minutes per week divided by 7 days/week and divided by 60 min/hr), then passing the filtered dataframe with the new variable onward
4. Taking the filtered/mutated version of sleepdata and grouping it by our good health variable and sending it on
5. Summarizing the grouped data, reporting mean/min/max hours per night and the total number in each group.

One quick note: if we wanted to use a pipe for a number of steps and then save the resulting object to memory, we can do that! As long as you add `name <-` before the object at the top of the pipe, the result at the end of all the pipes will be saved to memory.

For example, we can filter to keep data only for people older than 25 and add our hours slept per night variable using pipes:

`sleep_old <- sleepdata %>%
    filter(age > 25) %>%
    mutate(hrs_night = sleep/(7*60))
head(sleep_old)`

We could also subset the older data for those not in a union, keep only a few variables of interest, and then arrange the subset by hours slept:

`subset <- sleep_old %>%
    filter(union ==0) %>%
    select(hrs_night, union, gdhlth, age, exper) %>%
    arrange(hrs_night)
head(subset)`

### Practice with Grouping and Pipes

We want to know the average hours slept per night for everyone under age 30 in our sample. We feel the mean will be more informative if we can see the average hours slept per night by year of age. 

Report the mean of hours slept per night by ages 23, 24, 25, 26, 27, 28, and 29.

### Pipes: Summary

Pipes send the object to the left of the pipe into the next expression as the first argument. Pipes are a handy way to cut down on repetitive code and preserve memory, especially when working with **tidyverse** functions. However, there are situations where it might be worth intentionally pausing between steps or times where you can't use pipes as the first argument is not consistent across consecutive functions. 