# R Extra Review Section 9/14

Today we are going to go over the following topics:

- Data wrangling
    - Indexing and subsetting
    - Manipulating variables
- Plotting:
    - `plot()` 
    - `curve()`
    - `ggplot()`
- Regression:
    -hand-cooked
    - `lm()`
- Other issues you've been having

Megan will be holding an in-depth review of the solutions to Small Assignment 1 tomorrow.

## Data Wrangling

Let's start by reading in our favorite toy dataset, `autos.dta`, for which we will need the `haven` package. We also might as well load the `tidyverse` library while we're at it. 

In [None]:
#Load libraries
library(haven, tidyverse)
#Read in data
df <- read_dta('autos.dta')
#Display the head
head(df)

### Object Oriented Programming
What would happen if we didn't assign the data to the object 'df'?

In [None]:
#Clear the objects from the workspace
rm(list=ls())
read_dta('autos.dta')
head('autos.dta')


Recall that **R** doesn't store data in memory unless we assign it to an *object*. Since we didn't assign `autos.dta` a name, the `head()` command just thought we wanted to see the first 5 lines of the string (text fragmnet) 'autos.dta'. Not very useful. Reformed Stata users generally struggle with this at first. 

Let's re-read the data back in properly and look a bit more at the types of **R** objects we might deal with. 

In [None]:
df <- read_dta('autos.dta')

#Class of object df
paste('Class of data frame df is:', class(df)[3])
paste('Class of variable rep78 is:', class(df$rep78))
paste('Class of variable make is:',class(df$make)) 


Objects in **R** can be pretty general. We might store scalars (numbers), plots, functions, text strings or a whole bunch of other stuff. As you can see above, **R** assigns these objects different classes. We probably won't be doing too much with these in this course, but it's good to know for general programming in **R** and if you're getting a weird error message, it may be that your variable is of the wrong class. 

### Indexing

Most of the time, we will be dealing with data frames. As you can see, when we read a .dta or .csv file into **R** it takes on the class 'data.frame'. Data frames are useful because we can either treat them like matrices, or like an excel sheet with well-defined column names. For example, the following two lines both return 3. 

In [None]:
df$headroom[3]
df[3,5]


(The difference is that the former returns a 1x1 data frame that just contains 3, while the latter is just the number 3). But the key point is that we can either ask **R** for the 3rd row in column 5 or the 3rd observation in the variable 'headroom'. (*note for Python users,* **R** *indices start at 1 rather than 0*). 

Note that when we deal with two-dimensional objects, we index the row first and the column second. If we leave out one of the arguments (but still keep the comma in place) we get either the entire row or the entire column, respectively. Can you figure out what the following commands index?

In [None]:
#Create a vector of the first 10 prime numbers
x <- c(2,3,5,7,11,13,17,19,23,29)
#Play around with indexing
x[2:4]
x[-1]
x[c(1,5,9)]
x[c(9,1,5)]
x[-c(2,4:9)]

The `c()` command will come in handy when you want to combine different elements for one command.

### Data Manipulation and Subsetting

There are multiple ways to create new variables in **R**. One way is just to assign it to a new object. Another is to add a new column to the data frame by specifying `df$new_col`. Perhaps the most efficient way is with the `mutate()` function, where in the first argument you specify the data frame you want to alter and in the second you specify `newvar=expr`. 

In [None]:
price_thou <- df$price/1000
df$weight_tons <- df$weight/2000
head(df)
df <- mutate(df, price_thou=price/1000)
head(df)

You also might want to drop variables, or keep only certain rows. We refer to this as subsetting. Below are a few of the functions you can use to do this:

* `subset()` keeps the rows of df2 where the statement to the right of the comma is true
* We can also put an expression instead of an index number. This replaces df3 with the subset of rows in df3, where mpg is 14 or 18
* The next line keeps only columns that have no missing values
* The next line uses the select function to specify a list of variables in df3 to keep (using the `c()` function)
* The minus sign in the last line indicates that we want to drop these variables

In [None]:
#Create df3 as the subset of observations that are foreign with less than 30 mpg
df3<-subset(df, foreign==1 & mpg<30)
head(df3)
#Keep if mpg is equal to 14 or 18
df3<-df3[df3$mpg==18 | df3$mpg==14,]
head(df3)
#Drop any variable that contains missing values
df3<-df3[,colSums(is.na(df3))==0]
head(df3)
#Keep the following variables
df3 <- select(df3, c(make, price_thou, weight_tons, gear_ratio, trunk))
head(df3)
#Drop gear ratio and trunk
df3 <- select(df3, -gear_ratio, -trunk)
head(df3)

## Regression

Okay, now that we have a better understanding of how to handle our data, let's quickly breeze through our hand-cooked regression example before we introduce the `lm()` function. We promise that this week is the last time we'll make you calculate a regression by hand on an assignment. 

Suppose our model is $\text{price}=\beta_0+\beta_1 \text{weight}+u$. Estimate $\hat\beta_0$ and $\hat\beta_1$.

Let's try and implement the steps Jeremy did in his excel demonstration using as efficient **R** code as possible. Remembering the formulas for sample variance and covariance $$cov(x,y)=\frac{\sum_{i=1}^N(x_i-\bar x)(y_i-\bar y)}{N-1},\;\;\;var(x)=\frac{\sum_{i=1}^N(x_i-\bar x)^2}{N-1}$$ let's construct these ingredients one-by-one.

In [None]:
#Get means
xbar<- mean(df$weight)
ybar<- mean(df$price)

#Calculate x_i-xbar and y_i-ybar
df <- mutate(df, xres= weight-xbar, yres=price-ybar)
df <- mutate(df, xy= xres*yres, xx=xres^2)

##Get covariance and variance by summing up xy and xx and dividing by n-1. 
covxy<- sum(df$xy)/(nrow(df)-1)
varx<- sum(df$xx)/(nrow(df)-1)

#Calculate b0 and b1
b1 <- covxy/varx
b0 <- ybar-b1*xbar

#Display coeffs
c(b0,b1)

Now let's check this against the `lm()` command

In [None]:
reg <- lm(data=df, price~weight)
summary(reg)

Looks good! A few notes about the `lm()` function. 
* We could also have written `lm(df$price~df$weight)`
* To get the r-squared, we could just write `summary(reg)$r.squared`. We can also use similar syntax to extract coefficients and other statistics we'll learn about later. 
* If we wanted to extend this to a multiple regression model, we could write the regression expression as `price~weight+mpg+headroom` etc. 

## Plotting

Plotting in **R** can be really simple, but there are plenty of tools to make plots super fancy as well. The simplest plot you can make is a scatter plot, which is done as follows

In [None]:
plot(df$weight, df$price, xlab="X label", ylab="Y label", main="Main title")


To plot a line graph, just add the option `type='l'`. To plot a histogram, use the `hist` function. Note that the `breaks` option controls how many bins there are. 

In [None]:
hist(x=df$price, breaks= 12, main='Histogram of price', xlab='Price')

To plot a function, such as a regression line, one simple way is the `curve()` function. We can also add the `points()` function to overlay the scatter plot and the `abline()` function overlay lines corresponding to different statistics. 

In [None]:
curve(b0+b1*x,0,max(df$weight))
points(df$weight, df$price)
abline(v=median(df$weight), col='blue', lwd=3)

We can even use `abline()` to overlay a regression line on our original scatter plot.

In [None]:
plot(df$weight, df$price, xlab="Weight", ylab="Price", main="Regression of price on weight")
abline(lm(df$price~df$weight), col='red')

Finally, I'd like to briefly introduce the `ggplot()` package, which is a lot more powerful and flexible (but not necessary for the purposes of this course)

In [None]:
library(ggplot2)
p <- ggplot(data=df, aes(x=weight, y=price))
p+geom_point(color='magenta')+theme_minimal()+geom_smooth(method = "lm", se=TRUE, color="gold", formula = y ~ x)+labs(x='Weight',y='Price', title='Regression of Price on Weight', subtitle='Unnecessarily Fancy')  + theme(panel.background = element_rect(fill = "lightblue", colour = "lightblue", size = 0.5, linetype = "solid"))

That's all for now. Hopefully this answers some of the questions you had. If you still have questions about Small Assignment 1, please tune into Megan's section tomorrow. 