# Coding Section 2
## Econ 130
GSIs: Sarah Albert and Bryan Chu

### Goals for today
* Do some data analysis, building up to a diff-in-diff
* We will start with what we did in the previous section, but we will not assume any knowledge other than what we covered.


Often times, user written open-source packages are needed for specific functionality in R (e.g. nice graphics). However, we need to manually install these packages (once) and load them at the beginning of every script. Packages have been pre-installed in Jupyter notebooks.  

*If you are wondering why a command you've used before is no longer working, it may be because you haven't loaded the package.*

In [None]:
# You only need to install binsreg once (this is a warning message that is safe to ignore),
# but you will need to call the "library" command each time.

install.packages('binsreg')
library('binsreg')


In [None]:
# Clear any existing output and data stored in memory
rm(list = ls())

# Read in the data that we constructed last time
mw_no_shore <- read.csv('minwage_no_shore.csv')

In [None]:
# First, let's try to visualize the relationship between employment (full_time)
# and wages (wage_st, the starting wage). We'll focus on the relationship
# in period 1

# First, we'll try a scatterplot
pre_data <- mw_no_shore[which(mw_no_shore$interview==1),]
plot(pre_data$wage_st,pre_data$full_time)


In [None]:
#It's hard to interpret! Let's add a line of best fit
plot(pre_data$wage_st,pre_data$full_time)
abline(lm(pre_data$full_time~pre_data$wage_st), col="red")

# Note that it's plot(x,y) but linear model lm(y,x)!

In [None]:
# The line of best fit is upward-sloping, indicating that in the cross-section,
# higher wages are correlated with higher employment. But the figure is still
# a bit difficult to interpret, so let's try a binscatter. 
# We'll also add a title and a subtitle and label our axes

binsglm(pre_data$full_time, pre_data$wage_st, polyreg=1)

# When you run this command, you will see a lot of warning messages in a red rectangle.
# In general, you do not want the code you submit to have these errors, but your GSIs
# have done a lot of work and Googling and cannot figure out why these still appear.
# You may ignore warning messages when you run the binsglm command, but be wary if
# you have warning messages for other things that you run.

In [None]:
# How does the post period look?
post_data <- mw_no_shore[which(mw_no_shore$interview==2),]
binsglm(post_data$full_time, post_data$wage_st, polyreg=1)

## Correlations

In [None]:
# What is the correlation between full-time and part-time employees, and wages? How do we read this table?
# Note: the user = "complete.obs" is necessary in order to tell R to ignore missing data. Otherwise it will
# return a lot of NA's (you can try it if you want!).

cor(mw_no_shore[, c('full_time','part_time','wage_st')], use = "complete.obs")

In [None]:
# Recall (from lecture) that the correlation coefficient is not the same as the regression 
# coefficient, although they are related. (If you've taken ECON 140/141, you'll know why. 
# If not, don't worry about it!) For example,

m<-lm(mw_no_shore$full_time~mw_no_shore$part_time)

summary(m)

## Diff-in-Diff Table

For a diff-in-diff, we need to calculate four means: two "pre" means (one each for NJ and for PA) and two "post" means for full-time employment.

Let's make things more intuitive by generating some new variables "treated" and "post."

In [None]:
# Treated = 1 if NJ = 1 and 0 otherwise
mw_no_shore$treated <- 0
mw_no_shore$treated[mw_no_shore$nj == 1] <- 1 

# Post = 1 if interview = 2 and 0 otherwise
mw_no_shore$post <- 0
mw_no_shore$post[mw_no_shore$interview == 2] <- 1

# We'll use the print command to help us organize our output

print("Pre; NJ then PA")
summary(mw_no_shore$full_time[mw_no_shore$nj == 1 & mw_no_shore$post == 0])
summary(mw_no_shore$full_time[mw_no_shore$nj == 0 & mw_no_shore$post == 0])

print("Post; NJ then PA")
summary(mw_no_shore$full_time[mw_no_shore$nj == 1 & mw_no_shore$post == 1])
summary(mw_no_shore$full_time[mw_no_shore$nj == 0 & mw_no_shore$post == 1])

## Diff-in-Diff Regression

In [None]:
# Now let's see how we can get the differences with regressions

# Here's a naive regression: a single difference (NJ pre vs. NJ post)
nj_single <- lm(full_time ~ post, data = mw_no_shore[which(mw_no_shore$nj==1),])
summary(nj_single)

# Which difference in means does this correspond to? Is it causal?

In [None]:
# What about this version?
post_single <- lm(full_time ~ treated, data = mw_no_shore[which(mw_no_shore$post==1),])
summary(post_single)

In [None]:
# Now let's do the diff-in-diff

mw_no_shore$treatedxpost <- mw_no_shore$treated * mw_no_shore$post

diff_in_diff <- lm(full_time ~ treated + post + treatedxpost, data = mw_no_shore)
summary(diff_in_diff)

# Does this look like your table? What is the advantage of doing things this way
# vs. in a table? Are there disadvantages?

In [None]:
# How do we feel about this specification?
# Is there anything else you want to control for? Do we have these variables?

# I want to control for chain. Here's a nice way to do it without manually generating
# a lot of variables:

w_chain <- lm(full_time ~ treated + post + treatedxpost + factor(chain), data = mw_no_shore)
summary(w_chain)

# This set of indicator variables for chain are often referred to as "factor variables," which is
# where the R command gets it's name. I figured out how to do this by Googling "R ols factor variables"
# and reading what was the first result for me (from the UCLA stats department)

In [None]:
# What other outcome variables are you interested in that might be related
# to economic hypotheses about raising the minimum wage? Do we have data to test this?