# Coding Section 2
## Econ 130
GSIs: Richard Calvo and Julia Paris

### Goals for today
* Do some data analysis, building up to a diff-in-diff
* We will start with what we did in the previous section, but we will not assume any knowledge other than what we covered.


Often times, user written open-source packages are needed for specific functionality in R (e.g. nice graphics). However, we need to manually install these packages (once) and load them at the beginning of every script. Packages have been pre-installed in Jupyter notebooks.  

*If you are wondering why a command you've used before is no longer working, it may be because you haven't loaded the package.*

In [None]:
# Clear any existing output and data stored in memory
rm(list = ls())

# Read in the data that we constructed last time
mw_no_shore <- read.csv('../Section 9/minwage_no_shore.csv')
head(mw_no_shore)

In [None]:
# Let's try to visualize the relationship between employment (full_time)
# and wages (wage_st, the starting wage).

library(ggplot2)

# First, we'll try a scatterplot
basic_plot <- ggplot(data = mw_no_shore, aes(x = wage_st, y = full_time)) +
  geom_point()

basic_plot

In [None]:
# Now, we want visualize the relationship between employment
# and wages in the pre-period. 

# Create a new dataframe called "pre_data" which contains only data from the pre-period
# (hint: pre data was collected in the first interview (interview==1))

pre_data <- mw_no_shore[mw_no_shore$interview == 1, ]

# Plot a scatterplot of employment against wages using 
# observations from "pre_data"

pre_plot <- ggplot(data = pre_data, aes(x = wage_st, y = full_time)) +
  geom_point()

pre_plot

In [None]:
# The scatterplot is hard to interpret! 
# Let's add a line of best fit

pre_plot + geom_smooth(method = "lm", se = FALSE)

## Correlations vs Regressions

In [None]:
# What is the correlation between full-time and part-time employees, and wages? How do we read this table?
# Note: the user = "complete.obs" is necessary in order to tell R to ignore missing data. Otherwise it will
# return a lot of NA's (you can try it if you want!).

cor(mw_no_shore[, c('full_time','part_time','wage_st')], use = "complete.obs")

In [None]:
# Recall (from lecture) that the correlation coefficient is not the same as the regression 
# coefficient, although they are related. (If you've taken ECON 140/141, you'll know why. 
# If not, don't worry about it!) For example,

model <- lm(full_time~part_time, data = mw_no_shore)

summary(model)

In [None]:
# Now try regressing the number of full-time employees on the starting wage 
# (so y=full_time, x=wage_st)

model <- lm(full_time~wage_st, data = mw_no_shore)

# Print the results using summary()
summary(model)

## Diff-in-Diff Table

For a diff-in-diff, we need to calculate four means: two "pre" means (one each for NJ and for PA) and two "post" means for full-time employment.

Let's make things more intuitive by generating some new variables "treated" and "post."

In [None]:
# Create an indicator for "treated" which is equal to one if a store is in NJ,
# and 0 otherwise

mw_no_shore$treated <- ifelse(mw_no_shore$nj == 1, 1, 0)

# Create an indicator for "post" which is equal to one if an observation is
# in the post period, and 0 otherwise
# Post = 1 if interview == 2 and 0 otherwise

mw_no_shore$post <- ifelse(mw_no_shore$interview == 2, 1, 0)

# Generate the four means needed for a difference-in-difference estimator
# Remember the mean for full_time employment for the entire dataset can be written as:
mean(mw_no_shore$full_time, na.rm = TRUE)

# We'll use the print command to help us organize our output
print("Pre; NJ then PA")

mean(mw_no_shore$full_time[mw_no_shore$post == 0 & mw_no_shore$treated == 1], na.rm = TRUE)
mean(mw_no_shore$full_time[mw_no_shore$post == 0 & mw_no_shore$treated == 0], na.rm = TRUE)

print("Post; NJ then PA")

mean(mw_no_shore$full_time[mw_no_shore$post == 1 & mw_no_shore$treated == 1], na.rm = TRUE)
mean(mw_no_shore$full_time[mw_no_shore$post == 1 & mw_no_shore$treated == 0], na.rm = TRUE)


Now we can complete the table that we started last week:

## Diff-in-Diff Regression

In [None]:
# Now let's see how we can get the differences with regressions

# Here's a naive regression: a single difference (NJ pre vs. NJ post)
nj_single <- lm(full_time ~ post, data = mw_no_shore[mw_no_shore$nj==1,])
summary(nj_single)

# Which difference in means does this correspond to? Is it causal?

In [None]:
# What about this version?
post_single <- lm(full_time ~ treated, data = mw_no_shore[mw_no_shore$post==1,])
summary(post_single)

In [None]:
# Now let's do the diff-in-diff

mw_no_shore$treatedxpost <- mw_no_shore$treated * mw_no_shore$post

diff_in_diff <- lm(full_time ~ treated + post + treatedxpost, data = mw_no_shore)

summary(diff_in_diff)

# How can we relate this to the table from the Diff-in-Diff slides?

# Coding note: we could have the same result by using:
# summary(lm(full_time ~ treated*post, data = mw_no_shore))

In [None]:
# How do we feel about this specification?
# Is there anything else you want to control for? Do we have these variables?

# I want to control for chain. Here's a nice way to do it without manually generating
# a lot of variables:

w_chain <- lm(full_time ~ treated + post + treatedxpost + factor(chain), data = mw_no_shore)
summary(w_chain)

# This set of indicator variables for chain are often referred to as "factor variables," which is
# where the R command gets it's name. I figured out how to do this by Googling "R ols factor variables"
# and reading what was the first result for me (from the UCLA stats department)

In [None]:
# What other outcome variables are you interested in that might be related
# to economic hypotheses about raising the minimum wage? Do we have data to test this?

names(mw_no_shore)

In [None]:
# For example, let's do a diff-in-diff to see the effect of the minimum
# wage on part-time employees

summary(lm(part_time ~ treated*post, data = mw_no_shore))

# For example, let's do a diff-in-diff to see the effect of the minimum
# wage on wages (like we did in Section 4, but this time more formally)
summary(lm(wage_st ~ treated*post, data = mw_no_shore))

In [None]:
# BONUS: Let's make a plot of the difference-in-difference on full-time employees

diff_in_diff_plot <- ggplot(data = mw_no_shore, aes(x = post, y = full_time, group = treated)) +
  geom_point(aes(color = as.factor(treated))) +
  geom_smooth(aes(color = as.factor(treated)), method = 'lm', se = F) +
  geom_abline(slope = diff_in_diff$coefficients['post'], intercept = 
                diff_in_diff$coefficients['(Intercept)'] + 
                diff_in_diff$coefficients['treated'], linetype = "dashed")

diff_in_diff_plot

In [None]:
diff_in_diff_plot <- diff_in_diff_plot +
  labs(x = "Period",
       y = "Number of Full-Time Employees") +
  scale_color_manual(name = '',
                     labels = c("Control (PA)", "Treated (NJ)"),
                     values = c('0' = "red", '1' = "dodgerblue")) +
  scale_x_continuous(breaks = c(0,1), labels = c('Pre', 'Post')) +
  geom_abline(slope = diff_in_diff$coefficients['post'], intercept = 
                diff_in_diff$coefficients['(Intercept)'] + 
                diff_in_diff$coefficients['treated'], linetype = "dashed", color = "dodgerblue") +
  theme_minimal()

diff_in_diff_plot

In [None]:
  # BONUS: How could we make a similar plot for wages, completing our work from section 4?