# EEP/IAS 118 - Problem Set 5

## Due __Friday, December 3__ at 11:59PM. 

Submit materials as one combined pdf on __Gradescope__. All work can be completed in this notebook. Make sure to run (`shift` + `enter`) all your answer cells before submission to make sure all your output is displayed. After exporting your file to PDF, make sure that your output cells are not being cut off so we can read all of your code and results. If your output is getting cut off, try different ways of generating a PDF (File->Download as->PDF via HTML; go to print in your browser and save as PDF; whatever has worked for you or your peers in the past).


# Exercise 1. The Effect of Broadband Internet on Employment - Difference-in-Differences

## Background


In this exercise, we are going to look at a paper studying the effects of access to broadband internet on employment outcomes (note that we are only giving you a subset of the data, so your results are not going to match the results in the paper). This paper answers a very policy-relevant question: does improving broadband access and affordability increase opportunities on the labor market for low-income households? There are large disparities in the US in adoption and access to home internet, most of it explained by income. Affordability is often cited as the main barrier to at-home internet, and lack of internet access at home may prevent low-income job seekers from accessing new job opportunities. According to the author, job seekers without broadband are 21 percent less likely to use online resources for job search and face other obstacles to employment that modern online tools may be suited to address. 

Zuo (2021) test whether this is true using a change of policy in 2012, where high-speed internet was offered at a very affordable price but only for eligible households (families whose children receive free or reduced-price lunch) in areas that were covered by Comcast. Zuo uses information on the specific internet provider, eligibility and labor market outcomes to study this question (he actually implements a difference-in difference-in differences strategy, similar to difference-in-difference but beyond the scope of this class). 
Pre-policy change observations are coded as 2011 and post-policy change observations are coded as 2012 in your data for the sake of simplicity. This data is then used to obtain a difference-in-differences estimate of the effect of broadband internet on employment. 

The dataset is saved as `ps5_wired_data.csv` and contains the following variables:
    

|    Variable Name     | Description                       | 
|----------------------|-----------------------------------|
| $hhid $      | Household ID      |
| $year    $ | Year    |
| $comcast   $  | Dummy =1 if household is connected to Comcast for internet, =0 otherwise     |
| $employed   $   | Dummy =1 if household's respondent is employed |
| $unemployed   $   | Dummy =1 if household's respondent is unemployed |
| $met2013 $     | Metropolitan area  |
| $earnings95 $    | Yearly earnings  |

## Question 1.1.

### Load the data. Generate a summary table with two columns and two rows. There should be two columns:  one for households in Comcast areas (Treatment column) and one for households in areas with another provider (Control column) and two rows: one for the pre-period (year 2011), one for the post-period (year 2012). Within each cell, compute the mean percentage of respondents who are employed (the variable $employed$).
*Hint: Remember that you can subset data by writing, for example `data[data$var1==0 & data$var2==7,]$var3` to select values for var3 for observations in data that meet the given criteria for var1 and var2.*

*Hint: The command `cbind` may be helpful for constructing your table. Remember that you can create a vector of values using `c()`*

*Hint: Consider loading necessary packages for the rest of the assignment here as well. It is good practice to load all necessary packages at the beginning of your code. Think about what packages we have needed previously in this class. You will also need the `lfe` package.*

In [None]:
# code here
library(tidyverse)
library(haven)
library(lfe)

data <- read_dta('ps5_wired_data.dta')
controlled_2011 <- mean(data[data$comcast==0&data$year==2011,]$employed, na.rm=T)
controlled_2012 <- mean(data[data$comcast==0&data$year==2012,]$employed, na.rm=T)
treatment_2011 <- mean(data[data$comcast==1&data$year==2011,]$employed, na.rm=T)
treatment_2012 <- mean(data[data$comcast==1&data$year==2012,]$employed, na.rm=T)

In [None]:
# code here
controlled_data <- c(controlled_2011,controlled_2012)
treatment_data <- c(treatment_2011,treatment_2012)
time<-c("2011","2012")
cbind(time,controlled_data,treatment_data)

## Question 1.2.

### State the difference-in-differences estimator for the change in employment in terms of the following quantities $\bar Y_{Comcast, pre}, \bar Y_{Comcast, post}, \bar Y_{Control, pre}, \bar Y_{Control, post}$, where $\bar Y$ refers to the mean of $employed$ (writing a formula in R code is ok). Using the means reported in part 1, calculate a value for the estimator you just proposed.

In [9]:
# code here
d_in_d_estimator <- (treatment_2012 - treatment_2011) - (controlled_2012 - controlled_2011)
d_in_d_estimator

The difference-in-difference estimator is 0.004367

## Question 1.3.

### Let's proceed with estimating the difference-in-differences estimator via a regression:

#### (a)  Write an equation that will give you the difference-in-differences estimator for the impact of the subsidized internet on employment. State which coefficient gives the estimated treatment effect of this policy.

We can write a difference-in-difference estimator in regression form as

$$ y_{it} = \beta_ 0+ \beta_1 Post_t + \beta_2 Treatment_i + \beta_3 Post_t \times Treatment_i + u_{it}$$

In this context,

$$ Employed_{it} = \beta_ 0+ \beta_1 2012_t + \beta_2 Comcast_i + \beta_3 2012_t \times Comcast_i + u_{it}$$

And $\hat\beta_3$ tells us the estimated treatment effect for households in Comcast areas in the period following the policy change.

#### (b) Perform the estimation. 
*Hint: You will need to create a 'post' dummy variable from the 'year' variable to run this regression. Note that 'state' is already a dummy variable.*

In [5]:
# code here
data <- mutate(data, post = if_else(year== 2012,1,0))
reg1 <- lm(employed ~ post + comcast + post:comcast, data = data)
summary(reg1)


Call:
lm(formula = employed ~ post + comcast + post:comcast, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7896 -0.5750  0.2104  0.4227  0.4250 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.5772583  0.0009708 594.635   <2e-16 ***
post         -0.0022184  0.0012762  -1.738   0.0822 .  
comcast       0.2102288  0.0016751 125.503   <2e-16 ***
post:comcast  0.0043673  0.0022008   1.984   0.0472 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.467 on 827749 degrees of freedom
Multiple R-squared:  0.04428,	Adjusted R-squared:  0.04428 
F-statistic: 1.278e+04 on 3 and 827749 DF,  p-value: < 2.2e-16


#### (c) What do you conclude from the results of your estimation (about the differences between households connected to Comcast and other households, and the effect of the policy change)? Confirm that the results in this part are the same as your estimate in Question 1.2.

𝛽̂ 3  tells us the estimated treatment effect for households in Comcast areas in the period following the policy change. There was difference between housholds connected to Comcast vs that were connected to other providers before the post variable was introduced. B3 is telling us there is stastical significance of this variable towards higher employment.

## Question 1.4.

### In this question, we will explore the identifying assumptions for the difference-in-differences estimator.

#### (a) What key assumption do you need to make for your regression in part 1.3 to estimate the causal effect of minimum wage laws?


We are assuming there is no omitted bias that exist that is causing the difference from the casual affect of minimum wage laws. So any change in employment is due to the internet being provided to the people that didn't have the internet before.  

#### (b) What additional data might you need to provide evidence for this assumption?

We need more year data, for example if we had more years provided to us we can probably see a better trend that would prove that having internet does increase employment in a positive way. If we only have 2011 and 2012, this time there was a big recession that took part, which was ending towards the end of 2011, and would increase employment regardless wether it was because people now have internet access, but if we had the larger time sample of the data the trend be much easier to see and support.

## Question 1.5.

### Let's say that we wanted to estimate the effect of broadband access on employment ($employed$), but we only had data from households who are in Comcast areas. Using only data for areas with Comcast coverage, estimate and interpret the effect of subsidized internet on employment. Interpret your result, including testing for significance. 
*Hint: Save the subset of data for households in areas covered by Comcast as a new dataset, and run your regression on that dataset.*

In [6]:
# code here
comcast_area <- data %>% filter(comcast == 1)
reg2 <- lm(employed ~ post, data = comcast_area)
summary(reg2)


Call:
lm(formula = employed ~ post, data = comcast_area)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7896  0.2104  0.2104  0.2125  0.2125 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.787487   0.001193 659.998   <2e-16 ***
post        0.002149   0.001567   1.371     0.17    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4082 on 278489 degrees of freedom
Multiple R-squared:  6.752e-06,	Adjusted R-squared:  3.162e-06 
F-statistic:  1.88 on 1 and 278489 DF,  p-value: 0.1703


Our beta1 here gets a value of 0.002 increase towwards employment if we had only comcast people versus -0.0022184 if we had other providers, also having a large p-value telling us its not that statistically significant if we think in terms of just giving everyone comcast that should increase employment.

## Question 1.6. 

###  In no more than 3 sentences, compare your result from Question 1.4 to your result from Question 1.5. What are some factors that could explain the difference between the two results, and which estimator would be preferable?

There is an omitted variable bias that exist and which we aren't accounting for, we clearly see in model(reg1) the post value is statistically significant versus model(reg2) where it isn't significant, tells us there exists a omitted variable. The 1.4 would be more preferable as it does give us statistical signficant on b1 that comcast increases employment somehow.

## Question 1.7


### Consider each of the following statements (that are not necessarily true) and discuss whether it supports, violates, or is irrelevant to assumptions necessary for the DD estimator to provide a valid causal effect in this case. If it violates the assumptions, discuss how it might bias the results.

1. Comcast has long been considered to be the preferred internet provider for denser areas, which are on average richer and younger, but which also experienced a boom in the labor market in the early 2010s.
2. The dot com crash has deeply affected the bottom line of all US-based internet providers, and led to a profound restructuring of the internet provider market lasting over the next 10 years.
3. In 2012, Comcast and Sonic were condemned by the FTC for violating customer privacy, reducing demand for internet products nationwide.
4. Between 2010 and 2012, rural areas, where Comcast has less coverage, happened to experience more natural disasters such as hurricanes and wildfires, and had more internet outages than average. 

1. This is would be irrelevant to us, we are not caring for who the people that use the internet or the people that have it, we are trying to look for the casual effect of having internet, here we have omitted variable of rich and younger people who would make a difference on employment but for different reasons and not because of them having internet access now.

2. This would violate because after the dot com crash, there was a large recession, where employment did dip a lot, by giving people internet access wouldn't help us answer the question if the internet increases employment or the economy/current labor. In this case the labor and economy play a big role creating an omitted variable from recession playing a role in determining employment.

3. This would irrelevant to us, we don't care if comcast and sonic were condemned for FTC violations of customer privacy, people would opt out from the option of getting them as internet provider as they value their privacy but it has nothing to do with determing the casual effects on employment rates.

4. This would violate because we are not considerring natural disasters which caused people to have internt outages on average, they have other reasons for why employment is low for them and not because they have comcast internet which they can't even use.

# Exercise 2:  Lead testing and Graduation - Panel Regression

## Background


In this exercise, we will look at the effect of improving water pipes to prevent lead exposure in schools on students' school outcomes. Many school districts in California, particularly less wealthy school districts, have school infrastructures that are many decades old. These schools were built at a time where lead standards might not have been as stringent as they are now, exposing the students to potentially high levels of lead in walls or in drinking water. In 2017, the state of California started an initiative to test lead levels in all public K-12 schools in the state, and helping school districts replace their water pipes. We have data for these replacements for the years 2017-2019, with the number of replacements per year more or less increasing over the sample period. This data is combined with attendance data from all school districts in California over the same period to test the impact of reducing lead exposure through infrastructure upgrading on student health. Attendance is used to measure student health because students who are chronically ill are often absent from school. The full dataset is described in detail below.

 The dataset `Schools_PS5.dta` is an unbalanced panel of 200 school districts for the years 2009-2012, and contains the following variables:
 
 

|    Variable Name     | Description                       | 
|----------------------|-----------------------------------|
| $district\_code $      | Unique School District Identifier    |
| $year    $ | Year    |
| $lead_replace   $  | Number of Pipes Replaced   |
| $attendance  $  | Percent of students in attendance, on average in the year   |
| $aptrack  $   | Number of students in Advanced Learning Track   |
| $latino $    | Number of Latino Students  |
| $college $    | Number of Students with Parents that Attended College   |
| $advtgd $     | Number of Students from Higher Socio-Economic Backgrounds    |
| $fleet\_size $    | Average lead content in the school district? in the District Fleet   |
| $pupils\_trans  $    | Average Number of Students Traveling per Day   |
| $enrollment$     | Number of Students Enrolled in the District 
 
Some summary statistics are provided below:

In [19]:
schooldata <- read.csv("Schools_PS5_2022.csv")
head(schooldata)

# Summary Stats
piperep <- summarize(schooldata, mean = mean(lead_replace),
             sd= sd(lead_replace),
             min= min(lead_replace),
             max = max(lead_replace))
enroll <- summarize(schooldata, mean = mean(enrollment),
             sd = sd(enrollment),
             min = min(enrollment),
             max = max(enrollment))
pupils <- summarize(schooldata, mean = mean(pupils_trans, na.rm = TRUE),
             sd = sd(pupils_trans, na.rm = TRUE),
             min = min(pupils_trans, na.rm = TRUE),
             max = max(pupils_trans, na.rm = TRUE))

ss <- rbind(piperep, enroll, pupils)
sumstats <- cbind(c("Pipe Replacements", "Enrollment", "# of Students Commuting Daily"), ss)
names(sumstats)[1] <- "Variable"

print('Sum Stats')
sumstats

Unnamed: 0_level_0,district_code,year,lead_replace,aptrack,latinx,college,advtgd,pupils_trans,enrollment,attendance
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<int>,<dbl>
1,261333,2020,11.750881,0,31,6,25,63.5,97,94.07732
2,461382,2018,0.0,17,87,22,36,31.0,125,93.432
3,461382,2019,2.060723,14,71,21,28,62.0,126,92.92857
4,461382,2020,3.942311,13,62,19,25,44.5,126,96.20634
5,461408,2020,3.942311,18,202,37,134,70.0,534,94.89887
6,461424,2018,0.0,776,6081,1976,5670,694.0,12364,92.08743


[1] "Sum Stats"


Variable,mean,sd,min,max
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Pipe Replacements,2.351899,3.315411,0,12.92825
Enrollment,1006.066536,1562.073757,7,12931.0
# of Students Commuting Daily,276.635616,393.571615,0,2988.0


In [20]:
#Confirm number of observations
length(unique(schooldata$district_code))

## Question 2.1.

### You think that it might be important to control for the year in your regression of attendance on pipe replacements.  First, generate year dummy variables $(yr_{2017}, yr_{2018}, yr_{2019}, yr_{2020})$.  Next, estimate the following equation for school attendance.
\begin{align*}
attendance_{it} = \beta_0+ \beta_1 lead\_replace_{it} + \beta_2latinx_{it} &+ \beta_3college_{it} + \beta_4advtgd_{it} + \beta_5aptrack_{it} \ \ \ \ \ \ (1) \\
&+ \delta_1yr_{2018} + \delta_2yr_{2019} + \delta_3yr_{2020} + u_{it}    
\end{align*}

#### (a) Estimate the model and report your results.

In [21]:
library(haven) 
library(tidyverse)

In [22]:
schooldata <-mutate(schooldata,yr_2017  = as.numeric((year == 2017)))
schooldata <-mutate(schooldata,yr_2018  = as.numeric((year == 2018)))
schooldata <-mutate(schooldata,yr_2019  = as.numeric((year == 2019)))
schooldata <-mutate(schooldata,yr_2020  = as.numeric((year == 2020)))
head(schooldata)

Unnamed: 0_level_0,district_code,year,lead_replace,aptrack,latinx,college,advtgd,pupils_trans,enrollment,attendance,yr_2017,yr_2018,yr_2019,yr_2020
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,261333,2020,11.750881,0,31,6,25,63.5,97,94.07732,0,0,0,1
2,461382,2018,0.0,17,87,22,36,31.0,125,93.432,0,1,0,0
3,461382,2019,2.060723,14,71,21,28,62.0,126,92.92857,0,0,1,0
4,461382,2020,3.942311,13,62,19,25,44.5,126,96.20634,0,0,0,1
5,461408,2020,3.942311,18,202,37,134,70.0,534,94.89887,0,0,0,1
6,461424,2018,0.0,776,6081,1976,5670,694.0,12364,92.08743,0,1,0,0


In [25]:
reg1a <- lm(attendance ~ lead_replace + latinx + college + advtgd+ aptrack + yr_2018 + yr_2019 + yr_2020, data = schooldata)
summary(reg1a)


Call:
lm(formula = attendance ~ lead_replace + latinx + college + advtgd + 
    aptrack + yr_2018 + yr_2019 + yr_2020, data = schooldata)

Residuals:
    Min      1Q  Median      3Q     Max 
-87.605  -0.505   3.148   5.806  19.207 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  90.8109466  1.2949151  70.129   <2e-16 ***
lead_replace  0.1527748  0.3502172   0.436   0.6629    
latinx       -0.0049153  0.0023530  -2.089   0.0372 *  
college       0.0008201  0.0075426   0.109   0.9135    
advtgd        0.0018550  0.0038718   0.479   0.6321    
aptrack       0.0118125  0.0098848   1.195   0.2326    
yr_2018      -1.2216789  1.7158642  -0.712   0.4768    
yr_2019       0.0238259  1.8417807   0.013   0.9897    
yr_2020       2.8612529  3.0550689   0.937   0.3494    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.65 on 502 degrees of freedom
Multiple R-squared:  0.034,	Adjusted R-squared:  0.01861 
F-statist

#### (b) Give the meaning (economic interpretation) of $\beta_0$ and $\delta_{1}$

The economic interpreation of b0 is that it is looking at year 2017, where everything is fixed and there has been no pipe changes to reduce lead, while considering the variables latinx,college,advtgd, and aptrack.for 𝛿1 it helps see difference between the year 2017 and 2018 and so on, where you hold every other variable constant to see the difference.

#### (c) Interpret $\hat \beta_1$.  Be sure to mention sign, size and significance and what is being held constant.

b1 is an additional unit for replacement of the lead pipes which is 0.15 percentage point increase towards school attendance while holding all else constant, and isn't significant at all with p-value of 0.669. The variables that are being held constant are latnix, college(Number of Students with Parents that Attended College), advtgd(Number of Students from Higher Socio-Economic Backgrounds) and aptrack (Number of students in Advanced Learning Track).

#### (d) Why is the year 2017 dummy excluded?

I tried to inlude the variable 2017 by running a regression of ( I am dumb and I actually tried to include the first time I did my regression and it turns out R doesn't how to differ the variables or comparison to one another).

\begin{align*}
attendance_{it} = \beta_0+ \beta_1 lead\_replace_{it} + \beta_2latinx_{it} &+ \beta_3college_{it} + \beta_4advtgd_{it} + \beta_5aptrack_{it} \ \ \ \ \ \ (1) \\
&+ \delta_1yr_{2017} + \delta_1yr_{2018} + \delta_2yr_{2019} + \delta_3yr_{2020} + u_{it}    
\end{align*}

in this regression I included the variable 2017 or the year 2017, it would output the value for the year 2017 but no value for 2020 which would be NA. It turns if you try to include the year 2017, R can't compute the regression because it assumes multicollinearity and doesn't know what to do.

## Question 2.2.

### Consider now the following (unobserved) fixed effects model:
\begin{align*}
attendance_{it} =\beta_0+ \beta_1 lead\_replace_{it} + \beta_2latinx_{it} + \beta_3college_{it} + \beta_4advtgd_{it}+ \beta_5aptrack_{it} + \boldsymbol{\delta}_t+\mathbf{a_i} +u_{it} \ \ \ (2)
\end{align*}
#### (a) Why are we adding district fixed effects ($\mathbf{a_{i}}$)? In other words, what do these fixed effects control for in the regression? 

the reason why we would 𝐚𝐢 is to control for the variables in the school district that do not change over time, and to interpret the difference of attendence per school district with their unique ids. Basically, we would want these fixed effects controlled in our regression to eliminate or reduce ommitted variable bias that they may create if they aren't held fixed in our regression.

#### (b)  Estimate the model and interpret $\hat \beta_1$.  Be sure to mention sign, size, and significance and what is being held constant.
*Hint: Use `felm`. Remember to use `as.factor` to turn your fixed effects variables into factors (dummy variables for each category).*

In [65]:
# code here
reg2c <- felm(attendance ~ lead_replace + latinx + college + advtgd + aptrack | as.factor(year) 
              + as.factor(district_code), data = schooldata)
summary(reg2c)


Call:
   felm(formula = attendance ~ lead_replace + latinx + college +      advtgd + aptrack | as.factor(year) + as.factor(district_code),      data = schooldata) 

Residuals:
    Min      1Q  Median      3Q     Max 
-31.332  -1.301   0.000   1.373  28.485 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)  
lead_replace  0.5361813  0.2348613   2.283   0.0231 *
latinx       -0.0051474  0.0063825  -0.806   0.4206  
college      -0.0159869  0.0141405  -1.131   0.2591  
advtgd        0.0004412  0.0033121   0.133   0.8941  
aptrack       0.0010158  0.0101086   0.100   0.9200  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.679 on 303 degrees of freedom
Multiple R-squared(full model): 0.8991   Adjusted R-squared: 0.8301 
Multiple R-squared(proj model): 0.02517   Adjusted R-squared: -0.6408 
F-statistic(full model):13.04 on 207 and 303 DF, p-value: < 2.2e-16 
F-statistic(proj model): 1.564 on 5 and 303 DF, p-value: 0.1699 



Here B1 tells us that for every added unit of lead replacement we increase by 0.53 percentage point towards school attendance, while holding all else constant(latinx + college + advtgd + aptrack | as.factor(year) + as.factor(district_code)). there is statistical signicance/evidence that lead replacement increases school attendance.

#### (c) Comment specifically on how the size of $\hat \beta_1$ changes from model (1) to model (2). Describe your intuition for why it changes in the way that it does, and a possible omitted variable that could explain the differences. 

in model(1) the 𝛽̂ 1 was 0.153 and in model(2) it increased to 0.53. This is telling us that that schoold district do play a keyrole in determining attendance. My theory is that some districts are already rich and they barely replaced their lead pipes or had new ones versus the poor districts where they actually changed their lead pipes and which lead to an increase in attendance at school. The omitted variable here would be that we didn't consider individual districts to be impacted different based on changing lead pipes to increase attendance, we made an assumption that they all would be the same but it turns out there is an omitted variable of poor district versus rich district or locations of district where the water might be less contaimined somehow.

## Question 2.3.

### What is the MLR 4 assumption for model (2) to be unbiased? Do you think it is likely to hold in this case? Whatever position you take, give your argument.

MLR4 wouldn't hold based on our model(2). This data goes up to the year 2020, where the attendance would skyrocket because everyone was on zoom classes (not because we did more replacement with the lead pipes), even if you replace pipes and hold other variables constant, there is an ommitted variable from one year to the other, which would be COVID here, even if you are sick, you can attend class due to the fact you can do it from home. Or another example would be what if something happened, where people moved a lot, for example during covid a lot of people started to move, so that would create a change in the district and its value because there are "new" people in the district which could change the economy for this given district or have stronger immune systems.