<h1>ECON 140R Class 21</h1>

<b>Difference-in-differences (DID)</b> estimation is an extremely useful tool that can take several forms. Let us finish up our exploration of the second generalization of DID, which we call <b>panel fixed effects</b>. This is an estimation strategy when there are many more groups, which can be treated and then untreated, as we will see. This is the topic of section 5.2 in <i>Mastering Metrics</i>.

Learning objectives:

1. Examine panel data on the Minimum Legal Drinking Age (MLDA)
2. Run a DID using panel fixed effects and a clever TREAT $\times$ POST in `lm()`
3. Fail and then succeed. Failure is part of the story!
4. plot( ) time series on the same axes
5. Include other potential confounders like state-specific trends and beer taxes

In [None]:
library(haven)
library(dplyr)
library(estimatr)

Now let us load in the dataset `deaths.dta` that Angrist and Pischke examine in Table 5.2 and 5.3 inside section 5.2.

In [None]:
deaths <- read_dta("deaths.dta")
head(deaths)

This panel dataset is complicated. Each row is actually not a state-year, it is instead a <b>state-year-age_group-cause</b>. That is, variable `dtype` indexes the cause of death for which that row's `mrate` measures the mortality rate in deaths per 100,000; deaths and population at risk are for the age group indexed by variable `agegr`.

The codings for each of these variables are:

`dtype`
1. All deaths
2. Motor vehicle accidents
3. Suicide
4. Homicide
5. Other external cause
6. Internal cause 

`agegr`
1. ages 15-17
2. ages 18-20
3. ages 21-24



This kind of dataset is challenging to work with, although it is perfectly set up for panel analysis. Simply put, it is hard to see the trees for the forest, to turn a common idiom on its head. It is not easy to peer inside these complex data.

Inside __R__, I think the most obvious way to peer inside the data is to use `subset()` to carve off a piece of the whole. 

In [None]:
# This is data for California showing a bunch of variables between 1970 and 1996, 
# and the outcome variable mrate = Motor Vehicle Accident mortality per 100,000 aged 18-20 
ca_18_20_mva <- subset(deaths, state == 6 & agegr == 2 & dtype == 2)
ca_18_20_mva

There is not a lot going on here with the minimum legal drinking age (MLDA), because California had it set at 21 since 1933, apparently. [Source: Wikipedia](https://en.wikipedia.org/wiki/U.S._history_of_alcohol_minimum_purchase_age_by_state).

A pair of states that are more interesting to look at, per Angrist and Pischke, are Alabama and Arkansas. Alabama lowered its minimum legal drinking age to 19 in 1975, while Arkansas remained at 21 since Prohibition, like California.

In [None]:
# Alabama is state FIPS == 1
al_18_20_mva <- subset(deaths, state == 1 & agegr == 2 & dtype == 2 & year <= 1983)
al_18_20_mva
# Arkansas is state FIPS == 5
ar_18_20_mva <- subset(deaths, state == 5 & agegr == 2 & dtype == 2 & year <= 1983)
ar_18_20_mva

To run a panel fixed effects model, we need to stack or pool these two sets of state-specific observations. In __R__, `rbind()` allows us to do exactly that.

In [None]:
ar_al_18_20_mva <- rbind(al_18_20_mva, ar_18_20_mva)
ar_al_18_20_mva

Now let's call `mutate()` several times to create new variables for our standard DID set:

* $TREAT$ is an indicator of the treatment state, here Alabama
* $POST$ is an indicator of being past the treatment, here 1975 and onward
* $TREAT \times POST$ is the DID treatment variable, which measures being the treatment group during the treatment

In [None]:
# TREAT == 1 when state == 1, for Alabama
ar_al_18_20_mva <- mutate(ar_al_18_20_mva, 
                          treat = as.integer(state == 1))

# POST == 1 when year >= 1975, which is when Alabama started to
# lower its minimum legal drinking age
ar_al_18_20_mva <- mutate(ar_al_18_20_mva, 
                          post = as.integer(year >= 1975))

# TREAT x POST is the product. It registers 1 when the 
# treated state (Alabama) is actually treated
ar_al_18_20_mva <- mutate(ar_al_18_20_mva, 
                          treatxpost = as.integer(treat*post))

In [None]:
ar_al_18_20_mva

Now let us run regression DID with Alabama and Arkansas from 1970-1983. Here is the regression equation, which is also shown on page 192 of <i>Mastering Metrics</i>:

$$
mrate_{st} = \alpha + \beta \ TREAT_s + \gamma \ POST_t + \delta_{rDID} \ TREAT_s \times POST_t + e_{st} 
$$

where $mrate_{st}$ is the motor vehicle accident mortality rate in state $s$ at time $t$. (Angrist and Pischke are silent on which mortality rate they are talking about. Total or MVA mortality are the two obvious choices.) 

In [None]:
ar_al_did_70_83 <- lm(mrate ~ treat + post + treatxpost,
               data = ar_al_18_20_mva)
summary(ar_al_did_70_83)

What do we see in the results above? Are you impressed or disheartened? Are Alabama and Arkansas good states to compare? Is the time period very long?

<hr>

Another thing we could do is try and visualize what is happening. It turns out that the death rates are pretty easy to superimpose on the same axes, even with my poor graphing skills in __R__!

In [None]:
# These are the "pre-stacked" data frames
plot(ar_18_20_mva$year, ar_18_20_mva$mrate, col = "red", type = "b", 
     main = "Arkansas (red) and Alabama (blue) MVA death rates")
lines(al_18_20_mva$year, al_18_20_mva$mrate, col = "blue")
grid()

There is some evidence of similar trending, I suppose, but it is not exactly breathtaking, is it! 

Below is the clever Angrist-Pischke measure of treatment, $LEGAL_{st}$, which is a little different from $TREAT_s \times POST_t$ because it measures the share of 18-20 year olds who can legally drink. That makes a lot of sense given that the left-hand-side variable is the death rate among ages 18-20y. The variable $LEGAL_{st}$ tends to be near either 0, 1/3, 2/3, or 1 for obvious reasons.

In [None]:
plot(al_18_20_mva$year, al_18_20_mva$legal, col = "blue", type = "b")
lines(ar_18_20_mva$year, ar_18_20_mva$legal, col = "red")

Perhaps to no great surprise, Arkansas does not appear to be a particularly good control group for Alabama. Visually speaking, parallel trends are far from easy to pick out here.

Takeaways:
1. It is OK to be "wrong." Everybody is wrong often. Breathe it in and feel the wrong set you free
2. When data are annual averages, it can be easy not to find statistically significant results

Before we leave the Alabama and Arkansas example, let us examine the clever trick introduced by Angrist and Pischke in <i>Mastering Metrics</i> section 5.2, where they propose a new DID treatment variable for the minimum legal drinking age scenario, called $LEGAL_{st}$, which is an index ranging between 0 and 1 for state $s$ in year $t$.

Because states changed their MLDAs between 18 and 21, the age group 18-20 is most interesting to examine. But it also is not equally interesting in all places, because different states did different things. They also did them at <i>different times</i> during the year, passing a law that affected less than the full year, for example.

(An additional wrinkle that MM can largely sidestep is that alcohol-related motor vehicle accidents are surely not uniformly distributed across the days and weeks of the year. With annual data, these issues are unlikely to be important. But if the data were quarterly, monthly, weekly, or daily, these issues would be important. For more, see the ECON 140R modules on time series and seasonal adjustment.) 

Bottom line, $LEGAL_{st}$ is a constructed variable that helps estimation while still looking a lot like the $TREAT_s \times POST_t$ variable that we know and love. Let us examine what happens in the case of Alabama and Arkansas when we use $LEGAL_{st}$ instead:

In [None]:
ar_al_did_70_83_2 <- lm(mrate ~ treat + post + legal,
               data = ar_al_18_20_mva)
summary(ar_al_did_70_83_2)

What do we see here? Is the story much different using $LEGAL_{st}$?

<hr>

Let us jump ahead to the finish line in Chapter 5, which is Table 5.2, which shows the results of four different models (across the columns) applied to four different measures of mortality (across the rows). Here is Table 5.2:

<img src = "images/MMtbl52.png" width = 500 /> 

The way to read the table is that each cell is the DID estimate of the effect of a state's <i>lowering</i> its minimum legal drinking age (MLDA) on a particular death rate, which is shown along the row. A positive number is an increase in the death rate that is caused by lowering the MLDA, so that more 18-20 year olds can legally drink alcohol. The number in the upper left, $10.80$, means that expanding legal drinking down to age 18 triggers an increase in the total death rate among 18-20 year olds of 10.80 deaths per 100,000. That estimate is obtained without state trends or weights; when those are included, along the columns, the results change a little.

See that $10.80$? Here it is below, in the first row after the intercept. We are just running `lm()` with the all-cause death rate as our $y$-variable (`dtype1` == 1), in the appropriate time range and with the 18-20yo age group.

In [None]:
panel_fe_1 <- lm(mrate ~ legal + factor(state) + factor(year),
                 data = subset(deaths, year <= 1983 & agegr == 2 & dtype == 1))
summary(panel_fe_1)

Note how the standard error of that coefficient on $LEGAL_{st}$ is $3.14$ rather than the reported value of $4.59$ shown in Table 5.2. As Angrist and Pischke discuss in the appendix to Chapter 5, best practices are to <i>cluster the standard errors</i> on the group ID, here the states.

In the long, long ago we did the same for the RAND HIE, only there we clustered on `famid`, the family ID. So let us bring back our beloved friend `lm_robust()` for some good times:

In [None]:
panel_fe_1r <- lm_robust(mrate ~ legal + factor(state) + factor(year),
                         data = subset(deaths, year <= 1983 & agegr == 2 & dtype == 1),
                         clusters = state, se_type = "stata")
summary(panel_fe_1r)

Ha! Note the call to ask __R__ to use `se_type` equal to "stata" there as a finishing touch!

<hr>

<h2>Interlude</h2> 

There must be a way to generate interactions between `state` and `year` in __R__, but alas I cannot figure it out before class. I can hardcode it, but that would be 50-times extremely sad for me and you alike. Progress awaits further inquiry.

The reason we care about it is because Angrist and Pischke introduce state-specific linear time trends, to check the robustness of the MLDA result. Those are the "State trends" in Table 5.2 and on the right-hand side of Table 5.3. Operationally, we mean variables that shadow `year` and only "switch on" for an individual state. We can get there by using 50 calls to `mutate()`, but that is just preposterously inelegant. 

<hr>

<h2>More control variables in panel fixed effects
</h2>
    
The last topic in Chapter 5 is introducing another control variable to a panel fixed effects regression model. This turns out to be as easy as inserting the confounder of interest as another right-hand-side variable.

<img src = "images/MMtbl53.png" width = 500 /> 

The code below replicates the first two rows in the first two columns above. Note the very large standard error on the beer tax coefficient.

In [None]:
panel_fe_1r <- lm_robust(mrate ~ legal + beertaxa + factor(state) + factor(year),
                         data = subset(deaths, year <= 1983 & agegr == 2 & dtype == 1),
                         clusters = state, se_type = "stata")
summary(panel_fe_1r)

What do you see here? What is the bottom line for beer taxes and minimum legal drinking ages (MLDA) in terms of their effects on the death rate for people aged 18-20y?

<hr>

<h2>If there's time: Another successful DID failure(?)</h2>

Courtesy of Prof. Ethan Lewis at Dartmouth, we have another interesting DID regression or two up our sleeves. Recall the famous Card and Krueger paper about the rise in the minimum wage in New Jersey in 1992 compared to Pennsylvania? Then, it rose from \\$4.25 to \\$5.05 per hour.

In 2014, New Jersey raised its minimum wage from \\$7.25 to \\$8.25, and then a little more thereafter, to keep up with inflation. Like before, Pennsylvania left its minimum wage alone. Both times series can be found on the St. Louis Fed FRED website: [for NJ](https://fred.stlouisfed.org/series/STTMINWGNJ) and [for PA](https://fred.stlouisfed.org/series/STTMINWGPA).

This might be fun to examine. A big problem, as we've discussed in class, is that for a DID to be successful, you need to measure the units more likely to show an effect from treatment. Card and Krueger actually surveyed fast food restaurants, who employ lots of minimum-wage workers.

What we can do is look at annual averages of employment in the Census Bureau's American Community Survey (ACS) courtesy of IPUMS. We have two takes on this below: one that measures all employment in NJ and PA, and another that measures employment of just men with high school or less and aged 16-25. 

In [None]:
njpa1117_annavgemp <- read_dta("njpa1117_annavgemp.dta")
njpa1117_m1625_hs_annavgemp <- read_dta("njpa1117_m1625_hs_annavgemp.dta")

In [None]:
njpa1117_annavgemp <- mutate(njpa1117_annavgemp, 
                             treat = as.integer(nj == 1))

njpa1117_annavgemp <- mutate(njpa1117_annavgemp, 
                             post = as.integer(year >= 2014))

njpa1117_annavgemp <- mutate(njpa1117_annavgemp, 
                             treatxpost = as.integer(treat*post))

njpa1117_annavgemp

In [None]:
njpa1117_reg1 <- lm(sum_employed ~ treat + post + treatxpost,
                   data = njpa1117_annavgemp)
summary(njpa1117_reg1)

Hmm. What do you see in this DID regression of total employment on the 2014 minimum wage increase? Are you convinced of anything?

<hr>

Now let us also look at men aged 16-25 with a high school degree or less.

In [None]:
njpa1117_m1625_hs_annavgemp <- mutate(njpa1117_m1625_hs_annavgemp,
                                      treat = as.integer(nj == 1))

njpa1117_m1625_hs_annavgemp <- mutate(njpa1117_m1625_hs_annavgemp,
                                      post = as.integer(year >= 2014))

njpa1117_m1625_hs_annavgemp <- mutate(njpa1117_m1625_hs_annavgemp,
                                      treatxpost = as.integer(treat*post))

njpa1117_m1625_hs_annavgemp

In [None]:
njpa1117_reg2 <- lm(sum_employed ~ treat + post + treatxpost,
                   data = njpa1117_m1625_hs_annavgemp)
summary(njpa1117_reg2)

Hmm. What do you see in this DID regression of total employment on the 2014 minimum wage increase? Are you convinced of anything?

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>