--- 
Project for the course in Microeconometrics | Summer 2021, M.Sc. Economics, Bonn University | [Edoardo Falchi](https://github.com/edoardofalchi)

# Replication of David Card & Alan Krueger  (1994) <a class="tocSkip">   
---

This notebook contains my replication of the results from the following paper:

> Card, David, and Alan Krueger. 1994. “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania.” American Economic Review 84: 772–93.

##### Downloading and viewing this notebook:

* The best way to view this notebook is by downloading it and the repository it is located in from [GitHub](https://github.com/edoardofalchi/ose-data-science-course-projeect-edoardofalchi). Other viewing options like _MyBinder_ or _NBViewer_ may have issues with displaying images or coloring of certain parts (missing images can be viewed in the folder [files](https://github.com/edoardofalchi/ose-data-science-course-projeect-edoardofalchi/tree/master/files) on GitHub).


* The original paper can be accessed  [here](https://www.nber.org/papers/w4509), while the [dataset](https://davidcard.berkeley.edu/data_sets.html) is freely accessible from the author's website.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Introduction" data-toc-modified-id="1.-Introduction-1">1. Introduction</a></span></li><li><span><a href="#2.-Theoretical-Background" data-toc-modified-id="2.-Theoretical-Background-2">2. Theoretical Background</a></span></li><li><span><a href="#3.-Identification" data-toc-modified-id="3.-Identification-3">3. Identification</a></span></li><li><span><a href="#4.-Empirical-Strategy" data-toc-modified-id="4.-Empirical-Strategy-4">4. Empirical Strategy</a></span></li><li><span><a href="#5.-Replication-of-Lindo-et-al.-(2010)" data-toc-modified-id="5.-Replication-of-Lindo-et-al.-(2010)-5">5. Replication of Lindo et al. (2010)</a></span><ul class="toc-item"><li><span><a href="#5.1.-Data-&amp;-Descriptive-Statistics" data-toc-modified-id="5.1.-Data-&amp;-Descriptive-Statistics-5.1">5.1. Data &amp; Descriptive Statistics</a></span></li><li><span><a href="#5.2.-Results" data-toc-modified-id="5.2.-Results-5.2">5.2. Results</a></span><ul class="toc-item"><li><span><a href="#5.2.1.-Tests-of-the-Validity-of-the-RD-Approach" data-toc-modified-id="5.2.1.-Tests-of-the-Validity-of-the-RD-Approach-5.2.1">5.2.1. Tests of the Validity of the RD Approach</a></span></li><li><span><a href="#i.--Extension:-Visual-Validity-Check" data-toc-modified-id="i.--Extension:-Visual-Validity-Check-5.2.2">i.  Extension: Visual Validity Check</a></span></li><li><span><a href="#ii.-Advanced-Validity-Check" data-toc-modified-id="ii.-Advanced-Validity-Check-5.2.3">ii. Advanced Validity Check</a></span></li><li><span><a href="#5.2.2.-First-Year-GPAs-and-Academic-Probation" data-toc-modified-id="5.2.2.-First-Year-GPAs-and-Academic-Probation-5.2.4">5.2.2. First Year GPAs and Academic Probation</a></span></li><li><span><a href="#5.2.3.-The-Immediate-Response-to-Academic-Probation" data-toc-modified-id="5.2.3.-The-Immediate-Response-to-Academic-Probation-5.2.5">5.2.3. The Immediate Response to Academic Probation</a></span></li><li><span><a href="#5.2.4.-The-Impact-onSubsequent-Performance" data-toc-modified-id="5.2.4.-The-Impact-onSubsequent-Performance-5.2.6">5.2.4. The Impact onSubsequent Performance</a></span></li><li><span><a href="#i.-Main-Results-for-Impact-on-GPA-&amp;-Probability-of-Placing-Above-Cutoff-in-the-Next-Term" data-toc-modified-id="i.-Main-Results-for-Impact-on-GPA-&amp;-Probability-of-Placing-Above-Cutoff-in-the-Next-Term-5.2.7">i. Main Results for Impact on GPA &amp; Probability of Placing Above Cutoff in the Next Term</a></span></li><li><span><a href="#ii.-Formal-Bound-Analysis-on-Subsequent-GPA-(partial-extension)" data-toc-modified-id="ii.-Formal-Bound-Analysis-on-Subsequent-GPA-(partial-extension)-5.2.8">ii. Formal Bound Analysis on Subsequent GPA (partial extension)</a></span></li><li><span><a href="#5.2.5.-The-Impacts-on-Graduation" data-toc-modified-id="5.2.5.-The-Impacts-on-Graduation-5.2.9">5.2.5. The Impacts on Graduation</a></span></li></ul></li></ul></li><li><span><a href="#6.-Extension:-Robustness-Checks" data-toc-modified-id="6.-Extension:-Robustness-Checks-6">6. Extension: Robustness Checks</a></span><ul class="toc-item"><li><span><a href="#6.1.--A-Closer-Look-at-Students'-Subsequent-Performance." data-toc-modified-id="6.1.--A-Closer-Look-at-Students'-Subsequent-Performance.-6.1">6.1.  A Closer Look at Students' Subsequent Performance.</a></span><ul class="toc-item"><li><span><a href="#6.1.1.-Subsequent-Performance-and-Total-Credits-in-Year-2" data-toc-modified-id="6.1.1.-Subsequent-Performance-and-Total-Credits-in-Year-2-6.1.1">6.1.1. Subsequent Performance and Total Credits in Year 2</a></span></li><li><span><a href="#6.1.2.-Subsequent-Cumulative-Grade-Point-Average-(CGPA)" data-toc-modified-id="6.1.2.-Subsequent-Cumulative-Grade-Point-Average-(CGPA)-6.1.2">6.1.2. Subsequent Cumulative Grade Point Average (CGPA)</a></span></li></ul></li><li><span><a href="#6.2.-Bandwidth-Sensitivity" data-toc-modified-id="6.2.-Bandwidth-Sensitivity-6.2">6.2. Bandwidth Sensitivity</a></span></li></ul></li><li><span><a href="#7.-Conclusion" data-toc-modified-id="7.-Conclusion-7">7. Conclusion</a></span></li><li><span><a href="#8.-References" data-toc-modified-id="8.-References-8">8. References</a></span></li></ul></div>

In [4]:
%matplotlib inline
import numpy as np
import pandas as pd
import pandas.io.formats.style
import seaborn as sns
import statsmodels as sm
import statsmodels.formula.api as smf
import statsmodels.api as sm_api
import matplotlib as plt
from IPython.display import HTML

# 1. Introduction

To have a general idea let's lay the groundwork by posing some questions:

* **What is the causal link the paper is trying to reveal?**  
The big question that makes this paper important: _"How do employers in a low wage labor market respond to an increase in the minimum wage?"_.

    The narrow question this paper seeks to address: _"How does a rise in minimum wage in New Jersey affect employment and wage level within fast food restaurants?"_
The authors compare employment levels in fast-food restaurants in New Jersey and Pennsylvania (its neighboring state without similar policy change) before and after the minimum wage shifts from 4.25\$ to 5.05\$ per hour in New Jersey.

* **What is the key dependent variable?**  
The outcome variable $Y_{i,s,t}$ is employment in restaurant $i$ in state $s$ (NJ or PA) and period $t$ (a month before or eight months after the minimum-wage increase).

* **What is the key independent variable?**  
The treatment dummy taking value 1 for stores in New Jersey during the treatment (i.e. after the implementation of the higher minimum wage) and value 0 otherwise.

* **What is the data source?**  
Researchers constructed a sample frame of fast-food restaurants in New Jersey and eastern Pennsylvania from the Burger King, KFC, Wendy's, and Roy Rogers chain. They collected a first and a second wave interview data via survey that was conducted by telephone and the survey included questions on employment, starting wages, prices, and other store characteristics. First wave of data was collected in between February 15 and March 4 1992(period before the new increased minimum wage) and the second wave is between November 5 and December 31 1992 (period after the new increased minimum wages).

* **What is the identification strategy?**   
The authors use a difference-in-difference approach as the identification strategy and estimate the impact at the micro level. They compare employment levels in fast-food restaurants in New Jersey and Pennsylvania (its neighboring state without similar policy change) before and after the minimum wage shift in New Jersey. The model dependent variables is the change in employment from wave 1 to wave 2 at a particular store and the explanatory variables is set of chracteristics of stores and dummy variable that equals 1 for stores in New Jersey. For another equation with the same dependent variable, the explanatory variables are; the set of charateristics of stores and an alaternative measure of the impcat of the minimum wage at a certain store.

* **What are the assumptions / threats to this identification strategy?**   
The difference-in-difference method faces a central threat known as the parallel assumption - that is, employment levels in New Jersey and Pennsylvania would follow the same time trend in the absence of the minimum wage policy change.   
This identification strategy also assumes that fast-food restaurants in Pennsylvania are the best counterfactuals for those in New Jersey. Of the most importance, the extent of firm competitiveness in the food industry is assumed to be the same in the two states.|


---
# 1. Causal graph
---

![ERROR:Here should be causal graph 1](files/dagitty-model.jpg)
_How do we come up with this causal graph?_

The treated sample (NJ restaurants) at a certain point had a new policy applied to them (wage increase). Authors gathered data so that we can observe them both before and after the treatment went into effect. We argue that the policy treatment D might have had an effect on Y.  

The simple difference between Y before treatment and Y after treatment for the Treated group will reflect two things: the effect of treatment on Y -the part we're interested in- and the way that Y may have changed over time for reasons unrelated to treatment -the part we're not interested in-. Time gives us a back-door path from Treatment to Y. We can get from Treatment to Y either through the Treatment → Y path -which we want- or the Treatment ← Time → Y path -which we don't-.

But we can't close this back door by controlling for time if we only look at the treated group, since time perfectly predicts treatment (it's either Before and you're not treated, or after and you are) - so if we remove all parts of treatment explained by time, there's nothing left!

What can we do? We can add a control group that never gets treated (Pennsylvania restaurants). This is going to let us control for time, but introduces the problem that now we have another back door, since the control and treatment groups may be different. In our scenario, a restaurant receives treatment only if it is in the treated State group (New Jersey) AND in the time period after treatment is applied. In addition to our time back door, we also have a back door from Treatment ← State → Y that we need to close, and we can do so by applying **Difference-in-Differences**.

Looking at differences separately for treatment and control group is a way of controlling for State, closing the Treatment ← State → Y back-door. Then, we take our before/after difference for the treated group and subtract out the before/after difference for the control group . We just took out the before/after difference that was explained by time for the control group, in effect controlling for time and closing the Treatment ← Time → Y back-door.

---
# 2. Dataset
---

The original dataset was in wide format, i.e. there are separate columns for variables of each wave of the survey.

In [5]:
df = pd.read_stata('data/CK1994.dta')
df = df.sort_values(by="store").set_index("store")
df
#df[df.store==49]
#df["time"] = df["time"].astype(bool)
#bool(df["time"])
#type(df["time"])df.dtypes
#df.dtypes


Unnamed: 0_level_0,chain,co_owned,state,southj,centralj,northj,pa1,pa2,shore,ncalls,...,firstinc,meals,open,hoursopen,pricesoda,pricefry,priceentree,nregisters,nregisters11,time
store,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.25,3.0,7.0,16.0,0.93,0.83,0.85,4.0,3.0,0.0
1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,,...,,2.0,7.0,16.0,1.05,0.79,0.90,4.0,3.0,1.0
2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.50,1.0,7.0,14.0,1.06,0.91,0.96,2.0,2.0,0.0
2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,,...,0.38,1.0,7.0,15.0,1.05,1.01,0.94,2.0,2.0,1.0
3.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,3.0,...,0.25,2.0,11.0,10.0,1.06,0.95,3.09,5.0,3.0,0.0
3.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,...,0.25,1.0,11.0,11.0,1.05,0.94,2.75,5.0,3.0,1.0
4.0,3.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,,...,0.28,2.0,8.0,13.5,1.22,1.37,0.89,6.0,4.0,1.0
4.0,3.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.25,2.0,9.0,12.5,1.06,1.00,2.13,8.0,4.0,0.0
5.0,3.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,...,0.18,2.0,6.0,17.0,1.05,0.94,0.94,5.0,5.0,1.0
5.0,3.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.17,1.0,6.0,17.0,0.95,,2.13,5.0,4.0,0.0


In [17]:
#!pip install plotly_express
#import plotly.express as px

fig = px.choropleth(locations=["NJ", "PA"], locationmode="USA-states", color=df[df["time"]==0].groupby("state")["hoursopen"].mean(), scope="usa")
fig.show()

In [15]:
df[df["time"]<1].groupby("state")["hoursopen"].mean()

state
0.0    14.525316
1.0    14.418429
Name: hoursopen, dtype: float32

In [10]:
df_extended = pd.read_stata('data/fastfood.dta')
df_extended

Unnamed: 0,sheet,chain,co_owned,state,southj,centralj,northj,pa1,pa2,shore,...,firstin2,special2,meals2,open2r,hrsopen2,psoda2,pfry2,pentree2,nregs2,nregs112
0,46,1,0,0,0,0,0,1,0,0,...,0.08,1.0,2.0,6.5,16.5,1.03,,0.94,4.0,4.0
1,49,2,0,0,0,0,0,1,0,0,...,0.05,0.0,2.0,10.0,13.0,1.01,0.89,2.35,4.0,4.0
2,506,2,1,0,0,0,0,1,0,0,...,0.25,,1.0,11.0,11.0,0.95,0.74,2.33,4.0,3.0
3,56,4,1,0,0,0,0,1,0,0,...,0.15,0.0,2.0,10.0,12.0,0.92,0.79,0.87,2.0,2.0
4,61,4,1,0,0,0,0,1,0,0,...,0.15,0.0,2.0,10.0,12.0,1.01,0.84,0.95,2.0,2.0
5,62,4,1,0,0,0,0,1,0,0,...,,0.0,2.0,10.0,12.0,,0.84,1.79,3.0,3.0
6,445,1,0,0,0,0,0,0,1,0,...,0.15,0.0,2.0,6.0,18.0,1.04,0.86,0.94,3.0,3.0
7,451,1,0,0,0,0,0,0,1,0,...,0.20,0.0,2.0,0.0,24.0,1.11,0.84,0.94,6.0,3.0
8,455,2,1,0,0,0,0,0,1,0,...,0.25,0.0,2.0,11.0,11.0,0.94,0.84,2.32,4.0,3.0
9,458,2,1,0,0,0,0,1,0,0,...,0.25,0.0,1.0,11.0,10.5,0.90,0.73,2.32,4.0,3.0
