In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab07.ipynb")

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Lab 7: The Impact of Minimum Wage on Employment

## Overview

We are going to replicate a study conducted by **[Card and Krueger in 1994](https://davidcard.berkeley.edu/papers/njmin-aer.pdf)** that investigates the relationship between a rise in the minimum wage and employment.

## Background

[Economic theories](https://www.frbsf.org/research-and-insights/publications/economic-letter/2015/12/effects-of-minimum-wage-on-employment/) have long suggested that increases in the minimum wage lead to a reduction in employment for at least two reasons:

1. **Businesses are less likely to hire** and will rather invest in other resources that are now cheaper because of the wage increase.
2. **Higher salaries will induce businesses to raise their prices** to compensate for their greater costs; as prices increase, we expect fewer buyers, which will lead to lower demand and employment.

These theories have found [mixed support](https://www.nber.org/papers/w12663), but the discussion is still very much open within the policy world, as states discuss the opportunity to raise their minimum wage to help local populations to face increasing living costs. Discussions are currently occurring in **[New Jersey](https://www.nytimes.com/2019/01/17/nyregion/nj-minimum-wage.html)** and **[Illinois](https://kmox.radio.com/articles/discussions-underway-raise-illinois-minimum-wage-15hour)** to raise the minimum wage to **$15/hour** ([New York](https://www.nytimes.com/2018/12/31/nyregion/15-minimum-wage-new-york.html?module=inline) has successfully passed this same raise in 2018).

## The Original Study

One of the first studies looking at this policy problem was **Card and Krueger’s**. They applied a difference-in-difference design to look at two groups of fast-food restaurants:

- Fast-food restaurants in **New Jersey** where the minimum wage **increased** from \\$4.25 to $5.05 per hour (treatment group)
- Fast-food restaurants in **Pennsylvania** where the minimum wage did not change (control group).

They collected data before and after the minimum wage was approved. Data used in the study can be downloaded [here](https://github.com/DS4PS/PROG-EVAL-III/blob/master/TEXTBOOK/DATA/DID_Example.csv).

## Research Question
 - Does an increase in the minimum wage affect employment rates?
## Hypothesis
- An increase in the minimum wage is negatively correlated with employment.


## Part 1: Understanding the Research 

<!-- BEGIN QUESTION -->

**Question 1.1:** How does using a *difference-in-differences* research design affect the *validity* and reliability of findings in studies on changes in minimum wage and their impact on employment rate?

*Hint*: Think about the parallell trend assumption.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.2:** What's the difference between the `Group` the `Treatment` columns? 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<hr>

## Part 2: Data Cleaning

We now proceed with loading in the data used in the original 1994 study. It cointains the following variables:

| Variable Name | Description                                          |
|---------------|------------------------------------------------------|
| ID            | Unique identifier for fast food                     |
| Treatment     | Pre-treatment (=0) and post-treatment (=1)          |
| Group         | 1 if NJ (treatment); 0 if PA (Control)              |
| Empl          | # of full time employees                             |
| C.Owned       | If owned by a company (=1) or not (=0)               |
| Hours.Opening | Number hours open per day                            |
| Soda          | Price of medium soda, including tax                  |
| Fries         | Price of small fries, including tax                  |
| Chain         | 1 = BK, 2 = KFC, 3 = Roys, 4 = Wendys                |
| SouthJ        | South New Jersey                                     |
| CentralJ      | Central New Jersey                                   |
| NorthJ        | North New Jersey                                     |
| PA1           | Northeast suburbs of Philadelphia                    |
| PA2           | Easton and other PA areas                            |
| Shore         | New Jersey Shore                                     |


In [None]:
dd_df = pd.read_csv('data/DID_Example.csv')
dd_df.head()

<!-- BEGIN QUESTION -->

**Question 2.1:** What does each row in `dd_df` represent?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 2.2:** Summarize the data above such that it shows count, mean, and other useful statistics for each column.

*Hint*: use `.describe()`

In [None]:
dd_df_summarized = ...


dd_df_summarized

In [None]:
grader.check("q2_2")

**Question 2.3:** What's the average number of employees per fast-food restaurant?

In [None]:
mean_employees = ...

mean_employees

In [None]:
grader.check("q2_3")

**Question 2.4:** What percentage of fast-food restaurants are part of the treatment group? (e.g. 40 instead of 0.4)

In [None]:
per_treatment = ...

per_treatment

In [None]:
grader.check("q2_4")

**Question 2.5:** Notice that our `Chain` column is a categorical variable hidden as a numeric value! One-hot encode this column and save the your work along with the original `dd_df` in `dd_df_encoded`. After one-hot encoding `Chain`, you should *drop* the original column and *drop* `Roys` to avoid multicolinearity. This will be helpful for our analysis below.

*Hint*: use `pd.get_dummies()` for one-hot encoding. 

In [None]:
chain_dummies = ...
...
dd_df_encoded = ...

dd_df_encoded

In [None]:
grader.check("q2_5")

<hr>

## Part 3: Analysis: Estimating the Difference-in-Differences Model

We now estimate the Difference-in-Differences model based on the model below. Along with the **Group** and **Treatment** variables, we also include a set of control variables to account for differences across restaurants. 

Here we consider the variables:  
- **opening hours**—suggesting that fast-food restaurants open for more hours might need more employees. 
- prices of **fries and sodas**, under the assumption that more expensive fast food might have more resources to hire additional staff.


<!-- BEGIN QUESTION -->

**Question 3.1:** Using the general diff-in-diff model below, specify the model we're about to estimate using $\LaTeX$. Hint: Don't forget to add your controls. Feel free to copy and reformat the Latex Code below. 

$(1.1)$ 

$$Y = \beta_0 + \beta_1 \cdot \text{Treatment} + \beta_2 \cdot \text{Post} + \beta_3 \cdot \text{Treatment} \times \text{Post} + \text{Controls} + e$$


_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 3.2:** Now, rename the relevant column names of the `dd_df` to fit the equation specified in (1.1). Instead of having `Group` and `Treatment`, they should be `Treatment` and `Post`, both of which take on binary values. This will simplify our coming analysis!

In [None]:
dd_df = dd_df_encoded...
dd_df

In [None]:
grader.check("q3_2")

**Question 3.3:** Now, using `statsmodels` run the diff-in-diff from equation (1.1). Run this regression *without* your controls.

*Hint*: [here](https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html) is how to run OLS in statsmodels. 

*Hint*: think of what data type `X` and `y` should be.  

In [None]:
# Create your Treatment*Post interaction term.
dd_df['Treatment_Post'] = ...

# Define your dependent variable.
y = dd_df[...] 

# Define your independent variables, including the interaction term
X = dd_df[...]

# Add a constant to the model (the intercept)
X = sm.add_constant(X)

# Create the OLS model
model = ...

# Fit the model
results = ...

# Print the summary of the model
print(results.summary())

In [None]:
grader.check("q3_3")

**Question 3.4:** You notice that your output from the `print(results.summary())` gives you all NaN's as cofficients and std errors. This may be because of NaNs in the data. For the sake of simplicity, feel free to just drop all the NaNs for now.

In [None]:
dd_df = ...
dd_df

In [None]:
grader.check("q3_4")

**Question 3.5:** Repeat the analysis from `q3.3` and estimate the Diff-in-Diff estimator. Again, do not include controls from `q3.1` here. 

In [None]:
...

In [None]:
grader.check("q3_5")

**Question 3.6:** We notice that our $R^2$ (one of many goodness-of-fit parameters we can use) is very low. To aid our analysis, Include the controls you specified in the $\LaTeX$ equation above.

In [None]:
...

In [None]:
grader.check("q3_6")

<!-- BEGIN QUESTION -->

**Question 3.7:** Controlling for opening hours, and prices of soda & fries - What is the diff-in-diff estimator for the impact on introducing a minimum wage in New Jersey? Remember to include units and give a brief interpretation of our findings following the SSS framework from above. Be sure to mention how your $R^2$ changed, and it's practical implications. 

*Note*: Please format your markdown nicely (like the following cell) to aid readability.

#### Sign, Size, and Significance (SSS) framework for interpreting regression outputs

##### 1. Sign

- **Expected Sign**: What sign did you expect the estimated parameter(s) to have? Why?
- **Actual Sign**: Does your estimate(s) have this sign (i.e., are you surprised or reassured by your results)?

##### 2. Significance

- **Statistical Significance**: Is the estimate(s) statistically different from zero?
- **T-Statistic**: What is the t-statistic of this hypothesis?

##### 3. Size

- **Effect on Dependent Variable**: How do changes in this variable affect the dependent variable according to your estimation?
- **Economic Significance**: Is this an economically meaningful effect size?

This framework is borrowed from Berkeley's EEP C118 course. See more [here.](https://are.berkeley.edu/courses/EEP118/spring2014/section/Handout4_2014.pdf)

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<hr>

## Part 4: Visualizing the diff-in-diff estimator

Below, we plot the difference in difference estimates from question 3.5 and embark on a visual interpretation of our findings.

In [None]:
# First, calculate mean employment by group and time
mean_empl = dd_df.groupby(['Treatment', 'Post'])['Empl'].mean().reset_index()

# Now, plot these averages with lines to show the change from pre to post for each group
plt.figure(figsize=(14, 7))
sns.lineplot(x='Post', y='Empl', hue='Treatment', data=mean_empl, marker='o', palette='viridis')

plt.title('Difference-in-Differences: Employment Pre and Post Treatment')
plt.xlabel('Period (0 = Pre-Treatment, 1 = Post-Treatment)')
plt.ylabel('Average Employment')
plt.xticks(ticks=[0, 1], labels=['Pre-Treatment', 'Post-Treatment'])
plt.legend(title='Group / State', labels=['Control / PA', 'Treatment / NJ'])

# Adding annotations for clarity
for line in range(mean_empl.shape[0]):
    plt.text(mean_empl.Post[line]+0.02, mean_empl.Empl[line], 
             f"{mean_empl.Empl[line]:.2f}", horizontalalignment='left', 
             size='medium', color='black', weight='semibold')

plt.figure(figsize=(6, 5)) 
plt.tight_layout()
plt.show();

**Question 4.1:** Now, let's make sure we can visually interpret our findings. Match the regression output numbers below to the appropriate regression equation.

**Equations:**

a) $const$

b) $const + \text{Treatment}$

c) $const + \text{Post}$

d) $const + \text{Post} + \text{Treatment} + \text{Treatment*Post}$

**Regression Output Numbers:**

1. $10.44$
2. $7.59$
3. $7.75$
4. $8.47$


In [None]:
a = ... # These should be either 10.44, or 7.59, or 7.75, or 8.47
b = ...
c = ...
d = ...


In [None]:
grader.check("q4_1")

**Question 4.2:** Using the numbers from the visualization above, calculate the difference in difference estimate. It should match your estimate from question 3.5.

In [None]:
DD_estimate = ...
DD_estimate

In [None]:
grader.check("q4_2")

<!-- BEGIN QUESTION -->

**Question 4.3:** Looking back to your response to question 1.1 and the plot from above, what are some limitations in Card's Difference and Differences model's setup. 

*Hint*: Think of the assumptions used in Difference in Differences. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<hr>

### Conclusion
Congratulations, you have finished lab 7! We hope you enjoyed the lab - you're one step closer in becoming a master replicator of economics papers...

Have a great week!

Justin, Luis, and Dawson.


---
## Sources

*FOUNDATIONS OF PROGRAM EVALUATION III REGRESSION TOOLS FOR CAUSAL ANALYSIS* by Data Science for 
Public Service (https://ds4ps.org/PROG-EVAL-III/index.html), retrieved 22 Feb 2024.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)