## Difference-in-Differences

TODO

### Libraries

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as ss
import statsmodels.formula.api as smf
from causalinference import CausalModel

from matplotlib import style
import matplotlib.pyplot as plt
import seaborn as sns
style.use("fivethirtyeight")

import warnings
warnings.filterwarnings("ignore")

### Data

For the sake of simplicity and learning, we are going to use an adapted version of a well-known dataset from Card and Krueger (1994).

The aim is to estimate the **causal effect of an increase in the state minimum wage on the employment**. On April 1st, 1992, New Jersey raised the state minimum wage from $4.25 to $5.05, while the minimum wage in Pennsylvania stays the same at $4.25.

Since both states are geographically close to each other and the demographics are similar, we can use data on both of them, in order to answer the causal query.

Variables:
- `state` is a flag: 1 for New Jersey (treatment), 0 for Pennsylvania (control).
- `total_emp_feb` is the total number of employees of sampled restaurantes, in February 1992 - before the increase.
- `total_emp_nov ` is the total number of employees of sampled restaurantes, in November 1992 - after the increase.

In [None]:
df = pd.read_csv("../data/employment.csv",)
df.head()

We need to adapt the dataset a little bit, in order to reflect the relationship in time:

In [None]:
df = pd.concat([
    df.rename(columns={"total_emp_feb": "emp"}).assign(after_increase=0).drop(columns=["total_emp_nov"]),
    df.rename(columns={"total_emp_nov": "emp"}).assign(after_increase=1).drop(columns=["total_emp_feb"]),
], axis=0)
df.head()

Some basic definitions:

In [None]:
Y = "emp"
T = "state"
TIME = "after_increase"

### Theoretical Approach Step-By-Step

Let's apply what we've seen in the theoretical introduction.

### Python package: `causalinference`

Let's take a look at how we can accomplish the same task by using a thirdy party library.

In [None]:
model = CausalModel(
    Y=df[Y].values.squeeze(),
    D=df[T].values.squeeze(),
    X=df[X].values,
)
model.est_via_matching(matches=1, bias_adj=True)
model.estimates

### Our approach

In [None]:
# from causal_inference.linear import DiffInDiffEstimator

In [None]:
# m = DiffInDiffEstimator(
#     data=df,
#     outcome=Y,
#     treatment=T,
#     time_dimension=TIMEm
# )
# m.fit()

In [None]:
# m.estimate_ate(plot_result=True)