# Difference-in-differences

## A little background

Differences in the outcome variable (Y), i.e. a before-after comparison
(e.g. treatment vs control):

-   observations from regions that underwent intervention vs.
    observations from regions where the intervention did not take place
    -   treatment = intervention
    -   time = before/after
-   observations from a region before and after the intervention
    -   treatment = intervention
    -   synthetic control needed

**The main assumption is that without the change in the environment
(intervention/policy etc.) the outcome variable would have remained
constant!**

The **diff-in-diff** approach includes a before-after comparison for a
**treatment** and **control** group. This is a combination of:

-   a *cross-sectional comparison* (treated vs non-treated control
    group)
-   a **before-after** (*longitudinal*) **comparison** (treatment group
    with itself, before and after the treatment)

The before-after difference in the treatment group gets a correction, by
accounting for the before-after difference in the control group,
eliminating the trend problem.

To obtain an unbiased estimate of the treatment effect one needs to make
a parallel trend assumption. That means without the change in the
environment, the change in the outcome variable would have been the same
for the two groups (**counterfactual** outcome).

The validity of the diff-in-diff approach is closely related to the
similarity of the treatment and control groups. Hence, some plausibility
checks should be conducted:

-   compute placebo-diff-in-diff for periods without a change in the
    environment
-   for (longer) time series: check and demonstrate the parallel time
    trends
-   use an alternative control group (if available, or synthetic): the
    estimate should be the same
-   replace **y** by an alternative outcome which is known to be
    independent of the treatment (the diff-in-diff estimator should be
    0)  

## Diff-in-diff by hand

First, we walk through the calculations without using an explicit model.

### Injury dataset

This dataset comes from the R package **wooldridge**
([here](https://cran.r-project.org/web/packages/wooldridge/index.html)).

-   **1980**: new policy (Kentucky) on raised weekly earnings that were
    covered by worker's compensation (more after-injury benefits for
    high-earners).
-   **Research question**: has this new policy caused higher earning
    workers to spend more time injured? (generous benefits may cause
    higher earning workers to be more reckless on the job, or to claim
    that off-the-job injuries were incurred while at work, or to prefer
    injury benefits rather than keep on working - for mild injuries at
    least)

In [None]:
import numpy as np ## arrays
import pandas as pd ## dataframes
import seaborn as sns ## plots
import statsmodels.api as sm ## statistical models
import matplotlib.pyplot as plt ## plots

In [None]:
url = "https://raw.githubusercontent.com/filippob/longitudinal_data_analysis/refs/heads/main/data/injury/injury.csv"
injury = pd.read_csv(url)

injury

### Preprocessing

Rename columns:

In [None]:
injury = injury.rename(columns={'durat': "duration", 'log_duration': "ldurat",
                                'afchnge': "after_1980", 'highearn': "earnings"})

injury

-   `duration`: duration of injury benefits, measured in weeks

-   `log_duration`: `log(duration)` [natural logarithm]

-   `after_1980`: observation happened before (0) or after (1) the
    policy change in 1980. This is our time (or before/after variable)

    <!-- - `policy`: states that implemented (Kentucky, `1`) or not (other states, `0`) the policy on unemployment benefits //-->

-   `highearn`: observation is a low (0) or high (1) earner. This is our
    group (or treatment/control) variable: there was no change for low
    earners (same benefits), while higher earners now have benefits that
    they did not have earlier

In [None]:
# Create 'earnings' as a categorical variable
injury['earnings'] = np.where(injury['earnings'] == 0, 'low-income', 'high-income')

# Create 'after_1980' as a categorical variable
injury['after_1980'] = np.where(injury['after_1980'] == 0, 'before', 'after')

injury

### EDA

In [None]:
# Set up the FacetGrid
g = sns.FacetGrid(injury, col="earnings", height=4, aspect=1.2)

# Map a histogram with bin width = 8 and white edges
g.map_dataframe(sns.histplot, x="duration", binwidth=8, edgecolor="white", binrange=(0, injury["duration"].max()))

# Add titles and layout
g.set_axis_labels("Duration", "Count")
g.set_titles(col_template="{col_name}")
plt.tight_layout()
plt.show()

The distribution is really skewed, with most persons in both groups
getting the lowest range of benefits weeks. Using the logarithm of
duration would change this, making the distribution of the $y$ more
"gaussian", hence more amenable to be analysed with linear regression
models.

In [None]:
injury['log_duration'] = np.log(injury['duration'])

# Set up faceted histogram
g = sns.FacetGrid(injury, col="earnings", height=4, aspect=1.2)
g.map_dataframe(
    sns.histplot,
    x="log_duration",
    binwidth=0.5,
    edgecolor="white",
    binrange=(0, injury['log_duration'].max())
)

# Label axes and layout
g.set_axis_labels("Log(Duration)", "Count")
g.set_titles(col_template="{col_name}")
plt.tight_layout()
plt.show()

Let's plot average log-durations in the two groups, before and after the
implementation of the policy: we see that higher-income workers had
already a higher n. of injury-benefits weeks before the new policy
(maybe workers that do riskier jobs are paid better). With the new
policy, this seems emphasized.

In [None]:
# Create plot
g = sns.FacetGrid(injury, col="after_1980", height=4, aspect=1.2)

# Plot individual points
g.map_dataframe(
    sns.stripplot,
    x="earnings", y="log_duration",
    size=2, alpha=0.2, jitter=True
)

# Overlay group means
def add_group_means(data, **kwargs):
    means = data.groupby("earnings")["log_duration"].mean()
    for i, (x_cat, mean_val) in enumerate(means.items()):
        plt.scatter(i, mean_val, color='red', s=50, zorder=3)

g.map_dataframe(add_group_means)

# Label axes
g.set_axis_labels("policy", "log(Duration)")
g.set_titles(col_template="{col_name}")

plt.tight_layout()
plt.show()