# Difference-in-differences

## A little background

Differences in the outcome variable (Y), i.e. a before-after comparison
(e.g. treatment vs control):

-   observations from regions that underwent intervention vs.
    observations from regions where the intervention did not take place
    -   treatment = intervention
    -   time = before/after
-   observations from a region before and after the intervention
    -   treatment = intervention
    -   synthetic control needed

**The main assumption is that without the change in the environment
(intervention/policy etc.) the outcome variable would have remained
constant!**

The **diff-in-diff** approach includes a before-after comparison for a
**treatment** and **control** group. This is a combination of:

-   a *cross-sectional comparison* (treated vs non-treated control
    group)
-   a **before-after** (*longitudinal*) **comparison** (treatment group
    with itself, before and after the treatment)

The before-after difference in the treatment group gets a correction, by
accounting for the before-after difference in the control group,
eliminating the trend problem.

To obtain an unbiased estimate of the treatment effect one needs to make
a parallel trend assumption. That means without the change in the
environment, the change in the outcome variable would have been the same
for the two groups (**counterfactual** outcome).

The validity of the diff-in-diff approach is closely related to the
similarity of the treatment and control groups. Hence, some plausibility
checks should be conducted:

-   compute placebo-diff-in-diff for periods without a change in the
    environment
-   for (longer) time series: check and demonstrate the parallel time
    trends
-   use an alternative control group (if available, or synthetic): the
    estimate should be the same
-   replace **y** by an alternative outcome which is known to be
    independent of the treatment (the diff-in-diff estimator should be
    0)  

## Diff-in-diff by hand

First, we walk through the calculations without using an explicit model.

### Injury dataset

This dataset comes from the R package **wooldridge**
([here](https://cran.r-project.org/web/packages/wooldridge/index.html)).

-   **1980**: new policy (Kentucky) on raised weekly earnings that were
    covered by worker's compensation (more after-injury benefits for
    high-earners).
-   **Research question**: has this new policy caused higher earning
    workers to spend more time injured? (generous benefits may cause
    higher earning workers to be more reckless on the job, or to claim
    that off-the-job injuries were incurred while at work, or to prefer
    injury benefits rather than keep on working - for mild injuries at
    least)

In [None]:
import numpy as np ## arrays
import pandas as pd ## dataframes
import seaborn as sns ## plots
import statsmodels.api as sm ## statistical models
import matplotlib.pyplot as plt ## plots

In [None]:
url = "https://raw.githubusercontent.com/filippob/longitudinal_data_analysis/refs/heads/main/data/injury/injury.csv"
injury = pd.read_csv(url)

injury

### Preprocessing

Rename columns:

In [None]:
injury = injury.rename(columns={'durat': "duration", 'log_duration': "ldurat",
                                'afchnge': "after_1980", 'highearn': "earnings"})

injury

-   `duration`: duration of injury benefits, measured in weeks

-   `log_duration`: `log(duration)` [natural logarithm]

-   `after_1980`: observation happened before (0) or after (1) the
    policy change in 1980. This is our time (or before/after variable)

    <!-- - `policy`: states that implemented (Kentucky, `1`) or not (other states, `0`) the policy on unemployment benefits //-->

-   `highearn`: observation is a low (0) or high (1) earner. This is our
    group (or treatment/control) variable: there was no change for low
    earners (same benefits), while higher earners now have benefits that
    they did not have earlier

In [None]:
# Create 'earnings' as a categorical variable
injury['earnings'] = np.where(injury['earnings'] == 0, 'low-income', 'high-income')

# Create 'after_1980' as a categorical variable
injury['after_1980'] = np.where(injury['after_1980'] == 0, 'before', 'after')

injury

In [None]:
injury["after_1980"] = pd.Categorical(injury["after_1980"], ["before", "after"])

### EDA

In [None]:
# Set up the FacetGrid
g = sns.FacetGrid(injury, col="earnings", height=4, aspect=1.2)

# Map a histogram with bin width = 8 and white edges
g.map_dataframe(sns.histplot, x="duration", binwidth=8, edgecolor="white", binrange=(0, injury["duration"].max()))

# Add titles and layout
g.set_axis_labels("Duration", "Count")
g.set_titles(col_template="{col_name}")
plt.tight_layout()
plt.show()

The distribution is really skewed, with most persons in both groups
getting the lowest range of benefits weeks. Using the logarithm of
duration would change this, making the distribution of the $y$ more
"gaussian", hence more amenable to be analysed with linear regression
models.

In [None]:
injury['log_duration'] = np.log(injury['duration'])

# Set up faceted histogram
g = sns.FacetGrid(injury, col="earnings", height=4, aspect=1.2)
g.map_dataframe(
    sns.histplot,
    x="log_duration",
    binwidth=0.5,
    edgecolor="white",
    binrange=(0, injury['log_duration'].max())
)

# Label axes and layout
g.set_axis_labels("Log(Duration)", "Count")
g.set_titles(col_template="{col_name}")
plt.tight_layout()
plt.show()

Let's plot average log-durations in the two groups, before and after the
implementation of the policy: we see that higher-income workers had
already a higher n. of injury-benefits weeks before the new policy
(maybe workers that do riskier jobs are paid better). With the new
policy, this seems emphasized.

In [None]:
# Create plot
g = sns.FacetGrid(injury, col="after_1980", height=4, aspect=1.2)

# Plot individual points
g.map_dataframe(
    sns.stripplot,
    x="earnings", y="log_duration",
    size=2, alpha=0.5, jitter=False
)

# Overlay group means
def add_group_means(data, **kwargs):
    means = data.groupby("earnings")["log_duration"].mean()
    for i, (x_cat, mean_val) in enumerate(means.items()):
        plt.scatter(i, mean_val, color='red', s=50, zorder=3)

g.map_dataframe(add_group_means)

# Label axes
g.set_axis_labels("policy", "log(Duration)")
g.set_titles(col_template="{col_name}")

plt.tight_layout()
plt.show()

In [None]:
# Compute group means and 95% confidence intervals
plot_data = (
    injury
    .groupby(['earnings', 'after_1980'])
    .agg(
        mean_duration=('log_duration', 'mean'),
        se_duration=('log_duration', lambda x: x.std(ddof=1) / np.sqrt(len(x)))
    )
    .reset_index()
)

# Add 95% confidence intervals
plot_data['upper'] = plot_data['mean_duration'] + 1.96 * plot_data['se_duration']
plot_data['lower'] = plot_data['mean_duration'] - 1.96 * plot_data['se_duration']

In [None]:
plot_data

In [None]:
## to suppress warnings with plots

import warnings
warnings.filterwarnings('ignore')

In [None]:

# Plot point estimates without error bars first
g = sns.FacetGrid(plot_data, col="after_1980", height=4, aspect=1.2)
g.map_dataframe(
    sns.pointplot,
    x="earnings", y="mean_duration",
    join=False, color="darkgreen", errorbar=None,
    order=["low-income", "high-income"]
)

# Function to add error bars correctly
def add_error_bars(data, **kwargs):
    ax = plt.gca()
    earnings_order = ["low-income", "high-income"]
    for i, row in data.iterrows():
        x_pos = earnings_order.index(row['earnings'])  # categorical x-position
        ax.errorbar(
            x=x_pos,
            y=row['mean_duration'],
            yerr=[[row['mean_duration'] - row['lower']], [row['upper'] - row['mean_duration']]],
            fmt='none', ecolor='darkgreen', capsize=4, linewidth=1
        )

# Apply error bar overlay
g.map_dataframe(add_error_bars)

# Label and layout
g.set_axis_labels("policy", "mean duration (log scale)")
g.set_titles(col_template="{col_name}")
plt.tight_layout()
plt.show()

We can now see the change, in terms of number of weeks of injury
benefits (log scale) before and after the new policy, in the two groups:

In [None]:
# Ensure proper categorical order
plot_data['after_1980'] = pd.Categorical(plot_data['after_1980'], categories=["before", "after"], ordered=True)

# Create plot with grouped lines and CI bars
plt.figure(figsize=(6, 4))
for earning_group in ["low-income", "high-income"]:
    group_data = plot_data[plot_data["earnings"] == earning_group]
    sns.lineplot(
        data=group_data,
        x="after_1980",
        y="mean_duration",
        label=earning_group,
        marker="o",
        color="C0" if earning_group == "low-income" else "C1"
    )

    # Add error bars manually
    plt.errorbar(
        x=group_data["after_1980"],
        y=group_data["mean_duration"],
        yerr=[
            group_data["mean_duration"] - group_data["lower"],
            group_data["upper"] - group_data["mean_duration"]
        ],
        fmt='none',
        ecolor="C0" if earning_group == "low-income" else "C1",
        capsize=4,
        linewidth=1
    )

# Final plot formatting
plt.xlabel("policy")
plt.ylabel("mean duration (log scale)")
plt.legend(title="Earnings")
plt.tight_layout()
plt.show()

## Diff-in-diff by hand

After having explored the data, we can now actually calculate the
estimate of the **difference in differences** for the two groups:
difference between after-before differences in high vs low income
workers.

In [None]:
diffs = (
    injury.groupby(['after_1980', 'earnings'], as_index=False).agg({'log_duration' : 'mean', 'duration' : 'mean'}).round(2)
)

diffs

#### After - before differences

In [None]:
dd = diffs.drop(columns='duration').pivot(index = "earnings", columns = 'after_1980', values = 'log_duration')
dd['diff'] = dd['after'] - dd['before']
dd

#### High income - low income differences

In [None]:
df = pd.DataFrame(-np.diff(dd, axis=0))
df.columns = ['before','after','diff']
df.rename(index={0:'diff'},inplace=True)
df

#### Difference in differences

In [None]:
dd = pd.concat([dd,df])
dd

The **diff-in-diff estimate** is 0.20, which means that the program
causes an increase in unemployment duration of 0.20 log(weeks). For
*log-linear models* ($log(y) = \mu + \beta x + e$), this translates to
$e^{0.20}=1.2$ weeks.

This is shown graphically in the plot below: the dashed gray line is the
**counterfactual**.

In [None]:
from plotnine import (
    ggplot, aes, geom_point, geom_line, annotate, labs, theme_minimal
)

dd['earnings'] = dd.index
dd


In [None]:
g = (
    ggplot(diffs, aes(x='after_1980', y='log_duration', color='earnings')) +
    geom_point() +
    geom_line(aes(group='earnings')) +

    # Dashed segment: from before to after
    annotate("segment",
             x='before', xend='after',
             y=dd.iloc[0, 0], yend=dd.iloc[0, 1] - dd.iloc[2, 2],
             linetype='dashed', color='gray') +

    # Dotted vertical segment at 'after'
    annotate("segment",
             x='after', xend='after',
             y=dd.iloc[0, 1], yend=dd.iloc[0, 1] - dd.iloc[2, 2],
             linetype='dotted', color='blue') +

    # Label annotation for "Program effect"
    annotate("label",
             x='after',
             y=dd.iloc[0, 1] - (dd.iloc[1, 2] / 2),
             label='Program effect', size=8) +  # Size is larger in plotnine
    theme_minimal()
)

In [None]:
g.draw()

## Diff-in-diff: a regression model

Rather than calculating diff-in-diff by hand, we can use a regression
model which, besides simplifying the calculations, will also allow for a
more flxible, powerful and robust analysis (e.g. account for
covariables).

$$
\text{log}(duration) = \mu + \beta_1 \text{income} + \beta_2 \text{time} + \beta_3 (\text{income x time}) + e
$$

In [None]:
injury["earnings"] = pd.Categorical(injury["earnings"], ["low-income", "high-income"])

In [None]:
from statsmodels.formula.api import ols

res = ols('log_duration ~ earnings + after_1980 + earnings*after_1980', data=injury).fit()
print(res.summary())

We see that we got (approximately)bthe same value for the coefficient of the interaction term, as we did by hand (ressuring ;-)).

## Exercise

We see that the R-squared is pretty low: this model is not very well specified:

**Q: what if we add covariables to the model?**

You can try with obvious ones:

-   sex (`male`)
-   `age`
-   `married`

In [None]:
injury.head()

In [None]:
## TASK1: do a bit of EDA: get a feel of the other variables, and decide which to use

In [None]:
## TASK2: fit a model with additional covariables: is the R-squared better?

In [None]:
## TASK3: re-test the interaction term: has it changed?