# Part IV Panel Data
# Ch 8 Differences-in-Differences

David Card won the Nobel prize in 2021 for DID analysis debunking the intuitive claim that increasing minimum wage increases unemployment using data from New Jersy and Pennsylvania when former changed its minimum wage. 

So far we have treated all data like it is the following kind.

**Definition:** <u>Cross-sectional</u> data is characterized by each unit appearing only once.

In this section we study the following.

**Definition:**  <u>Panel data</u> is characterized by units having several observations across time.

aka <u>longitudinal data</u>.

In panel data you can see what happens before and after treatment.  

In previous chapters we treated repeated observation data as cross sectional (restaraunt discount causing sales). This is sometimes referred to as pooled cross-section.



There are categories that fall between the two.

**Definition:** In <u>pooled cross-section data</u> one treats repeated observations of units as cross sectional. 

For example, in the restaraunt example we considered the treatment effect ($ATE$) in sales (outcome $Y$) when we thought of the day of the week on which a discount was offered as treatment value ($T$).

**Definition:** In <u>repeated cross-sectional data</u> there are multiple time entries, but the units in each entry are not necessarily the same. 

Me: If $S_b$ is the sample of units from before a treatment of a population, and $S_A$ is a sample from the same population after the treatment of a population, and $\S_a \cap S_b = \{\}$, then you have repeated cross sectional data. 
 


## Panel Data

You often can’t control who sees your advertisements, especially in  offline marketing. Did placing some billboards in a city brings value in excess of its costs? Try geo-experiments: deploy a marketing campaign to some geographical region but not others and compare them with panel data methods. The unit is then a geographic area, and we make observations across time. 

### Notation and Terminology:

- $t$ is time stamp, aka periods 
- $T$ is the number of periods
- $T_{\text{pre}}$ is the number of periods before intervention
    - the others are called post intervention periods 
- $\text{Post}=1$ when $t>T_{\text{pre}}$ (i.e. $\text{Post}$ is a dummy variable)
- $D=1$ denotes treatment.
- $W = D\mathbb{1}(t>T_{\text{pre}}) = D \text{Post}$ is the conjuction of treatment and post, $W_i =1$ indicating that treatment has been given to this unit.

**Definition:** In panel data literature treatment is referred to as an <u>intervention</u>. 

ATT is the quantity of interest in panel data applications. To see why, in the image below $Y(0)$ is observable in 3 out of 4 options. In the one where it is not observable you have correlations that allow imputation of $Y(0)|D=1,\text{Post}=1$. 
- it is easy to impute the following from the other 3 quadrants' observables
    -  $Y(0)|D=1,\text{Post}=1$  (upper right)
- it is hard to impute all three of of the following from the remaining 1 quadrant's observable
    - $Y(1)|D=1,\text{Post}=0$  (upper left)
    - $Y(1)|D=0,\text{Post}=0$  (lower left)
    - $Y(1)|D=0,\text{Post}=1$  (lower right)


<img src="images/panel_data_factuals.png" width="300">

## ATT 
Goal: You want to understand the effect of the offline marketing campaign on the cities that got treated, after the treatment takes place:
$$
ATT:=\color{green}{E[Y(1) | D=1,\text{Post}=1]} −\color{red}{E[Y(0)| D=1,\text{Post}=1]}
$$

The first term is observable since $Y_{it}( 1)$ is observable. 
The second term need to be identified. 

## Canonical Difference-in-Differences

Consider the following image in which the slope of the potential outcome $Y(0)$ is the same for the treated and untreated; we want to identify the unobservable red point.

<img src="images/DID_identification.JPG" width="300"> 

The following identification is made under the assumption that the slope of the potential outcome $Y(0)$ is the approximately same for the treated and untreated groups.

The unobservable $\color{red}{E[Y(0)| D=1,\text{Post}=1]}$ is approximately the sum of
- pre-treatment baseline for treated units $\color{magenta}{E[Y(0)|D=1,Post =0]}$
- the time evolution from control $\delta = \color{blue}{E[Y(0)|D=0,\text{Post}=1]} - \color{purple}{E[Y(0)|D=0,\text{Post} = 0]}$

In each of those three terms the potential outcome in the expectation $Y(0)$ is the observable. Thus in moving to emperical expectation we can drop the $(0)$.
That is 
$$
\begin{array}{ll}
\color{red}{E[Y(0) | D =1, Post=0] }&
    \approx \color{magenta}{ \hat E[Y|D=1,\text{Post}=0]}\\
    &+( \color{blue}{\hat E[Y|D=0,\text{Post}=1]} 
        - \color{purple}{\hat E[Y|D=0,\text{Post} = 0]})
\end{array}
$$

Similarly in the green term $Y(1)$ is the factual, so it can be replaced by $Y$ in the emperical estimate.
If this is substituted into ATE the reult is a difference in differences
$$
\begin{array}{lll}
\text{ATT} 
    & \approx (\color{green}{\hat E[Y | D=1,\text{Post}=1]} 
        - \color{magenta}{ \hat  E[Y|D=1,\text{Post}=0])}
            & \text{treated pre-post difference}\\
    & \, -( \color{blue}{\hat E[Y|D=0,\text{Post}=1]} 
        - \color{purple}{\hat E[Y|D=0,\text{Post} = 0]})
            & \text{untreated pre-post difference}
\end{array}
$$

To obtain the data for this calculation with pandas
1. group by `post`
2. group by `treated`
3. aggregate by the mean of the outcome 

In what follows he also keeps the minimum date
```python
did_data = (mkt_data
            .groupby(["D", "post"])
            .agg({"Y":"mean", "date": "min"}))
```

ME: note that DID can be viewed as a second (discrete) derivative
$$
\text{ATT} = \frac{\partial^2}{\partial D \partial \,\text{post}} E[Y|D,\text{post}]
$$

### DID as Outcome Growth Diff
Here is another way to think of DID. 

**Definition:** The <u>difference in the outcome across time</u> for unit $i$ is $Δy_i = E[ y_i |\text{Post} =1 ]− E[ y_i | \text{Post}=0]$.

Each line of the DID above, treated pre-post diff and untreated pre-post diff, are both differences in outcome across time, but for different treatment $D$. 
That is, the DID estimate of ATT is
$$
\begin{array}{ll}
\text{ATT} & := E[Y(1) | D=1,\text{Post}=1] −E[Y(0)| D=1,\text{Post}=1]\\
           & \stackrel{\text{DID}}{\approx} E_i[ Δy_{i,D=1} − Δy_{i,D=0} ] \\
           & \stackrel{\text{emperical}}{\approx} E [Δy | D = 1 ]− E [Δy | D = 0]
\end{array}
$$
where going to an emperical observation takes advantage of the fact that the terms match up with the observable quantities. 

### DID OLS

$$
Y_{it} = β_0 + β_1 D_i + β_2 \text{Post}_t + β_3 D_i \text{Post}_t + e_{it}
$$

The parameter estimate $\hat{\beta}_3 \approx \text{ATT}$ (in agreement with my second derivative idea).

Each of the four DID terms is a different sum of $\beta$ coefficients


$$
\begin{array}{lll}
\text{ATT} & \approx (E[Y | D=1,\text{Post}=1] - E[Y|D=1,\text{Post}=0])& \text{treated pre-post difference}\\
    &\,-(E[Y|D=0,\text{Post}=1] - E[Y|D=0,\text{Post} = 0])& \text{untreated pre-post difference} \\
    & = (\beta_0 + \beta_1+\beta_2 +\beta_3) - (\beta_0 +\beta_1) \\
    &- [(\beta_0 + \beta_2) - \beta_0]\\
    &=\beta_3
\end{array}
$$

### Diff-in-Diff with Fixed Effects

This is another way to think about DID; time- and unit-fixed effect model (two- way fixed effects or TWFE). 

Model
$$
Y_{it} = τ W_{it} + α_i + γ_t + e_{it}
$$

where 
- $W = D_i \text{Post}_i$
- $\tau$ is treatment effect, it will match ATT from OLS 
- $\alpha_i$ is unit fixed effect
- $\gamma_t$ is time fixed effects

### Multiple Time Periods

**Definition:** In a  <u>block design</u> a comparison is made between (i) a group of units that are never treated and (ii) a group of units that are eventually treated at the same time period.

That is, you can’t have the treatment rolling out to units at different moments, giving this block form for $W(t)$ 

<img src="images/block.png" width="300">

### DID Causal Inference 

The CI's for $\text{ATE} = \tau$ from the regression methods just dicussed are probably wrong due to 
- having multiple points for $i$ with $\text{Post}=0$. 
- having multiple points for $i$ with $\text{Post}=1$. 

The $NT$ data points are not IDD. The correct standard error will come from treating your samle size as $N$ for $N$ clutsers each of size $T$.

Statsmodel's fit methods provide the KWARG ```cov_type="cluster"```. 

ChatGPT says the Statsmodels fit options for `cov_type` include
1. "nonrobust": This is the default option. It calculates standard errors using the observed information matrix assuming homoscedasticity and no correlation between observations. It does not account for heteroscedasticity or correlation in the data.
4. "cluster": This option allows for cluster-robust standard errors. It computes standard errors that account for clustering of observations. You need to specify the clustering structure using another parameter, 

While I do not see that other parameter, I get the general idea:
Clustering the errors will give you wider confidence intervals than no clustering at all. 

```python
In[]:
    # with clustering
    m = smf.ols(
        'downloads ~ treated:post + C(city) + C(date)', data=did_data
    ).fit(cov_type='cluster', cov_kwds={'groups': did_data['city']})

    print("ATT:", m.params["treated:post"])
    CI =m.conf_int().loc["treated:post"].values
    width = CI[1]-CI[0]
    print(f"CI: {CI} of width {width}")
```
```text
out[]: 
    ATT: 0.6917359536407139
    CI: [0.29610141 1.0873705 ] of width 0.7912690882307024
```

```python
# Without clustering
mod = smf.ols('downloads ~ treated:post + C(city) + C(date)',data=mkt_data )
m=mod.fit()

print("ATT:", m.params["treated:post"])
CI = m.conf_int().loc["treated:post"].values
width = CI[1]-CI[0]
print(f"CI: {CI} of width {width}")
```
```text
ATT: 0.6917359536407139
CI: [0.47801441 0.9054575 ] of width 0.4274430848117712
```

Note that if you want to bootstrap 
you need to sample (with replacement) the entire unit, where now a unit is a cluster. This procedure is called <u>block bootstrap</u>. 

## Identification Assumptions

it’s time to dig a little deeper into what kind of assumptions you were making when using DID.

### Parallel trends Assumption

If all you have was unit treatment assignments (no time dimension), 
- use the $Y|D=0$ to estimate the counterfactual $Y(0)|D=1$
If all you had was time dimension, but no control group (all units were treated at some point in time), you 
- use the past $Y|\text{Post} = 0 ,D=1$ from the treated units in a sort of before and after comparison to $Y|\text{Post} = 1 ,D=1$

Difference-in-differences makes a weaker assumption.

**Definition:** The <u>parallel trends</u> assumption is that the trajectory of outcomes across time would be the same, on average, for treatment and control groups, in the absence of the treatment;
$E[\color{red}{Y(0)_{i t=1}}−Y(0)_{it=0} |  D=1] =E[Y(0)_{i t=1}−Y(0)_{it=0} | D=0]$.

This assumption is untestable because it contains a term that is nonobservable: $E [Y (0)_{it = 1} | D = 1 ]$.

I discussed this assumption above. 

<img src="images/DID_identification.jpg" width="300">

Here I visualize violations. 

<img src="images/parallel_trends_violations.JPG" width="400">


Note that the parallel trends assumption is not scale invariant; if you non-linearly rescale data you can eliminate parallel trends. “When Is Parallel Trends Sensitive to Functional Form?” by Jonathan Roth and Pedro Sant’Anna derive a more strict version of parallel trends that is invariant to monotonic transfor‐ mation of the outcome and discuss in which situation that assump‐ tion is plausible.


Comparison: 
- CIA states that the level of Y 0 is the same, on average, in the treated and control groups when confounding factors are held constant $Y(0) \perp D |X$
- parallel trends states that the growth of Y 0 is the same between treated and control groups; $\Delta Y(0)\perp D$. (there is no cnditional part here.)

Here lies the power of panel data: even if the treatment is not randomly assigned, so long as the treated and control groups have the same potential outcome growth, the ATT can be identified.

you can relax the parallel trend assump‐ tion to be conditioned on covariates; $\Delta Y(0)\perp D|X$

### No Anticipatoin Assumption

the no anticipation assumption is more related to the stable unit of treatment value assumption (SUTVA);SUTVA violations happen when the effect spills over from treatment into the control units or vice versa.

Anticipation is when the effect spills over to periods when the treatment hasn’t yet taken place. 

SUTVA is still a big issue in panel data analysis, especially when the unit is a geographic region

Now we will talk about three assumptions that are all implied byt the techical and obscure condition called strict exogeny

### 1 No Time Varying Confounders

Confounders need to be constant over time

by zooming in on a unit and tracking how it evolves over time, you are already controlling for anything that is fixed over time. That includes any time-fixed confounders, even those that are unmeasured. 

### 2 No Feedback

treatment cannot be decided based on the outcome trajectory.

to condition on past outcomes, you need to look into methods that work under sequential ignorability.... There is more and more to causal infrence 

### 3 no carryonver 

Nto condition on past outcomes, you need to look into methods that work under sequential ignorability.

me e.g. previous ad campaigns when you are trying to measure the effect of the most recent.

This can be relaxed, if you believe that treatment at period t − 1 impacts the outcome at time t, you can use the following model:
$Y_{it} =τ_{it}W_{it}+θ W_{i,t−1}+α_i+γ_t+e_{it}$.


Also no lagged dependent variable, meaning that past out‐ come doesn’t directly cause current outcome.

## Effect Dynamics over Time 

Often, it takes some time for the treatment to reach its full effect. you might be underestimating the final treatment effect, because you are including periods where it hasn’t fully matured yet.

Skipping the rest of the CH 

# 9 Synthetic Control

DID works great if you have a relatively large number of units N compared to time periods T, it falls short when the reverse is true. In contrast, synthetic control was designed to work with very few, even one, treatment unit

It is an incredibly clever way to use (but not
condition on) past outcomes in order to estimate $E[ Y(0) |D = 1, \text{Post} = 1 ]$.

pped reading


In [None]:
combination of treatment and post