# Fixed effects and first differencing versus diffs in diffs
Have always been a little confused as to whether to use diffs in diffs or fixed effects when we have panel data with two time periods.  In this notebook, I generate some fake data to simulate the effect of using one versus the other.

In [10]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [11]:
# generate mock data
sampsize = 1000
pre = np.random.normal(0,1, sampsize)
treat = np.random.binomial(1,.5,sampsize)
# add a little bit of extra noise just to the "treated" units to
# make them a bit different from the control units.  note that this would not be the case in an RCT
# pre = pre + treat*np.random.normal(.1,.05,sampsize)

# generate noise to add to the difference between pre and post
# this is so that even without the treatment effect there would not be perfect correlation between pre and post
noise = np.random.normal(0,.15,sampsize)
effect = .1
post = pre + noise + effect*treat

# show the correlation between pre and post
np.corrcoef(pre,post)

array([[ 1.        ,  0.98791903],
       [ 0.98791903,  1.        ]])

In [12]:
# calculate the diffs in diffs estimate directly
post_treat =post[treat==1].mean()
pre_treat = pre[treat==1].mean()
post_control =post[treat==0].mean()
pre_control = pre[treat == 0].mean()

# calculate the coeffs from a diffs in diffs model
# note that these should match the coeffs from the model below
did = (post_treat-pre_treat)-(post_control-pre_control)
print(pre_control)
print(post_control-pre_control)
print(pre_treat-pre_control)
print(did)

0.015534681524
0.00751526976216
-0.0414956251501
0.101513919793


In [13]:
# estimate the effect using a diffs-in-diffs model
outcome = np.concatenate((pre,post),0)
repeat_treat = np.concatenate((treat,treat),0)
round = np.concatenate((np.repeat(0,sampsize),np.repeat(1,sampsize)),0)

import statsmodels.api as sm
X = np.stack((round, repeat_treat, round*repeat_treat),1)
X = sm.add_constant(X)

model = sm.OLS(outcome, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    0.9135
Date:                Thu, 16 Feb 2017   Prob (F-statistic):              0.434
Time:                        10:36:32   Log-Likelihood:                -2865.1
No. Observations:                2000   AIC:                             5738.
Df Residuals:                    1996   BIC:                             5761.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.0155      0.044      0.354      0.7

# Analytical results for variance from DiD model
For the DiD model above, my hunch was that the estimate of the variance would be equal to the sum of the variance of the four terms calculated above.  To make absolutely sure that this is the case, I calculated the the term $(X'X)^{-1}$ (including the intercept) to see how the variance is calculated for the regression. This is fairly easy in this case since the elements of X are all vectors with 1s and 0s.  (To test this out quickly, perform this operation on the X matrix above.) Sure enough, the diagonal term in this matrix corresponding to the treat_post variable is 4.  


In [14]:
# Estimate the same model using a first difference approach
delta_y = post - pre
delta_treat = treat
X = delta_treat
X = sm.add_constant(X)
model = sm.OLS(delta_y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.103
Model:                            OLS   Adj. R-squared:                  0.102
Method:                 Least Squares   F-statistic:                     114.8
Date:                Thu, 16 Feb 2017   Prob (F-statistic):           2.00e-25
Time:                        10:36:32   Log-Likelihood:                 482.84
No. Observations:                1000   AIC:                            -961.7
Df Residuals:                     998   BIC:                            -951.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.0075      0.006      1.163      0.2

# Key Results
For the very simple case with no covariates presented above, the estimates of the impact are always the same but...
1. The estimates of impact are always the same.
2. When there is no difference in treat and control in the pre period (as you would expect in an RCT), the estimate of the standard errors of the estimators are pretty similar
3. If there is a systematic difference between treat and control, the standard error of the first difference estimator can be much smaller than the standard error of the diffs in diffs estimator

A couple of other things to consider

1. With fixed effects / first differencing, you can't include covariates that don't change over time

# More stuff to do
1. Ideally, I should derive the estimates of variance for the different estimators and see how they differ
2. 
