In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [2]:
targeted_ads = pd.read_csv('targeted_ads.csv')
targeted_ads.head()

Unnamed: 0,user_id,age_below25,age_25to40,age_above40,treatment,past_rev,revenue
0,1,0,1,0,0,9.3268,8.701338
1,2,0,0,1,0,9.832103,8.952526
2,3,1,0,0,1,10.82672,13.562074
3,4,0,1,0,0,10.32761,8.143866
4,5,0,0,1,0,9.582078,10.420373


## Randomization Check

First we check random assignment or lack thereof 

In [3]:
random_check_reg = smf.ols(formula = 'treatment ~  age_25to40 + age_above40', data = targeted_ads)
result = random_check_reg.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:              treatment   R-squared:                       0.351
Model:                            OLS   Adj. R-squared:                  0.350
Method:                 Least Squares   F-statistic:                     270.0
Date:                Thu, 21 Mar 2024   Prob (F-statistic):           1.96e-94
Time:                        00:22:29   Log-Likelihood:                -454.64
No. Observations:                1000   AIC:                             915.3
Df Residuals:                     997   BIC:                             930.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       0.7508      0.022     34.788      

What is another way of checking randomization or lack thereof? 

Another way is to regress treatment on past revenue. 

In [4]:
random_check_reg2 = smf.ols(formula = 'treatment ~ past_rev', data = targeted_ads)
result = random_check_reg2.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:              treatment   R-squared:                       0.035
Model:                            OLS   Adj. R-squared:                  0.034
Method:                 Least Squares   F-statistic:                     36.28
Date:                Thu, 21 Mar 2024   Prob (F-statistic):           2.40e-09
Time:                        00:22:47   Log-Likelihood:                -653.21
No. Observations:                1000   AIC:                             1310.
Df Residuals:                     998   BIC:                             1320.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.4477      0.131     -3.406      0.0

## Estimating the effect of the treatment

Ignoring for now the non-random assignment we regress revenue on treatment. 

In [5]:
effect_bias = smf.ols(formula = 'revenue ~ treatment', data = targeted_ads)
result = effect_bias.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                revenue   R-squared:                       0.255
Model:                            OLS   Adj. R-squared:                  0.254
Method:                 Least Squares   F-statistic:                     341.1
Date:                Thu, 21 Mar 2024   Prob (F-statistic):           9.51e-66
Time:                        00:23:08   Log-Likelihood:                -1811.0
No. Observations:                1000   AIC:                             3626.
Df Residuals:                     998   BIC:                             3636.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      8.8098      0.058    152.878      0.0

We estimate a multivariate regression of revenue on treatment, controlling for the source of non-randomization (age bins). We compare the output to the univariate regression.

In [6]:
effect = smf.ols(formula = 'revenue ~ treatment + age_25to40 + age_above40', data = targeted_ads)
result = effect.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                revenue   R-squared:                       0.311
Model:                            OLS   Adj. R-squared:                  0.309
Method:                 Least Squares   F-statistic:                     150.2
Date:                Thu, 21 Mar 2024   Prob (F-statistic):           2.71e-80
Time:                        00:23:28   Log-Likelihood:                -1771.5
No. Observations:                1000   AIC:                             3551.
Df Residuals:                     996   BIC:                             3571.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       9.7054      0.120     80.955      

### Frisch-Waugh theorem
Let’s verify that you would obtain the same result by the sequential estimation in the Frisch-Waugh theorem.

**Step 1:** Regress treatment on age bin dummies and compute the residual from this regression. Then verify this residual is uncorrelated with age.

In [7]:
#form residual from regression of treatment on age-bin dummies
result_1 = smf.ols(formula = 'treatment ~ age_25to40 + age_above40', data = targeted_ads).fit()
targeted_ads['residuals_1'] = result_1.resid

#treatment residual on age bin dummies 
result_2 = smf.ols(formula = 'residuals_1 ~ age_25to40 + age_above40', data = targeted_ads).fit()
print(result_2.summary())

                            OLS Regression Results                            
Dep. Variable:            residuals_1   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.002
Method:                 Least Squares   F-statistic:                 7.798e-13
Date:                Thu, 21 Mar 2024   Prob (F-statistic):               1.00
Time:                        00:24:04   Log-Likelihood:                -454.64
No. Observations:                1000   AIC:                             915.3
Df Residuals:                     997   BIC:                             930.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept    3.476e-16      0.022   1.61e-14      

**Step 2:** Now regress revenue on the treatment residual from step 1


In [9]:
result_3 = smf.ols(formula = ' revenue ~ residuals_1', data = targeted_ads).fit()
print(result_3.summary())

                            OLS Regression Results                            
Dep. Variable:                revenue   R-squared:                       0.078
Model:                            OLS   Adj. R-squared:                  0.078
Method:                 Least Squares   F-statistic:                     84.98
Date:                Thu, 21 Mar 2024   Prob (F-statistic):           1.74e-19
Time:                        00:24:55   Log-Likelihood:                -1917.2
No. Observations:                1000   AIC:                             3838.
Df Residuals:                     998   BIC:                             3848.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       9.4295      0.052    180.994      

## Including “past revenue” as additional control

We now regress the treatment residual from step 1 on past revenue to confirm your intuition.


In [10]:
result_4 = smf.ols(formula = 'residuals_1 ~ past_rev', data = targeted_ads).fit()
print(result_4.summary())

                            OLS Regression Results                            
Dep. Variable:            residuals_1   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                    0.2143
Date:                Thu, 21 Mar 2024   Prob (F-statistic):              0.643
Time:                        00:25:18   Log-Likelihood:                -454.53
No. Observations:                1000   AIC:                             913.1
Df Residuals:                     998   BIC:                             922.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0496      0.108      0.460      0.6

Including past revenue as an additional control:

In [11]:
result_5 = smf.ols(formula = 'revenue ~ treatment + age_25to40 + age_above40 + past_rev', data = targeted_ads).fit()
print(result_5.summary())

                            OLS Regression Results                            
Dep. Variable:                revenue   R-squared:                       0.640
Model:                            OLS   Adj. R-squared:                  0.639
Method:                 Least Squares   F-statistic:                     442.4
Date:                Thu, 21 Mar 2024   Prob (F-statistic):          5.08e-219
Time:                        00:25:26   Log-Likelihood:                -1447.1
No. Observations:                1000   AIC:                             2904.
Df Residuals:                     995   BIC:                             2929.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      -0.2403      0.341     -0.704      