## Setup
All the imports. simutils contains a set of functions I created for generating synthetic data for these experiments.

In [1]:
#!pip install statsmodels
#!pip install scikit-learn
#!pip install numpy
#!pip install pandas
#!pip install matplotlib
#!pip install scipy

import simutils as sim
import numpy as np
import pandas as pd

## Predictor Variables of Non-Interest
First, create some synthetic age, credit score and arbitrary hypothetically relevant personality characteristic values. Most of these variables are nuisance or confounding variables, but will partially determine both the predictor of interest and the dependent variable.

Age will be normally distributed (M=35, SD=10) and range from 18 to 78. Credit scores are 700-730 on average, but tend to be lower for younger people, and are reduced by up to 40 points for the youngest, relative to the oldest. The hypothetical personality variable will reflect whatever set of intrinsic characteristics (e.g., inquisitiveness, intelligence) that are normally distributed and might drive the behaviour of interest.

In [2]:
N=50000
ages=sim.generate_ages(35, 10, N)
credit_scores=sim.generate_credit_scores(ages)
personality=np.random.normal(loc=0, scale=1, size=N)
#create data frame out of factors
Z=pd.DataFrame({
    'age': ages,
    'credit_score': credit_scores,
    'personality': personality
    })
Z.head()

Unnamed: 0,age,credit_score,personality
0,38,721,-2.989375
1,24,631,-0.068916
2,42,751,-0.902461
3,44,763,0.195817
4,18,572,-0.119864


## Predictor of Interest
The independent measure will be the number of recorded interactions with product information pages, obtained from cookies, since monitoring was implemented.

Our goal will be to determine the causal relationship between clicking behavior and the dependent variable, personal wealth. This relationship can be quantified by computing the coefficient or weight in a regression or machine learning model that predicts wealth from clicks.

Clicking behavior will be influenced by all three of the variables in Z. These influences will be non-linear, which would make a linear regression-based approach a poor choice, though the number of clicks will be a linear combination of all three influences. I will explicitly determine each confound's influence on an individual's clicking behavior, making it easier to establish the ground-truth.

In [3]:
from scipy.stats import randint

clicks=sim.generate_clicks(Z)
clicks.columns=['age_delta', 'csv_delta', 'personality_delta']
#external random factors are responsible for the baseline number of interactions
clicks['baseline'] = randint.rvs(10,50, size=N)
clicks['total'] = clicks.apply(np.sum, axis=1)
clicks.head()

Unnamed: 0,age_delta,csv_delta,personality_delta,baseline,total
0,38,7,-19,22,48
1,37,34,9,36,116
2,47,12,0,17,76
3,69,16,11,35,131
4,-5,45,8,11,59


From the perspective of an all-knowing oracle, we can isolate the total number of clicks for each person that were not attributable to either age or credit score value:

In [4]:
clean_clicks=pd.DataFrame(clicks['personality_delta']+clicks['baseline'])
clean_clicks.columns=['i_total']

## The Dependent Variable: Wealth/Value
Now to create the dependent variable, portfolio value, we can use a similar procedure. The DV value will be caused in part by the number of clicks, as well as age and credit score and we can see whether we can tease apart these components. **Every click causes an increases a person's wealth by $500**, and this is the causal parameter we are trying to discover.

In [5]:
#baseline wealth increases as individuals get older until ~75, at which point it drops off
age_wealth=Z.apply(lambda row: sim.age_worth(row[0]), axis=1)
#baseline wealth increases for people with better credit scores, who can get better rates
csv_wealth=Z.apply(lambda row: sim.csv_worth(row[1]), axis=1)
#wealth increases caused by the behavior we care about (clicks), itself influenced by age and credit
click_wealth=clicks.apply(lambda row: sim.clicks_worth(row[4]), axis=1)
#wealth attributable to all the other random chance factors
circumstance_wealth=clicks.apply(lambda row: sim.circumstance_worth(), axis=1)
wealth=pd.concat([age_wealth, csv_wealth, click_wealth, circumstance_wealth], axis=1)
wealth.columns=['age_wealth', 'csv_wealth', 'click_wealth', 'circumstance_wealth']
wealth["total_wealth"]=wealth.apply(np.sum, axis=1)
wealth.head()


Unnamed: 0,age_wealth,csv_wealth,click_wealth,circumstance_wealth,total_wealth
0,18379,54368,23990,34327,131064
1,15229,39603,57995,49381,162208
2,35530,60454,38009,37480,171473
3,38973,63080,65495,38179,205727
4,11037,32217,29497,36577,109328


Like for clicks, I want the oracle measure of the wealth attributable to those clicks that were not driven by age and credit score:

In [6]:
clean_wealth=clean_clicks.apply(lambda row: sim.clicks_worth(row[0]), axis=1)

## Obtaining Residuals

Double ML works by obtaining residualized scores for the variables of interest after using the nuisance variables to predict these values. Theoretically, the residuals represent variability among scores that are not predicted by the nuisance variables.

In [12]:
import statsmodels.formula.api as smf
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_predict

M_clicks=GradientBoostingRegressor()
M_wealth=GradientBoostingRegressor()

y=wealth["total_wealth"]
x=clicks["total"]

#use cross-validation values to obtain residuals
residualized_y = y-cross_val_predict(M_wealth, Z[["age", "credit_score"]], y, cv=3)
residualized_x = x-cross_val_predict(M_clicks, Z[["age", "credit_score"]], x, cv=3)
df=pd.DataFrame()
df["Y"]=y
df["X"]=x
df["Y_hat"]=residualized_y
df["X_hat"]=residualized_x
df["age"]=Z["age"]
df["credit_score"]=Z["credit_score"]

#A meaningful comparison of regression coefficients will require normalized values
norm_df=(df-df.mean())/df.std()
norm_df.head()


Unnamed: 0,Y,X,Y_hat,X_hat,age,credit_score
0,-1.102131,-2.576214,-2.508272,-2.315385,0.343853,0.30749
1,-0.14887,0.578175,1.253811,0.317924,-1.114945,-1.047467
2,0.134715,-1.277348,-2.215201,-2.31089,0.760653,0.759142
3,1.183169,1.273997,0.560698,0.949115,0.969052,0.939803
4,-1.767431,-2.065945,-1.737784,-1.569089,-1.740144,-1.935716


## Can We Use the Nuisance Variables to Predict the Residualized Scores?
Theoretically, no. Let's see what happens when we try. Compare the regression using the normalized Y against the regression using the normalized Y residuals:

In [13]:
#fit Y using age and credit score
reference_model = smf.ols(formula='Y ~ 1 + age + credit_score', data = norm_df).fit()
print(reference_model.summary())


                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.742
Model:                            OLS   Adj. R-squared:                  0.742
Method:                 Least Squares   F-statistic:                 7.190e+04
Date:                Wed, 15 Nov 2023   Prob (F-statistic):               0.00
Time:                        13:18:41   Log-Likelihood:                -37076.
No. Observations:               50000   AIC:                         7.416e+04
Df Residuals:                   49997   BIC:                         7.418e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept    -2.374e-16      0.002  -1.05e-13   

In [14]:
#fit residualized Y using age and credit score
residual_model = smf.ols(formula='Y_hat ~ 1 + age + credit_score', data = norm_df).fit()
print(residual_model.summary())

                            OLS Regression Results                            
Dep. Variable:                  Y_hat   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    0.1030
Date:                Wed, 15 Nov 2023   Prob (F-statistic):              0.902
Time:                        13:19:30   Log-Likelihood:                -70946.
No. Observations:               50000   AIC:                         1.419e+05
Df Residuals:                   49997   BIC:                         1.419e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept     1.369e-17      0.004   3.06e-15   

This is a fair comparison because all normalized values are on the same relative scale. The coefficients on the residualized scores are closer to zero, at least two orders of magnitude smaller, and the confidence intervals include zero. Accordingly, the t-statistics indidate that neither age nor credit score are significant predictors of the residualized Y scores. 