In [149]:
#pip install scikit-learn
import numpy as np
from numpy.random import randn
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression

n_samples=10000

We are trying to learn the effect of X1 on X3, let's start by simulating the mean after setting X1 to 1 and 0:

In [150]:
x2_1 = randn(n_samples)
x1_1 =  1
x3_1 = 5 * x1_1 + 4 * x2_1 + randn(n_samples)
x4_1 = 6 * x3_1 + randn(n_samples) 

x2_0 = randn(n_samples)
x1_0 =  0
x3_0 = 5 * x1_0 + 4 * x2_0 + randn(n_samples)
x4_0 = 6 * x3_0 + randn(n_samples)
diff = np.mean(x4_1) - np.mean(x4_0)
print(diff)

30.514748479180785


We can also try to check if the effect predicted by the path method for the total average causal effect of X2 on X4 (114) is correct:

In [147]:
x2_1 = 1
x1_1 =  3 * x2_1 + randn(n_samples) 
x3_1 = 5 * x1_1 + 4 * x2_1 + randn(n_samples)
x4_1 = 6 * x3_1 + randn(n_samples) 

x2_0 = 0
x1_0 = 3 * x2_0 + randn(n_samples) 
x3_0 = 5 * x1_0 + 4 * x2_0 + randn(n_samples)
x4_0 = 6 * x3_0 + randn(n_samples)
diff = np.mean(x4_1) - np.mean(x4_0)
print(diff)

115.57450550736193


We now show that we can also learn empirically the effect of X1 on X4 without simulating interventions. We simulate observational samples from the SCM from the example:

In [148]:
x2 = randn(n_samples) 
x1 = 3 * x2 + randn(n_samples) 
x3 = 5 * x1 + 4 * x2 + randn(n_samples)
x4 = 6 * x3 + randn(n_samples) 

df = pd.DataFrame({ "x2": x2, "x1": x1, "x3": x3,"x4": x4})
Y = df.iloc[:, 3].values.reshape(-1, 1)
X1 = df.iloc[:, 1].values.reshape(n_samples, 1)
X21 = df.iloc[:, 0:2].values.reshape(n_samples, 2)
X13 = df.iloc[:, 1:3].values.reshape(n_samples, 2)
X = df.iloc[:, 0:3].values.reshape(n_samples, 3)

We start by regressing X4 on X1:

In [155]:
linear_regressor = LinearRegression() 
linear_regressor.fit(X1, Y)
linear_regressor.coef_

array([[37.15893506]])

The linear coefficient is far from the prediction for X1 (which is 30 = 5*6)

We try something different and consider using both X1 and X3 as covariates in the regression:

In [151]:
linear_regressorX13 = LinearRegression() 
linear_regressorX13.fit(X13, Y)
linear_regressorX13.coef_[:,0]

array([0.14289015])

The result for X1 are even worse. We now consider X1 and X2 as covariates in the regression:

In [154]:
linear_regressorX12 = LinearRegression() 
linear_regressorX12.fit(X21, Y)
linear_regressorX12.coef_[:,1]

array([29.87150906])

We consider X1, X2 and X3 in the regression:

In [153]:
linear_regressorX123 = LinearRegression() 
linear_regressorX123.fit(X, Y)
linear_regressorX123.coef_[:,1]

array([0.15806091])

The only correct estimate of the causal effect of X1 on X4 is when we also use X2 in the regression, which fits the adjustment formula we will see in the next slides.