# Causal Inference for Tabular Data

Causal inference involves finding the effect of intervention on one set of variables, on another variable. For instance, if A->B->C. Then all the three variables may be correlated, but intervention on C, does not affect the values of B, since C is not a causal ancestor of of B. But on the other hand, interventions on A or B, both affect the values of C. 

While there are many different kinds of causal inference questions one may be interested in, we currently support two kinds-- Average Treatment Effect (ATE) and conditional ATE (CATE). In ATE, we intervene on one set of variables with a treatment value and a control value, and estimate the expected change in value of some specified target variable. Mathematically,

$$\texttt{ATE} = \mathbb{E}[Y | \texttt{do}(X=x_t)] - \mathbb{E}[Y | \texttt{do}(X=x_c)]$$

where $\texttt{do}$ denotes the intervention operation. In words, ATE aims to determine the relative expected difference in the value of $Y$ when we intervene $X$ to be $x_t$ compared to when we intervene $X$ to be $x_c$. Here $x_t$ and $x_c$ are respectively the treatment value and control value.

CATE makes a similar estimate, but under some condition specified for a set of variables. Mathematically,

$$\texttt{CATE} = \mathbb{E}[Y | \texttt{do}(X=x_t), C=c] - \mathbb{E}[Y | \texttt{do}(X=x_c), C=c]$$

where we condition on some set of variables $C$ taking value $c$. Notice here that $X$ is intervened but $C$ is not. 

In [1]:
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline
import pickle as pkl
import time
from functools import partial

from causalai.data.data_generator import DataGenerator, ConditionalDataGenerator
from causalai.models.tabular.causal_inference import CausalInference
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor

def define_treatments(name, t,c):
    treatment = dict(var_name=name,
                    treatment_value=t,
                    control_value=c)
    return treatment



## Continuous Data

### Average Treatment Effect (ATE)
For this example, we will use synthetic data that has linear dependence among data variables.

In [2]:
fn = lambda x:x
coef = 0.1
sem = {
        'a': [], 
        'b': [('a', coef, fn), ('f', coef, fn)], 
        'c': [('b', coef, fn), ('f', coef, fn)],
        'd': [('b', coef, fn), ('g', coef, fn)],
        'e': [('f', coef, fn)], 
        'f': [],
        'g': [],
        }
T = 1000
data, var_names, graph_gt = DataGenerator(sem, T=T, seed=0, discrete=False)

graph_gt

{'a': [],
 'b': ['a', 'f'],
 'c': ['b', 'f'],
 'd': ['b', 'g'],
 'e': ['f'],
 'f': [],
 'g': []}

In [3]:

# Notice c does not depend on a if we intervene on b. Hence intervening a has no effect in this case. 
# This can be verified by changing the intervention values of variable a, which should have no impact on the ATE. 
# (see graph_gt above)
t1='a' 
t2='b'
target = 'c'
target_var = var_names.index(target)

intervention11 = 10*np.ones(T)
intervention21 = 10*np.ones(T)
intervention_data1,_,_ = DataGenerator(sem, T=T, seed=0,
                        intervention={t1:intervention11, t2:intervention21})

intervention12 = -0.*np.ones(T)
intervention22 = -2.*np.ones(T)
intervention_data2,_,_ = DataGenerator(sem, T=T, seed=0,
                        intervention={t1:intervention12, t2:intervention22})



true_effect = (intervention_data1[:,target_var] - intervention_data2[:,target_var]).mean()
print("True ATE = %.2f" %true_effect)

True ATE = 1.20


In [4]:

tic = time.time()


treatments = [define_treatments(t1, intervention11,intervention12),\
             define_treatments(t2, intervention21,intervention22)]
# CausalInference_ = CausalInference(data, var_names, graph_gt,\
#         partial(MLPRegressor, hidden_layer_sizes=(100,100)) , False)
CausalInference_ = CausalInference(data, var_names, graph_gt, LinearRegression , discrete=False)

ate, y_treat,y_control = CausalInference_.ate(target, treatments)
print(f'Estimated ATE: {ate:.2f}')
toc = time.time()
print(f'{toc-tic:.2f}s')



Estimated ATE: 1.02
0.01s


### Conditional Average Treatement Effect (CATE)

The data is generated using the following structural equation model:
$$C = noise$$
$$W = C + noise$$
$$X = C*W + noise$$
$$Y = C*X + noise$$

We will treat C as the condition variable, X as the intervention variable, and Y as the target variable in our example below. The noise used in our example is sampled from the standard Gaussian distribution.

In [5]:
T=2000
data, var_names, graph_gt = ConditionalDataGenerator(T=T, data_type='tabular', seed=0, discrete=False)
# var_names = ['C', 'W', 'X', 'Y']
treatment_var='X'
target = 'Y'
target_idx = var_names.index(target)


intervention1 = 0.1*np.ones(T, dtype=int)
intervention_data1,_,_ = ConditionalDataGenerator(T=T, data_type='tabular',\
                                    seed=0, intervention={treatment_var:intervention1}, discrete=False)

intervention2 = 0.9*np.ones(T, dtype=int)
intervention_data2,_,_ = ConditionalDataGenerator(T=T, data_type='tabular',\
                                    seed=0, intervention={treatment_var:intervention2}, discrete=False)
graph_gt

{'C': [], 'W': ['C'], 'X': ['C', 'W'], 'Y': ['C', 'X']}

In [6]:
condition_state=2.1
diff = np.abs(data[:,0] - condition_state)
idx = np.argmin(diff)
assert diff[idx]<0.1, f'No observational data exists for the conditional variable close to {condition_state}'


cate_gt = (intervention_data1[idx,target_idx] - intervention_data2[idx,target_idx])
print(f'True CATE: {cate_gt:.2f}')

####
treatments = define_treatments(treatment_var, intervention1,intervention2)
conditions = {'var_name': 'C', 'condition_value': condition_state}

tic = time.time()
model = partial(MLPRegressor, hidden_layer_sizes=(100,100), max_iter=200)
CausalInference_ = CausalInference(data, var_names, graph_gt, model, discrete=False)#

cate = CausalInference_.cate(target, treatments, conditions, model)
toc = time.time()
print(f'Estimated CATE: {cate:.2f}')
print(f'Time taken: {toc-tic:.2f}s')

True CATE: -1.69
Estimated CATE: -1.71
Time taken: 1.63s


## Discrete Data

The synthetic data generation procedure for the ATE and CATE examples below are identical to the procedure followed above for the continuous case, except that the generated data is discrete in the cases below.

### Average Treatment Effect (ATE)
 For this example, we will use synthetic data that has linear dependence among data variables.

In [7]:
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import pickle as pkl
import time
from functools import partial

from causalai.data.data_generator import DataGenerator, ConditionalDataGenerator
from causalai.models.tabular.causal_inference import CausalInference
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier

def define_treatments(name, t,c):
    treatment = dict(var_name=name,
                    treatment_value=t,
                    control_value=c)
    return treatment

In [8]:
fn = lambda x:x
coef = 0.1
sem = {
        'a': [], 
        'b': [('a', coef, fn), ('f', coef, fn)], 
        'c': [('b', coef, fn), ('f', coef, fn)],
        'd': [('b', coef, fn), ('b', coef, fn), ('g', coef, fn)],
        'e': [('f', coef, fn)], 
        'f': [],
        'g': [],
        }
T = 5000
data, var_names, graph_gt = DataGenerator(sem, T=T, seed=0, discrete=True, nstates=10)

graph_gt

{'a': [],
 'b': ['a', 'f'],
 'c': ['b', 'f'],
 'd': ['b', 'b', 'g'],
 'e': ['f'],
 'f': [],
 'g': []}

In [9]:

t1='a'
t2='b'
target = 'c'
target_var = var_names.index(target)

# note that states can be [0,1,...,9], so the multiples below must be in this range
intervention11 = 0*np.ones(T, dtype=int)
intervention21 = 9*np.ones(T, dtype=int)
intervention_data1,_,_ = DataGenerator(sem, T=T, seed=0,
                            intervention={t1: intervention11, t2:intervention21}, discrete=True, nstates=10)

intervention12 = 6*np.ones(T, dtype=int)
intervention22 = 5*np.ones(T, dtype=int)
intervention_data2,_,_ = DataGenerator(sem, T=T, seed=0,
                            intervention={t1:intervention12, t2:intervention22}, discrete=True, nstates=10)

true_effect = (intervention_data1[:,target_var] - intervention_data2[:,target_var]).mean()
print("Ground truth ATE = %.2f" %true_effect)

Ground truth ATE = 0.66


In [10]:

tic = time.time()

treatments = [define_treatments(t1, intervention11,intervention12),\
             define_treatments(t2, intervention21,intervention22)]
model = partial(MLPRegressor, hidden_layer_sizes=(100,100), max_iter=200) # LinearRegression
CausalInference_ = CausalInference(data, var_names, graph_gt, model, discrete=True)#
o, y_treat,y_control = CausalInference_.ate(target, treatments)
print(f'Estimated ATE: {o:.2f}')
toc = time.time()
print(f'Time taken: {toc-tic:.2f}s')


Estimated ATE: 0.50
Time taken: 0.98s


### CATE (conditional ATE)
For this example we will use synthetic data that has non-linear dependence among data variables.

In [11]:
T=5000
data, var_names, graph_gt = ConditionalDataGenerator(T=T, data_type='tabular', seed=0, discrete=True, nstates=10)
# var_names = ['C', 'W', 'X', 'Y']

treatment_var='X'
target = 'Y'
target_idx = var_names.index(target)

# note that states can be [0,1,...,9], so the multiples below must be in this range
intervention1 = 1*np.ones(T, dtype=int)
intervention_data1,_,_ = ConditionalDataGenerator(T=T, data_type='tabular',\
                                    seed=0, intervention={treatment_var:intervention1}, discrete=True, nstates=10)

intervention2 = 9*np.ones(T, dtype=int)
intervention_data2,_,_ = ConditionalDataGenerator(T=T, data_type='tabular',\
                                    seed=0, intervention={treatment_var:intervention2}, discrete=True, nstates=10)
graph_gt

{'C': [], 'W': ['C'], 'X': ['C', 'W'], 'Y': ['C', 'X']}

In [12]:
condition_var = 'C'
condition_var_idx = var_names.index(condition_var)
condition_state=1
idx = np.where(data[:,condition_var_idx]==condition_state)[0]
cate_gt = (intervention_data1[idx,target_idx] - intervention_data2[idx,target_idx]).mean()
print(f'True CATE: {cate_gt:.2f}')

####
treatments = define_treatments(treatment_var, intervention1,intervention2)
conditions = {'var_name': condition_var, 'condition_value': condition_state}

tic = time.time()
model = partial(MLPRegressor, hidden_layer_sizes=(100,100), max_iter=200)
CausalInference_ = CausalInference(data, var_names, graph_gt, model, discrete=True)

cate = CausalInference_.cate(target, treatments, conditions, model)
toc = time.time()
print(f'Estimated CATE: {cate:.2f}')
print(f'Time taken: {toc-tic:.2f}s')

True CATE: 6.01
Estimated CATE: 6.00
Time taken: 3.66s
