# Causal Inference with Python Part 1

http://www.degeneratestate.org/posts/2018/Mar/24/causal-inference-with-python-part-1-potential-outcomes/

In [1]:
from __future__ import division

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")
sns.set_palette("colorblind")

%matplotlib inline

import datagenerators as dg

In [2]:
observed_data_0 = dg.generate_dataset_0()

observed_data_0.head()

Unnamed: 0,x,y
0,0,1
1,0,0
2,0,0
3,1,1
4,0,0


## Average Causal Effect

We are just calculating ACE = P(Y=1|X=1) - P(Y=1|X=0). We get a estimated_effect of -0.147, which means wearing cool hats (X=1) has a negative effect in productivity (Y=1), compared to not wearing that (X=0).

In [3]:
def estimate_uplift(ds):
    """
    Estiamte the difference in means between two groups.
    ACE = P(Y=1|X=1) - P(Y=1|X=0)
    
    Parameters
    ----------
    ds: pandas.DataFrame
        a dataframe of samples.
        
    Returns
    -------
    estimated_uplift: dict[Str: float] containing two items:
        "estimated_effect" - the difference in mean values of $y$ for treated and untreated samples.
        "standard_error" - 90% confidence intervals arround "estimated_effect"
        
        
    """
    base = ds[ds.x == 0]
    variant = ds[ds.x == 1]
    
    delta = variant.y.mean() - base.y.mean()
    delta_err = 1.96 * np.sqrt(
        variant.y.var() / variant.shape[0] + 
        base.y.var() / base.shape[0])
    
    return {"estimated_effect": delta, "standard_error": delta_err}

estimate_uplift(observed_data_0)

{'estimated_effect': -0.15695440573770492,
 'standard_error': 0.08668256024159274}

## chi2 test

chi squared test tells us these variables are correlated. We can see that from small p value (<0.05).

In [4]:
from scipy.stats import chi2_contingency

contingency_table = observed_data_0.assign(placeholder=1).pivot_table(index="x", columns="y", values="placeholder", aggfunc="sum").values
# contingency_table now looks like this: 
# [[122 153]
# [133  92]]

_, p, _, _ = chi2_contingency(contingency_table, lambda_="log-likelihood")

# p-value
p

0.0006070642072701378

## Randomized control trail, intervention and A/B test

In [5]:
def run_ab_test(datagenerator, n_samples=10000, filter_=None):
    """
    Generates n_samples from datagenerator with the value of X randomized
    so that 50% of the samples recieve treatment X=1 and 50% receive X=0,
    and feeds the results into `estimate_uplift` to get an unbiased 
    estimate of the average treatment effect.
    
    Returns
    -------
    effect: dict
    """
    n_samples_a = int(n_samples / 2)
    n_samples_b = n_samples - n_samples_a
    set_X = np.concatenate([np.ones(n_samples_a), np.zeros(n_samples_b)]).astype(np.int64)
    ds = datagenerator(n_samples=n_samples, set_X=set_X)
    if filter_ != None:
        ds = ds[filter_(ds)].copy()
    return estimate_uplift(ds)

run_ab_test(dg.generate_dataset_0)

{'estimated_effect': 0.20800000000000007,
 'standard_error': 0.019173011541954423}