In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline 

# What is the causal impact of having a PhD on making over 50K/year?

As of v0.27.x, ktrain supports causal inference using [meta-learners](https://arxiv.org/abs/1706.03461). We will use the well-studied [Adults](https://raw.githubusercontent.com/amaiya/ktrain/master/ktrain/tests/tabular_data/adults.csv) dataset from the UCI ML repository, which is census data from the early to mid 1990s.  The objective is estimate how much earning a PhD increases the probability of making over $50K in salary.  This dataset is simply being used as a simple demonstration example.  Unlike conventional supervised machine learning, there is typically no ground truth for causal infernence models, unless you're using a simulated datasets.  So, we will simply check our estimates to see if they agree with intuition for illustration purposes.

Let's begin by loading the dataset and creating a binary treatment (1 for PhD and 0 for no PhD).

In [2]:
!wget https://raw.githubusercontent.com/amaiya/ktrain/master/ktrain/tests/tabular_data/adults.csv -O /tmp/adults.csv

--2021-07-16 15:50:38--  https://raw.githubusercontent.com/amaiya/ktrain/master/ktrain/tests/tabular_data/adults.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4573758 (4.4M) [text/plain]
Saving to: ‘/tmp/adults.csv’


2021-07-16 15:50:38 (34.0 MB/s) - ‘/tmp/adults.csv’ saved [4573758/4573758]



In [3]:
import pandas as pd
df = pd.read_csv('/tmp/adults.csv')
df = df.rename(columns=lambda x: x.strip())
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x) 
filter_set = 'Doctorate'
df['treatment'] = df['education'].apply(lambda x: 1 if x in filter_set else 0)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class,treatment
0,25,Private,178478,Bachelors,13,Never-married,Tech-support,Own-child,White,Female,0,0,40,United-States,<=50K,0
1,23,State-gov,61743,5th-6th,3,Never-married,Transport-moving,Not-in-family,White,Male,0,0,35,United-States,<=50K,0
2,46,Private,376789,HS-grad,9,Never-married,Other-service,Not-in-family,White,Male,0,0,15,United-States,<=50K,0
3,55,?,200235,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,0,50,United-States,>50K,0
4,36,Private,224541,7th-8th,4,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0,0,40,El-Salvador,<=50K,0


Next, let's invoke the `causal_inference_model` function to create a `CausalInferenceModel` instance and invoke `fit` to estimate the individualized treatment effect for each row in this dataset.  By default, a [T-Learner](https://arxiv.org/abs/1706.03461) metalearner is used with LightGBM models as base learners. Note that we are ignoring the education-related columns as they are already captured in the treatment. We also ignore the weighting column, `fnlwgt`, since this example is just for illustration purposes.

In [4]:
from ktrain.tabular import causal_inference_model
cm = causal_inference_model(df,
                            treatment_col='treatment', 
                            outcome_col='class',
                            ignore_cols=['fnlwgt', 'education','education-num']).fit()

replaced ['<=50K', '>50K'] in column "class" with [0, 1]
outcome column (categorical): class
treatment column: treatment
numerical/categorical covariates: ['age', 'workclass', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
preprocess time:  0.48633289337158203  sec
start fitting causal inference model
time to fit causal inference model:  1.550849199295044  sec


### Average Treatment Effect (ATE)

The overall average treatment effect for all examples is 0.20.  That is, having a PhD increases your probability of making over $50K by 20 percentage points.

In [5]:
cm.estimate_ate()

{'ate': 0.20340645077516034}

### Conditional Average Treatment Effects (CATE)

We also compute treatment effects after conditioning on attributes.  

For those with Master's degrees, we find that it is lower (as expected):

In [6]:
cm.estimate_ate(cm.df['education'] == 'Masters')

{'ate': 0.17672418257642838}

For those that dropped out of school, we find that it is higher (also as expected):

In [7]:
cm.estimate_ate(cm.df['education'].isin(['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '12th']))

{'ate': 0.2586697863578173}

### Invidividualized Treatment Effects (ITE)

The CATEs above illustrate how causal effects vary across different subpopulations in the dataset.  In fact, `CausalInferenceModel.df` stores a DataFrame representation of your dataset that has been augmented with a column called `treatment_effect` that stores the **individualized** treatment effect for each row in your dataset.

We can, for example, view the individuals who have the treatment effect nearest to zero.  These are people for which having or not having a PhD makes no difference.  As shown below, older while males who are executives/managers and professionals are minimally affected by having (or not having) a PhD.

In [8]:
cm.df[(cm.df['treatment_effect'] < 0.01)  & (cm.df['treatment_effect'] > -0.01)].head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class,treatment,treatment_effect
11,41,Private,191547,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,99999,0,55,United-States,1,0,-0.001229
17,39,Private,173175,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024,0,50,United-States,1,0,-0.000222
26,51,State-gov,103063,Bachelors,13,Married-civ-spouse,Protective-serv,Husband,White,Male,7298,0,40,United-States,1,0,0.005321
66,72,Self-emp-inc,149689,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,20051,0,48,United-States,1,0,0.000649
71,43,Private,152958,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,7298,0,40,United-States,1,0,0.00268


On the other hand, females without a college education stand to benefit the most from a PhD with an increase of nearly 100 percentage points in the probability.  

In [9]:
cm.df.sort_values('treatment_effect', ascending=False).head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class,treatment,treatment_effect
19283,40,Private,207025,HS-grad,9,Never-married,Adm-clerical,Not-in-family,White,Female,6849,0,38,United-States,0,0,0.991928
16500,35,Private,162256,Assoc-voc,11,Divorced,Adm-clerical,Not-in-family,White,Female,6849,0,40,United-States,0,0,0.991656
30597,72,Private,298070,Assoc-voc,11,Separated,Other-service,Unmarried,White,Female,6723,0,25,United-States,0,0,0.991625
9888,27,Private,169557,HS-grad,9,Divorced,Machine-op-inspct,Not-in-family,White,Male,6849,0,40,United-States,0,0,0.989816
29341,39,Private,106183,HS-grad,9,Divorced,Other-service,Unmarried,Amer-Indian-Eskimo,Female,6849,0,40,United-States,0,0,0.989737


Examining how the treatment effect varies across units in the population can be useful for variety of different applications.  [Uplift modeling](https://en.wikipedia.org/wiki/Uplift_modelling) is often used by companies for targeted promotions by identifying those individuals with the highest estimated treatment effects. For instance, one might target informational brochures on educational opportunities to those individuals with the highest treatment effects shown above.


Interestingly, there is a subpopulation where our model predicts a negative effect for PhD.  That is, earning a PhD decreases the probability of making over $50K.  These appear to be white males in their 30s who work in a trade or specialty (e.g., `'Prof-specialty', 'Craft-repair'`).

In [10]:
cm.df.sort_values('treatment_effect').head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class,treatment,treatment_effect
4699,36,Private,58343,Bachelors,13,Married-civ-spouse,Tech-support,Husband,White,Male,3103,0,42,United-States,1,0,-0.903521
13164,37,Private,186934,11th,7,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,3103,0,44,United-States,1,0,-0.900153
18149,37,Private,377798,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,3103,0,40,United-States,1,0,-0.894803
26259,36,Private,131239,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,3103,0,45,United-States,1,0,-0.886201
23290,36,Private,24106,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,3103,0,40,United-States,1,0,-0.874514


Indeed, there are a number of PhD-holders in this dataset that earn less than $50K most of which appear to be in occupations like `'Prof-specialty'`. 

In [11]:
cm.df[((cm.df.education == 'Doctorate') & (cm.df['class']==0))].head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class,treatment,treatment_effect
236,62,Local-gov,167889,Doctorate,16,Widowed,Prof-specialty,Unmarried,White,Female,0,0,40,Iran,0,1,-0.021622
435,42,Private,34037,Doctorate,16,Never-married,Prof-specialty,Not-in-family,White,Male,0,0,50,United-States,0,1,-0.136606
516,32,Self-emp-not-inc,65278,Doctorate,16,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,0,1,0.067737
833,33,State-gov,25806,Doctorate,16,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,20,?,0,1,-0.174712
980,41,Self-emp-not-inc,49156,Doctorate,16,Never-married,Prof-specialty,Not-in-family,White,Male,0,0,30,United-States,0,1,0.210994


Peculiar results like this may warrant further investigation like looking into the precise meaning of `Prof-specialty`.

### Making Predictions on New Examples

Finally, we can predict treatment effects on new examples, as long as they are formatted like the original DataFrame.  For instance, let's make a prediction for one of the rows we already examined:

In [12]:
df_example = cm.df.sort_values('treatment_effect', ascending=False).iloc[[0]]
df_example

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class,treatment,treatment_effect
19283,40,Private,207025,HS-grad,9,Never-married,Adm-clerical,Not-in-family,White,Female,6849,0,38,United-States,0,0,0.991928


In [13]:
cm.predict(df_example)

array([[0.99192821]])

### Validation

As mentioned above, there is no ground truth for this problem to validate our estimates.  In the cells above, we simply inspected the estimates to see if they correspond to our intuition on what should happen.  Another approach to validating causal estimates is to evaluate robustness to various data manipulations (i.e., sensitivity analysis). For instance, the Placebo Treatment test replaces the treatment with a random covariate.  We see below that this causes our estimate to drop to near zero, which is expected and exactly what we want. Such tests might be used to compare different models.

In [14]:
cm.evaluate_robustness()

Unnamed: 0,Method,ATE,New ATE,New ATE LB,New ATE UB,Distance from Desired (should be near 0)
0,Placebo Treatment,0.203406,-0.00129309,-0.00702968,0.00444351,-0.00129309
0,Random Add,0.203406,0.256089,0.241756,0.270421,0.0526821
0,Subset Data(sample size @0.5),0.203406,0.279642,0.248744,0.31054,0.0762357
0,Random Replace,0.203406,0.261979,0.247675,0.276282,0.0585724


**ktrain** uses the **CausalNLP** package for inferring causality.  For more information, see the [CausalNLP documentation](https://amaiya.github.io/causalnlp).