ref https://github.com/PGuti/Uplift

# Uplift Modeling

In this notebook we run two common approaches for uplift modeling:

- two models approach
- class transformation approach

The evaluation of these models will be done in a separated notebook. 

We will use simulated data. The objective is to be able to predict uplift, that is the difference of probability in outcome that is generated by the treatment for each person.

In [1]:
%pylab inline
import warnings
warnings.filterwarnings("ignore")
import pandas as pd



Populating the interactive namespace from numpy and matplotlib


## Loading the data

In [2]:
# load dataset
thefile = "customer_simulation.csv"
df = pd.read_csv(thefile)

In [3]:
df.shape

(10000, 24)

This simulated dataset contains 10000 lines.

In [4]:
df.head()

Unnamed: 0,customer_id,Node1,Node2,Node3,Node4,Node5,Node6,Node7,Node8,Node9,...,Node14,Node15,customer_type,Node17,Node18,Node19,Node20,target_control,outcome,train_test
0,1,Value4,Value1,Value2,Value2,Value2,Value4,Value2,Value4,Value4,...,Value3,Value3,persuadable,Value2,Value2,Value3,Value1,control,0,train
1,2,Value2,Value1,Value1,Value2,Value3,Value4,Value2,Value2,Value4,...,Value1,Value1,sleeping_dog,Value3,Value3,Value1,Value4,control,1,test
2,3,Value2,Value2,Value1,Value3,Value2,Value1,Value2,Value4,Value3,...,Value2,Value3,lost_cause,Value2,Value3,Value2,Value4,target,0,train
3,4,Value3,Value1,Value1,Value2,Value3,Value4,Value4,Value1,Value4,...,Value2,Value1,persuadable,Value2,Value3,Value1,Value4,control,0,train
4,5,Value4,Value1,Value1,Value3,Value2,Value1,Value2,Value4,Value3,...,Value3,Value2,sleeping_dog,Value1,Value3,Value2,Value4,control,1,train


Each line represent a customer that can be either in treatment or control set. Treated person were exposed to some actions. For example receiving an email or an offer if the person was likely to churn. On the contrary persons from control dataset were let alone. 

In [5]:
df.target_control.value_counts()

control    5063
target     4937
Name: target_control, dtype: int64

Each line represent a customer that can be either persuadable, lost_cause, sleeping_dog or sure_thing.

In [6]:
df.customer_type.value_counts()

lost_cause      2554
sleeping_dog    2528
persuadable     2471
sure_thing      2447
Name: customer_type, dtype: int64

- A lost cause is a person that will react negatively no matter what (targeted by action or not). Targeting these people is a waste of resources. 
- A sleeping dog is a person that will react negatively if he is treated but not if he is left alone. An example would be someone that forgot a gymn subscription he was not using and just received an email about it. Targeting these people is bad for business. 
- A persuadable is a person that react positively to a solicitation but would have reacted negatively if not. This is the persons we want to target with the uplift model. 
- Finally, a sure_thing is a person that would react positively no matter what. Targeting these people is a waste of resources. 

Each line contains the outcome variable. Here 1 could be "not churning" for example. 

In [7]:
df.outcome.value_counts()

0    5047
1    4953
Name: outcome, dtype: int64

Each line belong either to the train or test set. The proportion are 80 / 20. 

In [8]:
df.train_test.value_counts()

train    7952
test     2048
Name: train_test, dtype: int64

The other 19 columns are variables that can be used to predict the outcome or uplift for each individual. 

# Reshaping

In [9]:
# dummify

feat = [x  for x in df.columns if "Node" in x]
features = []
for f in feat : 
    dummies = pd.get_dummies(df[f]).rename(columns=lambda x: f + "_" + str(x))
    features = features + list(dummies.columns)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop([f], axis=1)
    print "done", f

done Node1
done Node2
done Node3
done Node4
done Node5
done Node6
done Node7
done Node8
done Node9
done Node10
done Node11
done Node12
done Node13
done Node14
done Node15
done Node17
done Node18
done Node19
done Node20


In [10]:
train_df = df[df["train_test"]=="train"] 
test_df = df[df["train_test"]=="test"] 

print train_df.shape
print test_df.shape

(7952, 81)
(2048, 81)


# Two model approach

we model uplift by the difference between probability of outcome in target dataset minus the probability of outcome in control dataset

In [11]:
target = train_df[train_df["target_control"]=='target']  
control = train_df[train_df["target_control"]=='control']

print target.shape
print control.shape

(3934, 81)
(4018, 81)


In [12]:
target_X = target[features]
control_X = control[features]
target_Y = target[['outcome']]
control_Y = control[['outcome']]
test_X = test_df[features]

#### training

In [13]:
from sklearn.ensemble import GradientBoostingClassifier

clf1 = GradientBoostingClassifier(n_estimators = 100,learning_rate = 0.1,max_depth = 3)
clf2 = GradientBoostingClassifier(n_estimators = 100,learning_rate = 0.1,max_depth = 3)

In [14]:
clf1.fit(target_X.values,target_Y.values.ravel())
clf2.fit(control_X.values,control_Y.values.ravel())

GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=3, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

#### scoring 

In [36]:
test_df["proba_outcome_target"] = clf1.predict_proba(test_X)[:,1]
test_df["proba_outcome_control"] = clf2.predict_proba(test_X)[:,1]
# uplift is just the difference. 
test_df["uplift_1"] = test_df["proba_outcome_target"] - test_df["proba_outcome_control"] 

# Class Modification approach

this implement the class modification approach:
- stack the target and control data
- flip the target for the control dataset
- train a model on this target
- Uplift is 2 times predicted probabilities minus 1

In [21]:
train_df['istarget'] = train_df['target_control'].map(lambda x : 1 if x=='target' else 0)
train_df['modified_outcome'] = train_df['outcome'] * train_df['istarget'] \
                                 + (1-train_df['outcome'])*(1-train_df['istarget'])

In [27]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 500,max_depth=10)

In [28]:
clf.fit(train_df[features].values, train_df['modified_outcome'].values.ravel())

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [43]:
test_df["uplift_2"] = 2*clf.predict_proba(test_X)[:,1] -1

# Save Predictions 

In [44]:
test_df.to_csv("/Users/pierregutierrez/Downloads/uplift_predictions.csv")