# Uplift Models for Promoters Campaigns

### Summary

During the Promoters Campaigns, the monthly cashflow effect of sending promoters to an outlet may vary from being very positive to very negative. A negative effect implies that money was spent in sending the promoters but actually the monthly sales diminished, which is not desirable for Tsel or for the outlet. So we want to send promoters only to outlets where we expect the outcome will be at least enough positive to justify the investment.

Uplift models can be used to target the best outlets to send the promoters. In this notebook, we will load the data from the Promoters Pilot to make an uplift model.

### Load packages

The model is based on the "causalml" library from Uber and the data processing based on typical data science libraries.

In [None]:
%matplotlib inline
from causalml.dataset import synthetic_data
import math
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import r2_score, classification_report, confusion_matrix, accuracy_score
import statsmodels.api as sm
import seaborn as sns
import math as math
import matplotlib.pyplot as plt
import seaborn as sns

### Load Promoters Pilot table and organice it

This dataset contains the results of the change in cashflow from february to march for the Promoters Pilot Test and Control Groups.

The Control Group was created by finding for each Test outlet, another random outlet having the same classification, type and region. This means both groups have the same size (but we will see later that due to temporal reasons the sizes of both groups will become different).

In [None]:
PP_experiment_df = pd.read_csv('Promoters Pilot input table v2.txt')
PP_experiment_df.rename(columns={'Outlet_id':'outlet_id'},inplace=True)
#PP_experiment_df.head()

In [None]:
#PP_experiment_df = PP_experiment_df.drop(columns=['Cashflow_feb','Cashflow_march_1_29'])
PP_experiment_df.head()

In [None]:
# List of test and control outlets
outlets_list = PP_experiment_df['outlet_id'].tolist()
len(outlets_list) # Number of outlets

### Load outlets master table of features

This table is used as the input for the Reseller's Model, but we will use it to add features to the Test and Control groups outlets. The features will be useful during the modelling stage.

In [None]:
master_df = pd.read_csv('../../../data/reseller/07_model_output/gridsearchcv/ra_mck_int_gridsearchcv_master_prepared.csv')

In [None]:
#master_df.head()
len(master_df)
len(master_df.outlet_id.unique())
len(master_df.columns)

### Check how many Test and Control outlets in master table of features

Not at outlets in the Test and Control groups are present in the master table, so we will check how many there are. Note that this causes the Test and Control groups to have smaller and different sizes.

In [None]:
# Make list of all outlets in master table of features
master_outlets_list = master_df.outlet_id.unique().tolist()
len(master_outlets_list) # Outlets in master table of features

In [None]:
# Check how many Test and Control outlets in master table
Exp_outlets_in_master_df = PP_experiment_df[PP_experiment_df['outlet_id'].isin(master_outlets_list)]
len(Exp_outlets_in_master_df)

In [None]:
# Check for only Test/Treatment outlets
Exp_outlets_in_master_1_df = Exp_outlets_in_master_df.loc[Exp_outlets_in_master_df['Treatment'] == 1]
len(Exp_outlets_in_master_1_df)

In [None]:
# Check for only Control/Non-Treatment outlets
Exp_outlets_in_master_0_df = Exp_outlets_in_master_df.loc[Exp_outlets_in_master_df['Treatment'] == 0]
len(Exp_outlets_in_master_0_df)

### Filter master table of features by outlets list (Test and Control)

Now we filter the master table of features by the Test and Control outlets.

In [None]:
PP_features_df = master_df[master_df['outlet_id'].isin(outlets_list)]
len(PP_features_df)

In [None]:
# Check number of columns
len(PP_features_df.columns) 

In [None]:
# Clean NA columns just in case
PP_features_df = PP_features_df.dropna(axis=1)
len(PP_features_df.columns)

### Attach master table of features to Promoters Pilot table

Now both tables are joined to produce a Promoters Pilot master table.

In [None]:
PP_master_df = PP_experiment_df.join(PP_features_df.set_index('outlet_id'), on='outlet_id')
PP_master_df.head()

In [None]:
PP_master_df = PP_master_df.dropna(axis=0)
len(PP_master_df)

In [None]:
len(PP_features_df.columns)

### Create target variable y (many approaches/possibilities or target variable included)

In [None]:
PP_master_df['Delta_feb_mar'].describe()

In [None]:
cash = PP_master_df['Delta_feb_mar'].tolist()
PP_master_df['target_class1'] = -1
target1 = PP_master_df['target_class1'].tolist()
for i in range(0,len(cash)):
    if cash[i] > 6.835000e+05:
        target1[i] = 4
    elif cash[i] <= 6.835000e+05 and cash[i] > -1.254000e+06:
        target1[i] = 3
    elif cash[i] <= -1.254000e+06 and cash[i] > -3.588500e+06:  
        target1[i] = 2
    elif cash[i] <= -3.588500e+06:
        target1[i] = 1
    else:
        print('Error')

In [None]:
# Assign target variable
PP_master_df['target_class'] = target1
y = PP_master_df['target_class1'] # Here the targe variable is being chosen from 3 options
len(y)

### Create treatment

In [None]:
treatment = PP_master_df['Treatment']
len(treatment)

### Create features X

In [None]:
# Check for relevant columns
PP_master_df.columns[0:20].tolist()

In [None]:
# Relevant columns
df0 = PP_master_df.iloc[:,1:2]
df1 = PP_master_df.iloc[:,3:6]
df2 = PP_master_df.iloc[:,10:11]
df3 = PP_master_df.iloc[:,12:2108]
df4 = PP_master_df.iloc[:,2213:2371]

In [None]:
x = pd.concat([df0,df1,df2,df3,df4], axis=1)
#x

In [None]:
len(x)

In [None]:
columns = []
for j in range(0,len(x.columns)):
    if x.dtypes[j] == 'float64' or x.dtypes[j] == 'int64':
        columns.append(x.columns[j])
print(len(columns))
#columns

In [None]:
x = x[columns]

### Random forest model to define subset of x features to use in uplift model

#### Create model

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=40)

In [None]:
regressor = RandomForestClassifier()
regressor.fit(x_train,y_train)

In [None]:
y_pred = regressor.predict(x_test)

#### Check model

In [None]:
print(confusion_matrix(y_test,y_pred))

In [None]:
print(classification_report(y_test,y_pred))

#### Chooose top variables

In [None]:
Variables = pd.Series(x.columns)
Feature_importances = pd.Series(regressor.feature_importances_)
Feature_importances_dic = {'Variable': Variables, "Feature_importance": Feature_importances}
Feature_importances_df = pd.DataFrame(Feature_importances_dic)
Feature_importance_sorted_df = Feature_importances_df.sort_values(by="Feature_importance", ascending=False)
#Feature_importances_df

In [None]:
fis = Feature_importances_df.sort_values(by="Feature_importance", ascending=False)
fis.iloc[0:100,:]
#fis['Variable'].tolist()

In [None]:
vars_ranked = fis['Variable'].tolist()
top_vars = vars_ranked[0:50]
#top_vars

In [None]:
X = x[top_vars]
#X

### Check for possible sample bias

In [None]:
x_check_bias = x.drop(columns=['Promoter_days'])
x_train, x_test, t_train, t_test = train_test_split(x_check_bias, treatment, test_size=0.30, random_state=42)

In [None]:
regressor = RandomForestClassifier()
regressor.fit(x_train,t_train)

In [None]:
t_pred = regressor.predict(x_test)

In [None]:
print(confusion_matrix(t_test,t_pred))

In [None]:
print(classification_report(t_test,t_pred))

In [None]:
print(accuracy_score(t_test,t_pred))

In [None]:
Variables = pd.Series(x.columns)
Feature_importances = pd.Series(regressor.feature_importances_)
Feature_importances_dic = {'Variable': Variables, "Feature_importance": Feature_importances}
Feature_importances_df = pd.DataFrame(Feature_importances_dic)
Feature_importance_sorted_df = Feature_importances_df.sort_values(by="Feature_importance", ascending=False)
#Feature_importances_df

In [None]:
fis = Feature_importances_df.sort_values(by="Feature_importance", ascending=False)
fis.iloc[0:10,:]

### Uplift modeling

In [None]:
from causalml.inference.meta import BaseTRegressor
from xgboost import XGBRegressor
from causalml.inference.meta import XGBTRegressor

# Approach 1
y = PP_master_df['Delta_feb_mar']
# Approach 2
#y = PP_master_df['Percentage_change_feb_mar']
# Approach 3
#y = PP_master_df['Cashflow_march_1_29']

PP_master_df['target'] = y

In [None]:
data = pd.concat([
    pd.DataFrame({"y": y, "treatment": treatment}),
    pd.DataFrame(X)],
    axis = 1
)
#data

In [None]:
# Plot histograms of control and target
fig, axes = plt.subplots(1,2)
Hist_test = PP_master_df[PP_master_df.Treatment == 1]
Hist_control = PP_master_df[PP_master_df.Treatment == 0]
Hist_test.hist('target',bins=30,ax=axes[0])
Hist_control.hist('target',bins=30,ax=axes[1])

In [None]:
# Check variances of control and target
#Hist_test.hist('target',bins=30,ax=axes[0])
#Hist_control.hist('target',bins=30,ax=axes[1])

#### Model 1

In [None]:
X_train, X_test, y_train, y_test, treatment_train, treatment_test = train_test_split(X, y, treatment, test_size=0.30, random_state=42)

In [None]:
## Training T-learner on train
learner_t = XGBTRegressor(learner=XGBRegressor(random_state=42))
learner_t.fit(X=X_train, treatment=treatment_train, y=y_train)

## Get predictions, on the test set
t_pred = learner_t.predict(X=X_test)
#uplift, outcome_c, outcome_t = learner_t.predict(X=X_test, return_components=True)

## Aggregating everything on a dataframe
df = pd.DataFrame({'y': y_test,
                   'w': treatment_test,
                   'T-Learner': t_pred.reshape(-1)
                  })

In [None]:
# Lift plot for cummulative raw sales due to campaign
from causalml.metrics import plot
plot(df,kind='lift', outcome_col='y', treatment_col='w',figsize=(10, 3.3))

In [None]:
# Qini plot (where uplift is in y-axis, which is test - control for the given segment)
plot(df,kind='qini', outcome_col='y', treatment_col='w',figsize=(10, 3.3))

In [None]:
from causalml.metrics import auuc_score, qini_score
print('\nQINI Score\n',qini_score(df))

In [None]:
print('AUUC:\n',auuc_score(df))

In [None]:
# Feature importances using SHAP
import shap
from sklearn.ensemble import RandomForestRegressor

# Raw SHAP values
shap_values = learner_t.get_shap_values(X=X_test,
                                        tau=learner_t.predict(X_test),
                                        #we may specify the exact model to be used as additonal one
                                        model_tau_feature = RandomForestRegressor(n_estimators=100))
#shap_values

In [None]:
# SHAP importance plot
learner_t.plot_shap_values(X=X_test, tau=learner_t.predict(X_test))