# How-Many-Customers-Stay
Here, I am going to explore the power of gradient boosting for this binary classification problem.

In [1]:
import pandas as pd
import numpy as np

## Import data

In [2]:
train_df = pd.read_csv('train.csv', index_col = 0)
test_df = pd.read_csv('test.csv', index_col = 0)

In [3]:
train_df = train_df.assign(is_train = 1)
test_df = test_df.assign(is_train = 0)

train_df = train_df.drop(["CustomerId","Surname"],axis=1)
test_df = test_df.drop(["CustomerId","Surname"],axis=1)

test_df.head()

Unnamed: 0_level_0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,is_train
RowNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
7500,477,Germany,Male,34,8,139959.55,2,1,1,189875.83,0
7501,827,Spain,Female,35,0,0.0,2,0,1,184514.01,0
7502,726,France,Female,53,1,113537.73,1,0,1,28367.21,0
7503,600,France,Female,43,5,134022.06,1,1,0,194764.83,0
7504,624,France,Male,37,0,0.0,2,0,0,112104.55,0


## Exploratory Data Analysis

In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7500 entries, 0 to 7499
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      7500 non-null   int64  
 1   Geography        7500 non-null   object 
 2   Gender           7500 non-null   object 
 3   Age              7500 non-null   int64  
 4   Tenure           7500 non-null   int64  
 5   Balance          7500 non-null   float64
 6   NumOfProducts    7500 non-null   int64  
 7   HasCrCard        7500 non-null   int64  
 8   IsActiveMember   7500 non-null   int64  
 9   EstimatedSalary  7500 non-null   float64
 10  Exited           7500 non-null   int64  
 11  is_train         7500 non-null   int64  
dtypes: float64(2), int64(8), object(2)
memory usage: 761.7+ KB


We take a quick look at the features in the dataset and make sure there are no null values.

In [5]:
train_df.describe()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,is_train
count,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0,7500.0
mean,650.378667,38.959333,5.026,76968.155137,1.5304,0.7076,0.512667,100178.652747,0.207867,1.0
std,96.556472,10.526458,2.890266,62467.964419,0.58131,0.454895,0.499873,57595.705469,0.405808,0.0
min,350.0,18.0,0.0,0.0,1.0,0.0,0.0,91.75,0.0,1.0
25%,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51193.135,0.0,1.0
50%,652.0,37.0,5.0,98196.235,1.0,1.0,1.0,100114.385,0.0,1.0
75%,717.0,44.0,7.0,128208.9125,2.0,1.0,1.0,149595.1125,0.0,1.0
max,850.0,92.0,10.0,238387.56,4.0,1.0,1.0,199992.48,1.0,1.0


In [7]:
train_df.describe(include='O')

Unnamed: 0,Geography,Gender
count,7500,7500
unique,3,2
top,France,Male
freq,3735,4109


We need to encode these two attributes later

In [8]:
cat_fea = ['Geography','Gender']

In [9]:
train_df['Exited'].value_counts(normalize=True)

0    0.792133
1    0.207867
Name: Exited, dtype: float64

From this normalized response distribution we can see that the dataset is imbalanced.

## Data Preprocessing

In [10]:
all_df = pd.concat([train_df,test_df])
all_df = pd.concat([all_df,pd.get_dummies(all_df['Gender'], drop_first=True)],axis=1)
all_df = pd.concat([all_df,pd.get_dummies(all_df['Geography'], drop_first=False)],axis=1)

In [11]:
all_df.drop(cat_fea, axis=1, inplace=True)

In [12]:
train = all_df[all_df['is_train']==1].drop(['is_train'], axis=1)
test = all_df[all_df['is_train']==0].drop(['is_train','Exited'], axis=1)

In [13]:
train.head()

Unnamed: 0_level_0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Male,France,Germany,Spain
RowNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,683,38,5,127616.56,1,1,0,123846.07,0.0,0,1,0,0
1,619,28,3,0.0,2,1,0,53394.12,0.0,0,1,0,0
2,718,34,5,113922.44,2,1,0,30772.22,0.0,1,0,0,1
3,616,45,3,143129.41,2,0,1,64327.26,0.0,1,0,1,0
4,787,40,6,0.0,2,1,1,84151.98,0.0,0,1,0,0


In [14]:
from sklearn.model_selection import GridSearchCV

Since our dataset is small, we use cross validation to test and tune the model

Next, we compare a very basic model--**Decision Tree** and a gradient boosting ensemble method--**XGBoost** which is also tree based model

In [15]:
df_submission = pd.read_csv('samplesubmission.csv')
    
def testing(model, X_train, Y_train, X_test, model_name, drop_cat=None):
    if drop_cat != None:
        X_train = X_train.drop(drop_cat, axis=1)
        X_test = X_test.drop(drop_cat, axis=1)

    model.fit(X_train, Y_train)
    pred = model.predict(X_test)
    df_submission['Exited'] = pred.astype(int)
    df_submission.to_csv('submission_2_' + model_name + '.csv', index = False)

## Model Training: Decision Tree

In [54]:
from sklearn import tree

grid = GridSearchCV (
estimator = tree.DecisionTreeClassifier(criterion="gini", class_weight="balanced", random_state=1),
param_grid = {'max_depth':[6,8,10,12], 'min_samples_split':[2,4,6], 'min_samples_leaf':[1,3,5]},
    n_jobs=-1, cv=5, scoring='f1')

grid.fit(train.drop('Exited',axis=1), train.Exited)

GridSearchCV(cv=5,
             estimator=DecisionTreeClassifier(class_weight='balanced',
                                              random_state=1),
             n_jobs=-1,
             param_grid={'max_depth': [6, 8, 10, 12],
                         'min_samples_leaf': [1, 3, 5],
                         'min_samples_split': [2, 4, 6]},
             scoring='f1')

In [55]:
print("Best: %f" %(grid.best_score_))
for key,value in grid.best_params_.items():
    print(key+" : %f" %(value))

Best: 0.578536
max_depth : 6.000000
min_samples_leaf : 1.000000
min_samples_split : 6.000000


In [56]:
grid = GridSearchCV (
estimator = tree.DecisionTreeClassifier(criterion="gini", class_weight="balanced", random_state=1),
param_grid = {'max_depth':[5,6,7], 'min_samples_split':[5,6,7], 'min_samples_leaf':[1,2]},
    n_jobs=-1, cv=5, scoring='f1')

grid.fit(train.drop('Exited',axis=1), train.Exited)

GridSearchCV(cv=5,
             estimator=DecisionTreeClassifier(class_weight='balanced',
                                              random_state=1),
             n_jobs=-1,
             param_grid={'max_depth': [5, 6, 7], 'min_samples_leaf': [1, 2],
                         'min_samples_split': [5, 6, 7]},
             scoring='f1')

In [57]:
print("Best: %f" %(grid.best_score_))
for key,value in grid.best_params_.items():
    print(key+" : %f" %(value))

Best: 0.578536
max_depth : 6.000000
min_samples_leaf : 1.000000
min_samples_split : 6.000000


In [59]:
dt = tree.DecisionTreeClassifier(criterion="gini", class_weight="balanced", max_depth=6, min_samples_leaf=1, min_samples_split=6, random_state=1)
testing(dt, train.drop('Exited',axis=1), train.Exited, test, 'tree')

## Model Training: XGBoost

In [60]:
from xgboost.sklearn import XGBClassifier
import xgboost as xgb

First, we check the optimal number of trees using cv function of xgboost

In [61]:
target = 'Exited'

def modelfit(alg, dtrain, predictors, cv_folds=5, early_stopping_rounds=50):
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
        xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='auc', early_stopping_rounds=early_stopping_rounds, verbose_eval=True)
        

def grid_tuning(sta, var):
    grid = GridSearchCV (
        estimator = XGBClassifier(**sta),
        param_grid = var,
        n_jobs=-1, cv=5, scoring='f1')
    grid.fit(train.drop('Exited',axis=1), train.Exited)
    print("Best: %f" %(grid.best_score_))
    for key,value in grid.best_params_.items():
        print(key+" : %f" %(value))

In [62]:
predictors = test.columns 

xgb1 = XGBClassifier(
 learning_rate=0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 scale_pos_weight=1,
 random_state=20)

modelfit(xgb1, train, predictors)

[0]	train-auc:0.847234+0.00423116	test-auc:0.836662+0.00680219
[1]	train-auc:0.853176+0.00486974	test-auc:0.842808+0.00747938
[2]	train-auc:0.860545+0.00429672	test-auc:0.847768+0.00639288
[3]	train-auc:0.863788+0.00494137	test-auc:0.851627+0.00731558
[4]	train-auc:0.867534+0.00412619	test-auc:0.855928+0.00578762
[5]	train-auc:0.870795+0.00419033	test-auc:0.857571+0.00768679
[6]	train-auc:0.872028+0.00304867	test-auc:0.857461+0.00662626
[7]	train-auc:0.875303+0.00373494	test-auc:0.859877+0.00719701
[8]	train-auc:0.877431+0.00303124	test-auc:0.860898+0.00683028
[9]	train-auc:0.879211+0.00328173	test-auc:0.861922+0.00656606
[10]	train-auc:0.881168+0.00303745	test-auc:0.863439+0.00617217
[11]	train-auc:0.882464+0.00240792	test-auc:0.863896+0.0060595
[12]	train-auc:0.88274+0.00242652	test-auc:0.863997+0.00562744
[13]	train-auc:0.883449+0.00256611	test-auc:0.864334+0.00581241
[14]	train-auc:0.884924+0.00285812	test-auc:0.86513+0.00566131
[15]	train-auc:0.886218+0.00258294	test-auc:0.866187+

Next, we check *max_depth*: depth of a tree and *min_child_weight*: similar to min_child_leaf

In [63]:
variable_params = {'max_depth':range(3,10,2), 'min_child_weight':range(1,6,2)}
static_params = {'objective':'binary:logistic', 'learning_rate':0.1, 'n_estimators':119, 'gamma':0, 'subsample':0.8, 'colsample_bytree':0.8, 'scale_pos_weight':1, 'random_state':20}
grid_tuning(static_params, variable_params)

Best: 0.588710
max_depth : 7.000000
min_child_weight : 1.000000


In [64]:
variable_params = {'max_depth':[6,7,8], 'min_child_weight':[1,2]}
grid_tuning(static_params, variable_params)

Best: 0.588710
max_depth : 7.000000
min_child_weight : 1.000000


Update n_estimators with our new parameters

In [65]:
xgb2 = XGBClassifier(
 learning_rate=0.1,
 n_estimators=1000,
 max_depth=7,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 scale_pos_weight=1,
 random_state=20)

modelfit(xgb2, train, predictors)

[0]	train-auc:0.868904+0.00295162	test-auc:0.842754+0.00481612
[1]	train-auc:0.876683+0.00483307	test-auc:0.848634+0.0103694
[2]	train-auc:0.885724+0.00302127	test-auc:0.853018+0.00823052
[3]	train-auc:0.888694+0.00305368	test-auc:0.85545+0.00854117
[4]	train-auc:0.892293+0.00290103	test-auc:0.857688+0.00771907
[5]	train-auc:0.895581+0.00192945	test-auc:0.85919+0.00829884
[6]	train-auc:0.897472+0.00102625	test-auc:0.85899+0.00789364
[7]	train-auc:0.901719+0.00181447	test-auc:0.861209+0.0082306
[8]	train-auc:0.904614+0.0023968	test-auc:0.861585+0.00788253
[9]	train-auc:0.908208+0.00271857	test-auc:0.861403+0.00711999
[10]	train-auc:0.910284+0.0025468	test-auc:0.86244+0.00663055
[11]	train-auc:0.911819+0.00171067	test-auc:0.862142+0.00653955
[12]	train-auc:0.913279+0.00146025	test-auc:0.86267+0.00603324
[13]	train-auc:0.914945+0.00177506	test-auc:0.861994+0.0059501
[14]	train-auc:0.916842+0.00203339	test-auc:0.863094+0.00547334
[15]	train-auc:0.918323+0.00194882	test-auc:0.864339+0.00556

Next, we check *gamma*: minimum loss reduction required to make a split, *subsample*: fraction of observations for each tree and *colsample_bytree*: similar to max_features

In [66]:
variable_params = {'gamma':[0, 0.05, 0.1], 'subsample':[0.7, 0.8, 0.9], 'colsample_bytree':[0.7, 0.8, 0.9]}
static_params = {'max_depth':7, 'min_child_weight':1, 'objective':'binary:logistic', 'learning_rate':0.1, 'n_estimators':81, 'scale_pos_weight':1, 'random_state':20}
grid_tuning(static_params, variable_params)

Best: 0.594302
colsample_bytree : 0.800000
gamma : 0.000000
subsample : 0.800000


Next, we check *scale_pos_weight*: to weight the balance of +ve examples relative to -ve examples

In [68]:
variable_params = {'scale_pos_weight':[1,1.5,2]}
static_params = {'colsample_bytree': 0.8, 'gamma':0, 'subsample':0.8, 'max_depth':7, 'min_child_weight':1, 'objective':'binary:logistic', 'learning_rate':0.1, 'n_estimators':81, 'random_state':20}
grid_tuning(static_params, variable_params)

Best: 0.619473
scale_pos_weight : 2.000000


Though scale_pos_weight > 2 gives us a higher CV score, but here I restrict scale_pos_weight to be <=2 to avoid overfitting

Lastly, we lower the learning rate to add more trees

In [69]:
xgb3 = XGBClassifier(
 learning_rate=0.01,
 n_estimators=1000,
 max_depth=7,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 scale_pos_weight=2,
 random_state=20)

modelfit(xgb3, train, predictors)

[0]	train-auc:0.875633+0.00183756	test-auc:0.842854+0.00588081
[1]	train-auc:0.883066+0.0036854	test-auc:0.844398+0.0092044
[2]	train-auc:0.891598+0.00353096	test-auc:0.849709+0.0041049
[3]	train-auc:0.894337+0.00309053	test-auc:0.85349+0.0045578
[4]	train-auc:0.896439+0.00212776	test-auc:0.854398+0.00629576
[5]	train-auc:0.898862+0.00229122	test-auc:0.856284+0.00782554
[6]	train-auc:0.898786+0.000823369	test-auc:0.857331+0.00738492
[7]	train-auc:0.901977+0.00198587	test-auc:0.859886+0.0077526
[8]	train-auc:0.904125+0.00194065	test-auc:0.860371+0.00790004
[9]	train-auc:0.906445+0.00288985	test-auc:0.8609+0.00812829
[10]	train-auc:0.907896+0.00187214	test-auc:0.862712+0.00653065
[11]	train-auc:0.907891+0.00129872	test-auc:0.862848+0.00639722
[12]	train-auc:0.907833+0.00136872	test-auc:0.862502+0.00599227
[13]	train-auc:0.908291+0.00218172	test-auc:0.862391+0.00638941
[14]	train-auc:0.909053+0.0029145	test-auc:0.863528+0.00638487
[15]	train-auc:0.909937+0.00294195	test-auc:0.864456+0.006

Final decision:

In [72]:
xgb_model = XGBClassifier(learning_rate=0.01,
 n_estimators=110,
 max_depth=7,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 scale_pos_weight=2,
 random_state=20)

testing(xgb_model, train.drop('Exited',axis=1), train.Exited, test, 'xgb')

By running evaluate_2, we get test f1 score<br />
for Decision Tree: **0.5470**<br />
for XGBoost: **0.6314**<br />
The f1 score of XGBoost is around 8.5% higher than that of a tree!