# Predicting Effect of Marketing Campaign

Input variables:

bank client data:
   * age (numeric)
   * job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
                                       "blue-collar","self-employed","retired","technician","services") 
   * marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)
   * education (categorical: "unknown","secondary","primary","tertiary")
   * default: has credit in default? (binary: "yes","no")
   * balance: average yearly balance, in euros (numeric) 
   * housing: has housing loan? (binary: "yes","no")
   * loan: has personal loan? (binary: "yes","no")

related with the last contact of the current campaign:
   * contact: contact communication type (categorical: "unknown","telephone","cellular") 
   * day: last contact day of the month (numeric)
   * month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
   * duration: last contact duration, in seconds (numeric)

other attributes:
   * campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
   * pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
   * previous: number of contacts performed before this campaign and for this client (numeric)
   * poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

Output variable (desired target):
   * y - has the client subscribed a term deposit? (binary: "yes","no")

Source: https://archive.ics.uci.edu/ml/datasets/bank+marketing

# Import required packages

In [1]:
from datetime import datetime
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator
from joblib import dump, load
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# Set the parameters for the run

In [2]:
model_name = 'bank_deposit_campaign.joblib'

# Get the data

In [10]:
df = pd.read_csv('bank.csv', sep=';')

In [11]:
df.columns

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'y'],
      dtype='object')

In [12]:
df.rename(columns={'y':'subscribed'}, inplace=True)

In [13]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,subscribed
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


In [14]:
df.dtypes

age            int64
job           object
marital       object
education     object
default       object
balance        int64
housing       object
loan          object
contact       object
day            int64
month         object
duration       int64
campaign       int64
pdays          int64
previous       int64
poutcome      object
subscribed    object
dtype: object

# Train Test Split

In [16]:
y = df['subscribed']
X = df.drop('subscribed', axis=1)

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

# Exploratory Data Analysis

In [18]:
y_train.head()

3480    no
3344    no
1939    no
834     no
3715    no
Name: subscribed, dtype: object

In [19]:
type(y_train[1])

str

In [23]:
y_train.value_counts()

no     2788
yes     376
Name: subscribed, dtype: int64

In [20]:
# The y value is a string, convert to int
y_train.astype(int).head()

ValueError: invalid literal for int() with base 10: 'no'

In [89]:
y_train.describe()

count    105000.000000
mean          0.066857
std           0.249776
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           1.000000
Name: SeriousDlqin2yrs, dtype: float64

In [90]:
len(X_train)

105000

# Set up the model pipeline

In [92]:
pipeline = Pipeline(steps=[
    ('impute_mean', ColumnTransformer(transformers=[
                        ('scalar imputing mean', SimpleImputer(strategy='mean'), X_train.columns),
                        ], remainder='drop')),
    ('scale', ColumnTransformer(transformers=[
                        ('scalar scaling', MinMaxScaler(feature_range=(0, 1)), np.arange(0, len(X_train.columns))),
                        ], remainder='drop')),
    ('GBT', GradientBoostingClassifier())
    ])



# Set up the grid Search

Set the parameters for a grid search over the selected family of models.

In [93]:
# Currently only contains default values.  Commented parameters needn't be used
# in a grid search.
dct_grid = {
    'GBT__n_estimators' : [25, 50, 100],
    'GBT__max_depth'    : [2, 5, 9]
}

Search for the best model.

In [94]:
grid_search = GridSearchCV(pipeline, dct_grid, cv=10, return_train_score=False
                   , scoring=['accuracy', 'precision', 'average_precision', 'neg_log_loss']
                   , refit='average_precision', n_jobs=-1 )
grid_search.fit(X_train, y_train)

GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('impute_mean',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='drop',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('scalar '
                                                                         'imputing '
                                                                         'mean',
                                                                         SimpleImputer(add_indicator=False,
                                                                                       copy=True,
                                                                                 

# Time of Run

In [97]:
pd.DataFrame(grid_search.cv_results_)['mean_fit_time']

0      2.771568
1      4.582119
2      8.994104
3      7.313626
4     15.102574
5     29.684747
6     44.170989
7     75.898678
8    111.161170
Name: mean_fit_time, dtype: float64

In [98]:
time_of_run_in_hours = (pd.DataFrame(grid_search.cv_results_)['mean_fit_time'] * 10).sum() / 60 / 60
print('time of run in hours: {}'.format(time_of_run_in_hours))
hours_per_record = time_of_run_in_hours / 20000
print('hours per record: {}'.format(hours_per_record))
records_in_an_hour = 1 / hours_per_record
print('number of records in 1 hour {}'.format(records_in_an_hour))

time of run in hours: 0.8324432647228241
hours per record: 4.1622163236141207e-05
number of records in 1 hour 24025.66138445403


# Analyze the Model Results

In [106]:
pd.DataFrame(grid_search.cv_results_).T

Unnamed: 0,0,1,2,3,4,5,6,7,8
mean_fit_time,2.77157,4.58212,8.9941,7.31363,15.1026,29.6847,44.171,75.8987,111.161
mean_score_time,0.0563925,0.0743516,0.0943837,0.0737344,0.11627,0.171288,0.118278,0.1719,0.217594
mean_test_accuracy,0.9358,0.937057,0.937238,0.936543,0.936657,0.936533,0.934552,0.933867,0.933181
mean_test_average_precision,0.344293,0.362574,0.368537,0.372784,0.375265,0.372893,0.354394,0.347527,0.339908
mean_test_neg_log_loss,-0.194948,-0.191332,-0.189446,-0.188682,-0.186317,-0.185873,-0.187784,-0.188721,-0.191175
mean_test_precision,0.615378,0.621309,0.606148,0.6072,0.582789,0.574893,0.532536,0.514871,0.500386
param_GBT__max_depth,2,2,2,5,5,5,9,9,9
param_GBT__n_estimators,25,50,100,25,50,100,25,50,100
params,"{'GBT__max_depth': 2, 'GBT__n_estimators': 25}","{'GBT__max_depth': 2, 'GBT__n_estimators': 50}","{'GBT__max_depth': 2, 'GBT__n_estimators': 100}","{'GBT__max_depth': 5, 'GBT__n_estimators': 25}","{'GBT__max_depth': 5, 'GBT__n_estimators': 50}","{'GBT__max_depth': 5, 'GBT__n_estimators': 100}","{'GBT__max_depth': 9, 'GBT__n_estimators': 25}","{'GBT__max_depth': 9, 'GBT__n_estimators': 50}","{'GBT__max_depth': 9, 'GBT__n_estimators': 100}"
rank_test_accuracy,6,2,1,4,3,5,7,8,9


In [99]:
pd.DataFrame(grid_search.cv_results_)[['mean_fit_time', 'param_GBT__max_depth', 'param_GBT__n_estimators',
                                      'mean_test_accuracy', 'mean_test_precision', 'mean_test_average_precision',
                                      'mean_test_neg_log_loss']].T#.to_csv('grid_search_cv_results.csv')

Unnamed: 0,0,1,2,3,4,5,6,7,8
mean_fit_time,2.77157,4.58212,8.9941,7.31363,15.1026,29.6847,44.171,75.8987,111.161
param_GBT__max_depth,2.0,2.0,2.0,5.0,5.0,5.0,9.0,9.0,9.0
param_GBT__n_estimators,25.0,50.0,100.0,25.0,50.0,100.0,25.0,50.0,100.0
mean_test_accuracy,0.9358,0.937057,0.937238,0.936543,0.936657,0.936533,0.934552,0.933867,0.933181
mean_test_precision,0.615378,0.621309,0.606148,0.6072,0.582789,0.574893,0.532536,0.514871,0.500386
mean_test_average_precision,0.344293,0.362574,0.368537,0.372784,0.375265,0.372893,0.354394,0.347527,0.339908
mean_test_neg_log_loss,-0.194948,-0.191332,-0.189446,-0.188682,-0.186317,-0.185873,-0.187784,-0.188721,-0.191175


In [100]:
#pd.DataFrame(grid_sesarch.cv_results_).columns

In [101]:
grid_search.best_estimator_

Pipeline(memory=None,
         steps=[('impute_mean',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('scalar imputing mean',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='mean',
                                                                verbose=0),
                                                  Index(['id', 'RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30...
                                            learning_rate=0.1, loss='d

# Feature Importance

In [107]:
feature_imp = pd.DataFrame({'column': X_train.columns,
                            'feature_importance': grid_search.best_estimator_.named_steps["GBT"].feature_importances_})
feature_imp = feature_imp.sort_values('feature_importance', ascending=False)
#feature_imp.to_csv('feature_importance_90k_lapse.csv')

In [108]:
feature_imp.head(15)

Unnamed: 0,column,feature_importance
7,NumberOfTimes90DaysLate,0.515592
3,NumberOfTime30-59DaysPastDueNotWorse,0.147328
1,RevolvingUtilizationOfUnsecuredLines,0.103022
9,NumberOfTime60-89DaysPastDueNotWorse,0.092474
2,age,0.036528
5,MonthlyIncome,0.027886
6,NumberOfOpenCreditLinesAndLoans,0.021263
0,id,0.019938
8,NumberRealEstateLoansOrLines,0.01709
4,DebtRatio,0.010905


# Save the Model Output

In [104]:
dump(grid_search, '{}.joblib'.format(model_name))

['credit_default.joblib.joblib']

# Write the Predictions

Evaluate on the Test set

In [109]:
from sklearn.metrics import average_precision_score, log_loss, accuracy_score, precision_score
print('Test ({} samples) performance metrics'.format(len(y_test)))
print('average precision: {}'.format(average_precision_score(y_test, 
                                                            grid_search.predict_proba(X_test)[:, 1])))
print('log loss: {}'.format(log_loss(y_test, grid_search.predict_proba(X_test)[:, 1])))
print('accuracy: {}'.format(accuracy_score(y_test, grid_search.predict(X_test))))
print('precision: {}'.format(precision_score(y_test, grid_search.predict(X_test))))


Test (45000 samples) performance metrics
average precision: 0.39835852319365445
log loss: 0.18009017599069507
accuracy: 0.9371111111111111
precision: 0.5942184154175589


Save the results

In [112]:
test_scores = pd.DataFrame({'id': X_test['id'],
                            'probability': grid_search.predict_proba(X_test)[:, 1],
                            'prediction': grid_search.predict(X_test),
                             'actual': y_test})

In [113]:
test_scores.head()

Unnamed: 0,actual,id,prediction,probability
102264,0,102265,0,0.016047
149055,0,149056,0,0.039565
94569,0,94570,0,0.02733
22569,0,22570,0,0.033487
13656,0,13657,0,0.027812


In [113]:
#test_scores_with_data = pd.concat([test_scores, X_train.drop('id', axis=1)], axis=1)

In [114]:
#test_scores_with_data.head()

In [98]:
#test_scores_with_data.to_csv('{}_scores_with_data.csv'.format(model_name))

# Predcitions Confusion Matrix

In [114]:
cm = confusion_matrix(test_scores['actual'], test_scores['prediction'])
cm

array([[41615,   379],
       [ 2451,   555]])

In [115]:
total = len(test_scores)
perc_vals = [[cm[0][0]/total*100, cm[0][1]/total*100], 
             [cm[1][0]/total*100, cm[1][1]/total*100]]
perc_vals

[[92.47777777777777, 0.8422222222222222],
 [5.446666666666666, 1.2333333333333334]]