# Predictive analysis of Bank Marketing

#### Problem Statement
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. 

#### What to achieve?
The classification goal is to predict if the client will subscribe a term deposit (variable y).

#### Data Contains information in following format:

### Categorical Variable :

* Marital - (Married , Single , Divorced)",
* Job - (Management,BlueCollar,Technician,entrepreneur,retired,admin.,services,selfemployed,housemaid,student,unemployed,unknown)
* Contact - (Telephone,Cellular,Unknown)
* Education - (Primary,Secondary,Tertiary,Unknown)
* Month - (Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec)
* Poutcome - (Success,Failure,Other,Unknown)
* Housing - (Yes/No)
* Loan - (Yes/No)
* Default - (Yes/No)

### Numerical Variable:

* Age
* Balance
* Day
* Duration
* Campaign
* Pdays
* Previous

#### Class
* deposit - (Yes/No)

In [1]:
#Importing required libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
# from sklearn.cross_validation import train_test_split
from sklearn import metrics
import statsmodels.formula.api as smf
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

ModuleNotFoundError: No module named 'statsmodels'

In [None]:
#Importing and displaying data
data = pd.read_csv("bank.csv", delimiter=";",header='infer')
data.head()

In [None]:
#4521 rows and 17 features
data.shape

In [None]:
#datatypes of the columns
data.dtypes

Since the dtype contains types other than int, floot; we need to convert those column values into proper format in order to fit the data in model.

In [None]:
#Converting object type data into numeric type using One-Hot encoding method which is
#majorly used for XGBoost (for better accuracy) [Applicable only for non numeric categorical features]
data_new = pd.get_dummies(data, columns=['job','marital',
                                         'education','default',
                                         'housing','loan',
                                         'contact','month',
                                         'poutcome'])
#pd is instance of pandas. Using get_dummies method we can directly convert any type of data into One-Hot encoded format.

In [None]:
#Since y is a class variable we will have to convert it into binary format. (Since 2 unique class values)
data_new.y.replace(('yes', 'no'), (1, 0), inplace=True)

In [None]:
#Checking types of all the columns converted
data_new.dtypes

In [None]:
#Our New dataframe ready for XGBoost
data_new.head()

In [None]:
#Spliting data as X -> features and y -> class variable
data_y = pd.DataFrame(data_new['y'])
data_X = data_new.drop(['y'], axis=1)
print(data_X.columns)
print(data_y.columns)

In [None]:
#Dividing records in training and testing sets along with its shape (rows, cols)
X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.3, random_state=2, stratify=data_y)
print (X_train.shape)
print (X_test.shape)
print (y_train.shape)
print (y_test.shape)

In [None]:
#Create an XGB classifier and train it on 70% of the data set.
from sklearn import svm
from xgboost import XGBClassifier
clf = XGBClassifier()
clf

In [None]:
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
#classification accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred))

**Using xgb Library**

In [None]:
import xgboost as xgb
# X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.3, random_state=2, stratify=data_y)


In [None]:
X_pp_train, X_conf_train, y_pp_train, y_conf_train  = train_test_split(data_X, data_y, test_size=0.66, random_state=10)
X_conf_train, X_cal, y_conf_train, y_cal =  train_test_split(X_conf_train, y_conf_train, test_size=0.5, random_state=0)
X_cal, X_test, y_cal,  y_test, =                      train_test_split(X_cal, y_cal, test_size=0.5, random_state=0)

# model = RandomForestClassifier(random_state =0,
#                                n_estimators =800,
#                                n_jobs =-1)

# model.fit(X_pp_train, y_pp_train)



In [None]:
dtrain = xgb.DMatrix(X_pp_train, label=y_pp_train)
dtest = xgb.DMatrix(X_test)

In [None]:
watchlist = [(dtrain, 'train'),(dtest, 'val')]
print(watchlist)

In [None]:
#Train the model
params = {
    'objective':'multi:softprob',
    'max_dept':4,
    'silent':1,
    'eta':0.3,
    'gamma': 0,
    'num_class': 2
}
num_rounds=20

In [None]:
XGB_Model = xgb.train(params,dtrain,num_rounds)

In [None]:
XGB_Model.dump_model('dump.rawBank.txt')

In [None]:
y_predict = XGB_Model.predict(dtest)
print(y_predict)

In [None]:
from xgboost import plot_importance
from matplotlib import pyplot
plot_importance(XGB_Model)
pyplot.show()

In [None]:
#Tree visualisation (Double tap to zoo)
xgb.plot_tree(XGB_Model, num_trees=2)
fig = plt.gcf()
fig.set_size_inches(150, 300)
fig.savefig('tree.png')

### Let's have a look at the performance of the model on unseen data, we will compare to the random forest confidence predictions

In [None]:
# pip install plot_utils
from plotting_utils import plot_prediction_conf_surface, plot_macest_sklearn_comparison_surface

In [None]:
from sklearn.calibration import CalibratedClassifierCV

from macest.classification import models as clmod
from macest.classification import plots as clplot

macest_model = clmod.ModelWithConfidence(XGB_Model, X_conf_train, y_conf_train)

macest_model.fit(X_cal, y_cal)

preds = XGB_Model.predict(dtest)
conf_preds = macest_model.predict_proba(dtest)
rf_conf_preds = model.predict_proba(dtest)

In [None]:
macest_point_prediction_conf = macest_model.predict_confidence_of_point_prediction(X_test) 
                              
rf_point_prediction_conf = np.amax(rf_conf_preds, axis=1)

In [None]:
clplot.plot_calibration_curve([rf_point_prediction_conf,
                               macest_point_prediction_conf], 
                              ['Random Forest', 'MACE'],
                              preds,
                              y_test)

### Let's compare calibration and forecast metrics

In [None]:
clplot.plot_calibration_metrics([rf_point_prediction_conf, 
                                macest_point_prediction_conf], 
                              [ 'RF','MACE'], preds, y_test)

In [None]:
clplot.plot_forecast_metrics([rf_point_prediction_conf, 
                                macest_point_prediction_conf], 
                              [ 'RF','MACE'], preds, y_test)

## What does the surface learnt by the random forest  look like ?

#### As expected it is able to partition the space around the circles, it then extrapolates this surface and predicts anything outside of the inner circle as blue no matter how far we are from the data, this is probably what we want to happen.

#### If we now look at the confidence plot we see that apart from the boundary, between the two circles the random forest is >90% confident. This high confident  is also extrapolated very far from the data, this is probably bad as we have no relevant data 20 standard deviations away from the data

In [None]:
plot_prediction_conf_surface(3, 25, model, X_pp_train=X_pp_train, plot_training_data=False)

## Let's now compare those confidence estimates with MACEst

#### MACEst also learns to be very confident close to the data, however as we move further away from the data we see that the confidence drops off. At a certain point MACEst will return close an approximately uniform distribution, this suggests that we do not have the relevant data to be confident about the prediction

In [None]:
plot_macest_sklearn_comparison_surface(3, 25, macest_model, model, X_pp_train=X_pp_train, plot_training_data=True)

In [None]:
plot_macest_sklearn_comparison_surface(3, 25, macest_model, model, X_pp_train=X_pp_train, plot_training_data=True)