### Prediction of Customer Response to a marketing campaign

o   The data is related with direct marketing campaigns of a Portuguese banking institution,classification goal is to predict if the client will subscribe a term deposit or not given the customer information as input varibles
 
o   This self project aims at classification using various methods such as logistic regression, KNN, decision trees and Random Forest do the model diagnostics using various metrics.

o   Firstly, Exploratory data analysis is performed to know the information regarding the data and then data pre-processing is done. Outlier's detection is done using Box Plots. As the data is imbalnced resampling technique is used to get the train data.

o   Implemented models and checked for various evaluation metrics confusion matrix and AUC under ROC

o   Performed exploratory data analysis for outlier’s detection and used resampling technique to deal with imbalanced data

o   Solved binary classification problem using various supervised machine learning methods such as logistic regression, KNN, decision tree, and ensemble learning algorithm Random Forest, and also compared the evaluation metrics

o   Employed Grid-search technique to configure best parameters with AUC under ROC as metric and trained the model

o	Plotted gain and lift charts to segregate the customers likely to respond and target these customers for the campaign



In [None]:
#importing useful packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Exploratory Data Analysis

In [None]:
#Loading the bank data from UCI machine learning repository [Moro et al., 2011]

df = pd.read_csv('bank.csv',sep=';')
display(df.info())
display(df.head(5))


4521 observations and total 17 columns (input variables or attributes) are present
All the columns are not required to build model we take only the features that can explain the variation of dependent varible
Here 'y' refers to the dependent variable it contains info about whether the customer subscibed the term deposit or not
Only two output classes are present("YES" or "NO")

In [None]:
#Categorical features and no.of categories in particular column
lis=list(df.select_dtypes(include=['object']).columns)
for i in lis:
    print(i)
    display(df[i].value_counts())

In [None]:
#First we check for null values if any
display(df.isnull().values.any())
display(df.isnull().sum())
#No any null values in the data

In [None]:
#Dropping the columns of date and month and duration  and pdays as they don't represent the variation in dependent variable
new_df=df.drop(['contact','day','month','duration','pdays','poutcome'],axis=1)
bank_df=new_df.rename(columns={'housing':'housing-loan','loan':'personal-loan','campaign':'current-campaign','previous':'previous-campaign','y':'subscribed'})
bank_df.head(5)

In [None]:
bank_df.info()
count=0
for i in range(len(bank_df)):
    if bank_df.subscribed[i]=="yes":
        count+=1
print('No.of customers subscribed the term_deposit',count,'out of :',len(bank_df),'customers contacted')

#### Outlier Analysis::

In [None]:
import seaborn as sn
display(bank_df[['balance']].boxplot())
bank_df.hist(column='balance')

In [None]:
display(bank_df[['age']].boxplot())
bank_df.hist(column='age')

In [None]:
bank_df[['current-campaign']].boxplot()
bank_df.hist(column='current-campaign')


In [None]:
bank_df[['previous-campaign']].boxplot()
bank_df.hist(column='previous-campaign')

#### IMBALANCED DATA HANDLING::
Dealing with imbalance data set as the number of customers subscribed are very less compared to those who didn't subscribe
representation of two classes is not equal
Bootstrapping Technique to deal with imbalanced data set :Upsampling is done here
The other important thing is to shuffle the data after upsampling to remove any pattern

In [None]:

from sklearn.utils import resample
from sklearn.utils import shuffle

customers_subscribed= bank_df[bank_df.subscribed =='yes']
customers_not_subscribed= bank_df[bank_df.subscribed =='no']
##Upsample the subscribed observations.

customers_subscribed_upsampled = resample(customers_subscribed,replace=True,n_samples=2000)
bank_df_concated=pd.concat([customers_not_subscribed,customers_subscribed_upsampled])

bank_df_new=shuffle(bank_df_concated)


In [None]:
bank_df_new.info()

In [None]:
X_features = list( bank_df_new.columns )
X_features.remove('subscribed')
X_features
#Convert the categorical features to dummies
X= pd.get_dummies( bank_df_new[X_features],drop_first = True )
X.head(3)

In [None]:
#Encoding the output categories :yes=>1  and no=>1
Y = bank_df_new.subscribed.map( lambda x: int( x == 'yes') )


In [None]:
#Split the data into train and test set:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.3,random_state=42)

# Logistic Regression Classifier:

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn

logit = LogisticRegression()
logit.fit(X_train, y_train)
y_pred_log=logit.predict(X_test)

cm= metrics.confusion_matrix( y_test, y_pred_log, [1,0] )
sn.heatmap(cm, annot=True, fmt='.2f',xticklabels = ['Subscribed', 'Not Subscribed'],
           yticklabels = ['Subscribed', 'Not Subscribed'] )

In [None]:
#Classification report
print(metrics.classification_report( y_test, y_pred_log ) )

In [None]:
#ROC curve and Area under ROC
predict_proba_df = pd.DataFrame( logit.predict_proba(X_test) )
display(predict_proba_df.head(3))

df_prob=pd.DataFrame({'actual':y_test})
df_prob=df_prob.reset_index()
df_prob['prob_of_1']=predict_proba_df.iloc[:,1:2]
display(df_prob.head(10))

auc_score = metrics.roc_auc_score(df_prob.actual,df_prob.prob_of_1)

fpr, tpr, thresholds = metrics.roc_curve( df_prob.actual,df_prob.prob_of_1,drop_intermediate =False )
plt.plot(fpr, tpr, label = 'ROC curve (area = %0.2f)' % auc_score)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

print('Area under ROC:',round( float( auc_score ), 2 ))

#### Gain and Lift Charts

In [None]:
df_n=df_prob.sort_values('prob_of_1',ascending=False)
df_n.head(3)
#divide the data into deciles
df_n['decile']=1
idx=0
for i in range(1,11):
    lim=(len(df_n)//10)*i
    while(idx<lim):
        df_n.iloc[idx,3]=i
        idx+=1
df_n.head(5)    
gain_df=pd.DataFrame(df_n.groupby('decile')['actual'].sum())
gain_df.reset_index()
gain_df.columns=['decile','gain']


In [None]:
gain_df['percentage_of_gain']=(gain_df['gain'].cumsum()*100)/gain_df['gain'].sum()
display(gain_df)
plt.plot(gain_df['decile'],gain_df['percentage_of_gain'])
plt.title('Gain_Chart')

So from gain chart we can say that if we contact first 6 decile customers we would get 75% of the subscribers.
This way we can target the customers who are most likely to subscribe for a product or service

In [None]:
gain_df['lift']=gain_df.percentage_of_gain/(gain_df.decile*10)
display(gain_df)
plt.plot(gain_df['decile'],gain_df['lift'])
plt.title('Lift_Chart')


Lift is the ratio of the number of positive observations up to decile i using the model to the
expected number of positives up to that decile i based on a random model (not using a model)
We can know upto how much decile the lift value is feasible


# KNN-Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf= KNeighborsClassifier()
knn_clf.fit( X_train, y_train)

y_pred_knn=knn_clf.predict(X_test)

cm_knn= metrics.confusion_matrix( y_test, y_pred_knn, [1,0] )
sn.heatmap(cm_knn, annot=True, fmt='.2f',xticklabels = ['Subscribed', 'Not Subscribed'],
           yticklabels = ['Subscribed', 'Not Subscribed'] )

In [None]:
#Classification report
print(metrics.classification_report( y_test, y_pred_knn ) )

In [None]:
predict_proba_df_knn = pd.DataFrame( knn_clf.predict_proba(X_test) )
display(predict_proba_df_knn.head(3))

df_prob_knn=pd.DataFrame({'actual':y_test})
df_prob_knn=df_prob.reset_index()
df_prob_knn['prob_of_1']=predict_proba_df_knn.iloc[:,1:2]
df_prob_knn.head(3)

fpr_knn, tpr_knn, thresholds_knn = metrics.roc_curve( df_prob_knn.actual,df_prob_knn.prob_of_1,drop_intermediate =False )
plt.plot(fpr_knn, tpr_knn, label = 'ROC curve (area = %0.2f)' % auc_score)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

auc_score_knn = metrics.roc_auc_score(df_prob_knn.actual,df_prob_knn.prob_of_1)
print('Area under ROC KNN:',round( float( auc_score_knn ), 2 ))

#### Grid Search for best model paramters:

In [None]:
from sklearn.model_selection import GridSearchCV
grid_parameters = [{'n_neighbors': range(5,10),'metric': ['canberra', 'euclidean','minkowski']}]
clf = GridSearchCV(KNeighborsClassifier(),grid_parameters,cv=10,scoring='roc_auc')
clf.fit(X_train,y_train)
clf.best_score_


In [None]:
clf.best_params_

# Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf_tree = DecisionTreeClassifier(criterion = 'gini',max_depth = 15)
clf_tree.fit( X_train, y_train )
tree_predict = clf_tree.predict( X_test )
print(metrics.roc_auc_score( y_test, tree_predict ))

clf_tree_entropy= DecisionTreeClassifier(criterion = 'entropy',max_depth =15)
clf_tree_entropy.fit( X_train, y_train )
tree_predict_entropy = clf_tree_entropy.predict( X_test )
print(metrics.roc_auc_score( y_test, tree_predict_entropy ))

from sklearn.model_selection import GridSearchCV
params=[{'criterion':['gini','entropy'],'max_depth':range(2,15)}]
clf_tree_best = DecisionTreeClassifier()
clf_best = GridSearchCV(clf_tree_best,params,cv=10,scoring='roc_auc')
clf_best.fit(X_train, y_train)
clf_best.best_score_

In [None]:
clf_best.best_params_

# Random Forest (Ensemble)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rand_forest_clf=RandomForestClassifier(max_depth=15,n_estimators=10)
rand_forest_clf.fit(X_train,y_train)
y_pred_randf=rand_forest_clf.predict(X_test)
print('AUC under ROC random forest:',metrics.roc_auc_score(y_test,y_pred_randf))

tuned_parameters = [{'max_depth': [10, 18],'n_estimators': [10,20],'max_features': ['sqrt', 0.2]}]
radm_clf = RandomForestClassifier()
clf_randf = GridSearchCV(radm_clf,tuned_parameters,cv=5,scoring='roc_auc')
clf_randf.fit(X_train,y_train)
print('AUC under ROC using BEST MODEL parameters:',clf_randf.best_score_)


In [None]:
clf_randf.best_params_

In [None]:
clf_rf=RandomForestClassifier(max_depth=18,max_features='sqrt',n_estimators=20)
clf_rf.fit(X_train,y_train)
y_pred_clf=clf_rf.predict(X_test)

predict_proba_df_randf = pd.DataFrame( clf_rf.predict_proba(X_test) )

df_prob_randf=pd.DataFrame({'actual':y_test})
df_prob_randf=df_prob.reset_index()
df_prob_randf['prob_of_1']=predict_proba_df_randf.iloc[:,1:2]
df_prob_randf.head(3)

fpr_rf, tpr_rf, thresholds_rf = metrics.roc_curve( df_prob_randf.actual,df_prob_randf.prob_of_1,drop_intermediate =False )
plt.plot(fpr_rf, tpr_rf, label = 'ROC curve (area = %0.2f)' % auc_score)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

auc_score_randf= metrics.roc_auc_score(df_prob_randf.actual,df_prob_randf.prob_of_1)
print('Area under ROC Random forest best params:',round( float( auc_score_randf), 3))



In [None]:
cm_rf= metrics.confusion_matrix( y_test, y_pred_clf, [1,0] )
sn.heatmap(cm_rf, annot=True, fmt='.2f',xticklabels = ['Subscribed', 'Not Subscribed'],
           yticklabels = ['Subscribed', 'Not Subscribed'] )

In [None]:
print( metrics.classification_report( y_test, y_pred_clf ) )

#### Feature Importacne values 

In [None]:
feature_imp = pd.DataFrame( { 'feature': X_train.columns,'importance': clf_rf.feature_importances_ } )

feature_imp_sorted = feature_imp.sort_values('importance', ascending =False)
plt.figure(figsize=(8, 8))
# plot the values
sn.barplot( y = 'feature', x = 'importance', data = feature_imp_sorted )

Hence the fetures 'balance' 'age' 'previous-campaign' and 'current-campaign' are the top 4 features which expalins whether a customer is likely to subscribe the product or service  