# Problem Description

ABC Bank wants to sell its term deposit product to customers and before launching the product they want to develop a model which help them in understanding whether a particular customer will buy their product or not (based on customer's past interaction with bank or other Financial Institution).

# Importing the Data

In [1]:
# ! pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

# Import and view the dataset
df = pd.read_csv('https://raw.githubusercontent.com/agbaysa/dataglacier_week7/main/bank-additional-full.csv', sep=";")
df.head(3)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [2]:
# Check dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [3]:
# Check for missing values
print('Data columns with null values:',df.isnull().sum(), sep = '\n')

Data columns with null values:
age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64


We will now use `pandas-profiling` to profile the dataset in terms of the following:
* Number of Variables
* Number of observations
* Missing Data
* Duplicate Rows
* Data size
* Data Types
* Distribution of Continuous Features
* Cardinality of Categorical Features
* Correlations of Continuous Features

In [4]:
# Data profiling
import pandas_profiling
from pandas_profiling import ProfileReport

profile = ProfileReport(df)
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



# Data Cleansing and Features Engineering

In [5]:
# Drop the duration column as required
df.drop(['duration'], axis=1, inplace=True)
df.info()

# Drop duplicates
duplicated_rows = df[df.duplicated()]
print('Number of duplicated rows:', duplicated_rows.shape[0])

df.drop_duplicates(inplace = True)
print(df.shape)


# Select numeric and non-numeric columns
cat_vars = df.select_dtypes(include='object').columns
num_vars = df.select_dtypes(exclude='object').columns
print(cat_vars)

# Do label encoding
# from sklearn.preprocessing import LabelEncoder
# df[cat_vars] = df[cat_vars].apply(LabelEncoder().fit_transform)
# df.head(3)

# Convert target variable to numeric; 
def target_variable_binary(y):
    df['y'] = df['y'].replace({"yes":1,"no":0},inplace=True)

# Check label-encoded categorical columns
df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  campaign        41188 non-null  int64  
 11  pdays           41188 non-null  int64  
 12  previous        41188 non-null  int64  
 13  poutcome        41188 non-null  object 
 14  emp.var.rate    41188 non-null  float64
 15  cons.price.idx  41188 non-null  float64
 16  cons.conf.idx   41188 non-null  float64
 17  euribor3m       41188 non-null 

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [6]:
# Selective label encoding of categorical features
# cat_vars = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'y']
from sklearn.preprocessing import LabelEncoder

# job: nominal labelling
dict_job = {'blue-collar': 0, 'management': 1, 'technician': 2, 'admin.': 3, 'services': 4,
            'entrepreneur': 5, 'unknown': 5, 'retired': 5, 'self-employed': 5, 'unemployed': 5,
            'housemaid': 5, 'student': 5}
df.replace({'job':dict_job}, inplace=True)

In [7]:
# marital: nominal labelling
dict_marital = {'single': 0, 'married':1, 'divorced':2, 'unknown': 4}
df.replace({'marital':dict_marital}, inplace=True)

In [8]:
# education: nominal labelling
dict_education = {'university.degree': 0, 'high.school': 1, 'basic.9y': 2, 'professional.course': 2, 'basic.4y': 2,
                  'basic.6y': 2, 'unknown': 2, 'illiterate': 2}
df.replace({'education':dict_education}, inplace=True)

In [9]:
# default: nominal labelling
dict_default = {'no': 0, 'yes': 1, 'unknown': 2}
df.replace({'default':dict_default}, inplace=True)

In [10]:
# housing: nominal labelling
dict_housing = {'no': 0, 'yes': 1, 'unknown': 2}
df.replace({'housing':dict_default}, inplace=True)

In [11]:
# loan: nominal labelling
dict_loan = {'no': 0, 'yes': 1, 'unknown': 2}
df.replace({'loan':dict_loan}, inplace=True)

In [12]:
# contact: nominal labelling
dict_contact = {'cellular': 0, 'telephone': 1}
df.replace({'contact':dict_contact}, inplace=True)

In [13]:
# month: nominal labelling
dict_month = {'jan': 1, 'feb': 2,'mar': 3,'apr':4,'may':5,'jun':6,'jul':7,'aug':8,'sep':9,'oct':10,'nov':11,'dec':12}
df.replace({'month':dict_month}, inplace=True)

In [14]:
# campaign: Classify campaign 6 and above as 6 to minimize cardinality
df['campaign'].loc[df['campaign']>=6] = 6

In [15]:
# poutcome: nominal labelling
dict_poutcome = {'nonexistent': 0, 'failure': 1, 'success': 2}
df.replace({'poutcome':dict_poutcome}, inplace=True)

In [16]:
# y: nominal labelling
dict_y = {'no': 0, 'yes': 1}
df.replace({'y':dict_y}, inplace=True)

In [17]:
# Encoding pdays by aging buckets
df['pdays'].loc[df['pdays'].between(-999, 0)] = 0
df['pdays'].loc[df['pdays'].between(1, 60)] = 1
df['pdays'].loc[df['pdays'].between(61, 9999)] = 2

In [19]:
# day_of_week: nominal labelling
dict_dow = {'mon': 1, 'tue': 2, 'wed':3, 'thu':4,'fri':5}
df.replace({'day_of_week':dict_dow}, inplace=True)

In [21]:
# Check result of features engineering
df.head(3)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,5,1,2,0,0,0,1,5,1,1,2,0,0,1.1,93.994,-36.4,4.857,5191.0,0
1,57,4,1,1,2,0,0,1,5,1,1,2,0,0,1.1,93.994,-36.4,4.857,5191.0,0
2,37,4,1,1,0,1,0,1,5,1,1,2,0,0,1.1,93.994,-36.4,4.857,5191.0,0


The Data Cleansing and Features Engineering have been presented above.

The initial phases of the Modeling section is included below in order to test the data cleansing process. Based on results, the data cleansing and features engineering process resulted to satisfactory model metrics.

Kindly note that the Modeling section below will still be updated to include more models and parameter tuning.

# Modeling

In [22]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import precision_score, recall_score, precision_recall_curve,f1_score, fbeta_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
import sklearn.metrics as metrics


models = []
models.append(('RFC', RandomForestClassifier()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('LR', LogisticRegression()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
models.append(('LDA', LinearDiscriminantAnalysis()))

In [23]:
# Create X and y
X = df.drop('y', axis=1)
y = df.y

# Create train and test data for selected features only and stratify the target variable
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.3, random_state=6, stratify=y)

In [24]:
# Set parameters
num_folds = 10
seed = 6
scoring = 'roc_auc'

# Results using default parameters using roc-auc as a metric
results = []
names = []

for name, model in models:
  kfold = KFold(n_splits=num_folds)
  cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
  results.append(cv_results)
  names.append(name)
  msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
  print(msg)

RFC: 0.768676 (0.014665)
KNN: 0.718652 (0.014480)
CART: 0.612235 (0.016348)
LR: 0.765225 (0.019134)
NB: 0.765242 (0.015178)
SVM: 0.468449 (0.219556)
LDA: 0.778983 (0.015993)


In [25]:
# Results using default parameters using accuracy as a metric
scoring = 'accuracy'

results = []
names = []

for name, model in models:
  kfold = KFold(n_splits=num_folds)
  cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
  results.append(cv_results)
  names.append(name)
  msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
  print(msg)

RFC: 0.887028 (0.004411)
KNN: 0.880284 (0.005621)
CART: 0.828693 (0.006230)
LR: 0.891016 (0.005446)
NB: 0.832753 (0.008344)
SVM: 0.883294 (0.005729)
LDA: 0.890544 (0.004081)


## Logistic *Regression*

In [26]:
# Log Regression
import numpy as np

param_grid = {'C': np.logspace(-4, 4, 50),
             'penalty':['l1', 'l2']}
clf = GridSearchCV(LogisticRegression(random_state=0), param_grid,cv=5, verbose=0,n_jobs=-1)
best_model = clf.fit(X_train,y_train)
print(best_model.best_estimator_)
print("Mean accuracy of Logistic Regression:",best_model.score(X_valid,y_valid))

LogisticRegression(C=1526.4179671752302, random_state=0)
Mean accuracy of Logistic Regression: 0.8917272881069193


In [27]:
logreg = LogisticRegression(C=0.18420699693267145, random_state=0)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_valid)
print('Accuracy of logistic regression: {:.2f}'.format(logreg.score(X_valid, y_valid)))

Accuracy of logistic regression: 0.89


In [28]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_valid, y_pred)
print("Confusion Matrix:\n",confusion_matrix)
print("Classification Report:\n",classification_report(y_valid, y_pred))

Confusion Matrix:
 [[10287   156]
 [ 1119   260]]
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.99      0.94     10443
           1       0.62      0.19      0.29      1379

    accuracy                           0.89     11822
   macro avg       0.76      0.59      0.62     11822
weighted avg       0.87      0.89      0.87     11822



In [29]:
pd.DataFrame(y_valid).value_counts()

y
0    10443
1     1379
dtype: int64

In [32]:
10287+156

10443