## A Business Case project


This is a classification problem related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. I have used the following classification models, to see how these models will perform on this dataset:

1. Logistic Regression
2. K-Nearest Neighbours
3. Support Vector Machine
4. Kernel Support Vector Machine
5. Naive Bayes
6. Decision Tree Classifier
7. Random Forest Classifier

This dataset was derived from UCI Machine Learning Repository <a href="https://archive.ics.uci.edu/ml/datasets/Bank+Marketing" target="_blank"> (Click Here) </a>.

## ----------------------------------------------------------------------------------------------------------------------------------

## Importing the libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Importing the dataset

In [2]:
dataset = pd.read_csv('bank.csv', sep=';')
dataset.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


In [3]:
dataset.shape ## There are 4521 instances and 17 attributes

(4521, 17)

## View statistical details

The mean details in this dataset suggests that:

- The average age is around 40
- The average annual balance is €1423
- The last contact day of the month is 16 days
- The last contact duration was 264 seconds
- The number of contacts performed during this campaign and for this client is 3
- The number of days that passed by after the client was last contacted from a previous campaign is 40 days
- The number of contacts performed before this campaign and for this client is 1

In [4]:
dataset.describe()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,4521.0,4521.0,4521.0,4521.0,4521.0,4521.0,4521.0
mean,41.170095,1422.657819,15.915284,263.961292,2.79363,39.766645,0.542579
std,10.576211,3009.638142,8.247667,259.856633,3.109807,100.121124,1.693562
min,19.0,-3313.0,1.0,4.0,1.0,-1.0,0.0
25%,33.0,69.0,9.0,104.0,1.0,-1.0,0.0
50%,39.0,444.0,16.0,185.0,2.0,-1.0,0.0
75%,49.0,1480.0,21.0,329.0,3.0,-1.0,0.0
max,87.0,71188.0,31.0,3025.0,50.0,871.0,25.0


In [5]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        4521 non-null   int64 
 1   job        4521 non-null   object
 2   marital    4521 non-null   object
 3   education  4521 non-null   object
 4   default    4521 non-null   object
 5   balance    4521 non-null   int64 
 6   housing    4521 non-null   object
 7   loan       4521 non-null   object
 8   contact    4521 non-null   object
 9   day        4521 non-null   int64 
 10  month      4521 non-null   object
 11  duration   4521 non-null   int64 
 12  campaign   4521 non-null   int64 
 13  pdays      4521 non-null   int64 
 14  previous   4521 non-null   int64 
 15  poutcome   4521 non-null   object
 16  y          4521 non-null   object
dtypes: int64(7), object(10)
memory usage: 600.6+ KB


## Check for missing values
No missing values present

In [6]:
dataset.isnull().sum()

age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

##### Let x be the independant variables, and y be the dependant variable

In [7]:
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [8]:
print(x)

[[30 'unemployed' 'married' ... -1 0 'unknown']
 [33 'services' 'married' ... 339 4 'failure']
 [35 'management' 'single' ... 330 1 'failure']
 ...
 [57 'technician' 'married' ... -1 0 'unknown']
 [28 'blue-collar' 'married' ... 211 3 'other']
 [44 'entrepreneur' 'single' ... 249 7 'other']]


## Encoding categorical data

### Encoding the Independent Variable

In [9]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1,2,3,4,6,7,8,10,15])], remainder='passthrough')
x = np.array(ct.fit_transform(x))

In [10]:
x

array([[0.0, 0.0, 0.0, ..., 1, -1, 0],
       [0.0, 0.0, 0.0, ..., 1, 339, 4],
       [0.0, 0.0, 0.0, ..., 1, 330, 1],
       ...,
       [0.0, 0.0, 0.0, ..., 11, -1, 0],
       [0.0, 1.0, 0.0, ..., 4, 211, 3],
       [0.0, 0.0, 1.0, ..., 2, 249, 7]], dtype=object)

### Encoding the Dependant Variable

In [11]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [12]:
print(y)

[0 0 0 ... 0 0 0]


## Splitting the dataset into the Training set and Test set

In [13]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 42)

## Feature Scaling
There are two ways to scale feautres; normalisation or standardisation. I will use standardisation method because normalisation will work better when a dataset is normally distributed.

In [14]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.fit_transform(x_test)

In [15]:
print(x_train)

[[ 3.01983493 -0.51425049 -0.18823035 ... -0.56451332 -0.4074478
  -0.31253213]
 [-0.33114393 -0.51425049 -0.18823035 ...  0.34982025 -0.4074478
  -0.31253213]
 [-0.33114393  1.94457765 -0.18823035 ... -0.25973547 -0.4074478
  -0.31253213]
 ...
 [-0.33114393 -0.51425049 -0.18823035 ...  0.34982025  3.3457769
   0.82425319]
 [-0.33114393 -0.51425049 -0.18823035 ...  0.34982025 -0.4074478
  -0.31253213]
 [-0.33114393  1.94457765 -0.18823035 ... -0.25973547 -0.4074478
  -0.31253213]]


In [16]:
print(x_test)

[[-0.38044296 -0.51487928  4.55521679 ...  1.3431067  -0.40792925
  -0.35238102]
 [-0.38044296 -0.51487928 -0.21952852 ... -0.24544617 -0.37982219
   3.03083574]
 [-0.38044296 -0.51487928 -0.21952852 ...  0.15169205 -0.40792925
  -0.35238102]
 ...
 [-0.38044296 -0.51487928 -0.21952852 ...  1.3431067  -0.40792925
  -0.35238102]
 [-0.38044296 -0.51487928 -0.21952852 ... -0.24544617 -0.40792925
  -0.35238102]
 [-0.38044296 -0.51487928 -0.21952852 ... -0.64258438  1.31597029
   0.32426233]]


# --------------------------------- End of data preprocessing --------------------------------------

# Model Performance Selection

## Training the Logistic Regression model on the Training set

In [17]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(random_state=42)
log.fit(x_train, y_train)

LogisticRegression(random_state=42)

## Predicting the Test set results

In [18]:
y_pred = log.predict(x_test)

## Making the Confusion Matrix

In [19]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = log.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[979  27]
 [ 87  38]]


0.8992042440318302

## Training the Decision Tree classifier model on the Training set

In [20]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion = 'entropy', random_state = 42)
dtc.fit(x_train, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=42)

## Predicting the Test set results

In [21]:
y_pred = dtc.predict(x_test)

## Making the Confusion Matrix

In [22]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = dtc.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[945  61]
 [ 72  53]]


0.8824049513704686

## Training the K-NN model on the Training set

In [23]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn.fit(x_train, y_train)

KNeighborsClassifier()

## Making the Confusion Matrix

In [24]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = knn.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[979  27]
 [104  21]]


0.8841732979664014

## Training the Kernel SVM model on the Training set

In [25]:
from sklearn.svm import SVC
ksvm = SVC(kernel = 'rbf', random_state = 42)
ksvm.fit(x_train, y_train)

SVC(random_state=42)

## Making the Confusion Matrix

In [26]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = ksvm.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[988  18]
 [ 96  29]]


0.8992042440318302

## Training the Naive Bayes model on the Training set

In [27]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(x_train, y_train)

GaussianNB()

## Making the Confusion Matrix

In [28]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = gnb.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[876 130]
 [ 76  49]]


0.8178603006189213

## Training the Random Forest Classification model on the Training set

In [29]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42)
rfc.fit(x_train, y_train)

RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=42)

## Making the Confusion Matrix

In [30]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = rfc.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[983  23]
 [102  23]]


0.8894783377541998

## Training the SVM model on the Training set

In [31]:
from sklearn.svm import SVC
svm = SVC(kernel = 'linear', random_state = 42)
svm.fit(x_train, y_train)

SVC(kernel='linear', random_state=42)

## Making the Confusion Matrix

In [32]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = svm.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[987  19]
 [100  25]]


0.8947833775419982

### It looks like Logistic Regression and Kernel SVM performed really well, both having an accuracy of 89.92%. I will examine both of these models closely by applying k-Fold Cross Validation. I am applying this technique,  to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

## Computing the Logistic Regression accuracy with k-Fold Cross Validation

In [33]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = log, X = x_train, y = y_train, cv = 10)
print("Accuracy : {:.2f} %" .format(accuracies.mean()*100))
print("Standard Deviation : {:.2f} %" .format(accuracies.std()*100))

Accuracy : 90.35 %
Standard Deviation : 1.11 %


## Computing the Kernel SVM accuracy with k-Fold Cross Validation

In [34]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = ksvm, X = x_train, y = y_train, cv = 10)
print("Accuracy : {:.2f} %" .format(accuracies.mean()*100))
print("Standard Deviation : {:.2f} %" .format(accuracies.std()*100))

Accuracy : 88.88 %
Standard Deviation : 1.27 %


##### It seems that Logistic Regression has a higher accuracy rate, compared to Kernel SVM. It shows the model's ability to predict new data that was not used in estimating it. The Standard Deviation for Logistic Regression suggests that the model's accuracy lies between 89.24% to 91.46%.

##### To conclude, this dataset will perform well with Logistic Regression, predicting if a client will subscribe to a term deposit.