<a href="https://colab.research.google.com/github/g-e-mm/Term_Deposit_Prediction_using_SVM/blob/main/Term_Deposit_Prediction_using_SVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

13-03-2024 <center>SVM PROJECT
<center>Predicting Term Deposit Subscription by
a client</center>

# Introduction

**Abstract:**<br>
Marketing campaigns are characterized by focusing on the customer needs and their
overall satisfaction. Nevertheless, there are different variables that determine whether a
marketing campaign will be successful or not. There are certain variables that we need
to take into consideration when making a marketing campaign.
A Term deposit is a deposit that a bank or a financial institution offers with a fixed rate
(often better than just opening a deposit account) in which your money will be returned
back at a specific maturity time.<br><br>
**Problem Statement:**
Predict if a customer subscribes to a term deposits or not, when contacted by a
marketing agent, by understanding the different features and performing predictive
analytics

# Loading Data & Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from scipy.stats import chi2_contingency
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [None]:
data = pd.read_csv('/content/bank-additional-full.csv',sep=';')

# Exploratory Data Analysis

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [None]:
data.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [None]:
data.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02406,258.28501,2.567593,962.475454,0.172963,0.081886,93.575664,-40.5026,3.621291,5167.035911
std,10.42125,259.279249,2.770014,186.910907,0.494901,1.57096,0.57884,4.628198,1.734447,72.251528
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


In [None]:
data.skew()

  data.skew()


age               0.784697
duration          3.263141
campaign          4.762507
pdays            -4.922190
previous          3.832042
emp.var.rate     -0.724096
cons.price.idx   -0.230888
cons.conf.idx     0.303180
euribor3m        -0.709188
nr.employed      -1.044262
dtype: float64

We infer that
* 41188 rows
* No missing values
* Objects need to be encoded
* highly skewed data
* highly heterogenenous data
* high cardinality in many cateogorical variables



# Data Preparation

One Hot Encoder

In [None]:
data_transform = pd.get_dummies(data.iloc[:,:-1])
data_transform

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,56,261,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
1,57,149,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
2,37,226,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
3,40,151,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
4,56,307,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,334,1,999,0,-1.1,94.767,-50.8,1.028,4963.6,...,0,0,1,0,0,0,0,0,1,0
41184,46,383,1,999,0,-1.1,94.767,-50.8,1.028,4963.6,...,0,0,1,0,0,0,0,0,1,0
41185,56,189,2,999,0,-1.1,94.767,-50.8,1.028,4963.6,...,0,0,1,0,0,0,0,0,1,0
41186,44,442,1,999,0,-1.1,94.767,-50.8,1.028,4963.6,...,0,0,1,0,0,0,0,0,1,0


In [None]:
data_transform.info()

In [None]:
x = data_transform
y = data['y']
print(x.shape)
print(y.shape)

(41188, 63)
(41188,)


splitting data into train and test

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)
print("x_train", x_train.shape)
print("x_test", x_test.shape)
print("y_train", y_train.shape)
print("y_test", y_test.shape)

x_train (32950, 63)
x_test (8238, 63)
y_train (32950,)
y_test (8238,)


# Multivariate Analysis

## Logistic Regression

In [None]:
lr = LogisticRegression(max_iter = 10000)
lr.fit(x_train, y_train)
y_hat_train_lr = lr.predict(x_train)
y_hat_test_lr = lr.predict(x_test)

In [None]:
def model_eval(actual, predicted):
  conf_matrix = confusion_matrix(actual,predicted)
  acc_score = accuracy_score(actual, predicted)
  clas_rep = classification_report(actual, predicted)
  print('The Accuracy of the model is: ', round(acc_score,2))
  print(conf_matrix)
  print(clas_rep)
#running the model
model_eval(y_train, y_hat_train_lr)

The Accuracy of the model is:  0.91
[[28485   760]
 [ 2174  1531]]
              precision    recall  f1-score   support

          no       0.93      0.97      0.95     29245
         yes       0.67      0.41      0.51      3705

    accuracy                           0.91     32950
   macro avg       0.80      0.69      0.73     32950
weighted avg       0.90      0.91      0.90     32950



## Decision Tree Classifier

In [None]:
dtree = DecisionTreeClassifier()
dtree.fit(x_train, y_train)
y_hat_train_dtree = dtree.predict(x_train)
y_hat_test_dtree = dtree.predict(x_test)

In [None]:
model_eval(y_train, y_hat_train_dtree)

The Accuracy of the model is:  1.0
[[29245     0]
 [    0  3705]]
              precision    recall  f1-score   support

          no       1.00      1.00      1.00     29245
         yes       1.00      1.00      1.00      3705

    accuracy                           1.00     32950
   macro avg       1.00      1.00      1.00     32950
weighted avg       1.00      1.00      1.00     32950



In [None]:
model_eval(y_test, y_hat_test_dtree)

The Accuracy of the model is:  0.89
[[6818  485]
 [ 438  497]]
              precision    recall  f1-score   support

          no       0.94      0.93      0.94      7303
         yes       0.51      0.53      0.52       935

    accuracy                           0.89      8238
   macro avg       0.72      0.73      0.73      8238
weighted avg       0.89      0.89      0.89      8238



## Random Forest Classifier

In [None]:
rf = RandomForestClassifier()
rf.fit(x_train, y_train)
y_hat_train_rf = rf.predict(x_train)
y_hat_test_rf = rf.predict(x_test)

In [None]:
model_eval(y_train, y_hat_train_rf)

The Accuracy of the model is:  1.0
[[29245     0]
 [    0  3705]]
              precision    recall  f1-score   support

          no       1.00      1.00      1.00     29245
         yes       1.00      1.00      1.00      3705

    accuracy                           1.00     32950
   macro avg       1.00      1.00      1.00     32950
weighted avg       1.00      1.00      1.00     32950



In [None]:
model_eval(y_test, y_hat_test_rf)

The Accuracy of the model is:  0.91
[[7077  226]
 [ 512  423]]
              precision    recall  f1-score   support

          no       0.93      0.97      0.95      7303
         yes       0.65      0.45      0.53       935

    accuracy                           0.91      8238
   macro avg       0.79      0.71      0.74      8238
weighted avg       0.90      0.91      0.90      8238



In [None]:
importances = rf.feature_importances_
importances_df = pd.DataFrame({'feature': x_train.columns, 'importance': importances})
importances_df = importances_df.sort_values('importance', ascending=False)
importances_df

## Ada-Boost

In [None]:
ada = AdaBoostClassifier(n_estimators = 100)
ada.fit(x_train, y_train)
y_train_ada = ada.predict(x_train)
y_hat_test_ada = ada.predict(x_test)

In [None]:
model_eval(y_train, y_train_ada)

The Accuracy of the model is:  0.91
[[28521   724]
 [ 2180  1525]]
              precision    recall  f1-score   support

          no       0.93      0.98      0.95     29245
         yes       0.68      0.41      0.51      3705

    accuracy                           0.91     32950
   macro avg       0.80      0.69      0.73     32950
weighted avg       0.90      0.91      0.90     32950



In [None]:
model_eval(y_test, y_hat_test_ada)

The Accuracy of the model is:  0.91
[[7129  174]
 [ 563  372]]
              precision    recall  f1-score   support

          no       0.93      0.98      0.95      7303
         yes       0.68      0.40      0.50       935

    accuracy                           0.91      8238
   macro avg       0.80      0.69      0.73      8238
weighted avg       0.90      0.91      0.90      8238




* Ada Boost also performed well but with slightly lower accuracy. Therefore, XGBoost appears to be the preferred choice for this classification task.

## Gradient Boost

In [None]:
gb = GradientBoostingClassifier(n_estimators = 150)
gb.fit(x_train, y_train)
y_hat_train_gb = gb.predict(x_train)
y_hat_test_gb = gb.predict(x_test)

In [None]:
model_eval(y_train, y_hat_train_gb)

The Accuracy of the model is:  0.92
[[28366   879]
 [ 1602  2103]]
              precision    recall  f1-score   support

          no       0.95      0.97      0.96     29245
         yes       0.71      0.57      0.63      3705

    accuracy                           0.92     32950
   macro avg       0.83      0.77      0.79     32950
weighted avg       0.92      0.92      0.92     32950



In [None]:
model_eval(y_test, y_hat_test_gb)

The Accuracy of the model is:  0.92
[[7074  229]
 [ 435  500]]
              precision    recall  f1-score   support

          no       0.94      0.97      0.96      7303
         yes       0.69      0.53      0.60       935

    accuracy                           0.92      8238
   macro avg       0.81      0.75      0.78      8238
weighted avg       0.91      0.92      0.91      8238



## Extreme Gradient Booster

In [None]:
le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc = le.fit_transform(y_test)

In [None]:
xgb = XGBClassifier()
xgb.fit(x_train, y_train_enc)
y_hat_train_xgb = xgb.predict(x_train)
y_hat_test_xgb = xgb.predict(x_test)

In [None]:
model_eval(y_train_enc, y_hat_train_xgb)

The Accuracy of the model is:  0.96
[[28853   392]
 [  847  2858]]
              precision    recall  f1-score   support

           0       0.97      0.99      0.98     29245
           1       0.88      0.77      0.82      3705

    accuracy                           0.96     32950
   macro avg       0.93      0.88      0.90     32950
weighted avg       0.96      0.96      0.96     32950



In [None]:
model_eval(y_test_enc, y_hat_test_xgb)

The Accuracy of the model is:  0.92
[[7016  287]
 [ 410  525]]
              precision    recall  f1-score   support

           0       0.94      0.96      0.95      7303
           1       0.65      0.56      0.60       935

    accuracy                           0.92      8238
   macro avg       0.80      0.76      0.78      8238
weighted avg       0.91      0.92      0.91      8238



* XGBoost demonstrated superior performance with an accuracy of 1.0 on the training set and 0.92 on the test set, outperforming Ada Boost and other models. Its high precision, recall, and F1-scores across all classes indicate effective classification.

## K Nearest Neighbour

In [None]:
np.sqrt(len(x_train))

181.52134860671347

In [None]:
knn = KNeighborsClassifier(n_neighbors=41)
knn.fit(x_train, y_train)
y_hat_train_knn = knn.predict(x_train)
y_hat_test_knn = knn.predict(x_test)

In [None]:
model_eval(y_train, y_hat_train_knn)

The Accuracy of the model is:  0.92
[[28317   928]
 [ 1865  1840]]
              precision    recall  f1-score   support

          no       0.94      0.97      0.95     29245
         yes       0.66      0.50      0.57      3705

    accuracy                           0.92     32950
   macro avg       0.80      0.73      0.76     32950
weighted avg       0.91      0.92      0.91     32950



In [None]:
model_eval(y_test, y_hat_test_knn)

The Accuracy of the model is:  0.91
[[7073  230]
 [ 487  448]]
              precision    recall  f1-score   support

          no       0.94      0.97      0.95      7303
         yes       0.66      0.48      0.56       935

    accuracy                           0.91      8238
   macro avg       0.80      0.72      0.75      8238
weighted avg       0.90      0.91      0.91      8238



Trying with multiple K Values

In [None]:
k_values = range(1, 200)  # Vary k from 1 to 200
accuracies = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(x_train, y_train)
    y_pred = knn.predict(x_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)

plt.plot(k_values, accuracies, marker='o')
plt.xlabel("Number of Neighbors (k)")
plt.ylabel("Accuracy")
plt.title("Accuracy vs. Number of Neighbors (KNN)")
plt.grid(True)
plt.show()

## Support Vector Machines

In [None]:
svm = SVC(C=1.5, kernel='poly', degree=5)
svm.fit(x_train,y_train)
y_hat_train_svm=svm.predict(x_train)
y_hat_test_svm=svm.predict(x_test)

In [None]:
model_eval(y_train, y_hat_train_svm)

The Accuracy of the model is:  0.9
[[28775   470]
 [ 2859   846]]
              precision    recall  f1-score   support

          no       0.91      0.98      0.95     29245
         yes       0.64      0.23      0.34      3705

    accuracy                           0.90     32950
   macro avg       0.78      0.61      0.64     32950
weighted avg       0.88      0.90      0.88     32950



In [None]:
model_eval(y_test, y_hat_test_svm)

The Accuracy of the model is:  0.9
[[7171  132]
 [ 726  209]]
              precision    recall  f1-score   support

          no       0.91      0.98      0.94      7303
         yes       0.61      0.22      0.33       935

    accuracy                           0.90      8238
   macro avg       0.76      0.60      0.64      8238
weighted avg       0.87      0.90      0.87      8238

