[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/)
- Click "Upload" and select this file and the data file.

# Bank customers: SVM

Goal: Build a classification algorithm for bank customers' response to marketing campaign.
- deposit: whether the client purchased a term deposit (yes, no)
- age
- job: type of job (admin, bluecollar, entrepreneur, housemaid, management, retired, selfemployed, services, student, technician, unemployed)
- marital: marital status (divorced/widowed, married, single)
- education (primary, secondary, tertiary)
- default: has credit in default? (yes, no)
- balance: average yearly balance, in euros
- housing: has housing loan? (yes, no)
- loan: has personal loan? (yes, no)
- day, month: last contact day and month
- duration: last contact duration, in seconds
- campaign: number of contacts performed during this campaign and for the client
- passdays: number of days that passed by after the client was last contacted from a previous campaign
- previous: number of contacts performed before this campaign

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import randint
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, recall_score, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [2]:
# Read the data

df = pd.read_csv('Bank customer.csv')
df.head()

Unnamed: 0,deposit,age,job,marital,education,default,balance,housing,loan,day,month,duration,campaign,passdays,previous
0,no,58,management,married,tertiary,no,2143,yes,no,5,may,261,1,-1,0
1,no,44,technician,single,secondary,no,29,yes,no,5,may,151,1,-1,0
2,no,33,entrepreneur,married,secondary,no,2,yes,yes,5,may,76,1,-1,0
3,no,35,management,married,tertiary,no,231,yes,no,5,may,139,1,-1,0
4,no,28,management,single,tertiary,no,447,yes,yes,5,may,217,1,-1,0


In [3]:
# Encode 'deposit' (binary variable)

df['deposit'] = df['deposit'].apply(lambda x: 1 if x=='yes' else 0)
df.head()

Unnamed: 0,deposit,age,job,marital,education,default,balance,housing,loan,day,month,duration,campaign,passdays,previous
0,0,58,management,married,tertiary,no,2143,yes,no,5,may,261,1,-1,0
1,0,44,technician,single,secondary,no,29,yes,no,5,may,151,1,-1,0
2,0,33,entrepreneur,married,secondary,no,2,yes,yes,5,may,76,1,-1,0
3,0,35,management,married,tertiary,no,231,yes,no,5,may,139,1,-1,0
4,0,28,management,single,tertiary,no,447,yes,yes,5,may,217,1,-1,0


- `df['deposit'].apply(lambda x: 1 if x=='yes' else 0)` applies a lambda function to each value in the 'deposit' column. The lambda function checks if the value is 'yes'. If it is, it returns 1; otherwise, it returns 0. This effectively encodes the 'deposit' column into a binary variable.

In [4]:
# Encode 'month' (change into numbers)

mth = ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']
df['month'] = df['month'].apply(lambda x: mth.index(x)+1)
df.head()

Unnamed: 0,deposit,age,job,marital,education,default,balance,housing,loan,day,month,duration,campaign,passdays,previous
0,0,58,management,married,tertiary,no,2143,yes,no,5,5,261,1,-1,0
1,0,44,technician,single,secondary,no,29,yes,no,5,5,151,1,-1,0
2,0,33,entrepreneur,married,secondary,no,2,yes,yes,5,5,76,1,-1,0
3,0,35,management,married,tertiary,no,231,yes,no,5,5,139,1,-1,0
4,0,28,management,single,tertiary,no,447,yes,yes,5,5,217,1,-1,0


- `mth = ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']` creates a list called `mth` containing the abbreviated names of the months from January to December.
- `df['month'].apply(lambda x: mth.index(x)+1)` applies a lambda function to each value in the 'month' column. The lambda function finds the index of the month abbreviation in the `mth` list and adds 1 to it. This effectively converts the month names into corresponding numbers (1 for January, 2 for February, and so on).

In [5]:
# Create dummies from the categorical variables

df = pd.get_dummies(df, columns=['job','marital','education','default','housing','loan'], drop_first=True)
df.head()

Unnamed: 0,deposit,age,balance,day,month,duration,campaign,passdays,previous,job_blue-collar,...,job_student,job_technician,job_unemployed,marital_married,marital_single,education_secondary,education_tertiary,default_yes,housing_yes,loan_yes
0,0,58,2143,5,5,261,1,-1,0,False,...,False,False,False,True,False,False,True,False,True,False
1,0,44,29,5,5,151,1,-1,0,False,...,False,True,False,False,True,True,False,False,True,False
2,0,33,2,5,5,76,1,-1,0,False,...,False,False,False,True,False,True,False,False,True,True
3,0,35,231,5,5,139,1,-1,0,False,...,False,False,False,True,False,False,True,False,True,False
4,0,28,447,5,5,217,1,-1,0,False,...,False,False,False,False,True,False,True,False,True,True


- `pd.get_dummies(df, columns=['job','marital','education','default','housing','loan'], drop_first=True)` generates dummy variables for the specified categorical columns in the DataFrame `df`. The `columns` parameter specifies which columns to encode, and `drop_first=True` drops the first level of each categorical variable to avoid multicollinearity issues in some modeling techniques.

In [6]:
# Define x and y. Split into train, test data

y=df.deposit
x=df.drop('deposit', axis=1)

xtrain, xtest, ytrain, ytest = train_test_split(x, y, random_state=1)

- `y=df.deposit` defines the target variable `y` as the 'deposit' column of the DataFrame `df`.
- `x=df.drop('deposit', axis=1)` defines the feature variables `x` by dropping the 'deposit' column from the DataFrame `df` along the columns axis (axis=1).
- `xtrain, xtest, ytrain, ytest = train_test_split(x, y, random_state=1)` splits the features (X) and target variable (y) into training and testing sets.
- `random_state=1` sets the random seed for reproducibility.

### Decision tree

In [7]:
dt = DecisionTreeClassifier(random_state=42).fit(xtrain, ytrain)
pred1 = dt.predict(xtest)  # Prediction

print("Decision Tree Accuracy", accuracy_score(ytest, pred1))
print("Decision Tree Confusion Matrix \n", confusion_matrix(ytest, pred1))
print("Decision Tree Classification Report \n", classification_report(ytest, pred1))

Decision Tree Accuracy 0.8741550143531809
Decision Tree Confusion Matrix 
 [[8865  703]
 [ 656  575]]
Decision Tree Classification Report 
               precision    recall  f1-score   support

           0       0.93      0.93      0.93      9568
           1       0.45      0.47      0.46      1231

    accuracy                           0.87     10799
   macro avg       0.69      0.70      0.69     10799
weighted avg       0.88      0.87      0.88     10799



- `DecisionTreeClassifier()` initializes a Decision Tree classifier object.
  - `fit(xtrain, ytrain)` This trains the Decision Tree classifier on the training data `xtrain` (features) and `ytrain` (target).
- `pred1 = dt.predict(xtest)` generates predictions (`pred1`) on the test data (`xtest`) using the trained Decision Tree classifier (`dt`).
- `accuracy_score(ytest, pred1))` calculates the accuracy of the Decision Tree classifier by comparing the predicted values (`pred1`) with the actual target values (`ytest`) using the `accuracy_score()` function .
- `confusion_matrix(ytest, pred1))` computes the confusion matrix for the Decision Tree classifier.
- The confusion matrix provides a summary of the classifier's performance by showing the counts of true positives, true negatives, false positives, and false negatives.
- `classification_report(ytest, pred1))` generates a classification report for the Decision Tree classifier.
- `\n` is a newline character.

### Random forests

In [8]:
rf = RandomForestClassifier(n_estimators=200, max_depth=20, random_state=1).fit(xtrain, ytrain)
pred2 = rf.predict(xtest)

print("Random Forest Accuracy", accuracy_score(ytest, pred2))
print("Random Forest Confusion Matrix \n", confusion_matrix(ytest, pred2))
print("Random Forest Classification Report \n", classification_report(ytest, pred2))

Random Forest Accuracy 0.90387998888786
Random Forest Confusion Matrix 
 [[9283  285]
 [ 753  478]]
Random Forest Classification Report 
               precision    recall  f1-score   support

           0       0.92      0.97      0.95      9568
           1       0.63      0.39      0.48      1231

    accuracy                           0.90     10799
   macro avg       0.78      0.68      0.71     10799
weighted avg       0.89      0.90      0.89     10799



- `RandomForestClassifier(n_estimators=200, max_depth=20, random_state=1)` initializes a Random Forest classifier object.
    - `n_estimators` sets the number of trees in the forest to 200.
    - `max_depth` sets the maximum depth of each tree to 20.
    - `random_state=1` ensures reproducibility.

### SVM

In [9]:
# RBF kernel

svm_rbf = SVC(kernel='rbf', C=1).fit(xtrain, ytrain)
pred3 = svm_rbf.predict(xtest)  # Prediction

print("RBF Kernel Accuracy", accuracy_score(ytest, pred3))
print("RBF Kernel Confusion Matrix \n", confusion_matrix(ytest, pred3))
print("RBF Kernel Classification Report \n", classification_report(ytest, pred3))

RBF Kernel Accuracy 0.8866561718677656
RBF Kernel Confusion Matrix 
 [[9561    7]
 [1217   14]]
RBF Kernel Classification Report 
               precision    recall  f1-score   support

           0       0.89      1.00      0.94      9568
           1       0.67      0.01      0.02      1231

    accuracy                           0.89     10799
   macro avg       0.78      0.51      0.48     10799
weighted avg       0.86      0.89      0.84     10799



- `SVC(kernel='rbf', C=1)` initializes a Support Vector Classifier (SVC) object with an RBF kernel.
    - The `C` parameter controls the regularization strength. Higher values lead to fewer misclassifications on the training data, potentially at the cost of overfitting.

In [10]:
# Sigmoid kernel

svm_sm = SVC(kernel='sigmoid', C=0.1).fit(xtrain, ytrain)
pred4 = svm_sm.predict(xtest)  # Prediction

print("Sigmoid Kernel Accuracy", accuracy_score(ytest, pred4))
print("Sigmoid Kernel Confusion Matrix \n", confusion_matrix(ytest, pred4))
print("Sigmoid Kernel Classification Report \n", classification_report(ytest, pred4))

Sigmoid Kernel Accuracy 0.8703583665154181
Sigmoid Kernel Confusion Matrix 
 [[9220  348]
 [1052  179]]
Sigmoid Kernel Classification Report 
               precision    recall  f1-score   support

           0       0.90      0.96      0.93      9568
           1       0.34      0.15      0.20      1231

    accuracy                           0.87     10799
   macro avg       0.62      0.55      0.57     10799
weighted avg       0.83      0.87      0.85     10799



# Hyperparameter tuning

In [11]:
# Define the SVC model pipeline
svc = make_pipeline(StandardScaler(), SVC())

# Set up the grid search parameter grid
param = {'svc__C': [1, 10], 'svc__kernel': ['rbf', 'poly']}

# Initialize the GridSearchCV object with cross-validation
search = GridSearchCV(svc, param, cv=5, scoring=['accuracy', 'f1'], refit='f1', verbose=2).fit(xtrain, ytrain)

# Print the best parameters and the highest cross-validated accuracy
print("Best parameters:", search.best_params_)
print("Best cross-validated accuracy:", search.cv_results_['mean_test_accuracy'][search.best_index_])
print("Best cross-validated F1 score:", search.best_score_)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] END ..........................svc__C=1, svc__kernel=rbf; total time=  18.8s
[CV] END ..........................svc__C=1, svc__kernel=rbf; total time=  14.5s
[CV] END ..........................svc__C=1, svc__kernel=rbf; total time=  14.5s
[CV] END ..........................svc__C=1, svc__kernel=rbf; total time=  16.9s
[CV] END ..........................svc__C=1, svc__kernel=rbf; total time=  17.7s
[CV] END .........................svc__C=1, svc__kernel=poly; total time=  16.5s
[CV] END .........................svc__C=1, svc__kernel=poly; total time=  14.1s
[CV] END .........................svc__C=1, svc__kernel=poly; total time=  14.2s
[CV] END .........................svc__C=1, svc__kernel=poly; total time=  14.5s
[CV] END .........................svc__C=1, svc__kernel=poly; total time=  13.3s
[CV] END .........................svc__C=10, svc__kernel=rbf; total time=  29.7s
[CV] END .........................svc__C=10, svc_

- `svc = make_pipeline(StandardScaler(), SVC())`
   - `make_pipeline` creates a pipeline. A pipeline is a way to chain multiple processing steps together, such as feature scaling, feature selection, or model training, into a single object. This makes it easier to work with machine learning workflows by encapsulating all the necessary steps into one entity.
   - `StandardScaler()` is for feature scaling.
   - `SVC()` is for Support Vector Classification.
- The `param` dictionary defines the hyperparameters to search over.
   - Each hyperparameter is prefixed with `svc__` to specify that it belongs to the SVC model in the pipeline.
- `search = GridSearchCV(svc, param, cv=5, scoring='accuracy', verbose=2)`
    - `estimator`: The SVC model pipeline (`svc`).
    - `param`: The parameter grid to search over.
    - `cv=5`: 5 fold cross-validation.
    - `scoring`: The evaluation metric used.
    - `refit='f1'` indicates that the GridSearchCV should use the F1 score to select the best model configuration. After the grid search is complete, the estimator with the highest F1 score on the validation set will be retrained on the entire training set.
    - `verbose`: Controls the verbosity of the output during the grid search (set to 2 for detailed output).
- The code loops through each set of parameters and their corresponding metrics, then retrieves the cross-validation results (`cv_results_`) containing evaluation metrics for each set of hyperparameters. For each set of parameters:
    - `search.best_params_` returns the hyperparameters that resulted in the best performance during the grid search.
    - `search.best_score_` returns the mean cross-validated score (F1 in this case) achieved by the best estimator on the validation set
    - `cv_results_` contains a dictionary with detailed information about the cross-validation results for each combination of hyperparameters.
      -  `cv_results_['mean_test_accuracy']` contains the mean test accuracy for each combination of hyperparameters.
      - `search.best_index_` refers to the index in the cv_results_ array that corresponds to the best performing estimator.