# Exercise - SVM

The data set for this exercise is from the banking industry. It contains data about the home loans of 2,500 bank clients. Each row represents a single loan. The columns include the characteristics of the client who used a loan. This is a binary classification task: predict whether a loan will be bad or not (1=Yes, 0=No). This is an important task for banks to prevent bad loans from being issued.

## Description of Variables

The description of variables are provided in "Loan - Data Dictionary.docx"

## Goal

Use the **loan.csv** data set and build a model to predict **BAD**. 

Since you have a relatively small data set, I recommend using cross-validation to evaluate your accuracy.

# Read and Prepare the Data

In [47]:
# Common imports

import pandas as pd
import numpy as np

np.random.seed(42)

# Get the data

In [48]:
#We will predict the "price" value in the data set:

loan = pd.read_csv("loan.csv")
loan.head()

Unnamed: 0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
0,0,25900,61064.0,94714.0,DebtCon,Office,2.0,0.0,0.0,98.809375,0.0,23.0,34.565944
1,0,26100,113266.0,182082.0,DebtCon,Sales,18.0,0.0,0.0,304.852469,1.0,31.0,33.193949
2,1,50000,220528.0,300900.0,HomeImp,Self,5.0,0.0,0.0,0.0,0.0,2.0,
3,1,22400,51470.0,68139.0,DebtCon,Mgr,9.0,0.0,0.0,31.168696,2.0,8.0,37.95218
4,0,20900,62615.0,87904.0,DebtCon,Office,5.0,,,177.864849,,15.0,36.831076


# Split data (train/test)

In [49]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(loan, test_size=0.35)

# Data Prep

Perform your data prep here. You can use pipelines like we do in the tutorials. Otherwise, feel free to use your own data prep steps. Eventually, you should do the following at a minimum:<br>
- Separate inputs from target<br>
- Impute/remove missing values<br>
- Standardize the continuous variables<br>
- One-hot encode categorical variables<br>

In [50]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

## Separate the target variable 

In [123]:
train_y = train['BAD']
test_y = test['BAD']

train_inputs = train.drop(['BAD'], axis=1)
test_inputs = test.drop(['BAD'], axis=1)

##  Identify the numeric, binary, and categorical columns

In [52]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [53]:
numeric_columns

['LOAN',
 'MORTDUE',
 'VALUE',
 'YOJ',
 'DEROG',
 'DELINQ',
 'CLAGE',
 'NINQ',
 'CLNO',
 'DEBTINC']

In [54]:
categorical_columns

['REASON', 'JOB']

# Pipeline

In [55]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [56]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [57]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)],
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

# Transform: fit_transform() for TRAIN

In [58]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

array([[-0.40252748, -0.12260323,  0.09790573, ...,  0.        ,
         0.        ,  0.        ],
       [-0.39364354, -0.75819809, -0.56091742, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.7968052 , -0.88280712, -0.63820513, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-0.21596462, -0.83083216, -0.82443238, ...,  0.        ,
         0.        ,  0.        ],
       [-0.4647151 ,  1.81989076,  1.39734775, ...,  0.        ,
         0.        ,  0.        ],
       [-0.31368803, -0.08009314, -0.21285451, ...,  0.        ,
         0.        ,  0.        ]])

In [59]:
train_x.shape

(1625, 20)

# Tranform: transform() for TEST

In [60]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

array([[ 0.06832164,  0.37936176,  0.33393544, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.32595607,  0.59090733,  0.44187357, ...,  0.        ,
         1.        ,  0.        ],
       [-0.33145592,  0.42488659,  0.13209687, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.70796574,  0.04909642, -0.2232263 , ...,  0.        ,
         0.        ,  0.        ],
       [-0.27815224, -1.52606735, -1.14273899, ...,  0.        ,
         0.        ,  0.        ],
       [-0.26038435, -0.0044677 , -0.25865132, ...,  0.        ,
         0.        ,  0.        ]])

In [61]:
test_x.shape

(875, 20)

# Calculate the Baseline

In [62]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")

dummy_clf.fit(train_x, train_y)

DummyClassifier(strategy='most_frequent')

In [63]:
from sklearn.metrics import accuracy_score

In [64]:
# This is the baseline Train Accuracy

dummy_train_pred = dummy_clf.predict(train_x)

baseline_train_acc = accuracy_score(train_y, dummy_train_pred)

print('Baseline Train Accuracy: {}' .format(baseline_train_acc))

Baseline Train Accuracy: 0.6092307692307692


In [65]:
# This is the baseline Test Accuracy

dummy_test_pred = dummy_clf.predict(test_x)

baseline_test_acc = accuracy_score(test_y, dummy_test_pred)

print('Baseline Test Accuracy: {}' .format(baseline_test_acc))

Baseline Test Accuracy: 0.5702857142857143


# Train an SVM model with linear kernel

In [88]:
from sklearn.svm import SVC
 
lin_svm = SVC(kernel="linear", C=10)

lin_svm.fit(train_x, train_y)

SVC(C=10, kernel='linear')

### Calculate the accuracy

In [89]:
from sklearn.metrics import accuracy_score

In [90]:
#Predict the train values
train_y_pred = lin_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.7581538461538462

In [91]:
#Predict the test values
test_y_pred = lin_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.7142857142857143

## Checking another SVC model with Linear kernel by changing 'C' Value

In [96]:
from sklearn.svm import SVC
 
lin_svm = SVC(kernel="linear", C=0.1)

lin_svm.fit(train_x, train_y)

SVC(C=0.1, kernel='linear')

In [97]:
from sklearn.metrics import accuracy_score

In [98]:
#Predict the train values
train_y_pred = lin_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.7550769230769231

In [100]:
#Predict the test values
test_y_pred = lin_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.7211428571428572

### There is a slight increase in the accuracy of the model(test) after changing the 'C' value from 10 to 0.1

# Train an SVM model with poly kernel

In [101]:
from sklearn.svm import SVC

# You need to enter a value for gamma. Remember, gamma controls the shape of the bell curve for rbf
# You can also set it is as gamma='scale'. This will be the default option in future releases

pol_svm = SVC(kernel="poly", degree=4, coef0=1, C=10)

pol_svm.fit(train_x, train_y)

SVC(C=10, coef0=1, degree=4, kernel='poly')

### Calculate the accuracy

In [102]:
#Predict the train values
train_y_pred = pol_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.96

In [103]:
#Predict the test values
test_y_pred = pol_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8057142857142857

### As we can see that there is a huge difference in the accuracy of the Train and Test, it shows a case of overfitting. So, we will try to reduce overfitting by reducing degree at first try and build another model to see the further change in the accuracy of train and test.

In [116]:
pol_svm = SVC(kernel="poly", degree=2, coef0=1, C=10)

pol_svm.fit(train_x, train_y)

SVC(C=10, coef0=1, degree=2, kernel='poly')

In [117]:
#Predict the train values
train_y_pred = pol_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.8135384615384615

In [118]:
#Predict the test values
test_y_pred = pol_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.7554285714285714

### By changing the degree of input variable , we see that the gap decresed for the accuracy of train and test. Although the accuracy of this model decreased and still we can say it a overfitting, this model(degree 2) is better than earlier model(degree 4)So, we need to further change the parameters to reduce overfitting and build a better model by changing C Value, coef0, etc.)

# Train an SVM model with rbf kernel

In [119]:
rbf_svm = SVC(kernel="rbf", C=10, gamma='scale')

rbf_svm.fit(train_x, train_y)

SVC(C=10)

### Calculate the accuracy

In [120]:
#Predict the train values
train_y_pred = rbf_svm.predict(train_x)

#Train accuracy
accuracy_score(train_y, train_y_pred)

0.9243076923076923

In [122]:
#Predict the test values
test_y_pred = rbf_svm.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_y_pred)

0.8274285714285714

## Just like in the case of SVM model with poly kernel, this SVM model with rbf kernel is also facing Overfitting issues. So, we will arbitrarily change the various parameters to fix the overfitting issue and find a better model.

### It is obseved that all SVM models above with different kernels (linear, poly, rbf) have accuracy far better than our baseline model showing 57% accuracy. So, All models are better than our baseline model.

# Optional: try grid search on one of your SVM models

In [124]:
train_y = train['BAD']
test_y = test['BAD']

In [131]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 4 (2×2) combinations of hyperparameters
    {'C': [1, 10], 
     'gamma': [0.1, 0.2]}
  ]

rbf_svm = SVC(kernel="rbf", C=10, gamma='scale')

# train across 10 folds, that's a total of 4*10=40 rounds of training 
grid_search = GridSearchCV(rbf_svm, param_grid, cv=10,
                           scoring='accuracy', return_train_score=True)

grid_search.fit(train_x, train_y)

GridSearchCV(cv=10, estimator=SVC(C=10),
             param_grid=[{'C': [1, 10], 'gamma': [0.1, 0.2]}],
             return_train_score=True, scoring='accuracy')

In [132]:
grid_search.best_params_

{'C': 10, 'gamma': 0.2}

In [128]:
grid_search.best_estimator_

SVC(C=10, gamma=0.2)

In [133]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(mean_score, params)

0.8024691358024691 {'C': 1, 'gamma': 0.1}
0.8295235931227751 {'C': 1, 'gamma': 0.2}
0.8344770128001212 {'C': 10, 'gamma': 0.1}
0.8529160039384986 {'C': 10, 'gamma': 0.2}


In [135]:
final_model = grid_search.best_estimator_

test_predictions = final_model.predict(test_x)

#Test accuracy
accuracy_score(test_y, test_predictions)

0.8628571428571429

### The accuracy of this final model is 87% using best estimator.