## SVM - Credit Default Prediction

In this project, we will build a SVM model to predict whether a given customer defaults or not. Credit default is one of the most important problems in the banking and risk analytics industry. There are various attributes which can be used to predict default, such as demographic data (age, income, employment status, etc.), (credit) behavioural data (past loans, payment, number of times a credit payment has been delayed by the customer etc.).

We'll start the process with data cleaning and preparation and then tune the model to find optimal hyperparameters.

<hr>

### Data Understanding and Cleaning

In [1]:
# Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# To ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Reading the csv file and putting it into 'df' object.
df = pd.read_csv('credit-card-default.csv')
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,defaulted
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [3]:
# Let's understand the type of columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   ID         30000 non-null  int64
 1   LIMIT_BAL  30000 non-null  int64
 2   SEX        30000 non-null  int64
 3   EDUCATION  30000 non-null  int64
 4   MARRIAGE   30000 non-null  int64
 5   AGE        30000 non-null  int64
 6   PAY_0      30000 non-null  int64
 7   PAY_2      30000 non-null  int64
 8   PAY_3      30000 non-null  int64
 9   PAY_4      30000 non-null  int64
 10  PAY_5      30000 non-null  int64
 11  PAY_6      30000 non-null  int64
 12  BILL_AMT1  30000 non-null  int64
 13  BILL_AMT2  30000 non-null  int64
 14  BILL_AMT3  30000 non-null  int64
 15  BILL_AMT4  30000 non-null  int64
 16  BILL_AMT5  30000 non-null  int64
 17  BILL_AMT6  30000 non-null  int64
 18  PAY_AMT1   30000 non-null  int64
 19  PAY_AMT2   30000 non-null  int64
 20  PAY_AMT3   30000 non-null  int64
 21  PAY_AMT4   3

In this case, we know that there are no major data quality issues, so we'll go ahead and build the model.

<hr>

### Data Preparation and Model Building

In [4]:
# Importing test_train_split from sklearn library
from sklearn.model_selection import train_test_split

In [5]:
# Putting feature variable to X
X = df.drop('defaulted',axis=1)

# Putting response variable to y
y = df['defaulted']

# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)



#### Default Hyperparameters
Let's first fit a random forest model with default hyperparameters.

In [6]:
# Importing SVC (Support Vector Classifier) from sklearn library
from sklearn.svm import SVC

# Running the SVM with default parameters
svm_model = SVC()

In [7]:
# fit
svm_model.fit(X_train,y_train)

In [8]:
# Making predictions
predictions = svm_model.predict(X_test)

In [9]:
# Importing classification report and confusion matrix from sklearn metrics
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score

In [10]:
# Let's check the report of our default model
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.78      1.00      0.88      7058
           1       0.00      0.00      0.00      1942

    accuracy                           0.78      9000
   macro avg       0.39      0.50      0.44      9000
weighted avg       0.62      0.78      0.69      9000



In [11]:
# Printing confusion matrix
print(confusion_matrix(y_test,predictions))

[[7058    0]
 [1942    0]]


In [12]:
print(accuracy_score(y_test,predictions))

0.7842222222222223


So far so good, let's now look at the list of hyperparameters which we can tune to improve model performance.

<hr>

### Hyperparameter Tuning

The following hyperparameters are present in an SVM (Support Vector Machine) Classifier:


- **C:** float, default=1.0. This is the regularization parameter, controlling the trade-off between achieving a low error on the training data and minimizing the model complexity (which helps avoid overfitting).
- **kernel:** string, default='rbf'. Specifies the kernel type to be used in the algorithm. Common options include:


*   'linear' for a linear kernel.
*   'poly' for polynomial kernel.
*   'rbf' for Radial Basis Function kernel (default).
*   'sigmoid' for the sigmoid kernel.

- **degree:** int, default=3. Degree of the polynomial kernel function (only relevant when the kernel is set to 'poly').
- **gamma:** {'scale', 'auto'}, default='scale'. Defines how far the influence of a single training example reaches. With 'scale', it uses 1/(n_features⋅X.var()) as the default value.
- **class_weight:** dict or 'balanced', default=None. Assigns weights to different classes. Useful when dealing with imbalanced datasets.

<hr>

In [None]:
# Import required libraries
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Set up parameter grid
param_grid = {'C': [0.1, 1, 10, 100], 'kernel': ['linear', 'rbf', 'poly']}

# Instantiate the SVC model
svm_model = SVC()

# Perform grid search
grid_search = GridSearchCV(svm_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get best parameters
print('Best parameters:', grid_search.best_params_)

## Tuning gamma for non-linear kernels
For non-linear kernels like 'rbf', the gamma parameter controls the influence of each training example.

In [None]:
# GridSearchCV to find optimal gamma
param_grid = {'gamma': ['scale', 'auto', 0.01, 0.1, 1], 'C': [1, 10, 100], 'kernel': ['rbf']}

# GridSearch for 'rbf' kernel
grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get best parameters
print('Best parameters for rbf kernel:', grid_search.best_params_)


Plotting Accuracy vs. C

In [None]:
# Results from GridSearchCV
scores = grid_search.cv_results_

# Plot accuracy as a function of C
plt.plot(scores['param_C'], scores['mean_test_score'], label="Test Accuracy")
plt.xlabel("C (Regularization Parameter)")
plt.ylabel("Accuracy")
plt.title("Accuracy vs. C")
plt.legend()
plt.show()


<hr>

<h3> This code uses GridSearchCV to tune the hyperparameters and observe how C, kernel, and gamma influence model performance.</h3>

In [None]:
# Final SVM model with best hyperparameters
svm_final = SVC(C=grid_search.best_params_['C'], kernel=grid_search.best_params_['kernel'])
svm_final.fit(X_train, y_train)

# Predictions and evaluation
predictions = svm_final.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
