## Random Forest - Credit Default Prediction

In this lab, we will build a random forest model to predict whether a given customer defaults or not. Credit default is one of the most important problems in the banking and risk analytics industry. There are various attributes which can be used to predict default, such as demographic data (age, income, employment status, etc.), (credit) behavioural data (past loans, payment, number of times a credit payment has been delayed by the customer etc.).

We'll start the process with data cleaning and preparation and then tune the model to find optimal hyperparameters.

<hr>

### Data Understanding and Cleaning

In [22]:
# Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
%matplotlib inline

# To ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [23]:
# Reading the csv file and putting it into 'df' object.
df = pd.read_csv('ecommerce_consumers.csv')
df.head()

Unnamed: 0,ratio,time,label
0,0.54,17.2,1
1,0.93,18.2,0
2,0.84,13.6,1
3,0.19,6.0,0
4,0.89,13.2,1


In [24]:
# Let's understand the type of columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 3 columns):
ratio    200 non-null float64
time     200 non-null float64
label    200 non-null int64
dtypes: float64(2), int64(1)
memory usage: 4.8 KB


In this case, we know that there are no major data quality issues, so we'll go ahead and build the model.

### Data Preparation and Model Building

In [4]:
# Importing test_train_split from sklearn library
from sklearn.model_selection import train_test_split

In [9]:
# Putting feature variable to X
X = df.drop('label',axis=1)

# Putting response variable to y
y = df['label']

# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

In [10]:
# Importing random forest classifier from sklearn library
from sklearn.ensemble import RandomForestClassifier

# Running the random forest with default parameters.
rfc = RandomForestClassifier()

In [11]:
# fit
rfc.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [25]:
# Let's check the report of our default model
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.90      1.00      0.95        35
           1       1.00      0.84      0.91        25

    accuracy                           0.93        60
   macro avg       0.95      0.92      0.93        60
weighted avg       0.94      0.93      0.93        60



In [16]:
# Printing confusion matrix
print(confusion_matrix(y_test,predictions))

[[35  0]
 [ 4 21]]


In [17]:
print(accuracy_score(y_test,predictions))

0.9333333333333333


So far so good, let's now look at the list of hyperparameters which we can tune to improve model performance.

In [26]:
from sklearn.svm import SVC
# linear model

model_linear = SVC(kernel='linear')
model_linear.fit(X_train, y_train)

# predict
y_pred = model_linear.predict(X_test)

In [27]:
# accuracy
print("accuracy:", metrics.accuracy_score(y_true=y_test, y_pred=y_pred), "\n")

# cm
print(metrics.confusion_matrix(y_true=y_test, y_pred=y_pred))

accuracy: 0.5833333333333334 

[[35  0]
 [25  0]]
