A data science approach to predict and understand the applicant’s profile to minimize the risk of future loan defaults.

#**Machine Learning: Predicting Bank Loan Defaults**

![](https://miro.medium.com/max/828/0*7oPZJ8exIl1oqAaW)

## About the project
The dataset contains information about credit applicants. Banks, globally, use this kind of dataset and type of informative data to create models to help in deciding on who to accept/refuse for a loan.

After all the exploratory data analysis, cleansing and dealing with all the anomalies we might (will) find along the way, the patterns of a good/bad applicant will be exposed to be learned by machine learning models.

##Machine Learning issue and objectives
We’re dealing with a supervised binary classification problem. The goal is to train the best machine learning model to maximize the predictive capability of deeply understanding the past customer’s profile minimizing the risk of future loan defaults.

##Performance Metric
The metric used for the models’ evaluation is the ROC AUC given that we’re dealing with a highly unbalanced data.

## Project structure

The project divides into three categories:

EDA: Exploratory Data Analysis

1.  EDA: Exploratory Data Analysis
2.  Data Wrangling: Cleansing and Feature Selection
3.  Machine Learning: Predictive Modelling

# Feature description

* **id**: Unique ID of the loan application.
* **grade**: LC assigned loan grade.
* **annual_inc**: The self-reported annual income provided by the borrower during registration.
* **short_emp**: 1 when employed for 1 year or less.
* **emp_length_num**: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
* **home_ownership**: Type of home ownership.
* **dti (Debt-To-Income Ratio)**: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
* **purpose**: A category provided by the borrower for the loan request.
* **term**: The number of payments on the loan. Values are in months and can be either 36 or 60.
* **last_delinq_none**: 1 when the borrower had at least one event of delinquency.
* **last_major_derog_none**: 1 borrower had at least 90 days of a bad rating.
* **revol_util**: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
* **total_rec_late_fee**: Late fees received to date.
* **od_ratio**: Overdraft ratio.
* **bad_loan**: 1 when a loan was not paid.


![](https://miro.medium.com/max/1400/0*k03uxySXGeuRTAGv)

##Importing the libraries and dependencies required:

In [5]:
import pandas as pd
import numpy as np
import seaborn as sns
import pingouin as pg
import scipy
from scipy.stats import chi2
from scipy.stats import chi2_contingency
from scipy.stats import pearsonr, spearmanr
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn import tree
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from sklearn.linear_model import Perceptron
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import  precision_recall_curve, roc_auc_score, confusion_
matrix,accuracy_score, recall_score, precision_score, f1_score,auc, roc_curve,
plot_
confusion_matrix
​from category_encoders import BinaryEncoder
from IPython.display import Image
import pydotplus
import matplotlib.pyplot as plt
%matplotlib inline
color = sns.color_palette()
seed = 42

SyntaxError: invalid non-printable character U+200B (226667298.py, line 26)

Loading and displaying the dataset:

In [None]:
>> data = pd.read_csv('lending_club_loan_dataset.csv', low_memory=False)
>> data.head()

![](https://miro.medium.com/max/828/1*7WrV6qGoG1d5cckXEG4rJw.webp)

## EDA: Explaratory Data Analysis
Main stats of numeric attributes:

In [None]:
>> data.describe().round(3)

![](https://miro.medium.com/max/1400/1*Th4y415-r17gC93z9EJDgg.webp)

The dataset has 2000 observations and 15 variables including the target, divided into 11 numeric and 4 categoric features.

There are variables with missing values: ‘home_ownership’ with 7.46%, ‘dti’ with 0.77%, and ‘last_major_derog_none’ with 97.13%.

From the difference between the mean and the median, and also the distance of maximum values of the variables ‘annual_inc’, ‘revol_util’ and ‘total_rec_late_fee’, it seems there are some outliers.

Main stats of categoric attributes:

In [None]:
>> data.describe(include=[np.object])



![](https://miro.medium.com/max/828/1*C2WH3HlitN80ilirdXcxsw.webp)

In [None]:
# Checking data balance/proportion
loan = data.bad_loan.value_counts().to_frame().rename(columns={"bad_loan":"absolute"})
loan["percent"] = (loan.apply(lambda x: x/x.sum()*100).round(2))
display(loan)
---
​# pie chart
data.bad_loan.value_counts().plot(kind='pie', subplots=True, autopct='%1.2f%%',
explode= (0.05, 0.05), startangle=80, legend=True, fontsize=12, figsize=(14,6),
 textprops={'color':"black"})
plt.legend(["0: paid loan","1: not paid loan"]);


![](https://miro.medium.com/max/828/1*3-YDeBbl53iOF6Qu7upBjg.webp)

**Unbalanced data:** target has 80% of default results (value 1) against 20% of loans that ended up by been paid/ non-default (value 0).

**Type of variables:**

In [None]:
>> data.dtypes.sort_values(ascending=True)

In [None]:
id                         int64
short_emp                  int64
emp_length_num             int64
last_delinq_none           int64
bad_loan                   int64
annual_inc               float64
dti                      float64
last_major_derog_none    float64
revol_util               float64
total_rec_late_fee       float64
od_ratio                 float64
grade                     object
home_ownership            object
purpose                   object
term                      object
dtype: object

## Couting variables by type:

In [None]:
>> data.dtypes.value_counts()

In [None]:
float64    6
int64      5
object     4
dtype: int64

**Checking for missing values:**

In [None]:
nulval = data.isnull().sum().to_frame().rename(columns={0:"absolute"})
nulval["percent"] = (nulval.apply(lambda x: x/x.sum())*100).round(2)
nulval

![](https://miro.medium.com/max/720/1*xdlyh45PtgS-ski1zItVEw.webp)

#EDA functions

Describing all the features in the dataset using and abusing graphics. Start by defining a few functions for every chart: boxplot, histograms, bar and pie charts, scatterplots, pivot charts, as well as a statistic descriptions.

In [None]:
# General statistics
def stats(x):
    print(f"Variable: {x}")
    print(f"Type of variable: {data[x].dtype}")
    print(f"Total observations: {data[x].shape[0]}")
    detect_null_val = data[x].isnull().values.any()
    if detect_null_val:
        print(f"Missing values: {data[x].isnull().sum()} ({(data[x].isnull().sum() / data[x].isnull().shape[0] *100).round(2)}%)")
    else:
        print(f"Missing values? {data[x].isnull().values.any()}")
    print(f"Unique values: {data[x].nunique()}")
    if data[x].dtype != "O":
        print(f"Min: {int(data[x].min())}")
        print(f"25%: {int(data[x].quantile(q=[.25]).iloc[-1])}")
        print(f"Median: {int(data[x].median())}")
        print(f"75%: {int(data[x].quantile(q=[.75]).iloc[-1])}")
        print(f"Max: {int(data[x].max())}")
        print(f"Mean: {data[x].mean()}")
        print(f"Std dev: {data[x].std()}")
        print(f"Variance: {data[x].var()}")
        print(f"Skewness: {scipy.stats.skew(data[x])}")
        print(f"Kurtosis: {scipy.stats.kurtosis(data[x])}")
        print("")

        # Percentiles 1%, 5%, 95% and 99%
print("Percentiles 1%, 5%, 95%, 99%")
        display(data[x].quantile(q=[.01, .05, .95, .99]))
        print("")
    else:
        print(f"List of unique values: {data[x].unique()}")
---
# Variable vs. target chart
def target(x):
    short_0 = data[data.bad_loan == 0].loc[:,x]
    short_1 = data[data.bad_loan == 1].loc[:,x]

    a = np.array(short_0)
    b = np.array(short_1)

    np.warnings.filterwarnings('ignore')

    plt.hist(a, bins=40, density=True, color="g", alpha = 0.6, label='Not-default', align="left")
    plt.hist(b, bins=40, density=True, color="r", alpha = 0.6, label='Default', align="right")
plt.legend(loc='upper right')
    plt.title(x, fontsize=10, loc="right")
    plt.xlabel('Relative frequency')
    plt.ylabel('Absolute frequency')
    plt.show()
---
​# Boxplot + Hist chart
def boxhist(x):
    variable = data[x]
    np.array(variable).mean()
    np.median(variable)
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.5, 2)})
    mean=np.array(variable).mean()
    median=np.median(variable)
sns.boxplot(variable, ax=ax_box)
    ax_box.axvline(mean, color='r', linestyle='--')
    ax_box.axvline(median, color='g', linestyle='-')
sns.distplot(variable, ax=ax_hist)
    ax_hist.axvline(mean, color='r', linestyle='--')
    ax_hist.axvline(median, color='g', linestyle='-')
plt.title(x, fontsize=10, loc="right")
    plt.legend({'Mean':mean,'Median':median})
    ax_box.set(xlabel='')
    plt.show()
---
# Histogram
def hist(x):
    plt.hist(data[x], bins=25)
    plt.title(x, fontsize=10, loc="right")
    plt.xlabel('Relative frequency')
    plt.ylabel('Absolute frequency')
    plt.show()
---
# Pie chart
def pie(x):
    data[x].value_counts(dropna=False).plot(kind='pie', figsize=(6,5), fontsize=10, autopct='%1.1f%%', startangle=0, legend=True, textprops={'color':"white", 'weight':'bold'});
# Number of observations by class
obs = data[x].value_counts(dropna=False)
o = pd.DataFrame(obs)
o.rename(columns={x:"Freq abs"}, inplace=True)
o_pc = (data[x].value_counts(normalize=True) * 100).round(2)
obs_pc = pd.DataFrame(o_pc)
obs_pc.rename(columns={x:"percent %"}, inplace=True)
obs = pd.concat([o,obs_pc], axis=1)
display(obs)
---
​# Variable vs. target chart
def target(x):
    short_0 = data[data.bad_loan == 0].loc[:,x]
    short_1 = data[data.bad_loan == 1].loc[:,x]

    a = np.array(short_0)
    b = np.array(short_1)

    np.warnings.filterwarnings('ignore')

    plt.hist(a, bins=40, density=True, color="g", alpha = 0.6, label='Not-default', align="left")
    plt.hist(b, bins=40, density=True, color="r", alpha = 0.6, label='Default', align="right")
plt.legend(loc='upper right')
    plt.title(x, fontsize=10, loc="right")
    plt.xlabel('Relative frequency')
    plt.ylabel('Absolute frequency')
    plt.show()
---
​# Boxplot + Hist chart
def boxhist(x):
    variable = data[x]
    np.array(variable).mean()
    np.median(variable)
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.5, 2)})
    mean=np.array(variable).mean()
    median=np.median(variable)
sns.boxplot(variable, ax=ax_box)
    ax_box.axvline(mean, color='r', linestyle='--')
    ax_box.axvline(median, color='g', linestyle='-')
sns.distplot(variable, ax=ax_hist)
    ax_hist.axvline(mean, color='r', linestyle='--')
    ax_hist.axvline(median, color='g', linestyle='-')
plt.title(x, fontsize=10, loc="right")
    plt.legend({'Mean':mean,'Median':median})
    ax_box.set(xlabel='')
    plt.show()
----
# Bar chart
def bar(x):
    ax = data[x].value_counts().plot(kind="bar", figsize=(6,5), fontsize=10, color=sns.color_palette("rocket"), table=False)
    for p in ax.patches:
        ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 5), textcoords='offset points')
plt.xlabel(x, fontsize=10)
    plt.xticks(rotation=0, horizontalalignment="center")
    plt.ylabel("Absolute values", fontsize=10)
    plt.title(x, fontsize=10, loc="right")

---

# Barh chart
def barh(x):
    data[x].value_counts().plot(kind="barh", figsize=(6,5), fontsize=10, color=sns.color_palette("rocket"), table=False)
    plt.xlabel("Absolute values", fontsize=10)
    plt.xticks(rotation=0, horizontalalignment="center")
    plt.ylabel(x, fontsize=10)
    plt.title(x, fontsize=10, loc="right")
---

# Pivot_table_mean
def pivot_mean(a, b, c):
    type_pivot_mean = data.pivot_table(
        columns=a,
        index=b,
        values=c, aggfunc=np.mean)
    display(type_pivot_mean)
# Display pivot_table
    type_pivot_mean.sort_values(by=[b], ascending=True).plot(kind="bar", title=(b), figsize=(6,4),fontsize = 12);
# Pivot_table_sum
def pivot_sum(a, b, c):
    type_pivot_sum = data.pivot_table(
        columns=a,
        index=b,
        values=c, aggfunc=np.sum)
    display(type_pivot_sum)
# Display pivot_table
    type_pivot_sum.sort_values(by=[b], ascending=True).plot(kind="bar", title=(b), figsize=(6,4),fontsize = 12);
---

# Scatter plot
def scatter(x, y):
    targets = data["bad_loan"].unique()
for target in targets:
        a = data[data["bad_loan"] == target][x]
        b = data[data["bad_loan"] == target][y]
plt.scatter(a, b, label=f"bad loan: {target}", marker="*")

    plt.xlabel(x, fontsize=10)
    plt.ylabel(y, fontsize=10)
    plt.title("abc", fontsize=10, loc="right")
    plt.legend()
    plt.show()

# Visualization of the numeric distribution:

In [None]:
data.hist(figsize=(10,9), bins=12, ec="b", xlabelsize=8, ylabelsize=8, alpha=0.9, grid=False)
plt.tight_layout()
plt.show()

![](https://miro.medium.com/max/828/1*Pe5X20TccBJ4-2ydDPKBBw.webp)

#Visualization of the categoric distribution:

In [None]:
for col in data.select_dtypes(include=["object"]).columns:
    data[col].value_counts().plot(kind="bar", color=sns.color_palette("rocket"))

    plt.xlabel("Class", fontsize=10)
    plt.xticks(rotation=90, horizontalalignment="center")
    plt.ylabel("Count", fontsize=10)
    plt.title(col, fontsize=10, loc="right")
    plt.show()

![](https://miro.medium.com/max/828/1*CN5GYi961u4v9CV_yp91yA.webp)

It seems there is a typo in the ’36 Months’ class on the variable ‘term’. Let’s fix it by lowering the capitalized characters.

In [None]:
>> data.term = data.term.str.lower()
>> data.term.value_counts()
36 months    15001
 60 months     4999
Name: term, dtype: int64

## Feature: grade
LC assigned loan grade.

In [None]:
stats("grade")
Variable: grade
Type of variable: object
Total observations: 20000
Missing values? False
Unique values: 7
List of unique values: ['A' 'D' 'E' 'B' 'G' 'C' 'F']
---
bar("grade")

![](https://miro.medium.com/max/828/1*dPlP4ct0sF7xrol3HMEBfg.webp)

In [None]:
pivot_sum("home_ownership","grade","id")

![](https://miro.medium.com/max/828/1*aIQJLZpKeTyq1qvcE9NOEA.webp)

![](https://miro.medium.com/max/828/1*4GnzSash1LhToWWvEvOvpw.webp)

When the grade classes decrease, the type of homeownership tends to shift from mortgage to rent. It is on grades B, C, and D that we see the type of own propriety as the highest class.

In [None]:
target("grade")

![](https://miro.medium.com/max/828/1*VTgcgnGrZ_htsTPROzqBsw.webp)

It is between the upper-grade classes that the highest not-default loans happen.

## Feature: annual_inc

The self-reported annual income provided by the borrower during registration.

In [None]:
boxhist("annual_inc")

![](https://miro.medium.com/max/828/1*VAOaQ5o99Q5dNn1Fk-4Rtg.webp)

In [None]:
Variable: annual_inc
Type of variable: float64
Total observations: 20000
Missing values? False
Unique values: 2566
Min: 8412
25%: 47000
Median: 65000
75%: 88000
Max: 1000000
Mean: 73349.57835
Std dev: 45198.56725472537
Variance: 2042910481.8799326
Skewness: 5.275648123592321
Kurtosis: 66.72665803201564

Percentiles 1%, 5%, 95%, 99%
0.01     20519.5
0.05     30000.0
0.95    145000.0
0.99    225000.0
Name: annual_inc, dtype: float64
---
target("annual_inc")

![](https://miro.medium.com/max/828/1*YVq_ZcQgCHEx_AAH3h15Jw.webp)


The histogram tells us that the higher the income, the higher is the trend of default.

In [None]:
scatter("annual_inc","dti")

![](https://miro.medium.com/max/828/1*pGJPzn58WKh1TN07ALXVSg.webp)


The scatterplot shows a weak and negative association between ‘annual income’ and ‘debt to income ratio’.

In [None]:
data.annual_inc.corr(dti)
>> -0.22853314935876534

The correlation value is of -0.23 meaning as the annual_inc decreases, the loans at instance 1 (default/ not paid) increases.

## Feature: short_emp
1 when employed for 1 year or less.

In [None]:
hist("short_emp")

![](https://miro.medium.com/max/828/1*fuJ9Xsdz1Hig9WS1gMlraA.webp)


The clients that have been employed for one or less years (instance 1) represent 11.25% whereas 88.75% of the clients were employed for more than 1 year.

In [None]:
stats("short_emp")
Variable: short_emp
Type of variable: int64
Total observations: 20000
Missing values? False
Unique values: 2
Min: 0
25%: 0
Median: 0
75%: 0
Max: 1
Mean: 0.1125
Std dev: 0.3159885163057429
Variance: 0.09984874243710473
Skewness: 2.4526820936006293
Kurtosis: 4.015649452269171

Percentiles 1%, 5%, 95%, 99%
0.01    0.0
0.05    0.0
0.95    1.0
0.99    1.0
Name: short_emp, dtype: float64
---
target("short_emp")


![](https://miro.medium.com/max/828/1*DIZetnVN3mWxa7-wni8Y7A.webp)

The segment employed for less than 1 year had loans on default more frequently than the other segment.

## Feature: emp_length_num
Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.

In [None]:
boxhist("emp_length_num")

![](https://miro.medium.com/max/828/1*zcuNZXvKFEInNmnW_UEutA.webp)

In [None]:
stats("emp_length_num")
Variable: emp_length_num
Type of variable: int64
Total observations: 20000
Missing values? False
Unique values: 12
Min: 0
25%: 3
Median: 7
75%: 11
Max: 11
Mean: 6.8214
Std dev: 3.7742302898357223
Variance: 14.24481428071344
Skewness: -0.27964924120655704
Kurtosis: -1.3664296257576731

Percentiles 1%, 5%, 95%, 99%
0.01     0.0
0.05     1.0
0.95    11.0
0.99    11.0
Name: emp_length_num, dtype: float64
---
target("emp_length_num")

![](https://miro.medium.com/max/828/1*EHP8StIGmEA1ELoD5pX4-w.webp)

With a few exceptions, it’s amongst nonstop employed clients for more than 10 years that the not-default loans occur.

In [None]:
pivot_mean("bad_loan", "purpose", "emp_length_num")


![](https://miro.medium.com/max/828/1*XABhYzYwSFW-_0sajYc5hw.webp)

Wedding and vacation are the two purposes in which, on average, the majority of loans ended up not been paid.

##Feature: home_ownership
Type of home ownership.

In [None]:
stats("home_ownership")
Variable: home_ownership
Type of variable: object
Total observations: 20000
Missing values: 1491 (7.46%)
Unique values: 3
List of unique values: ['RENT' 'OWN' 'MORTGAGE' nan]
---
bar("home_ownership")

![](https://miro.medium.com/max/828/1*O94LlPEGd1pzmc9X9yD00A.webp)

In [None]:
pie("home_ownership")

![](https://miro.medium.com/max/828/1*4FzHFUO8YacNQje9FcE_ag.webp)

In [None]:
pivot_sum("bad_loan", "home_ownership", "id")

![](https://miro.medium.com/max/828/1*7tx6UdStyUx7h3BBNi7U8Q.webp)

Proportionally, there is no major difference between the type of homeownership and the default loans.

## Feature: dti (debt-to-income)


A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.

In [None]:
boxhist("dti")

![](https://miro.medium.com/max/828/1*PZfu5YX_GrC1Ll_hgsCaZg.webp)

In [None]:
stats("dti")
Variable: dti
Type of variable: float64
Total observations: 20000
Missing values: 154
Unique values: 3295
Min: 0
25%: 10
Median: 16
75%: 22
Max: 34
Mean: 16.58784137861536
Std dev: 7.585811951545168
Variance: 57.544542964205505
Skewness: nan
Kurtosis: nan

Percentiles 1%, 5%, 95%, 99%
0.01     1.7800
0.05     4.6500
0.95    29.6900
0.99    33.4355
Name: dti, dtype: float64
---
target("dti")

![](https://miro.medium.com/max/828/1*9ecx-iLTOeGjK7k8awc5Kg.webp)

The distribution of bad loans (default) has, on average, higher ‘dti’ values (debt to income ratio) in comparison with the good loans. The trend is: the higher the effort rate, the more frequently are the loans on default.

In [None]:
pivot_sum("home_ownership", "purpose", "dti")

![](https://miro.medium.com/max/828/1*fkiYJJ3EBCajr61XzJEZfQ.webp)

It is notorious that the main purpose for the requested loans is ‘debt consolidation’ followed by ‘credit card’, for ‘mortage’ and ‘own’ as types of home ownership by more than 6000 and 4000 people respectivelly.

On the other hand, ‘moving’ and ‘wedding’ and the purpose less declared by 19 and 47 in both same segments of type of ownership. For those who live in a rented place, the number of people who requests loans is substancially inferior, proporcionally.

![](https://miro.medium.com/max/828/1*Jp4-eR_eGMJb7wUU_AGdEQ.webp)

In [None]:
pivot_sum("bad_loan", "grade", "dti")


![](https://miro.medium.com/max/828/1*5C195281ZL_6QB9K2buSQQ.webp)

The trend is when the grade classes decrease, the probability of a default loan increases.

##Feature: purpose
A category provided by the borrower for the loan request.

In [None]:
stats("purpose")
Variable: purpose
Type of variable: object
Total observations: 20000
Missing values? False
Unique values: 12
List of unique values: ['credit_card' 'debt_consolidation' 'medical' 'other' 'home_improvement'
 'small_business' 'major_purchase' 'vacation' 'car' 'house' 'moving'
 'wedding']
---
barh("purpose")

![](https://miro.medium.com/max/828/1*vOLggx61tc3mx0pSOzz_hw.webp)

In [None]:
pivot_sum("bad_loan", "purpose", "id")

![](https://miro.medium.com/max/828/1*UZenCxogg-9VhaZuD2rusA.webp)

##Feature: term
The number of payments on the loan. Values are in months and can be either 36 or 60.

In [None]:
pie("term")

![](https://miro.medium.com/max/828/1*yFULmo4Ft-jHharkTyFCIg.webp)

In [None]:
target("term")

![](https://miro.medium.com/max/828/1*ZD2eJklJU-2d3aKcMC8ppQ.webp)

Default loans occur inversely and more frequently over a 60 months term period.

In [None]:
pivot_mean("term", "grade", "annual_inc")

![](https://miro.medium.com/max/828/1*7c7dmqUBhevubJEw5n777g.webp)

On average, the 36 months term is the most common amongst clients with the highest debt-to-income that belong to the lowest grade class.

##Feature: last_delinq_none
1 when the borrower had at least one event of delinquency.

In [None]:
target("last_delinq_none")

![](https://miro.medium.com/max/1400/1*ZDb5Ok1EZUo2U7jRSyIXAA.webp)

In [None]:
pie("last_delinq_none")

![](https://miro.medium.com/max/828/1*c_DpgeKfdd_bdj-nIcs93w.webp)

In [None]:
stats("last_delinq_none")
Variable: last_delinq_none
Type of variable: int64
Total observations: 20000
Missing values? False
Unique values: 2
Min: 0
25%: 0
Median: 1
75%: 1
Max: 1
Mean: 0.5466
Std dev: 0.49783614979391533
Variance: 0.24784083204162968
Skewness: -0.18721487004502552
Kurtosis: -1.9649505924340243

Percentiles 1%, 5%, 95%, 99%
0.01    0.0
0.05    0.0
0.95    1.0
0.99    1.0
Name: last_delinq_none, dtype: float64
---
pivot_mean("bad_loan","purpose","last_delinq_none")


![](https://miro.medium.com/max/828/1*PwPocAtxanroske2srlVew.webp)

Loans on default are, on average, the more frequent between loans’ purposes such as vacation and house.

##Feature: last_major_derog_none
1 borrower had at least 90 days of a bad rating.

In [None]:
bar("last_major_derog_none")


![](https://miro.medium.com/max/828/1*IyF1tPqq4BIGos_sfg_6Jw.webp)

In [None]:
stats("last_major_derog_none")
Variable: last_major_derog_none
Type of variable: float64
Total observations: 20000
Missing values: 19426 (97.13%)
Unique values: 2
Min: 0
25%: 1
Median: 1
75%: 1
Max: 1
Mean: 0.759581881533101
Std dev: 0.42771012441406686
Variance: 0.18293595052629658
Skewness: nan
Kurtosis: nan
Percentiles 1%, 5%, 95%, 99%
0.01    0.0
0.05    0.0
0.95    1.0
0.99    1.0
Name: last_major_derog_none, dtype: float64
---
target("last_major_derog_none")

![](https://miro.medium.com/max/828/1*5IHJVJqplGSPD7EtcccoTw.webp)

##Feature: revol_util
Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.


In [None]:
stats("revol_util")
Variable: revol_util
Type of variable: float64
Total observations: 20000
Missing values? False
Unique values: 1030
Min: 0
25%: 38
Median: 57
75%: 73
Max: 5010
Mean: 55.95814805499972
Std dev: 42.117455872216155
Variance: 1773.880089148075
Skewness: 81.32716395041949
Kurtosis: 9569.242123791564
Percentiles 1%, 5%, 95%, 99%
0.01     2.699
0.05    14.500
0.95    91.800
0.99    97.300
Name: revol_util, dtype: float64
---
scatter("annual_inc", "revol_util")

![](https://miro.medium.com/max/828/1*X0ra6Sp0F4Qo0vsshre3FA.webp)

The lowest the clients’ annual income, the highest the amount of credit the borrower is using relative to all available revolving credit.

In [None]:
boxhist("revol_util")

![](https://miro.medium.com/max/828/1*58dxIqTATziU0lcHt3Azlw.webp)

# Feature: total_rec_late_fee
Late fees received to date.

In [None]:
stats("total_rec_late_fee")
Variable: total_rec_late_fee
Type of variable: float64
Total observations: 20000
Missing values? False
Unique values: 166
Min: 0
25%: 0
Median: 0
75%: 0
Max: 96
Mean: 0.29062163999999996
Std dev: 3.1086544166442467
Variance: 9.663732282121781
Skewness: 14.299156408331024
Kurtosis: 262.30322072057277
Percentiles 1%, 5%, 95%, 99%
0.01     0.0
0.05     0.0
0.95     0.0
0.99    15.0
Name: total_rec_late_fee, dtype: float64
---
target("total_rec_late_fee")


![](https://miro.medium.com/max/828/1*DsxPJEj0AAqnAahc6fQvGw.webp)

In [None]:
scatter("annual_inc", "total_rec_late_fee")

![](https://miro.medium.com/max/1400/1*SS1efqu8TvdAwMSLymGDAA.webp)

In [None]:
data.total_rec_late_fee.corr(annual_inc)
>> -0.00975830114406848

The customers with the lowest annual income are the ones that have more late fees, especially the highest and heavy ones.

In [None]:
pivot_mean("bad_loan", "purpose", "total_rec_late_fee")

![](https://miro.medium.com/max/1400/1*GqvACq8453faNRuK8G4hMg.webp)

The late fees occur in a higher frequency amongst loan purposes such as a house, small business, or vacation. On the other hand, wedding and car are the credit purposes with the lowest late fees execution.

# Feature: od_ratio
Overdraft ratio.

In [None]:
boxhist("od_ratio")

![](https://miro.medium.com/max/828/1*6MafxZoIS5xM2XibM5eEIg.webp)

In [None]:
stats("od_ratio")
Variable: od_ratio
Type of variable: float64
Total observations: 20000
Missing values? False
Unique values: 20000
Min: 0
25%: 0
Median: 0
75%: 0
Max: 0
Mean: 0.5044303048872487
Std dev: 0.2877201586666063
Variance: 0.0827828897031371
Skewness: -0.02052095981509419
Kurtosis: -1.1914529752985776
Percentiles 1%, 5%, 95%, 99%
0.01    0.009887
0.05    0.051495
0.95    0.951616
0.99    0.990142
Name: od_ratio, dtype: float64
---
scatter("annual_inc", "od_ratio")

![](https://miro.medium.com/max/1400/1*2_YXgx0kuHd1L-3CfdOG2Q.webp)

There is a higher overdraft rate between clients with the lowest annual income, the same applicants involved with the most frequent defaulted loans.

In [None]:
pivot_sum("bad_loan", "term", "od_ratio")

![](https://miro.medium.com/max/828/1*1S2K2eza2AvXjGTxIf4qvw.webp)

Proportionally, overdraft ratios are higher on a 60-month term amongst defaulted loans.

# Feature: bad_loan
1 when a loan was not paid.

In [None]:
stats("bad_loan")
Variable: bad_loan
Type of variable: int64
Total observations: 20000
Missing values? False
Unique values: 2
Min: 0
25%: 0
Median: 0
75%: 0
Max: 1
Mean: 0.2
Std dev: 0.40001000037498174
Variance: 0.16000800039999288
Skewness: 1.4999999999999996
Kurtosis: 0.24999999999999956
Percentiles 1%, 5%, 95%, 99%
0.01    0.0
0.05    0.0
0.95    1.0
0.99    1.0
Name: bad_loan, dtype: float64
---
bar("bad_loan")

![](https://miro.medium.com/max/828/1*PoBR8pKv768AWdCkjvI-AQ.webp)

#**CORRELATIONS**
**Heatmap → Pearson method**

In [None]:
mask = np.triu(data.corr(), 1)
plt.figure(figsize=(19, 9))
sns.heatmap(data.corr(), annot=True, vmax=1, vmin=-1, square=True, cmap='BrBG', mask=mask);

![](https://miro.medium.com/max/828/1*2K824DQz8MKV8uoFzi0vgg.webp)

The heatmap shows there are some positive and negative correlations amongst variables.

Let’s now find out which numerical features are the most correlated with the target.

In [None]:
bad_loan_c = pg.pairwise_corr(data, columns=['bad_loan'], method='pearson').loc[:,['X','Y','r']]
bad_loan_c.sort_values(by=['r'], ascending=False)

![](https://miro.medium.com/max/786/1*MMNqqY2exph5ubpeKFlAvA.webp)

The variable that is most correlated with the target is ‘dti’ with a weak and positive correlation of 0.141884.

**Heatmap → Spearman method**

In [None]:
data_spear = data.copy()
data_spear.drop(["bad_loan"], axis=1, inplace=True)
---
spearman_rank = pg.pairwise_corr(data_spear, method='spearman').loc[:,['X','Y','r']]
pos = spearman_rank.sort_values(kind="quicksort", by=['r'], ascending=False).iloc[:5,:]
neg = spearman_rank.sort_values(kind="quicksort", by=['r'], ascending=False).iloc[-5:,:]
con = pd.concat([pos,neg], axis=0)
display(con.reset_index(drop=True))

![](https://miro.medium.com/max/828/1*5rOzPEKi4U_EpXeEZsGXJQ.webp)

In [None]:
mask = np.triu(data_spear.corr(method='spearman'), 1)
plt.figure(figsize=(19, 9))
sns.heatmap(data_spear.corr(method='spearman'), annot=True, vmax=1, vmin=-1, square=True, cmap='BrBG', mask=mask);

![](https://miro.medium.com/max/828/1*tj3guVCLqAEzn-88M1Wenw.webp)

By plotting a heatmap — spearman method — it’s easy to understand that the variables ‘last_major_derog_none’ and ‘last_delinq_none’ are the two most correlated features with the highest monotonic relationship (60%). Nevertheless, ‘last_major_derog_none’ has 19426 missing values (97%), which is too much.

In this scenario, the column ‘last_major_derog_none’ is to be dropped as well as all the nan values from the dataset. Thus, the next most correlated variables would be ‘emp_length_num’ and ‘short_emp’, with a Spearman value of -55%.



#**Data Wrangling: Cleansing and Feature Selection**
**OUTLIERS**

Let’s examine the data and check for any outliers.
Starting by selecting and filtering numeric and categoric data.

In [None]:
data_ca = data.select_dtypes(exclude=["int64","float64"]).copy()
data_nu = data.select_dtypes(exclude=["object","category"]).copy()

**Boxplot**: Visualizing the numeric data dispersion

In [None]:
fig, axs = plt.subplots(ncols=3, nrows=4, figsize=(16, 8))
index = 0
axs = axs.flatten()
for k,v in data_nu.items():
    sns.boxplot(y=k, data=data_nu, ax=axs[index], orient="h")
    index += 1
    plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)

![](https://miro.medium.com/max/828/1*ekadzEhgx5WaDG4ojhOCgQ.webp)

In [None]:
display(data.describe().loc[["mean","50%","std"]].loc[:,["annual_inc","revol_util","total_rec_late_fee"]])

![](https://miro.medium.com/max/828/1*JW9F4mJ5uR0x2Sb2MwY_Bg.webp)

Definitely, there are some outliers in the variables ‘annual_inc’, ‘revol_util’ and ‘total_rec_late_fee’. Let’s detect and solve them.

**‘annual_inc’**

In [None]:
print(data.annual_inc.describe())
count      20000.000000
mean       73349.578350
std        45198.567255
min         8412.000000
25%        47000.000000
50%        65000.000000
75%        88000.000000
max      1000000.000000
Name: annual_inc, dtype: float64
---
boxhist("annual_inc")

![](https://miro.medium.com/max/828/1*fbDwwLiScGH_ZcLVND2RGg.webp)

The graph and the boxplot suggests this variable has too many data points outside the upper quartile. Outliers are plotted as those individual points beyond the boxplot whiskers. The method used here is the IQR score.

In [None]:
# Dealing with the outliers through IQR score method
Q1 = data['annual_inc'].quantile(0.25)
Q3 = data['annual_inc'].quantile(0.75)
IQR = Q3 - Q1
data['annual_inc'] = data.annual_inc[~((data.annual_inc < (Q1 - 1.5 * IQR)) |(data.annual_inc > (Q3 + 1.5 * IQR)))]
---
print(data.annual_inc.describe())
count     19074.000000
mean      66792.117857
std       27241.646991
min        8412.000000
25%       46000.000000
50%       62000.000000
75%       84000.000000
max      149000.000000
Name: annual_inc, dtype: float64
---
boxhist("annual_inc")

![](https://miro.medium.com/max/828/1*rEGKMXpJzy4mb8Mybp-V0w.webp)

In [None]:
print(int(data_nu.annual_inc.describe()[0]) - int(data.annual_inc.describe()[0]),"outliers were removed with this operation.")
>> 926 outliers were removed with this operation.

**‘revol_util’**

In [None]:
print(data.revol_util.describe())
count    20000.000000
mean        55.958148
std         42.117456
min          0.000000
25%         38.800000
50%         57.100000
75%         73.900000
max       5010.000000
Name: revol_util, dtype: float64
---
boxhist("revol_util")

![](https://miro.medium.com/max/828/1*bSta08oqzIqWp_4XO05mmw.webp)

The graph and the boxplot analisys suggests that this variable has one data point far away from the upper quartile.

Outliers are plotted as those individual points beyond the boxplot whiskers but that doesn’t mean that every single data point outside the lower and upper quartiles are indeed outliers.

Better check it out and possibly remove that single outlier.

In [None]:
# Dealing with the 5010.0 outlier
value = data.revol_util.quantile([.99999])
p = value.iloc[0]
data = data[data["revol_util"] < p]
---
print(data['revol_util'].describe())
​count    19999.000000
mean        55.710434
std         23.380722
min          0.000000
25%         38.800000
50%         57.100000
75%         73.900000
max        128.100000
Name: revol_util, dtype: float64
---
boxhist("revol_util")

![](https://miro.medium.com/max/828/1*44s3HA4X63-9MBnHnGonQA.webp)

In [None]:
print(int(data_nu.revol_util.describe()[0]) - int(data.revol_util.describe()[0]),"outlier was removed with this operation.")
>> 1 outlier was removed with this operation.

**‘total_rec_late_fee’**
Visualizing the data dispersion:

In [None]:
sns.boxplot(x=data['total_rec_late_fee'],data=data)
plt.xlabel('total_rec_late_fee', fontsize=10)
plt.show()


![](https://miro.medium.com/max/828/1*n6l8SU4WXMvKVxdJ0AkzlA.webp)

**Removing outlier:**

value = data.total_rec_late_fee.quantile([.989])

p = value.iloc[0]

data = data[data["total_rec_late_fee"] < p]

**Checking results:**

sns.boxplot(x=data['total_rec_late_fee'],data=data)

plt.xlabel('total_rec_late_fee', fontsize=10)

plt.show()

![](https://miro.medium.com/max/828/1*W-6b-6oHW9GNQ_ugNSgumw.webp)

Although there is a significant number of data point distant from the upper quartile, I believe this is the case that those are not outliers, their values are significant for the target classification. Not removing any others.

In [None]:
for col in data[["annual_inc", "total_rec_late_fee", "revol_util"]].columns:
    sns.boxplot(data[col])
    plt.show()

![](https://miro.medium.com/max/1400/1*C9QXTaPmkbBaajT1Ul3cuA.webp)

#MISSING VALUES
Time to detect and eliminate them.

In [None]:
for column in data.columns:
    if data[column].isna().sum() != 0:
        missing = data[column].isna().sum()
        portion = (missing / data.shape[0]) * 100
        print(f"'{column}': number of missing values '{missing}' ---> '{portion:.3f}%'")
>> 'annual_inc': number of missing values '915' ---> '4.626%'
>> 'home_ownership': number of missing values '1476' ---> '7.462%'
>> 'dti': number of missing values '152' ---> '0.768%'
>> 'last_major_derog_none': number of missing values '19208' ---> '97.113%'

**‘annual_inc’**

In [None]:
data.annual_inc.value_counts(dropna=False)
NaN         915 <---
60000.0     771
50000.0     729
65000.0     607
70000.0     599
           ...
109097.0      1
88621.0       1
50455.0       1
18300.0       1
96241.0       1
Name: annual_inc, Length: 2349, dtype: int64
---
boxhist("annual_inc")

![](https://miro.medium.com/max/828/1*-rksXe2K5oMN4UlogFnGuA.webp)

**Strategy**: Replacing missing values with the mean (average).

In [None]:
data["annual_inc"] = data.annual_inc.fillna(data.annual_inc.mean())
print(f"Fillna done. Anomalies detected: {data.annual_inc.isnull().values.any()}")
>> Fillna done. Anomalies detected: False

**‘home_ownership’**

In [None]:
data.home_ownership.value_counts(dropna=False)
MORTGAGE    9744
RENT        6959
OWN         1600
NaN         1476 <---
Name: home_ownership, dtype: int64
---
bar("home_ownership")

![](https://miro.medium.com/max/828/1*B9qGbYbo4KjJj0MIq0dGkQ.webp)

**Strategy**: Mode imputation (replacing NaN by most frequent value: Mortage).

In [None]:
data["home_ownership"] = data.home_ownership.fillna(data.home_ownership.value_counts().index[0])
print(f"Imputation done. Missing values: {data.home_ownership.isnull().sum()}")
>> Imputation done. Missing values: 0

**‘dti’**

In [None]:
data.dti.value_counts(dropna=False)
NaN      152 <---
19.64     20
15.87     20
9.60      19
13.49     19
        ...
1.61       1
29.00      1
29.19      1
31.78      1
3.26       1
Name: dti, Length: 3286, dtype: int64
---
boxhist("dti")


![](https://miro.medium.com/max/828/1*Qb7DHQRyhAVIGFY9leY-hA.webp)

**Strategy**: Replacing missing values with the mean (average).

In [None]:
data["dti"] = data.dti.fillna(data.dti.mean())
print(f"Fillna done. Missing values: {data.dti.isnull().values.any()}")
>> Fillna done. Missing values: False

**‘last_major_derog_none’**

In [None]:
abs_mv = data.last_major_derog_none.value_counts(dropna=False)
pc_mv = data.last_major_derog_none.value_counts(dropna=False, normalize=True) * 100
pc_mv_df = pd.DataFrame(pc_mv)
pc_mv_df.rename(columns={"last_major_derog_none":"Percent %"}, inplace=True)
abs_pc = pd.concat([abs_mv,pc_mv_df], axis=1)
abs_pc

![](https://miro.medium.com/max/640/1*GWjpn-GVqau2lVqqWIw2Dw.webp)

**Strategy**: Drop ‘last_major_derog_none’ numerical variable (too many anomalies).

In [None]:
data.drop("last_major_derog_none", axis=1, inplace=True)
print(f"All missing values are solved in the entire dataset: {data.notnull().values.any()}")
>> All missing values are solved in the entire dataset: True

## FEATURE SELECTION

In [None]:
>> data.info()

![](https://miro.medium.com/max/828/1*aNz5K95zKVvsHtP8K2eJPQ.webp)

**Strategy**: Drop ‘id’ numerical variable (irrelevant feature).

In [None]:
data.drop("id", axis=1, inplace=True)
---
data.shape
>> (19779, 13)

##Numerical Features and Categorical/Binary Target

Selecting numeric variables only:

In [None]:
>> data_nu = data.select_dtypes(exclude=["object","category"]).copy()

Creating subsets:

In [None]:
>> Xnum = data_nu.drop(["bad_loan"], axis= "columns")

In [None]:
>> ynum = data_nu.bad_loan

In [None]:
# Identifying the predictive features using the Pearson Correlation p-value
pd.DataFrame(
    [scipy.stats.pearsonr(Xnum[col],
    ynum) for col in Xnum.columns],
    columns=["Pearson Corr.", "p-value"],
    index=Xnum.columns,
).round(4)

![](https://miro.medium.com/max/640/1*WDnr6ErOST-qe_mIR06p9w.webp)

**Strategy**: Drop ‘od_ratio’ (p-value > 0.05) → low information to the target, and keep all the others.

##Categorical Features and Categorical/Binary Target

Selecting categoric variables only:

In [None]:
Xcat = data.select_dtypes(exclude=['int64','float64']).copy()

Creating subsets:

In [None]:
Xcat['target'] = data.bad_loan
Xcat.dropna(how="any", inplace=True)
ycat = Xcat.target
Xcat.drop("target", axis=1, inplace=True)

**Chi-square** test for independence:

In [None]:
for col in Xcat.columns:
    table = pd.crosstab(Xcat[col], ycat)
    print()
    display(table)
    _, pval, _, expected_table = scipy.stats.chi2_contingency(table)
    print(f"p-value: {pval:.25f}")


![](https://miro.medium.com/max/786/1*S14GSrS87x1xMIYONNi5Sg.webp)

![](https://miro.medium.com/max/828/1*JKu7tjJ0SYxtnQLMADMxOw.webp)

**Strategy**: Keep all features (p-value < 0.05). The categorical variables have predictive power.

##ENCODING & TRANSFORMATIONS

Let’s continue by encoding and transforming the categorical variables into numeric ones.

The feature ‘grade’ is a scale which means i’ll be mapping it to numbers. On the other hand, concerning the variables ‘term’, ‘home_ownership’ and ‘purpose’, we need to inspect and decide which procedure (OHE or Binary Encoding) is the best option.

**Variable: ‘grade’**

In [None]:
>> data["grade"] = data.grade.map({"A":7, "B":6, "C":5, "D":4, "E":3, "F":2, "G":1})

**Variables: ‘term’, ‘home_ownership’, ‘purpose’**

One Hot Encoding and Binary Encoding will be both displayed so we can chose the best to apply.

In [None]:
df_term = data.term
df_home = data.home_ownership
df_purp = data.purpose
#term
t_ohe = pd.get_dummies(df_term)
bin_enc_term = BinaryEncoder()
t_bin = bin_enc_term.fit_transform(df_term)
#home_ownsership
h_ohe = pd.get_dummies(df_home)
bin_enc_home = BinaryEncoder()
h_bin = bin_enc_home.fit_transform(df_home)
#purpose
p_ohe = pd.get_dummies(df_purp)
bin_enc_purp = BinaryEncoder()
p_bin = bin_enc_purp.fit_transform(df_purp)
>> The results are:
** COLUMNS OHE **
term: 2 <--- best
home: 3 <--- best
purp: 12

** COLUMNS BINARY **
term: 2
home: 3
purp: 5 <--- best

**One Hot Encoding (OHE)**

In [None]:
>> data = pd.get_dummies(data, columns=["term","home_ownership"])

**Binary Encoding**

In [None]:
>> bin_enc_purp = BinaryEncoder()
>> data_bin = bin_enc_purp.fit_transform(data.purpose)

In [None]:
# Concatenating both datasets
df = pd.concat([data,data_bin],axis=1)
# Dropping 'purpose'
df.drop(["purpose"], axis=1, inplace=True)
# Lowering upper characters
df.columns = [x.lower() for x in df.columns]
# printing 5 first rows
df.head()

![](https://miro.medium.com/max/828/1*A-G29ncWTMFXywthGS9XMg.webp)

At this point, we are ready to test and train some models!



#**MACHINE LEARNING: Predictive modeling**

We’re dealing with a supervised binary problem using classification techniques.

Given that we have unbalanced data, we’ll use AUC ROC as the best metric to evaluate the performance of the following models.

Let’s define a function of the AUC to plot and display the threshold between the true positive rate (TPR) and false positive rate (FPR).

In [None]:
# ROC Curve: Area Under the Curve
def auc_roc_plot(y_test, y_preds):
    fpr, tpr, thresholds = roc_curve(y_test,y_preds)
    roc_auc = auc(fpr, tpr)
    print(roc_auc)
    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--'
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate'
    plt.xlabel('False Positive Rate')
    plt.show()

#Logistic Regression (LR)

In [None]:
# Making a copy of the dataset
df_lr = df.copy()
---
# Dividing the dataset in train (80%) and test (20%)
train_set_lr, test_set_lr = train_test_split(df_lr, test_size = 0.2, random_state = seed)
X_train_lr = train_set_lr.drop(['bad_loan'], axis = 1)
y_train_lr = train_set_lr['bad_loan']
X_test_lr = test_set_lr.drop(['bad_loan'], axis = 1)
y_test_lr = test_set_lr['bad_loan']
---
# Normalizing the train and test data
scaler_lr = MinMaxScaler()
features_names = X_train_lr.columns
X_train_lr = scaler_lr.fit_transform(X_train_lr)
X_train_lr = pd.DataFrame(X_train_lr, columns = features_names)
X_test_lr = scaler_lr.transform(X_test_lr)
X_test_lr = pd.DataFrame(X_test_lr, columns = features_names)
---
%%time
lr = LogisticRegression(max_iter = 1000, solver = 'lbfgs', random_state = seed, class_weight = 'balanced' )
parameters = {'C':[0.001, 0.01, 0.1, 1, 10, 100]}
clf_lr = GridSearchCV(lr, parameters, cv = 5).fit(X_train_lr, y_train_lr)
>>> CPU times: user 10.3 s, sys: 449 ms, total: 10.8 s
Wall time: 3.21 s
---
clf_lr
>>> GridSearchCV(cv=5, error_score=nan, estimator=LogisticRegression(C=1.0, class_weight='balanced', dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=1000, multi_class='auto', n_jobs=None, penalty='l2', random_state=42, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False), iid='deprecated', n_jobs=None, param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100]}, pre_dispatch='2*n_jobs', refit=True, return_train_score=False, scoring=None, verbose=0)
clf_lr.best_estimator_
>>> LogisticRegression(C=0.1, class_weight='balanced', dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=1000, multi_class='auto', n_jobs=None, penalty='l2', random_state=42, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)
---
y_preds_lr = clf_lr.predict_proba(X_test_lr)[:,1]
---
auc_roc_plot(y_test_lr, y_preds_lr)
>> 0.7074872114784778

![](https://miro.medium.com/max/828/1*9N9EarySA3vlXMo-1n7hCw.webp)

In [None]:
# Confusion Matrix display
plot_confusion_matrix(clf_lr, X_test_lr, y_test_lr, values_format=".4g", cmap="Blues");
---
# Creating assignments for Final Results
tn, fp, fn, tp = confusion_matrix(y_test_lr == 1, y_preds_lr > 0.5).ravel()
tn_lr = tn
fp_lr = fp
fn_lr = fn
tp_lr = tp


![](https://miro.medium.com/max/786/1*F6WY_6sNxnS2dxJEjmqciQ.webp)

#K-Nearest Neighbors (KNN)

In [None]:
# Making a copy of the dataset
df_knn = df.copy()
---
# Dividing the dataset in train (80%) and test (20%)
train_set_knn, test_set_knn = train_test_split(df_knn, test_size = 0.2, random_state = seed)
​X_train_knn = train_set_knn.drop(['bad_loan'], axis = 1)
y_train_knn = train_set_knn['bad_loan']
X_test_knn = test_set_knn.drop(['bad_loan'], axis = 1)
y_test_knn = test_set_knn['bad_loan']
---
# Normalizing train and test data
scaler_knn = MinMaxScaler()
features_names = X_train_knn.columns
X_train_knn = scaler_knn.fit_transform(X_train_knn)
X_train_knn = pd.DataFrame(X_train_knn, columns = features_names)
X_test_knn = scaler_knn.transform(X_test_knn)
X_test_knn = pd.DataFrame(X_test_knn, columns = features_names)
---
%%time
for k in range(1, 200, 5):
    k = k + 1
    knn = KNeighborsClassifier(n_neighbors = k).fit(X_train_knn, y_train_knn)
    acc = knn.score(X_test_knn, y_test_knn)
    print('Accuracy for k =', k, ' is:', acc)
Accuracy for k = 2  is: 0.7965116279069767
Accuracy for k = 7  is: 0.7944893832153691
Accuracy for k = 12  is: 0.8066228513650152
Accuracy for k = 17  is: 0.8066228513650152
Accuracy for k = 22  is: 0.8088978766430738
Accuracy for k = 27  is: 0.8081395348837209
Accuracy for k = 32  is: 0.8106673407482305
Accuracy for k = 37  is: 0.8094034378159757
Accuracy for k = 42  is: 0.8116784630940344
Accuracy for k = 47  is: 0.8119312436804853
Accuracy for k = 52  is: 0.8109201213346815
Accuracy for k = 57  is: 0.8104145601617796
Accuracy for k = 62  is: 0.8096562184024267
Accuracy for k = 67  is: 0.8101617795753286
Accuracy for k = 72  is: 0.8101617795753286
Accuracy for k = 77  is: 0.8104145601617796
Accuracy for k = 82  is: 0.8109201213346815
Accuracy for k = 87  is: 0.8106673407482305
Accuracy for k = 92  is: 0.8104145601617796
Accuracy for k = 97  is: 0.8106673407482305
Accuracy for k = 102  is: 0.8104145601617796
Accuracy for k = 107  is: 0.8104145601617796
Accuracy for k = 112  is: 0.8104145601617796
Accuracy for k = 117  is: 0.8104145601617796
Accuracy for k = 122  is: 0.8104145601617796
Accuracy for k = 127  is: 0.8106673407482305
Accuracy for k = 132  is: 0.8104145601617796
Accuracy for k = 137  is: 0.8104145601617796
Accuracy for k = 142  is: 0.8101617795753286
Accuracy for k = 147  is: 0.8101617795753286
Accuracy for k = 152  is: 0.8101617795753286
Accuracy for k = 157  is: 0.8104145601617796
Accuracy for k = 162  is: 0.8104145601617796
Accuracy for k = 167  is: 0.8104145601617796
Accuracy for k = 172  is: 0.8104145601617796
Accuracy for k = 177  is: 0.8104145601617796
Accuracy for k = 182  is: 0.8104145601617796
Accuracy for k = 187  is: 0.8104145601617796
Accuracy for k = 192  is: 0.8104145601617796
Accuracy for k = 197  is: 0.8104145601617796
>> CPU times: user 1min 8s, sys: 883 ms, total: 1min 9s
Wall time: 1min 10s
---
%%time
knn = KNeighborsClassifier(n_neighbors = 47, weights='uniform').fit(X_train_knn, y_train_knn)
y_preds_knn = knn.predict(X_test_knn)
>> CPU times: user 1.29 s, sys: 16 ms, total: 1.31 s
Wall time: 1.31 s
---
auc_roc_plot(y_test_knn, y_preds_knn)
>> 0.6670792264504056

![](https://miro.medium.com/max/828/1*rNcN5EH_1OFLezrdvPmjjw.webp)

In [None]:
# Confusion Matrix display
plot_confusion_matrix(knn, X_test_knn, y_test_knn, values_format=".4g", cmap="Blues");
---
​# Creating assignments for Final Results
tn, fp, fn, tp = confusion_matrix(y_test_knn == 1, y_preds_knn > 0.5).ravel()
tn_knn = tn
fp_knn = fp
fn_knn = fn
tp_knn = tp

![](https://miro.medium.com/max/786/1*jWXdv2qEddcpa_2OlZgetw.webp)

#Support Vector Machine (SVC)

In [None]:
# Making a copy of the dataset
df_svm = df.copy()
---
# Dividing the dataset in train (80%) and test (20%)
train_set_svc, test_set_svc = train_test_split(df_svm, test_size = 0.2, random_state = seed)
X_train_svc = train_set_svc.drop(['bad_loan'], axis = 1)
y_train_svc = train_set_svc['bad_loan']
X_test_svc = test_set_svc.drop(['bad_loan'], axis = 1)
y_test_svc = test_set_svc['bad_loan']
---
# Standardization of train and test data
zscore_svc = StandardScaler()
features_names = X_train_svc.columns
X_train_svc = zscore_svc.fit_transform(X_train_svc)
X_train_svc = pd.DataFrame(X_train_svc, columns = features_names)
X_test_svc = zscore_svc.transform(X_test_svc)
X_test_svc = pd.DataFrame(X_test_svc, columns = features_names)
---
%%time
svc = SVC(random_state=seed, class_weight='balanced',probability=True, verbose=True)
parameters = {'C':[0.1, 1, 10]}
clf_svc = GridSearchCV(svc, parameters, cv = 5).fit(X_train_svc, y_train_svc)
>> [LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM]CPU times: user 14min 34s, sys: 22.7 s, total: 14min 57s
Wall time: 15min 2s
---
%%time
y_preds_svc = clf_svc.predict_proba(X_test_svc)[:,1]
>> CPU times: user 2.95 s, sys: 17.9 ms, total: 2.97 s
Wall time: 3 s
---
auc_roc_plot(y_test_svc, y_preds_svc)
>> 0.6754917862341443

![](https://miro.medium.com/max/828/1*alKutmhimT8pnPvp93k5yQ.webp)

In [None]:
# Confusion Matrix display
plot_confusion_matrix(clf_svc, X_test_svc, y_test_svc, values_format=".4g", cmap="Blues");
---
​# Creating assignments for Final Results
tn, fp, fn, tp = confusion_matrix(y_test_svc == 1, y_preds_svc > 0.5).ravel()
tn_svc = tn
fp_svc = fp
fn_svc = fn
tp_svc = tp


![](https://miro.medium.com/max/786/1*f49zWYqCgPaUb8x3Hflamg.webp)

#Decision Trees (DT)

In [None]:
# Making a copy of the dataset
df_trees = df.copy()
---
# Dividing the dataset in train (80%) and test (20%)
train_set_dt, test_set_dt = train_test_split(df_trees, test_size = 0.2, random_state = seed)
X_train_dt = train_set_dt.drop(['bad_loan'], axis = 1)
y_train_dt = train_set_dt['bad_loan']
X_test_dt = test_set_dt.drop(['bad_loan'], axis = 1)
y_test_dt = test_set_dt['bad_loan']
---
%%time
clf_tree = tree.DecisionTreeClassifier(random_state = seed, max_depth = 10).fit(X_train_dt, y_train_dt)
>> CPU times: user 203 ms, sys: 65.8 ms, total: 268 ms
Wall time: 388 ms
---
clf_tree.score(X_test_dt, y_test_dt)
>> 0.788675429726997
---
# Visualizing variables by importance
important_features = pd.DataFrame(data = clf_tree.feature_importances_, index = X_train_dt.columns, columns = ["value"])
important_features.sort_values(by = "value", ascending = False)

![](https://miro.medium.com/max/640/1*Mtr_Cu1VHRBZYNcRtbAvOA.webp)

In [None]:
y_preds_dt = clf_tree.predict_proba(X_test_dt)[:,1]
---
auc_roc_plot(y_test_dt, y_preds_dt)
>> 0.6295855687253067


![](https://miro.medium.com/max/828/1*DzGUzG26pkBrpIOkQlQ0TA.webp)

In [None]:
# Confusion Matrix display
plot_confusion_matrix(clf_tree, X_test_dt, y_test_dt, values_format=".4g", cmap="Blues");
---
# Creating assignments Final Results
tn, fp, fn, tp = confusion_matrix(y_test_dt == 1, y_preds_dt > 0.5).ravel()
tn_dt = tn
fp_dt = fp
fn_dt = fn
tp_dt = tp


![](https://miro.medium.com/max/786/1*IBNCggWi2WRN57U_gdxXFQ.webp)

#Random Forest (RF)

In [None]:
# Making a copy of the dataset
df_rf = df.copy()
---
# Dividing the dataset in train (80%) and test (20%)
train_set_rf, test_set_rf = train_test_split(df_rf, test_size = 0.2, random_state = seed)
X_train_rf = train_set_rf.drop(['bad_loan'], axis = 1)
y_train_rf = train_set_rf['bad_loan']
X_test_rf = test_set_rf.drop(['bad_loan'], axis = 1)
y_test_rf = test_set_rf['bad_loan']
---
%%time
rf = RandomForestClassifier(random_state = seed, class_weight = None).fit(X_train_rf, y_train_rf)
parameters = {'n_estimators':[10, 100, 300, 1000]}
clf_rf = GridSearchCV(rf, parameters, cv = 5).fit(X_train_rf, y_train_rf)
>> CPU times: user 2min 11s, sys: 3.33 s, total: 2min 14s
Wall time: 2min 15s
---
y_preds_rf = clf_rf.predict_proba(X_test_rf)[:,1]
---
auc_roc_plot(y_test_rf, y_preds_rf)
>> 0.6735905593678521

![](https://miro.medium.com/max/828/1*Vdad9mJuJpos9m_QMhiXGw.webp)

In [None]:
# Confusion Matrxi display
plot_confusion_matrix(clf_rf, X_test_rf, y_test_rf, values_format=".4g", cmap="Blues");
---
​# Creating assignments for Final Results
tn, fp, fn, tp = confusion_matrix(y_test_rf == 1, y_preds_rf > 0.5).ravel()
tn_rf = tn
fp_rf = fp
fn_rf = fn
tp_rf = tp


![](https://miro.medium.com/max/786/1*YUUUMp11Iw8qSIl1mrEGTQ.webp)

#Neural Networks (NN)

In [None]:
# Making a copy of the dataset
df_nn = df.copy()
---
# Dividing the dataset in train (80%) and test (20%)
train_set_nn, test_set_nn = train_test_split(df_nn, test_size = 0.2, random_state = seed)
X_train_nn = train_set_nn.drop(['bad_loan'], axis = 1)
y_train_nn = train_set_nn['bad_loan']
X_test_nn = test_set_nn.drop(['bad_loan'], axis = 1)
y_test_nn = test_set_nn['bad_loan']
---
# Normalization of the train and test data
scaler_nn = MinMaxScaler()
features_names = X_train_nn.columns
X_train_nn = scaler_nn.fit_transform(X_train_nn)
X_train_nn = pd.DataFrame(X_train_nn, columns = features_names)
X_test_nn = scaler_nn.transform(X_test_nn)
X_test_nn = pd.DataFrame(X_test_nn, columns = features_names)
---
%%time
mlp_nn = MLPClassifier(solver = 'adam', random_state = seed, max_iter = 1000 )
parameters = {'hidden_layer_sizes': [(20,), (20,10), (20, 10, 2)], 'learning_rate_init':[0.0001, 0.001, 0.01, 0.1]}
clf_nn = GridSearchCV(mlp_nn, parameters, cv = 5).fit(X_train_nn, y_train_nn)
>> CPU times: user 25min 41s, sys: 41.4 s, total: 26min 22s
Wall time: 6min 53s
---
y_preds_nn = clf_nn.predict_proba(X_test_nn)[:,1]
---
auc_roc_plot(y_test_nn, y_preds_nn)
>> 0.7081023081721772


![](https://miro.medium.com/max/828/1*hYiQuCh8qTfhbuf1rDgNDw.webp)

In [None]:
# Confusion Matrix display
plot_confusion_matrix(clf_nn, X_test_nn, y_test_nn, values_format=".4g", cmap="Blues");​
---
# Creating assignments for Final Results
tn, fp, fn, tp = confusion_matrix(y_test_nn == 1, y_preds_nn > 0.5).ravel()
tn_nn = tn
fp_nn = fp
fn_nn = fn
tp_nn = tp


![](https://miro.medium.com/max/786/1*nOM3t6fx5HA0y4Z7Y2I9TA.webp)

#**RESULTS: Performance comparison between models**

In [None]:
# Creating performance table
results_1 = {'Classifier': ['AUC ROC (%)','TN (%)','FP (%)','FN (%)','TP (%)'],
'Logistic Regression (LR)': [aucroclr, (tn_lr/3956*100).round(2), (fp_lr/3956*100).round(2), (fn_lr/3956*100).round(2), (tp_lr/3956*100).round(2)],
'K Nearest Neighbour (KNN)': [aucrocknn, (tn_knn/3956*100).round(2),(fp_knn/3956*100).round(2), (fn_knn/3956*100).round(2),(tp_nn/3956*100).round(2)],
'Support Vector Machine (SVC)': [aucrocsvc, (tn_svc/3956*100).round(2),(fp_svc/3956*100).round(2), (fn_svc/3956*100).round(2),(tp_svc/3956*100).round(2)],
'Decision Trees (DT)': [aucrocdt, (tn_dt/3956*100).round(2), (fp_dt/3956*100).round(2), (fn_dt/3956*100).round(2),(tp_dt/3956*100).round(2)],
'Random Forest (RF)': [aucrocrf, (tn_rf/3956*100).round(2), (fp_rf/3956*100).round(2), (fn_rf/3956*100).round(2),(tp_rf/3956*100).round(2)],
'Neural Networks (NN)': [aucrocnn, (tn_nn/3956*100).round(2), (fp_nn/3956*100).round(2),(fn_nn/3956*100).round(2),(tp_nn/3956*100).round(2)]}
df1 = pd.DataFrame(results_1, columns = ['Classifier', 'Logistic Regression (LR)', 'K Nearest Neighbour (KNN)', 'Support Vector Machine (SVC)', 'Decision Trees (DT)', 'Random Forest (RF)', 'Neural Networks (NN)'])
df1.set_index("Classifier", inplace=True)
results = df1.T
results


![](https://miro.medium.com/max/828/1*svGFkUT00v2gTAfGApFGWQ.webp)

In [None]:
​# Creating table for graphic visualization
results_2 = {'Classifier': ['ROC AUC'], 'Logistic Regression (LR)': [aucroclr], 'K Nearest Neighbour (KNN)': [aucrocknn], 'Support Vector Machine (SVC)': [aucrocsvc], 'Decision Trees (DT)': [aucrocdt], 'Random Forest (RF)': [aucrocrf], 'Neural Networks (NN)': [aucrocnn]}
df2 = pd.DataFrame(results_2, columns = ['Classifier', 'Logistic Regression (LR)', 'K Nearest Neighbour (KNN)', 'Support Vector Machine (SVC)', 'Decision Trees (DT)', 'Random Forest (RF)', 'Neural Networks (NN)'])
df2.set_index("Classifier", inplace=True)
results_2 = df2
---
# Display tHe graph
ax = results_2.plot(kind="bar", title=("Evaluating models' performance"), figsize=(12,8) ,fontsize=10, grid=True)
for p in ax.patches:
    ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 5), textcoords='offset points')
plt.legend(fontsize=8.5, loc="upper right")
plt.xlabel('')
plt.xticks(rotation='horizontal')
plt.ylabel('Relative frequency (%)')
plt.show()

![](https://miro.medium.com/max/828/1*VntsjdXRHDcJ2-XuaqssOQ.webp)

#**Conclusion**

Best model: Support Vector Machine - Classifier (SVC): 75.21%.

The rule of thumb is very straightforward: the higher the value of the ROC AUC metric, the better. If a random model would show 0.5, a perfect model would achieve 1.0.

The academic scoring system stands as follows:

In [None]:
.9 -  1 = excellent  (A)
.8 - .9 = good       (B)
.7 - .8 = reasonable (C)
.6 - .7 = weak       (D)
.5 - .6 = terrible   (F)

The ratio between TPR and FPR determined by a threshold over which results in a positive instance puts the chosen model (SVC) at a reasonable level ( C ) with a ROC AUC score of 75.21%.