# Abstract
For this Project, I have used two models, a logistic regression and a random forests classifier to predict the loan status of the individual borrower based on his/her/their credit worthiness measured by i.e. debt-to-income ratio, number of accounts open, total debts, and their incomes.
we will be using binary classification models and training them using 7 varibales as shown in the dataframe to predict the loan status of individual borrowers. 

# Prediction
I suspect the accuracy of the model to predict the loan status as "disapproved" or "1" to not be perfect as I see right off the bat from the table, with some individuals having a high total debt, high number of accounts open, and debt-to-income ratio did get approved which is denoted by "0."

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
%matplotlib inline
from matplotlib import pyplot as plt
from sklearn.datasets import make_regression
import os

In [2]:
# Load dataset
file_path = os.path.join("Resources/lending_data.csv")
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [4]:
y = df["loan_status"]
X = df.drop("loan_status", axis=1)
target_names = ["Approve loan", "Disapprove loan"]

In [5]:
X = df.drop("loan_status", axis=1)
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Logistic regression

In [7]:
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [8]:
classifier = LogisticRegression(max_iter=100000)
classifier.fit(X_train, y_train)
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 0.9919177328380795
Testing Data Score: 0.9924680148576145


In [9]:
y_true = y_test
y_pred = classifier.predict(X_test)
# this one matches with the slide picture
cm = confusion_matrix(y_true, y_pred, labels=[1,0])
cm

array([[  539,    53],
       [   93, 18699]], dtype=int64)

In [10]:
y_true = y_test
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_true, y_pred)
cm
tn, fp, fn, tp = cm.ravel()

print(f"True positives (TP): {tp}")
print(f"True negatives (TN): {tn}")
print(f"False positives (FP): {fp}")
print(f"False negatives (FN): {fn}")

True positives (TP): 539
True negatives (TN): 18699
False positives (FP): 93
False negatives (FN): 53


![image.png](attachment:image.png)

In [11]:
# Calculate the precision of the model based on the confusion matrix
precision = tp / (tp + fp)
precision

0.8528481012658228

In [12]:
# Calculate the sensitivity of the model based on the confusion matrix
sensitivity = tp / (tp + fn)
sensitivity

0.910472972972973

In [13]:
# harmonic average
f1 = 2 * precision * sensitivity / (precision + sensitivity)
f1

0.880718954248366

In [14]:
# Train a Logistic Regression model print the model score
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18792
           1       0.85      0.91      0.88       592

    accuracy                           0.99     19384
   macro avg       0.93      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384



# Random Forest Classifier

In [15]:
# Import a Random Forests classifier
from sklearn.ensemble import RandomForestClassifier

In [16]:
# Fit a model, and then print a classification report
clf = RandomForestClassifier(random_state=1).fit(X_train_scaled, y_train)
y_pred = clf.predict(X_test_scaled)
print(classification_report(y_test, y_pred, target_names=target_names))
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

                 precision    recall  f1-score   support

   Approve loan       1.00      0.99      1.00     18792
Disapprove loan       0.85      0.90      0.87       592

       accuracy                           0.99     19384
      macro avg       0.92      0.95      0.93     19384
   weighted avg       0.99      0.99      0.99     19384

Training Score: 0.9971798046498831
Testing Score: 0.9917973586463062


Our model that we trained had an accuracy of 100% to predict the approval of loan based on the individual's credit-worthiness measures. but As we have anticipated, accuracy was only 87% when predicting the loan disapproval. This validates our observation on the data by which despite the high debt, debt-to-income, and number of accounts open for some individuals, they were given a loan. This is perhaps due to the fact that the credit worthiness threhold varies by institution.

In [17]:
# Import an Extremely Random Trees classifier
from sklearn.ensemble import ExtraTreesClassifier

In [18]:
clf = ExtraTreesClassifier(random_state=1).fit(X_train_scaled, y_train)
y_pred = clf.predict(X_test_scaled)
print(classification_report(y_test, y_pred, target_names=target_names))
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

                 precision    recall  f1-score   support

   Approve loan       1.00      1.00      1.00     18792
Disapprove loan       0.85      0.87      0.86       592

       accuracy                           0.99     19384
      macro avg       0.92      0.93      0.93     19384
   weighted avg       0.99      0.99      0.99     19384

Training Score: 0.9971970009629936
Testing Score: 0.9912298803136608


In [19]:
# Import an Adaptive Boosting classifier
from sklearn.ensemble import AdaBoostClassifier

In [20]:
clf = AdaBoostClassifier(random_state=1).fit(X_train_scaled, y_train)
y_pred = clf.predict(X_test_scaled)
print(classification_report(y_test, y_pred, target_names=target_names))
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

                 precision    recall  f1-score   support

   Approve loan       1.00      0.99      1.00     18792
Disapprove loan       0.85      1.00      0.92       592

       accuracy                           0.99     19384
      macro avg       0.92      1.00      0.96     19384
   weighted avg       1.00      0.99      0.99     19384

Training Score: 0.9944111982390975
Testing Score: 0.9944799834915394


# Conclusion
Our **Random Forests model** that we trained had an accuracy of 100% to predict the approval of loan based on the individual's credit-worthiness measures. but As we have anticipated, accuracy was only 87% when predicting the loan disapproval. This validates our observation on the data by which despite the high debt, debt-to-income, and number of accounts open for some individuals, they were given a loan. This is perhaps due to the fact that the credit worthiness threhold varies by institution.
For our prediction, **adaptive boosting classifier** actually helped us get a higher prediction accuracy. The accuracy of the prediction may have improved due to the weight of incorrectly classified instances were adjusted by boosting the good decision trees and reducing the bad trees.