Predict whether a loan will be approved or not using a machine learning model.
Create a Logistic Regression model and a Random Forest Classifier.

In [34]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [44]:
# Import the data
df = pd.read_csv('Resources/lending_data.csv')
df

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.430740,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0
...,...,...,...,...,...,...,...,...
77531,19100.0,11.261,86600,0.653580,12,2,56600,1
77532,17700.0,10.662,80900,0.629172,11,2,50900,1
77533,17600.0,10.595,80300,0.626401,11,2,50300,1
77534,16300.0,10.068,75300,0.601594,10,2,45300,1


In [36]:
#List dataframe data types and make any necessary adjustments
#all float or integers
df.dtypes

loan_size           float64
interest_rate       float64
borrower_income       int64
debt_to_income      float64
num_of_accounts       int64
derogatory_marks      int64
total_debt            int64
loan_status           int64
dtype: object

In [37]:
# Find null values
for column in df.columns:
    print(f"Column {column} has {df[column].isnull().sum()} null values")

Column loan_size has 0 null values
Column interest_rate has 0 null values
Column borrower_income has 0 null values
Column debt_to_income has 0 null values
Column num_of_accounts has 0 null values
Column derogatory_marks has 0 null values
Column total_debt has 0 null values
Column loan_status has 0 null values


In [38]:
X = df.drop('loan_status', axis=1)
y = df['loan_status']

In [39]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [40]:
#Use the Standard Scaler to not weigh one feature too heavily
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Consider the models Logistic Regression and Random Forest Classifier. 
I predict that Random Forest Classifier will perform better than the logistic regression on whether a loan will be approved or not because it uses an ensemble learning method for classification by constructing a multitude of decision trees at training. I predict that the decision on approving a loan or not will not necessarily be a linear decision.

Time to test this hypothesis.

In [41]:
# Import Logistic Regression model
classifier = LogisticRegression(max_iter=10000)

In [42]:
# Train a Logistic Regression model print the model score
classifier.fit(X_train_scaled, y_train)
print(f"Training Data Score: {classifier.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test_scaled, y_test)}")

Training Data Score: 0.9942908240473243
Testing Data Score: 0.9936545604622369


In [43]:
# Train a Random Forest Classifier model and print the model score
clf = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9975409272252029
Testing Score: 0.9917457697069748


The Logistic Regression model performed better than the Random Forest Classifier model when comparing the testing scores (0.994 vs 0.992 respectively).

A summary of the results is below:
Logistic Regression:
Training Data Score: 0.9942908240473243
Testing Data Score: 0.9936545604622369

Random Forest Classifier:
Training Score: 0.9975409272252029
Testing Score: 0.9917457697069748

As you can see, the random forest classifier had a stronger score in the training data, 0.998, vs the 0.994 in the logistic regression, but this may be a sign that the model overfit the training data, as the model did not perform as well in the test case scenario. 

Logistic regression models assume predictor variables have a linear or one-directional trend and work well for binary decisions, 0 or 1. The logistic regression seems like the better model for this dataset, which makes me think that there is a linear combination of one or more independent features to loan approvals.

Overall, I would recommend using a logistic regression for this dataset to predict whether a loan will be approved or not.