In [1]:
#Dependencies
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [2]:
#Loading data
df = pd.read_csv('Resources/lending_data.csv')
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [3]:
#Defining features and target sets
y = df['loan_status']
X = df.drop('loan_status', axis=1)

In [4]:
#Splitting data into training and testing sets, scaling data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Model Prediction
Which model will yield better results: a logistic regression, or a random forest classifier?

For the data used, a random forest classifier would probably yield better results.
A logistic regression is best suited for categorical data, of which there aren't many strict defined categories that the dataset can be sorted by.
It also isn't clear if there is a logistic relationship between the tens of thousands of data points.
To have the best chance of successful determination, a model that would resample the many variations within a large dataset would be preferred.

In [7]:
#Logistic Regression
clf_lr = LogisticRegression(max_iter=1000)
clf_lr.fit(X_train_scaled, y_train)
print(f'Training data score for Logistic Regression: {clf_lr.score(X_train_scaled, y_train)}')
print(f'Testing data score for Logistic Regression: {clf_lr.score(X_test_scaled, y_test)}')

Training data score for Logistic Regression: 0.9942908240473243
Testing data score for Logistic Regression: 0.9936545604622369


In [8]:
#Random Forest Classifier
clf_fr = RandomForestClassifier(random_state=1, n_estimators=1000).fit(X_train_scaled, y_train)
print(f'Training data score for Random Forest Classification: {clf_fr.score(X_train_scaled, y_train)}')
print(f'Testing data score for Random Forest Classification: {clf_fr.score(X_test_scaled, y_test)}')

Training data score for Random Forest Classification: 0.9975409272252029
Testing data score for Random Forest Classification: 0.9917973586463062


## Model Results

On creating, fitting, and testing both models, both have proven to be highly accurate in classifying the data, scoring over 99% accuracy.
The highest training data score was with the random forest classifier, which was also the highest score overall.
The highest testing data score, however, was with the logistic regression. It also processed significantly faster than the other method.

With this initial run, the logistic regression produced slightly more accurate results. To get a better picture of the model's performances, more tests would have to be run under different random states and parameters.