### Predicting model accuracy

This analysis will compare the accuracies of the Logistic Regression and Random Forests Classifier models on the given dataset. Logistic Regression will look for a linear separation of the data into classes, while Random Forests Classifier will sample an ensemble of decision trees. We predict that the Random Forests Classifier will perform the better of the two models (or at least be no worse than Logistic Regression), because one can consider the output of the logistic model to be just a particularly rigid decision tree (is the data on one side of the dividing line or the other?). Real world data may not fit so well to a single separation like that, so we expect the random sampling of trees to perform best. 

In [2]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

In [3]:
# Import data
path = "Resources/lending_data.csv"
df = pd.read_csv(path)
df

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.430740,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0
...,...,...,...,...,...,...,...,...
77531,19100.0,11.261,86600,0.653580,12,2,56600,1
77532,17700.0,10.662,80900,0.629172,11,2,50900,1
77533,17600.0,10.595,80300,0.626401,11,2,50300,1
77534,16300.0,10.068,75300,0.601594,10,2,45300,1


In [4]:
# Define the X (features) and y (target) sets
y = df["loan_status"].values
X = df.drop("loan_status", axis=1)

In [5]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
52022,9400.0,7.116,47600,0.369748,3,0,17600
6190,9700.0,7.251,48800,0.385246,4,0,18800
34125,9400.0,7.112,47500,0.368421,3,0,17500
56259,9200.0,7.028,46700,0.357602,3,0,16700
64959,10900.0,7.758,53600,0.440299,5,1,23600
...,...,...,...,...,...,...,...
45171,10100.0,7.408,50300,0.403579,4,1,20300
59717,10600.0,7.621,52300,0.426386,5,1,22300
58794,10400.0,7.546,51600,0.418605,4,1,21600
44454,7500.0,6.311,40000,0.250000,2,0,10000


In [6]:
# Train a Logistic Regression model print the model score
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 0.9920724996560737
Testing Data Score: 0.9917973586463062


In [10]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train, y_train)
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 0.9973345714678773
Testing Data Score: 0.9920553033429633


We find that our Random Forest Classifier can marginally outperform the Logistic Regression model. However, the Logisitic Regression also performs very well on this dataset, perhaps indicating that the data is to a good approximation of the form where there actually is at least one separating hyperplane which the logistic regression can find.