In [1]:
# Import dependencies
import numpy as np
import pandas as pd
from pathlib import Path

In [2]:
# Read in data from 2 CSV files. One will provide the training data and the other will provide the testing data
train_df = pd.read_csv(Path('Resources/2019loans.csv'))
test_df = pd.read_csv(Path('Resources/2020Q1loans.csv'))

In [3]:
# Pull X and y data from these dataframes
X_train = train_df.drop(["loan_status"], axis=1)
y_train = train_df["loan_status"]
X_test = test_df.drop(["loan_status"], axis=1)
y_test = test_df["loan_status"]

### Make sure all data is Numeric

In [4]:
# Print out the first few rows of the X_train dataframe
X_train.head()

Unnamed: 0.1,Unnamed: 0,index,loan_amnt,int_rate,installment,home_ownership,annual_inc,verification_status,pymnt_plan,dti,...,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,hardship_flag,debt_settlement_flag
0,57107,57107,13375.0,0.1797,483.34,MORTGAGE,223000.0,Not Verified,n,29.99,...,100.0,50.0,0.0,0.0,577150.0,122018.0,32000.0,170200.0,N,N
1,141451,141451,21000.0,0.1308,478.68,MORTGAGE,123000.0,Source Verified,n,11.26,...,85.0,33.3,0.0,0.0,132750.0,27896.0,15900.0,35398.0,N,N
2,321143,321143,20000.0,0.124,448.95,MORTGAGE,197000.0,Source Verified,n,11.28,...,85.7,33.3,0.0,0.0,628160.0,114043.0,22600.0,90340.0,N,N
3,11778,11778,3000.0,0.124,100.22,RENT,45000.0,Not Verified,n,18.08,...,100.0,16.7,1.0,0.0,42006.0,20761.0,19900.0,15406.0,N,N
4,169382,169382,30000.0,0.1612,1056.49,MORTGAGE,133000.0,Source Verified,n,27.77,...,100.0,66.7,0.0,0.0,283248.0,109056.0,79500.0,58778.0,N,N


In [5]:
# Print out the first few rows of the X_test dataframe
X_test.head()

Unnamed: 0.1,Unnamed: 0,index,loan_amnt,int_rate,installment,home_ownership,annual_inc,verification_status,pymnt_plan,dti,...,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,hardship_flag,debt_settlement_flag
0,67991,67991,40000.0,0.0819,814.7,MORTGAGE,140000.0,Not Verified,n,19.75,...,97.7,0.0,0.0,0.0,527975.0,70914.0,74600.0,99475.0,N,N
1,25429,25429,6000.0,0.1524,208.7,RENT,55000.0,Not Verified,n,11.52,...,66.7,0.0,0.0,0.0,34628.0,23460.0,5900.0,23628.0,N,N
2,38496,38496,3600.0,0.1695,128.27,RENT,42000.0,Not Verified,n,6.74,...,100.0,0.0,0.0,0.0,23100.0,19183.0,7300.0,15000.0,N,N
3,19667,19667,20000.0,0.1524,478.33,RENT,100000.0,Not Verified,n,12.13,...,100.0,50.0,0.0,0.0,56481.0,43817.0,13800.0,35981.0,N,N
4,37505,37505,3600.0,0.124,120.27,RENT,50000.0,Not Verified,n,16.08,...,100.0,25.0,0.0,0.0,45977.0,32448.0,21000.0,24977.0,N,N


We can see that certain columns have strings as their values such as home_ownership. Python can only use numeric datatypes in regression so I need to convert these columns to numeric data

In [6]:
# get_dummies converts categorical columns into dummy/indicator columns
X_train_dummies = pd.get_dummies(X_train)
X_test_dummies = pd.get_dummies(X_test)

In [7]:
# Here is a sample of the train dummy data
X_train_dummies.head(3)

Unnamed: 0.1,Unnamed: 0,index,loan_amnt,int_rate,installment,annual_inc,dti,delinq_2yrs,inq_last_6mths,open_acc,...,verification_status_Verified,pymnt_plan_n,initial_list_status_f,initial_list_status_w,application_type_Individual,application_type_Joint App,hardship_flag_N,hardship_flag_Y,debt_settlement_flag_N,debt_settlement_flag_Y
0,57107,57107,13375.0,0.1797,483.34,223000.0,29.99,0.0,0.0,15.0,...,0,1,0,1,1,0,1,0,1,0
1,141451,141451,21000.0,0.1308,478.68,123000.0,11.26,2.0,0.0,16.0,...,0,1,0,1,1,0,1,0,1,0
2,321143,321143,20000.0,0.124,448.95,197000.0,11.28,0.0,0.0,12.0,...,0,1,0,1,1,0,1,0,1,0


### Make Sure the training and testing dataframes match

I am going to train regression models using train_dummies_df and then test the model using test_dummies_df. So I need to make sure these dataframes have the same columns or else the regression models will fail

In [8]:
# Get all column headers from both dataframes and print the number of columns of each df
training_columns = []
for i in X_train_dummies.columns:
    training_columns.append(i)

testing_columns = []
for i in X_test_dummies.columns:
    testing_columns.append(i)

print(f"The training dataframe has {len(training_columns)} columns. The testing dataframe has {len(testing_columns)} columns")

The training dataframe has 94 columns. The testing dataframe has 93 columns


In [9]:
# Find any column headers that are in one data set but not in the other
missing_headers = list(set(training_columns) - set(testing_columns))
print(missing_headers)

['debt_settlement_flag_Y']


In [10]:
# Add column to testing dataframe so it matches training dataframe. Since this column was not in the testing dataframe before I know all the values are 0
X_test_dummies[missing_headers] = 0

In [11]:
# Get a list of possible y categories
myset = set(y_train)
target_names = list(myset)
print(target_names)

['high_risk', 'low_risk']


##  Models
In the next lines of code I am going to be comparing 2 models, a logistic regression and a random forests classifier, on how well they predict whether each loan will become high risk or not. 

Logistic regression (LR) is a classification algorithm used to predict a discrete set of classes or categories. LR uses the Sigmoid function to return a probability value of 0 or 1. 

Random Forests Classifier (RF) is also a classification algorith that returns a discrete set of classes or categories. However, unlike logistic regression, RF uses a collection of decision trees to make predictions. A decision trees is a question / test that has binary results it is either true or false. Decision trees can become very complex very quickly. This is why the Random Forests Classifier model uses several smaller decision trees. Each of these decision trees on its own its considered a weak classifier but when you combine them they form a strong classifier. 

## Comparison
On average LR is faster
LR is easier to understand
RFC performs better for more categorical data
LR performs better for linear data
RFC is better for unbalanced data
LR performs better when signal-to-noise is low (i.e. the problem is “hard” and there is little data)

## Prediction: Random Forests Classifier will perform better
I have not studied the data extensively, but I am assuming it will be more categorical than linear so I am going to guess that Random Forests Classifier will perform better than Logistic Regression

## Logistic Regression

In [12]:
# Train the Logistic Regression model on the unscaled data and print the model score
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier
classifier.fit(X_train_dummies, y_train)

print(f"Training Data Score: {classifier.score(X_train_dummies, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test_dummies, y_test)}")

Training Data Score: 0.648440065681445
Testing Data Score: 0.5253083794130158


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [13]:
# Run a confusion matrix for the logistic regression model
from sklearn.metrics import confusion_matrix, classification_report

y_true = y_test
y_pred = classifier.predict(X_test_dummies)
confusion_matrix(y_true, y_pred, labels = target_names)

array([[ 526, 1825],
       [ 407, 1944]], dtype=int64)

In [14]:
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

   high_risk       0.56      0.22      0.32      2351
    low_risk       0.52      0.83      0.64      2351

    accuracy                           0.53      4702
   macro avg       0.54      0.53      0.48      4702
weighted avg       0.54      0.53      0.48      4702



## Random Forest Classifier

In [16]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=1, n_estimators=100).fit(X_train_dummies, y_train)

print(f'Training Score: {clf.score(X_train_dummies, y_train)}')
print(f'Testing Score: {clf.score(X_test_dummies, y_test)}')

Training Score: 1.0
Testing Score: 0.6405784772437261


In [17]:
# Run a confusion matrix for the logistic regression model
from sklearn.metrics import confusion_matrix, classification_report

y_true = y_test
y_pred = clf.predict(X_test_dummies)
confusion_matrix(y_true, y_pred, labels = target_names)

array([[ 930, 1421],
       [ 269, 2082]], dtype=int64)

In [18]:
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

   high_risk       0.78      0.40      0.52      2351
    low_risk       0.59      0.89      0.71      2351

    accuracy                           0.64      4702
   macro avg       0.68      0.64      0.62      4702
weighted avg       0.68      0.64      0.62      4702



## Conclusions

#### Overall the Random Forests Classifier performed better than the Logistic Regression model. 

### Score

First lets look at the training and testing scores of each model. Score means the R-squared value. R-squared is a common baseline metric for evaluating models because it shows how accurate the model is. So an R-squared score of 1 means the model is 100% accurate and a R-squared score of 0.50 means the model is 50% accurate. 

For each model I calculated 2 R-squared scores a training and a testing score. The models were created with training data so the training score shows how well the model fits the data that it was created with. The testing score shows how well the model predicts new data. 

The LR model had a training score of 0.648. This means that the LR model did not fit well to the data at all. On the other hand, the RF had a perfect 1 training score. This means it fit to the training data perfectly, which actually means it is probably overfit. 

The LR model had a testing score of 0.525 (52.5%). This is absolutlely terrible because it means that the model is only 2.5% more accurate than if you were to just flip a coin to predict the result. The RF had a testing score of 0.640 (64.0%). This testing score is still not great but it is significantly better than the logistic regression model. 

### Classification Report

It also helps to look at additional metrics other than testing score to evaluate and compare the models. The testing score is just the overall accuracy of the model certain times you may want to be wrong about certain things to be safe rather than sorry. For example, in my opinion I would rather have this model predict a loan as high risk and the loan actually be low risk than to predict a loan as low risk when it is actually high risk. 

To see how accurate the models are in this area we will look at the classification reports. Precision is a measure of how reliable a classification is. Recall is a measure of the percentage of a classification that the model labelled / predicted correctly. 

The LR had a high risk Recall of 0.22 (22%) so out of all the high risk loans it only predicted around a fifth correctly. The RF had a high risk Recall of 0.35 (40%) so it predicted 4 out of 10 high-risk loans correctly which is significantly better than the LR model. 

The LR had a low risk Precision of 0.52 (52%) meaning when it predicts a loan is low risk it is correct 52% of the time. The RF had a low risk precision of 0.59 (59%).

Overall the RF performed better than the LR in every important metric.


## Prediction 2: Scaling the data 
In the first models I did not scale the data before fitting the models. I am going to re-fit the models using scaled data. I am confident the accuracy of both models will increase. Additionally I predict the Random Forests Classifier will still perform better than Logistic Regression

In [19]:
# Scale data and refit it
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train_dummies)
X_train_scaled = scaler.transform(X_train_dummies)
X_test_scaled = scaler.transform(X_test_dummies)

## Logistic Regression

In [20]:
# Train the Logistic Regression model on the scaled data and print the model score
classifier = LogisticRegression()
classifier
classifier.fit(X_train_scaled, y_train)

print(f"Training Data Score: {classifier.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test_scaled, y_test)}")

Training Data Score: 0.713136288998358
Testing Data Score: 0.7203317737133135


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [21]:
# Run a confusion matrix for the logistic regression model
y_true = y_test
y_pred = classifier.predict(X_test_scaled)
confusion_matrix(y_true, y_pred, labels = target_names)

array([[1243, 1108],
       [ 207, 2144]], dtype=int64)

In [22]:
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

   high_risk       0.86      0.53      0.65      2351
    low_risk       0.66      0.91      0.77      2351

    accuracy                           0.72      4702
   macro avg       0.76      0.72      0.71      4702
weighted avg       0.76      0.72      0.71      4702



In [23]:
# Train a Random Forest Classifier model on the scaled data and print the model score
clf = RandomForestClassifier(random_state=1, n_estimators=100).fit(X_train_scaled, y_train)

print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 1.0
Testing Score: 0.6418545299872395


In [27]:
y_true = y_test
y_pred = clf.predict(X_test_scaled)
confusion_matrix(y_true, y_pred, labels = target_names)

array([[ 936, 1415],
       [ 269, 2082]], dtype=int64)

In [28]:
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

   high_risk       0.78      0.40      0.53      2351
    low_risk       0.60      0.89      0.71      2351

    accuracy                           0.64      4702
   macro avg       0.69      0.64      0.62      4702
weighted avg       0.69      0.64      0.62      4702



# Before and After 

## Logistic Regression 

Training Score: 0.648 vs 0.713
Testing Score: 0.525 vs  0.720
Recall high-risk: 0.22 vs 0.53
Precision low-risk: 0.52 vs 0.66

## Random Forest Classifier 

Training Score: 1 vs 1
Testing Score: 0.64 vs 0.64
Recall high risk: 0.40 vs 0.40
Precision low-risk: 0.59 vs 0.60
    
## Conclusions
I was incorrect, both models did not improve by scaling the data. The Logistic Regression model improved when the data was scaled but the Random Forests Classifier data gave the same results.

# Comparison

## LR vs RF

Training score: 0.713 vs 1 
Conclusion: RF is still overfit
----------------------------------------------
Testing Score: 0.720 vs 0.64
Winner: LR. LR was 8% more accurate overall
-----------------------------------------------
Recall high risk: 0.53 vs 0.40
Winner: LR. LR predicted 13% more of the total high risk loans correctly. 
----------------------------------------------
Precision low-risk: 0.66 vs 0.60
Winner: LR. LR's prediction that a loan is low risk is 6% more reliable
---------------------------------------------

## Conclusion 
The LR model was the better model with scaled data. The RF data is still overfit though, so I am confident that if features were removed from the RF model if would become more accurate. 