In [29]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [2]:
train_df = pd.read_csv(Path('Resources/2019loans.csv'))
test_df = pd.read_csv(Path('Resources/2020Q1loans.csv'))

In [4]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,index,loan_amnt,int_rate,installment,home_ownership,annual_inc,verification_status,loan_status,pymnt_plan,...,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,hardship_flag,debt_settlement_flag
0,57107,57107,13375.0,0.1797,483.34,MORTGAGE,223000.0,Not Verified,low_risk,n,...,100.0,50.0,0.0,0.0,577150.0,122018.0,32000.0,170200.0,N,N
1,141451,141451,21000.0,0.1308,478.68,MORTGAGE,123000.0,Source Verified,low_risk,n,...,85.0,33.3,0.0,0.0,132750.0,27896.0,15900.0,35398.0,N,N
2,321143,321143,20000.0,0.124,448.95,MORTGAGE,197000.0,Source Verified,low_risk,n,...,85.7,33.3,0.0,0.0,628160.0,114043.0,22600.0,90340.0,N,N
3,11778,11778,3000.0,0.124,100.22,RENT,45000.0,Not Verified,low_risk,n,...,100.0,16.7,1.0,0.0,42006.0,20761.0,19900.0,15406.0,N,N
4,169382,169382,30000.0,0.1612,1056.49,MORTGAGE,133000.0,Source Verified,low_risk,n,...,100.0,66.7,0.0,0.0,283248.0,109056.0,79500.0,58778.0,N,N


In [27]:
# Split the data and target
X = train_df.drop('loan_status', axis=1)
y = train_df['loan_status']

# Convert our target to numeric values
y_train = LabelEncoder().fit_transform(train_df['loan_status'])

In [22]:
# Convert categorical data to numeric for training data
X_train = pd.get_dummies(X)

In [23]:
# Convert categorical data to numeric and separate target feature for testing data
Xtest = test_df.drop('loan_status', axis=1)
ytest = test_df['loan_status']

y_test = LabelEncoder().fit_transform(test_df['loan_status'])
X_test = pd.get_dummies(Xtest)

In [24]:
# add missing dummy variables to testing set
X_train2, X_test2 = X_train.align(X_test, join='outer', axis=1, fill_value=0)

## Prediction!

I believe a random forest classifier model will perform better because the point of a random forest model is to run many unconnected models and pick the winning value from the majority.

### Let's test this theory!

Run the cells below to see if we're correct.


In [28]:
# Train the Logistic Regression model on the unscaled data and print the model score
classifier = LogisticRegression()
classifier

classifier.fit(X_train2, y_train)

print(f"Training Data Score: {classifier.score(X_train2, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test2, y_test)}")

Training Data Score: 0.648440065681445
Testing Data Score: 0.5253083794130158


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [30]:
# Train a Random Forest Classifier model and print the model score
clf = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train2, y_train)
print(f'Training Score: {clf.score(X_train2, y_train)}')
print(f'Testing Score: {clf.score(X_test2, y_test)}')

Training Score: 1.0
Testing Score: 0.5971926839642705


## We were right!

It looks like the training data was overfit by the random forest classifier (100%), but our testing score is higher with this model than with logistic regression.

### And on the scaled data?

My prediction is that the random forest classifier will still score higher than the logistic regression model for the same reason listed above.

<!-- 
✓ Makes a comparison between predicted behavior of the models on scaled data and the actual results.
 -->

In [31]:
# Scale the data
scalertrain = StandardScaler().fit(X_train2)
scalertest = StandardScaler().fit(X_test2)

X_train_scaled = scalertrain.transform(X_train2)
X_test_scaled = scalertest.transform(X_test2)

In [34]:
# Train the Logistic Regression model on the scaled data and print the model score
classifier.fit(X_train_scaled, y_train)

print(f"Training Data Score: {classifier.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test_scaled, y_test)}")

Training Data Score: 0.7130541871921182
Testing Data Score: 0.670565716716291


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [35]:
# Train a Random Forest Classifier model on the scaled data and print the model score
clf = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')


Training Score: 1.0
Testing Score: 0.5791152700978307


## Analysis

### Unscaled data

Random Forest Classifier was more effective on the testing data than the Logistic Regression model (60% versus 53%).

### Scaled data

On the scaled data, the logistic regression model did better (67%) than the random forest classifier (60%).

#### Final thoughts

As expected, the modeling on scaled data performed better than the models on unscaled data. Scaling data is frequently part of data preprocessing as it normalizes the range of features of the data, and we expect scaled data to perform better.