# Homework 19
## William Vann

# My Prediction: 

My prediction is that **Logistic Regression (LR) will perform better than Random Forest Classification (RFC)**. 

One reason is that the credit risk data problem is similar to other problems (like medical disease risk) we've used LR effectively for. We are looking for a prediction of 0 or 1 for credit risk, not a continuous numeric value, and LR is a standard and trusty choice (often the first choice, at least for testing) for such situations. 

Of course RFC is a good choice for such situations as well, so I've considered what I know about the math underlying each approach to favor LR.  My understanding is that the math underlying LR involves calculation of probabilities, while the math underlying RFC involves a "majority voting" scenario among some number of decision trees.  This particular data problem seems more probabilistic to me, as the question "Is person A a good credit risk (loan_status=0), or a bad credit risk (loan_status=1)?" seems better answered with "Yes (with an x% probability" than "Yes (a majority of the trees vote Yes)". I am persuaded on this point even more when I consider that LR admits of a "threshold" probability that the data scientist can change, so as a result it seems like it might be more "fine-tunable" than RFC.

In [1]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Load the dataset

In [2]:
# Import the data

df = pd.read_csv("Resources/lending_data.csv")
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [3]:
df.describe()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
count,77536.0,77536.0,77536.0,77536.0,77536.0,77536.0,77536.0,77536.0
mean,9805.562577,7.292333,49221.949804,0.377318,3.82661,0.392308,19221.949804,0.032243
std,2093.223153,0.889495,8371.635077,0.081519,1.904426,0.582086,8371.635077,0.176646
min,5000.0,5.25,30000.0,0.0,0.0,0.0,0.0,0.0
25%,8700.0,6.825,44800.0,0.330357,3.0,0.0,14800.0,0.0
50%,9500.0,7.172,48100.0,0.376299,4.0,0.0,18100.0,0.0
75%,10400.0,7.528,51400.0,0.416342,4.0,1.0,21400.0,0.0
max,23800.0,13.235,105200.0,0.714829,16.0,3.0,75200.0,1.0


In [4]:
df["loan_status"].value_counts()

0    75036
1     2500
Name: loan_status, dtype: int64

In [5]:
df.groupby(["loan_status"]).mean()

Unnamed: 0_level_0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
loan_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,9515.627166,7.169118,48062.314089,0.368549,3.565968,0.333533,18062.314089
1,18507.8,10.990529,84027.72,0.640501,11.6496,2.1564,54027.72


In [6]:
df.groupby(["loan_status"]).median()

Unnamed: 0_level_0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
loan_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,9500.0,7.151,47900.0,0.373695,4.0,0.0,17900.0
1,18600.0,11.0235,84300.0,0.644128,12.0,2.0,54300.0


# Preprocess the data

In [7]:
X = df.drop("loan_status", axis=1).values

y = df["loan_status"].values

target_names=["good risk", "bad risk"] # loan_status=0 or 1, respectively

In [8]:
# Split the data into X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [9]:
# Create a scaler to standardize the data

scaler = StandardScaler()

# Train the scaler with the X_train data.

scaler.fit(X_train)

StandardScaler()

In [10]:
# Transform X_train and X_test.
# Note that the scaler used to transform X_train and X_test was trained on X_train.

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create predictions with Logistic Regression

In [11]:
# Train a Logistic Regression model 

model = LogisticRegression().fit(X_train_scaled, y_train)

In [12]:
# Score the model

print(f"Logisitic Regression Training Data Score: {model.score(X_train_scaled, y_train)}")
print(f"Logisitic Regression Testing Data Score: {model.score(X_test_scaled, y_test)}")

Logisitic Regression Training Data Score: 0.9941876461686614
Logisitic Regression Testing Data Score: 0.9939640940982254


In [13]:
y_true = y_test
y_pred = model.predict(X_test_scaled)

cm_lr = confusion_matrix(y_true, y_pred)
cr_lr = classification_report(y_true, y_pred, target_names=target_names)

print(f"Logisitic Regression CONFUSION MATRIX: \n\n {cm_lr}\n\n")
print(f"Logisitic Regression CLASSIFICATION REPORT: \n\n {cr_lr}\n\n")

Logisitic Regression CONFUSION MATRIX: 

 [[18617   106]
 [   11   650]]


Logisitic Regression CLASSIFICATION REPORT: 

               precision    recall  f1-score   support

   good risk       1.00      0.99      1.00     18723
    bad risk       0.86      0.98      0.92       661

    accuracy                           0.99     19384
   macro avg       0.93      0.99      0.96     19384
weighted avg       0.99      0.99      0.99     19384





# Create predictions with Random Forest Classifier

In [14]:
# Train a Random Forest Classifier model

clf = RandomForestClassifier(random_state=0, n_estimators=500).fit(X_train_scaled, y_train)

In [15]:
# Score the model

print(f"Random Forest Classifier Training Score: {clf.score(X_train_scaled, y_train)}")
print(f"Random Forest Classifier Testing Score: {clf.score(X_test_scaled, y_test)}")

Random Forest Classifier Training Score: 0.9973861604072087
Random Forest Classifier Testing Score: 0.9911782913743293


In [16]:
y_true = y_test
y_pred = clf.predict(X_test_scaled)

cm_rfc = confusion_matrix(y_true, y_pred)
cr_rfc = classification_report(y_true, y_pred, target_names=target_names)

print(f"Random Forest Classifier CONFUSION MATRIX: \n\n {cm_rfc}\n\n")
print(f"Random Forest Classifier CLASSIFICATION REPORT: \n\n {cr_rfc}\n\n")

Random Forest Classifier CONFUSION MATRIX: 

 [[18626    97]
 [   74   587]]


Random Forest Classifier CLASSIFICATION REPORT: 

               precision    recall  f1-score   support

   good risk       1.00      0.99      1.00     18723
    bad risk       0.86      0.89      0.87       661

    accuracy                           0.99     19384
   macro avg       0.93      0.94      0.93     19384
weighted avg       0.99      0.99      0.99     19384





## Results and Reflection


In [17]:
print(f"LR: {cr_lr}")
print(f"RFC: {cr_rfc}")

LR:               precision    recall  f1-score   support

   good risk       1.00      0.99      1.00     18723
    bad risk       0.86      0.98      0.92       661

    accuracy                           0.99     19384
   macro avg       0.93      0.99      0.96     19384
weighted avg       0.99      0.99      0.99     19384

RFC:               precision    recall  f1-score   support

   good risk       1.00      0.99      1.00     18723
    bad risk       0.86      0.89      0.87       661

    accuracy                           0.99     19384
   macro avg       0.93      0.94      0.93     19384
weighted avg       0.99      0.99      0.99     19384



## Logistic Regression is the Better-Performing Model

Looking at the classification reports for the two models, both models perform well on precision, recall, and f1-score for "good risk", but less well for the same metrics for "bad risk". Nevertheless, LR performed better than RFC on "bad risk" recall and f1-score. Therefore, my prediction was confirmed.  

After exploring the dataset, it appears that the dataset is somewhat "lop-sided" in that only about 3% of the sample had a "loan status" value of 1 (i.e., meaning loan was not approved, they were judged to be a "bad" credit risk).  Hence, 97% of the sample was approved for a loan, judged to be a "good" risk.  My intuition is that we would expect most of the classification models to perform well with such a skew in the training dataset, and that was borne out in our comparison of two of the most-often used binary classification models.  