# Credit Risk Evaluator

In [41]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

## Retrieve the Data

The data is located in the Challenge Files Folder:

* `lending_data.csv`

Import the data using Pandas. Display the resulting dataframe to confirm the import was successful.

In [42]:
# Import the data
credit_riskdf = pd.read_csv("Resources/lending_data.csv")
credit_riskdf.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

*Replace the text in this markdown cell with your predictions, and be sure to provide justification for your guess.*

## Split the Data into Training and Testing Sets

In [43]:
# Split the data into X_train, X_test, y_train, y_test

X = credit_riskdf.drop(columns=['loan_status'])
Y = credit_riskdf['loan_status']

X_train,X_test,Y_train,Y_test = train_test_split(X,Y)

## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

In [44]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=1)
X_r, y_r = ros.fit_resample(X_train, Y_train)
X_r, y_r

(        loan_size  interest_rate  borrower_income  debt_to_income  \
 0          9100.0          7.006            46500        0.354839   
 1          9800.0          7.300            49300        0.391481   
 2         11600.0          8.068            56500        0.469027   
 3          8900.0          6.910            45600        0.342105   
 4          7800.0          6.443            41200        0.271845   
 ...           ...            ...              ...             ...   
 112535    20000.0         11.642            90200        0.667406   
 112536    20200.0         11.698            90700        0.669239   
 112537    15800.0          9.844            73200        0.590164   
 112538    17900.0         10.732            81600        0.632353   
 112539    19100.0         11.246            86400        0.652778   
 
         num_of_accounts  derogatory_marks  total_debt  
 0                     3                 0       16500  
 1                     4                 0  

In [45]:
# Train a Logistic Regression model and print the model score
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=0)
model_fit = model.fit(X_train, Y_train)
training_score = model_fit.score(X_train, Y_train)
testing_score = model_fit.score(X_test, Y_test)

### END SOLUTION 

print(f"Training Score: {training_score}")
print(f"Testing Score: {testing_score}")

Training Score: 0.9921756775347366
Testing Score: 0.9916941807676434


In [46]:
from sklearn.metrics import confusion_matrix

# Get the predictions for the test set
Y_pred = model_fit.predict(X_test)

# Calculate the confusion matrix
confusion_mat = confusion_matrix(Y_test, Y_pred)

# Print the confusion matrix
print(confusion_mat)


[[18661   105]
 [   56   562]]


True negatives (TN): 18661 , which means the number of instances that were correctly predicted as negative (not bad credit).
False positives (FP): 105, which means the number of instances that were incorrectly predicted as positive (bad credit) but are actually negative.
False negatives (FN): 56, which means the number of instances that were incorrectly predicted as negative (not bad credit) but are actually positive (bad credit).
True positives (TP): 562, which means the number of instances that were correctly predicted as positive (bad credit).

In [47]:
tn, fp, fn, tp = confusion_mat.ravel()
print(f"(TP): {tp}")
accuracy = (tp + tn)/ (tp + fp + tn + fn)
print(f" accuracy: {accuracy:.3f}")

(TP): 562
 accuracy: 0.992


In [48]:
tn_rate = tn/(tn+fp)
tn_rate
print(tn_rate)
"What is the trueneg +falsepos/everything"
neg_rate = (tn + fp)/ (tp + fp + tn + fn)
neg_rate

0.9944047745923479


0.9681180354931903

 A high negative rate is generally desirable as it means that the classifier is able to identify most of the negatives correctly.

In [49]:
"""precision and recall can give a better understanding of how well the model is performing in terms of identifying bad credits."""
from sklearn.metrics import precision_score, recall_score

# Get the predictions for the test set
Y_pred = model_fit.predict(X_test)

# Calculate precision
precision = precision_score(Y_test, Y_pred)

# Calculate recall
recall = recall_score(Y_test, Y_pred)
# spec = (TN/TN *FP)

# Print the results
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
f1_score = 2*(precision*recall)/(precision + recall)
print(f"f1_score: {f1_score:.3f}")

Precision: 0.843
Recall: 0.909
f1_score: 0.875


In [62]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(random_state=0)
forest_fit = forest.fit(X_train, Y_train)
ftraining_score = forest_fit.score(X_train, Y_train)
ftesting_score = forest_fit.score(X_test, Y_test)

### END SOLUTION 

print(f"Training Score: {ftraining_score}")
print(f"Testing Score: {ftesting_score}")

Training Score: 0.9974549456596505
Testing Score: 0.991384647131655


In [64]:
# Get the predictions for the test set
Y_predf = forest_fit.predict(X_test)

# Calculate the confusion matrix
fconfusion_mat = confusion_matrix(Y_test, Y_predf)

# Print the confusion matrix
print(fconfusion_mat)

[[18667    99]
 [   68   550]]


In [66]:
tn, fp, fn, tp = fconfusion_mat.ravel()
print(f"(TP): {tp}")
accuracy = (tp + tn)/ (tp + fp + tn + fn)
print(f" accuracy: {accuracy:.3f}")

(TP): 550
 accuracy: 0.991


In [67]:
tn_rate = tn/(tn+fp)
tn_rate
print(tn_rate)
"What is the trueneg +falsepos/everything"
neg_rate = (tn + fp)/ (tp + fp + tn + fn)
neg_rate

0.9947245017584995


0.9681180354931903

In [68]:
# Get the predictions for the test set

# Calculate precision
precisionf = precision_score(Y_test, Y_predf)

# Calculate recall
recallf = recall_score(Y_test, Y_predf)
# spec = (TN/TN *FP)

# Print the results
print(f"Precision: {precisionf:.3f}")
print(f"Recall: {recallf:.3f}")
f1_scoref = 2*(precisionf*recallf)/(precisionf + recallf)
print(f"f1_score: {f1_scoref:.3f}")

Precision: 0.847
Recall: 0.890
f1_score: 0.868


*Which model performed better? How does that compare to your prediction? Replace the text in this markdown cell with your answers to these questions.*
I predicted that the Linear Regression Model may have been better suited to handle this dataset as there isn't too many features for this classification. According to the values of precision, recall, and f_1 score; the Logistic Regression Model performed a bit better. The random forest classifier has a slightly lower precision and recall but a similar f1-score compared to the logistic regression model. My prediciton was correct!