# Modelling Credit Risk

## Making Predictions
Now that the feature columns have been cleaned and prepared for modelling, I can start using the data to make predictions. 

Although I need to remember the class imbalance in my target column, `loan_status`. There are six times as many loans that were paid off on time (`1`) than those that weren't (`0`). I will keep this in mind as I build my machine learning models.

### Error Metric
Before I start, I should pick an error metric. Since I am viewing the problem as a conservative investor, I would want to minimize risk and avoid false positives as much as possible. Due to class imbalance, I should measure error using the rate of true and false negatives and positives instead of using accuracy. This means I will optimize for high recall (true positive rate) and low fall-out (false positive rate).

### Logistic Regression
Logistic regression is a good first algorithm to apply to binary classification problems because it is quick to train, easy to interpret, and is less prone to overfitting than more complex models like decision trees. In order to get a realistic depiction of the accuracy of the model, I will perform k-fold cross validation to further avoid overfitting. This is how I will get my initial predictions.

In [10]:
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

In [11]:
#bring in data
import pandas as pd
import numpy as np

loans = pd.read_csv('loans_clean_V2.csv')

#import algorithms for logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

#instantiating the model
lr = LogisticRegression()

#getting feature columns from df
cols = loans.columns
train_cols = cols.drop('loan_status')

#setting feature and target
features = loans[train_cols]
target = loans['loan_status']

#performing 3-fold cross validation
predictions = cross_val_predict(lr, features, target, cv=3)

#convert predictions to pandas series
predictions = pd.Series(predictions)

#calculating false positive and true positive rate
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)
print(predictions.head(15))

0.9989121566494424
0.9967943009795192
0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    1
14    1
dtype: int64


It appears the model is predicting all 1s which is resulting in a high accuracy despite not using accuracy as an error metric. It also isn't accounting for the imbalance in the classes. I need to implement a way to tell the classifier to correct for imbalanced classes. I can do this by having the classifier penalize misclassifications of the less prevalent class more than the other class. 

### Penalizing the Classifier
By setting the `class_weight` parameter to `balanced` when creating the Logistic Regression instance, I can have the classifier pay more attention to correctly classifying rows where `loan_status = 0`. As a result, accuracy will decrease when `loan_status = 1` and increase when it equals `0`. 

In [12]:
#instantiating the model
lr = LogisticRegression(class_weight='balanced')

#performing 3-fold cross validation
predictions = cross_val_predict(lr, features, target, cv=3)

#convert predictions to pandas series
predictions = pd.Series(predictions)

#calculating false positive and true positive rate
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)

0.6593237240504034
0.3796972395369546


I was able to significantly improve the false positive rate by balancing the classes, which reduced the true positive rate. The true positive rate is now about 66% and the false positive rate is about 38%. From a conservative investor's standpoint, it's reassuring that the false positive rate is lower because it means that I'll be able to do a better job at avoiding bad loans than if I funded everything. However, I would only ever decide to fund 66% of the total loans (true positive rate), and reject a good amount of loans.

I could try to lower the false positive rate furhter by assigning a harsher penalty for misclassifications. For the model below I will pass in a  dictionary to implement a penalty of 10 for misclassifying a 0, and a penalty of 1 for misclassifying a 1.

In [15]:
lr = LogisticRegression(class_weight={
    0: 10,
    1: 1
})

#performing 3-fold cross validation
predictions = cross_val_predict(lr, features, target, cv=3)

#convert predictions to pandas series
predictions = pd.Series(predictions)

#calculating false positive and true positive rate
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)

0.23560873900824947
0.08708815672306322


It looks like assigning manual penalties lowered the false positive rate to 9%, and lowered our risk. This comes at the expense of true positive rate. While I have fewer false positives, I'm also missing opportunities to fund more loans and potentially make more money. Given that I am approaching this as a conservative investor, this strategy makes sense, but it's worth keeping in mind the tradeoffs. I'll move on to try a different model.


### Random Forests
Training a random forest algorithm may allow me to get more accuracy due to the columns that correlate nonlinearly with `loan_status`.

In [14]:
from sklearn.ensemble import RandomForestClassifier

#instantiating the model
rfc = RandomForestClassifier(random_state=1, class_weight='balanced')

#performing 3-fold cross validation
predictions = cross_val_predict(rfc, features, target, cv=3)

#convert predictions to pandas series
predictions = pd.Series(predictions)

#calculating false positive and true positive rate
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)

0.9709304082434351
0.9271593944790739


## Conclusion
Unfortunately, using a random forest classifier didn't improve the false positive rate. The model is probably weighting too heavily on the `1` class and still predicting mostly `1s`. I could apply a harsher penalty for misclassifications of `0s` like I did for the logistic regression. However, my best model had a false positive rate of 9% and a true positive rate of 24%. At these rates, a conservative investor will make money as long as the interest rate is high enough to offset the losses from 9% of borrowers defaulting, or that the pool of 24% of borrowers is large enough to make enough interest money to offset the loss.