<img style="float: left;" src="../images/fanniemae.png">
<br><br><br><br><br><br>
______

# Mortgage Loan Default Classifier
____________
____________

## Problem Statement:
_____________
Fannie Mae, or more specifically the Federal National Mortgage Association (FNMA), is a government sponsored entity whose primary goal is to raise home ownership and affordable housing levels.  Fannie Mae attempts to accomplish this in essence by purchasing mortgage loans within certain parameters from mortgage lenders.  In turn, mortgage lenders are provided cash flow to issue additional mortgages.<br>

The cause of the Financial Crisis of 2008 can in part be drawn back to the purchase of mortgage loans with an actual probability of default that were higher than assumed.  By creating a classification model that will predict whether a mortgage loan will default based on pre-purchase characteristics, Fannie Mae may better avoid high risk mortgage loans.  The model will be evaluated based on Accuracy and False Negative Rate.  In this particular case, the "positive" class will be loans that default therefore, we will seek to minimize the False Negative Rate while maximizing Accuracy.

##  Preprocessing and Modeling
_________
The proportion of Defaults in this given dataset is approximately 0.309%.  For this model to be successful/useful, it is tasked with the challenge of beating this percentage.  In this specific case, the type of incorrect prediction is very important.  Incorrectly predicting that the loan default results in a lost opportunity of potential revenue.  In contrast, incorrectly predicting that the loan will not default will result in huge losses associated a default in the neighborhood of $50,000 on average*.  This second case is referred to as a false negative.  The model will seek not only to have a high overall accuracy but to minimize the false negative rate as well.  A Logistic Regression will be used as the base model.

*/ https://homeguides.sfgate.com/basic-foreclosure-fees-costs-61248.html

### Import packages

In [1]:
# import necessary packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

### Load dataset

In [2]:
df = pd.read_csv('../data/complete2011q1.csv')
df.head()

Unnamed: 0,LOAN IDENTIFIER,ORIGINATION CHANNEL,SELLER NAME,ORIGINAL INTEREST RATE,ORIGINAL UPB,ORIGINAL LOAN TERM,ORIGINAL LOAN-TO-VALUE (LTV),ORIGINAL COMBINED LOAN-TO-VALUE (CLTV),NUMBER OF BORROWERS,ORIGINAL DEBT TO INCOME RATIO,...,LOAN PURPOSE,PROPERTY TYPE,NUMBER OF UNITS,OCCUPANCY TYPE,PROPERTY STATE,PRODUCT TYPE,RELOCATION MORTGAGE INDICATOR,DEFAULT,MI,MIN CREDIT SCORE
0,100000841305,C,"CITIMORTGAGE, INC.",4.125,124000,360,79,79.0,1.0,28.0,...,R,SF,1,P,TX,FRM,N,0,0.0,792.0
1,100001889356,R,OTHER,4.625,115000,240,68,68.0,1.0,34.0,...,C,SF,1,P,IL,FRM,N,0,0.0,705.0
2,100006453372,C,"BANK OF AMERICA, N.A.",4.375,175000,360,52,52.0,2.0,29.0,...,C,PU,1,S,AZ,FRM,N,0,0.0,776.0
3,100010656545,C,"BANK OF AMERICA, N.A.",4.375,365000,360,59,59.0,3.0,40.0,...,C,PU,1,P,IL,FRM,N,0,0.0,797.0
4,100010758624,R,"CITIMORTGAGE, INC.",3.875,69000,120,28,28.0,1.0,32.0,...,C,SF,1,P,SC,FRM,N,0,0.0,785.0


### Transform categorical features to binary indicator variables

In [3]:
categorical_features = ['ORIGINATION CHANNEL', 'SELLER NAME', 'FIRST TIME HOME BUYER INDICATOR', 'LOAN PURPOSE', 
                        'PROPERTY TYPE', 'OCCUPANCY TYPE', 'PROPERTY STATE', 'PRODUCT TYPE', 'RELOCATION MORTGAGE INDICATOR']

In [4]:
df = pd.get_dummies(df, columns=categorical_features, drop_first=True)
df.shape

(504559, 92)

### Balance target classes

In [5]:
# balance target
# split classes
df_maj = df[df['DEFAULT'] == 0]
df_min = df[df['DEFAULT'] == 1]
 
# downsample majority
df_maj_resample = resample(df_maj, 
                           replace=False,    
                           n_samples=df_min.shape[0],
                           random_state=42)             

# concat downsample and minority
df_resample = pd.concat([df_maj_resample, df_min])
 
# display new class counts
df_resample['DEFAULT'].value_counts()

1    1561
0    1561
Name: DEFAULT, dtype: int64

### Create train and test datasets 

In [6]:
# train test split
X = df_resample.drop(columns=['DEFAULT'])
y = df_resample['DEFAULT']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y,
                                                    shuffle=True,
                                                    random_state=42)

### Scale features

In [7]:
# scale features to balance feature weights
ss = StandardScaler()
ss.fit(X_train, y_train)
X_train_sc = ss.transform(X_train)
X_test_sc = ss.transform(X_test)

### Base Logistic Model

In [8]:
# define custom metrics
def false_negative_rate(fn, tp):
    return fn/(tp+fn)

def accuracy(tn, tp, y_test):
    return (tn + tp)/len(y_test)

In [9]:
# instantiate Logistic model
log = LogisticRegression()
log.fit(X_train_sc, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [10]:
# accuracy on train data
log.score(X_train_sc, y_train)

0.7876975651431012

In [11]:
# cross validation accuracy on train data
cross_val_score(log, X_train_sc, y_train, cv=5)

array([0.7228145 , 0.76709402, 0.77136752, 0.78632479, 0.73504274])

In [12]:
# accuracy on test data
log.score(X_test_sc, y_test)

0.7554417413572343

__Insight:__
First, the accuracy is of the base logistic classification model leaves a lot to be desired.  The overall test accuracy of 0.72 is very low and the difference between the training and test accuracy scores hints to some overfitting. 

In [13]:
# assign predictions
preds = log.predict(X_test_sc)

# assign outcome cases for use in custom metrics
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()

# display confusion matrix on base threshold of 50%
confusion_matrix(y_test, preds)

array([[283, 108],
       [ 83, 307]], dtype=int64)

In [14]:
# display false negative rate
false_negative_rate(fn, tp)

0.2128205128205128

__Insight:__
This base model at a 50% threshold has a false negative rate of 21.3% which is not close to the 0.31% bar set by the overall Default proportion.  

In [31]:
# create prediction table to see affect of adjusted threshold
preds_adj = pd.DataFrame(log.predict_proba(X_test_sc), columns=['NoDefault', 'Default'])
preds_adj['y_pred'] = preds
preds_adj['y_pred_adj'] = preds_adj['Default'].map(lambda x: 1 if x >= .05 else 0)
preds_adj.head()

Unnamed: 0,NoDefault,Default,y_pred,y_pred_adj
0,0.368981,0.631019,1,1
1,0.978883,0.021117,0,0
2,0.51798,0.48202,0,1
3,0.495776,0.504224,1,1
4,0.981273,0.018727,0,0


In [32]:
# assign outcome cases for use in custom metrics given threshold adjustment
tn, fp, fn, tp = confusion_matrix(y_test, preds_adj['y_pred_adj']).ravel()

In [33]:
# display confusion matrix based on adjusted threshold
confusion_matrix(y_test, preds_adj['y_pred_adj'])

array([[ 66, 325],
       [  9, 381]], dtype=int64)

In [34]:
# display accuracy given adjusted threshold
accuracy(tn, tp, y_test)

0.5723431498079385

In [35]:
# display false negative rate given adjusted threshold
false_negative_rate(fn, tp)

0.023076923076923078

__Insight:__
By drastically adjusting the threshold of what would be considered a Default from 50% to 5%, the false negative rate is pushed down to 2.3%.  Unfortunately, this still does not surpass the 0.31% benchmark and it comes at very high expense of Non Default misclassifications.