<img style="float: left;" src="../images/fanniemae.png">
<br><br><br><br><br><br>
______

# Mortgage Loan Default Classifier
____________
____________

## Problem Statement:
_____________
Fannie Mae, or more specifically the Federal National Mortgage Association (FNMA), is a government sponsored entity whose primary goal is to raise home ownership and affordable housing levels.  Fannie Mae attempts to accomplish this in essence by purchasing mortgage loans within certain parameters from mortgage lenders.  In turn, mortgage lenders are provided cash flow to issue additional mortgages.<br>

The cause of the Financial Crisis of 2008 can in part be drawn back to the purchase of mortgage loans with an actual probability of default that were higher than assumed.  By creating a classification model that will predict whether a mortgage loan will default based on pre-purchase characteristics, Fannie Mae may better avoid high risk mortgage loans.  The model will be evaluated based on Accuracy and False Negative Rate.  In this particular case, the "positive" class will be loans that default therefore, we will seek to minimize the False Negative Rate while maximizing Accuracy.

## Engineered Features, Production Model, and Conclusion
___________
In order to render a better accuracy and false negative rate interaction features and a neural network will be introduced.  Correlation between continuous features and whether a loan default were minimal.  By using PolynomialFeatures, the model can investgate whether the interaction between these features are of more importance than the feature alone.  Additional, a neural network wil be introduced to increase accuracy and decrease the false negative rate.  A logistic regressor has an advantage of being able to indentify key features that have the greatest influence in determining whether a loan will default.  However, as evidenced by the base model, it may not be able to reach the accuracy needed to be a successful model.  A neural network does not have the interpretability inherent in the logistic model however, it has been shown to highly accurate because of their ability to generalise and respond to unexpected patterns. The high degree of inaction between features in this dataset may make a neural network the right choice in this situation.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures 
from sklearn.metrics import confusion_matrix

from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras import regularizers
from keras.callbacks import EarlyStopping

%matplotlib inline

Using TensorFlow backend.


In [2]:
df = pd.read_csv('../data/complete2011q1.csv')
df.head()

Unnamed: 0,LOAN IDENTIFIER,ORIGINATION CHANNEL,SELLER NAME,ORIGINAL INTEREST RATE,ORIGINAL UPB,ORIGINAL LOAN TERM,ORIGINAL LOAN-TO-VALUE (LTV),ORIGINAL COMBINED LOAN-TO-VALUE (CLTV),NUMBER OF BORROWERS,ORIGINAL DEBT TO INCOME RATIO,...,LOAN PURPOSE,PROPERTY TYPE,NUMBER OF UNITS,OCCUPANCY TYPE,PROPERTY STATE,PRODUCT TYPE,RELOCATION MORTGAGE INDICATOR,DEFAULT,MI,MIN CREDIT SCORE
0,100000841305,C,"CITIMORTGAGE, INC.",4.125,124000,360,79,79.0,1.0,28.0,...,R,SF,1,P,TX,FRM,N,0,0.0,792.0
1,100001889356,R,OTHER,4.625,115000,240,68,68.0,1.0,34.0,...,C,SF,1,P,IL,FRM,N,0,0.0,705.0
2,100006453372,C,"BANK OF AMERICA, N.A.",4.375,175000,360,52,52.0,2.0,29.0,...,C,PU,1,S,AZ,FRM,N,0,0.0,776.0
3,100010656545,C,"BANK OF AMERICA, N.A.",4.375,365000,360,59,59.0,3.0,40.0,...,C,PU,1,P,IL,FRM,N,0,0.0,797.0
4,100010758624,R,"CITIMORTGAGE, INC.",3.875,69000,120,28,28.0,1.0,32.0,...,C,SF,1,P,SC,FRM,N,0,0.0,785.0


In [3]:
continuous_features = ['ORIGINAL INTEREST RATE', 'ORIGINAL UPB', 'ORIGINAL LOAN-TO-VALUE (LTV)', 
                       'ORIGINAL COMBINED LOAN-TO-VALUE (CLTV)', 'ORIGINAL DEBT TO INCOME RATIO',
                       'MIN CREDIT SCORE']

In [4]:
poly = PolynomialFeatures(degree=2, include_bias=False)
df_poly = pd.DataFrame(poly.fit_transform(df[continuous_features]), columns=poly.get_feature_names(continuous_features))
df_poly = df_poly.drop(columns=continuous_features)
df_poly = pd.concat([df, df_poly], axis=1)
df_poly.head()

  for c in combinations)


Unnamed: 0,LOAN IDENTIFIER,ORIGINATION CHANNEL,SELLER NAME,ORIGINAL INTEREST RATE,ORIGINAL UPB,ORIGINAL LOAN TERM,ORIGINAL LOAN-TO-VALUE (LTV),ORIGINAL COMBINED LOAN-TO-VALUE (CLTV),NUMBER OF BORROWERS,ORIGINAL DEBT TO INCOME RATIO,...,ORIGINAL LOAN-TO-VALUE (LTV)^2,ORIGINAL LOAN-TO-VALUE (LTV) ORIGINAL COMBINED LOAN-TO-VALUE (CLTV),ORIGINAL LOAN-TO-VALUE (LTV) ORIGINAL DEBT TO INCOME RATIO,ORIGINAL LOAN-TO-VALUE (LTV) MIN CREDIT SCORE,ORIGINAL COMBINED LOAN-TO-VALUE (CLTV)^2,ORIGINAL COMBINED LOAN-TO-VALUE (CLTV) ORIGINAL DEBT TO INCOME RATIO,ORIGINAL COMBINED LOAN-TO-VALUE (CLTV) MIN CREDIT SCORE,ORIGINAL DEBT TO INCOME RATIO^2,ORIGINAL DEBT TO INCOME RATIO MIN CREDIT SCORE,MIN CREDIT SCORE^2
0,100000841305,C,"CITIMORTGAGE, INC.",4.125,124000,360,79,79.0,1.0,28.0,...,6241.0,6241.0,2212.0,62568.0,6241.0,2212.0,62568.0,784.0,22176.0,627264.0
1,100001889356,R,OTHER,4.625,115000,240,68,68.0,1.0,34.0,...,4624.0,4624.0,2312.0,47940.0,4624.0,2312.0,47940.0,1156.0,23970.0,497025.0
2,100006453372,C,"BANK OF AMERICA, N.A.",4.375,175000,360,52,52.0,2.0,29.0,...,2704.0,2704.0,1508.0,40352.0,2704.0,1508.0,40352.0,841.0,22504.0,602176.0
3,100010656545,C,"BANK OF AMERICA, N.A.",4.375,365000,360,59,59.0,3.0,40.0,...,3481.0,3481.0,2360.0,47023.0,3481.0,2360.0,47023.0,1600.0,31880.0,635209.0
4,100010758624,R,"CITIMORTGAGE, INC.",3.875,69000,120,28,28.0,1.0,32.0,...,784.0,784.0,896.0,21980.0,784.0,896.0,21980.0,1024.0,25120.0,616225.0


In [5]:
categorical_features = ['ORIGINATION CHANNEL', 'SELLER NAME', 'FIRST TIME HOME BUYER INDICATOR', 'LOAN PURPOSE', 
                        'PROPERTY TYPE', 'OCCUPANCY TYPE', 'PROPERTY STATE', 'PRODUCT TYPE', 'RELOCATION MORTGAGE INDICATOR']

In [6]:
df_poly = pd.get_dummies(df_poly, columns=categorical_features, drop_first=True)
df_poly.shape

(504559, 113)

### Target number of observations  = features ^ 2
- 112^2 = 12,544  

Need to upsample minority and downsample majority

In [7]:
df['DEFAULT'].value_counts()

0    502998
1      1561
Name: DEFAULT, dtype: int64

In [8]:
# balance targets
# split classes
df_maj = df_poly[df_poly['DEFAULT'] == 0]
df_min = df_poly[df_poly['DEFAULT'] == 1]

# upsample minority
df_min_resample = resample(df_min,
                           replace=True,
                           n_samples=int(len(df_poly.columns)**2.05),
                           random_state=42)

# downsample majority
df_maj_resample = resample(df_maj, 
                           replace=False,    
                           n_samples=df_min_resample.shape[0],
                           random_state=42)             


# concat downsample and minority
df_resample = pd.concat([df_maj_resample, df_min_resample])
 
# Display new class counts
df_resample['DEFAULT'].value_counts()

1    16173
0    16173
Name: DEFAULT, dtype: int64

In [9]:
# train test split
X = df_resample.drop(columns=['DEFAULT'])
y = df_resample['DEFAULT']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y,
                                                    shuffle=True,
                                                    random_state=42)

In [10]:
ss = StandardScaler()
ss.fit(X_train, y_train)
X_train_sc = ss.transform(X_train)
X_test_sc = ss.transform(X_test)

## Logistic Classifier

In [11]:
params = {
    'penalty': ['l1', 'l2'],
    'C': [0.1, 0.2, 0.3, 0.4]
}

grid_log_cv = GridSearchCV(LogisticRegression(), params, cv=5, return_train_score=True, n_jobs=-1)
grid_log_cv.fit(X_train_sc, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.1, 0.2, 0.3, 0.4]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [12]:
grid_log_cv.best_params_

{'C': 0.2, 'penalty': 'l1'}

In [13]:
grid_log_cv.best_score_

0.7796694010470341

In [14]:
log_opt = grid_log_cv.best_estimator_

In [15]:
log_opt.score(X_test_sc, y_test)

0.7813775194757018

In [16]:
y_pred = log_opt.predict(X_test_sc)

In [17]:
def false_negative_rate(fn, tp):
    return fn/(tp+fn)

def accuracy(tn, tp, y_test):
    return (tn + tp)/len(y_test)

In [18]:
confusion_matrix(y_test, y_pred)

array([[3104,  940],
       [ 828, 3215]], dtype=int64)

In [19]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [20]:
false_negative_rate(fn, tp)

0.20479841701706653

In [21]:
accuracy(tn, tp, y_test)

0.7813775194757018

In [22]:
df_pred = pd.DataFrame(log_opt.predict_proba(X_test_sc), columns=['NO DEFAULT', 'DEFAULT'])
df_pred['y_pred'] = y_pred
df_pred['y_pred_adj'] = df_pred['DEFAULT'].map(lambda x: 1 if x >= 0.05 else 0)
df_pred.head()

Unnamed: 0,NO DEFAULT,DEFAULT,y_pred,y_pred_adj
0,0.082869,0.917131,1,1
1,0.706924,0.293076,0,1
2,0.820701,0.179299,0,1
3,0.882574,0.117426,0,1
4,0.232236,0.767764,1,1


In [23]:
confusion_matrix(y_test, df_pred['y_pred_adj'])

array([[ 561, 3483],
       [  27, 4016]], dtype=int64)

In [24]:
tn, fp, fn, tp = confusion_matrix(y_test, df_pred['y_pred_adj']).ravel()

In [25]:
false_negative_rate(fn, tp)

0.006678209250556517

In [26]:
accuracy(tn, tp, y_test)

0.565970075429702

__Insight:__
The interaction features that were engineered increased the accuracy of the logistic model from 0.72 to 0.78.  However, an acceptable false negative rate is not obtainable without a huge loss of accurary and Non-Default misclassification.

## Neural Network

In [27]:
num = X_train_sc.shape[1]

nn = Sequential()

nn.add(Dense(num, activation='relu', input_dim=num))
nn.add(Dropout(0.1))
nn.add(Dense(int(num/2), activation='relu'))
nn.add(Dense(1, activation='sigmoid'))

nn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

early = EarlyStopping(monitor='val_loss', min_delta=0, patience=2)

nn_result = nn.fit(X_train_sc, y_train, epochs=100, validation_data=(X_test_sc, y_test), callbacks=[early])

Train on 24259 samples, validate on 8087 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100


In [28]:
df_pred_nn = pd.DataFrame(nn.predict_proba(X_test_sc), columns=['DEFAULT'])
df_pred_nn['y_pred'] = nn.predict_classes(X_test_sc)
df_pred_nn['y_pred_adj'] = df_pred_nn['DEFAULT'].map(lambda x: 1 if x >= 0.1 else 0)
df_pred_nn.head()

Unnamed: 0,DEFAULT,y_pred,y_pred_adj
0,0.987966,1,1
1,0.9708326,1,1
2,1.677061e-10,0,0
3,3.102563e-06,0,0
4,0.9992926,1,1


In [29]:
confusion_matrix(y_test, df_pred_nn['y_pred_adj'])

array([[3250,  794],
       [   3, 4040]], dtype=int64)

In [30]:
tn, fp, fn, tp = confusion_matrix(y_test, df_pred_nn['y_pred_adj']).ravel()

In [31]:
false_negative_rate(fn, tp)

0.0007420232500618352

In [32]:
accuracy(tn, tp, y_test)

0.9014467664152344

## Conclusion
The neural network model with an adjusted threshold of 0.10 is able to obtain 0.90 accuracy and a false negative rate of 0.074% which far surpasses the benchmark of 0.309%.  For reference, Fannie Mae would be able to avoid 235 defaulted loans using this model.  Using an average of $50,000 loss per default, this would equate to a loss avoidance of \$11,750,000 per 100,000 loans.