# Lending Club Loan Data Modeling

In this section we will attempt to determine the best model to predict whether or not a borrower will default in the Lending Club Loan data.

Before beginning, we'll define our **_Satisficing_** and **_Optimizing_** metrics. Andrew Ng recommends outlining these before beginning in the _deeplearning.ai_ course named _Structuring Machine Learning Projects_.

After, we'll get down and dirty with some data cleaning to get this dataset in tip-top shape and ready to be modeled.

We then start the modeling, beginning with a **_Logistic Regresion_** model, using **_Forward Selection_** to determine the features. We will then try a **_K-Nearest Neighbors Classifier_** and end with a **_Random Forest_** and some hyperparameter tuning. 

After we'll wrap it all up with a summary of what we have learned.

First though, let's do our usual import of a billions packages so we're ready to machine learn.

In [None]:
import os
import pandas as pd
import numpy as np
import re
import itertools
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder, StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import mcnulty_util as mcu

In [None]:
import warnings
warnings.filterwarnings('ignore')
%load_ext autoreload
%autoreload 2

# Table of Contents

1. [Project Goals](#project_goals)
2. [Data Cleaning](#data_cleaning)
3. [Logistic Regressions](#log_reg)
    1. [Single Feature](#log_reg_one)
    2. [Multiple Features](#log_reg_mult)
    3. [Visualization of Best Model](#log_reg_viz)
    4. [Hyperparameter Tuning with Grid Search](#log_reg_hyperparams)
4. [K-Nearest Neighbors](#knn)
5. [Random Forest](#rf)

<a id="project_goals"></a>
# Project Goals

Our goals is to provide investors with a model that allows them to invest in loans with a high confidence that the borrower will not default. This will minimize losses to that investor by making sure their prinicpal is secure.

As we our cautious investors, we want our model to predict default if it is not confident either way. To do this we will optimize **_Recall_**, also known as the **_True Positive Rate_**, which is defined as follows:

\begin{equation*}
Recall = True\ Positive\ Rate = \frac{TP} {TP + FN}
\end{equation*}

\begin{equation*}
TP = Number\ of\ True\ Positives
\end{equation*}

\begin{equation*}
FP = Number\ of\ False\ Negatives
\end{equation*}

**_Recall_** is a measure of a classifiers **_completeness_**. 
In this instance, recall measures how many of the defaulters we predicted defaulted.

However, if we optimize too much towards recall we will end up predicting default the whole time. 
This will give us 100% recall, but not a good model. 
To avoid this we'll also include **_precision_**. 
**_Precision_**, also known as the **_Positive Predictive Value._**, is defined as follows:

\begin{equation*}
Precision = Positive\ Predictive\ Value = \frac{TP} {TP + FP}
\end{equation*}

\begin{equation*}
TP = Number\ of\ True\ Positives
\end{equation*}

\begin{equation*}
FP = Number\ of\ False\ Positives
\end{equation*}


Since we want to balance precision and recall, we'll use the **_F1 Score_**, which is the harmonic mean of the two.

This means **_F1 Score_** will be our optimizing metric. We'll want to find a model with above 80% precision as a satisficing metric.

Before we get to that, let's get down and dirty with some data cleaning.

<a id="data_cleaning"></a>
# Data Cleaning

I have moved our data cleaning to the _mcnulty_util.py_ model to keep modularity. 

In this function, we filter out loan status' that don't apply per the EDA file. We also subsetted the columns.

In [None]:
df = mcu.mcnulty_preprocessing()

<a id="log_reg"></a>
# Logistic Regression

<a id="log_reg_one"></a>
## Single Feature Models

In [None]:
independents = [
    ['dti'],
    ['int_rate'],
    ['annual_inc'],
    ['loan_amnt'],
    ['revol_bal'],
    ['term'],
    ['delinq_2yrs'],
    ['home_ownership'],
    ['grade'],
    ['purpose'],
    ['emp_length']]
dependent = 'default'

In [None]:
results = list()
for variable in independents:
    X, y = df.loc[:, variable], df.loc[:, dependent]
    clf = LogisticRegression(C=1000000, penalty='l1')
    if X.iloc[:, 0].dtype not in [np.float64, np.int64]:
        enc = OneHotEncoder()
        X = enc.fit_transform(X)
        record = mcu.log_clf_model(clf, 'Logistic Regression', X, y, variable)
        results.append(record)
    else:
        for degree in range(1, 4):
            if degree == 1:
                LogisticRegression(C=1000000, penalty='l1')
                record = mcu.log_clf_model(clf, 'Logistic Regression', X, y, variable)
                results.append(record)
            else:
                clf = Pipeline([('poly', PolynomialFeatures(degree)), 
                                ('clf', LogisticRegression(C=1000000, penalty='l1'))])
                record = mcu.log_clf_model(clf, 'Logistic Regression', X, y, variable, degree)
                results.append(record)
# Let's also add a bias model
X = np.ones((df.shape[0], 1))
y = df.loc[:, dependent]
clf = LogisticRegression(C=1000000, penalty='l1')
results.append(mcu.log_clf_model(clf, 'Logistic Regression', X, y, 'bias'))
(mcu.results_to_df(results)
 .pipe(mcu.scores_formatted))

We can see that all our models except 2 are guess 100 percent non-default. This is common with imbalanced classes. We are basically dealing with a high-bias problem here. We need to add features to **_reduce bias_** and **_add variance_**.

<a id="log_reg_mult"></a>
## Multiple Features

To add some variance, we'll now add models with two or three features, with each numeric dependent variable having polynomial tranformations from 1-3 degrees.

In [None]:
for features_tuple in itertools.combinations(list(mcu.independents.keys()), 2):
    features = list(features_tuple)
    if mcu.independents[features[0]] == 'dummy' and mcu.independents[features[1]] == 'dummy':
        clf = LogisticRegression(C=1000000, penalty='l1')
        pipeline = mcu.clf_pipeline(clf, features, degree)
        record = mcu.log_clf_model(pipeline, 'Logistic Regression', df, y, features, 1)
        results.append(record)
    else: 
        for degree in range(1, 4):
            clf = LogisticRegression(C=1000000, penalty='l1')
            pipeline = mcu.clf_pipeline(clf, features, degree)
            record = mcu.log_clf_model(pipeline, 'Logistic Regression', df, y, features, degree)
            results.append(record)
for features_tuple in itertools.combinations(list(mcu.independents.keys()), 3):
    features = list(features_tuple)
    if (    mcu.independents[features[0]] == 'dummy'
        and mcu.independents[features[1]] == 'dummy'
        and mcu.independents[features[2]] == 'dummy'):
        clf = LogisticRegression(C=1000000, penalty='l1')
        pipeline = mcu.clf_pipeline(clf, features, degree)
        record = mcu.log_clf_model(pipeline, 'Logistic Regression', df, y, features, 1)
        results.append(record)
    else: 
        for degree in range(1, 4):
            clf = LogisticRegression(C=1000000, penalty='l1')
            pipeline = mcu.clf_pipeline(clf, features, degree)
            record = mcu.log_clf_model(pipeline, 'Logistic Regression', df, y, features, degree)
            results.append(record)
(mcu.results_to_df(results)
 .pipe(scores_formatted)
 .head(10))

In [None]:
list(set(unpack_list(results_to_df(results)
         .pipe(scores_formatted)
         .head(15)
         .features.tolist())))

<a id="log_reg_viz"></a>
## Visualization of Best Model

In [None]:
features = ['dti', 'int_rate']
dependent = 'default'
X, y = df.loc[:, features], df.loc[:, dependent]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11,
                                                    stratify=y)
degree = 2
clf = mcu.clf_pipeline(LogisticRegression(), features, degree)
clf.fit(X_train, y_train)
ax = mcu.plot_estimator(clf, X_test,y_test)
ax.set(title='Defaulters by DTI and Interest Rate')
plt.show()

It's not great, but our classifer is definitely telling us people with higher Debt-to-Income Ratios and higher Interest Rates are more likely to default, which makes sense. We can tell from this that the data doesn't provide us with a clean split unfortunately.

<a id="#log_reg_hyperparams"></a>
## Hyperparameter Tuning with Grid Search

In [None]:
features = ['dti', 'int_rate', 'emp_length', 'home_ownership', 'purpose',
            'delinq_2yrs','revol_bal', 'loan_amnt', 'grade', 'term']
degree = 2
X, y = df.loc[:, features], df.loc[:, dependent]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11,
                                                    stratify=y)
pipeline = clf_pipeline(LogisticRegression(), features, degree)
weight_space = np.linspace(0.05, 0.95, 20)
class_weights = [{0: x, 1: 1.0-x} for x in weight_space]
hyperparameters = dict(clf__class_weight=class_weights)
gs = GridSearchCV(pipeline, hyperparameters, scoring='f1', cv=5)
gs.fit(X_train, y_train)

In [None]:
print("Best Class Weights : {}".format(pd.DataFramegs.best_params_))

In [None]:
model_desc = 'Logistic Regression with Class Weights'
#class_weight = gs.best_params_['clf__class_weight']
features = ['dti', 'int_rate', 'emp_length', 'home_ownership', 'purpose',
            'delinq_2yrs','revol_bal', 'loan_amnt', 'grade', 'term']
class_weight = {0: 0.23947368421052628, 1: 0.7605263157894737}
logr = LogisticRegression(class_weight=class_weight)
X, y = df.loc[:, features], df.loc[:, dependent]
pipeline = clf_pipeline(logr, features, degree)
results.append(mcu.log_clf_model(pipeline, model_desc, X, y, features, degree=degree))

In [None]:
model_desc = 'Logistic Regression with Class Weights'
degree = 2
features = ['dti', 'int_rate', 'emp_length', 'home_ownership', 'purpose',
            'delinq_2yrs','revol_bal', 'loan_amnt', 'grade', 'term', 'installment']
class_weight = {0: 0.23947368421052628, 1: 0.7605263157894737}
logr = LogisticRegression(class_weight=class_weight)
X, y = df.loc[:, features], df.loc[:, dependent]
pipeline = clf_pipeline(logr, features, degree)
results.append(mcu.log_clf_model(pipeline, model_desc, X, y, features, degree=degree))

In [None]:
model_desc = 'Logistic Regression with Class Weights'
degree = 3
features = ['dti', 'int_rate', 'emp_length', 'home_ownership', 'purpose',
            'delinq_2yrs','revol_bal', 'loan_amnt', 'grade', 'term', 'installment']
class_weight = {0: 0.23947368421052628, 1: 0.7605263157894737}
logr = LogisticRegression(class_weight=class_weight)
X, y = df.loc[:, features], df.loc[:, dependent]
pipeline = clf_pipeline(logr, features, degree)
results.append(mcu.log_clf_model(pipeline, model_desc, X, y, features, degree=degree))

In [None]:
model_desc = 'Logistic Regression with Class Weights'
degree = 3
features = ['dti', 'int_rate', 'emp_length', 'home_ownership', 'purpose',
            'delinq_2yrs','revol_bal', 'loan_amnt', 'grade', 'term', 'installment',
            'addr_state']
class_weight = {0: 0.23947368421052628, 1: 0.7605263157894737}
logr = LogisticRegression(class_weight=class_weight)
X, y = df.loc[:, features], df.loc[:, dependent]
pipeline = clf_pipeline(logr, features, degree)
results.append(mcu.log_clf_model(pipeline, model_desc, X, y, features, degree=degree))

In [None]:
(results_to_df(results)
 .pipe(scores_formatted)
 .head(10))

<a id="knn"></a>
# K-Nearest Neighbors

Next let's try the K-Nearest Neighbors algorithm on the data. We'll pick features by those with the highest correlations.

In [None]:
(df.corr()
 .loc[:, ['default']]
 .drop('default', axis=0)
 .rename(columns={'default': 'correlation'})
 .assign(correlation_abs=lambda x: x.correlation.abs())
 .sort_values('correlation_abs', ascending=False)
 .head(10))

In [None]:
features, dependent = ['out_prncp', 'int_rate'], ['default']
X, y = df.loc[:, features], df.loc[:, dependent]
scaler = StandardScaler()
X_trans = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_trans, y, test_size=0.2, random_state=11,
                                                    stratify=y)
n_neighbors = [2, 4, 8, 16, 32]
train_results = list()
test_results = list()
for neighbors in n_neighbors:
    knc = KNeighborsClassifier(n_neighbors=neighbors, n_jobs=-1)
    knc.fit(X_train, y_train)
    y_train_hat = knc.predict(X_train)
    f1_train_score = metrics.f1_score(y_train, y_train_hat)
    train_results.append(f1_train_score)
    y_test_hat = knc.predict(X_test)
    f1_test_score = metrics.f1_score(y_test, y_test_hat)
    test_results.append(f1_test_score)

In [None]:
plt.style.use('ggplot')
fig, ax = plt.subplots()
ax.plot(n_neighbors, train_results, c='blue', label='Training Set')
ax.plot(n_neighbors, test_results, c='red', label='Training Set')
ax.set(title='K-Nearest Neighbors\nDefault by Outstanding Principal and Interest Rate',
       xlabel="Number of Neighbors", ylabel="F1 Score")
plt.show()

Our best KNN is one with 2 neighbors. Let's check out a visual of this for funzies.

<a id="rf"></a>
# Random Forest

We did some great work with our logistic regression modeling, but let's see if we can obtain a little more accuracy with a random forest.

In [None]:
features = ['dti', 'int_rate', 'emp_length', 'home_ownership', 'purpose',
            'delinq_2yrs','revol_bal', 'loan_amnt', 'grade', 'term', 'installment', 'addr_state']
dependent = 'default'
model_name = 'Random Forest'
X, y = df.loc[:, features], df.loc[:, dependent]
rf = RandomForestClassifier()
pipeline = mcu.clf_pipeline(rf, features, 1)
mcu.log_clf_model(pipeline, model_name, X, y, features, degree=1)

### Random Forest Hyperparameter Grid Search

In [None]:
""" Parameters """
features = ['dti', 'int_rate', 'emp_length', 'home_ownership', 'purpose',
            'delinq_2yrs','revol_bal', 'loan_amnt', 'grade', 'term', 'installment', 'addr_state']
degree = 1
X, y = df.loc[:, features], df.loc[:, dependent]

""" Preprocessing """
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11,
                                                    stratify=y)
transformer_list = mcu.feature_transformer_list(features, degree)
feats = FeatureUnion(transformer_list=transformer_list)
X_train_trans = feats.fit_transform(X_train)
rf = RandomForestClassifier()
rf.fit(X_train_trans, y_train)

""" Grid Search Hyperparameters """
# Number of trees in random forest
n_estimators = [x for x in range(200, 2000, 100)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [None] + [x for x in [x for x in range(10, 110, 10)]]
# Minimum number of samples required to split a node
min_samples_split = [2**r for r in range(1, 4)]
# Minimum number of samples required at each leaf node
min_samples_leaf = [2**r for r in range(0, 3)]

""" Grid Search """
hyperparameters = {'n_estimators': n_estimators,
                   'max_features': max_features,
                   'max_depth': max_depth,
                   'min_samples_split': min_samples_split,
                   'min_samples_leaf': min_samples_leaf,
                   'bootstrap': bootstrap}
gs = GridSearchCV(rf, hyperparameters, scoring='f1', cv=5)