# Lending Club Loan Data Modeling

In this section we will attempt to determine the best model to predict whether or not a borrower will default in the Lending Club Loan data.

Before beginning, we'll define our **_Satisficing_** and **_Optimizing_** metrics. Andrew Ng recommends outlining these before beginning in the _deeplearning.ai_ course named _Structuring Machine Learning Projects_.

After, we'll get down and dirty with some data cleaning to get this dataset in tip-top shape and ready to be modeled.

We then start the modeling, beginning with a **_Logistic Regresion_** model, using **_Forward Selection_** to determine the features. We will then try a **_K-Nearest Neighbors Classifier_** and end with a **_Random Forest_** and some hyperparameter tuning. 

After we'll wrap it all up with a summary of what we have learned.

First though, let's do our usual import of a billions packages so we're ready to machine learn.

In [3]:
import os
import pandas as pd
import numpy as np
import re
import itertools
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
import mcnulty_util as mcu

In [4]:
df = mcu.mcnulty_preprocessing()

Initiating MAXIMUM data munging power
Luther Preprocessing Successful Woo Woo!



In [20]:
gross_counts = df.default.value_counts()
normalized_counts = df.default.value_counts(normalize=True)
df_counts = pd.concat([gross_counts, normalized_counts], axis=1)             
df_counts.columns = ['Gross Count', 'Percentage of Total']
df_counts.loc[:, 'Gross Count'] = df_counts.loc[:, 'Gross Count'].map('{:,d}'.format)
df_counts.loc[:, 'Percentage of Total'] = df_counts.loc[:, 'Percentage of Total'].map('{:0.2%}'.format)
print(df_counts.to_html())

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Gross Count</th>
      <th>Percentage of Total</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>207,721</td>
      <td>78.16%</td>
    </tr>
    <tr>
      <th>1</th>
      <td>58,056</td>
      <td>21.84%</td>
    </tr>
  </tbody>
</table>


In [14]:
df_counts

Unnamed: 0,Gross Count,Percentage_Count
0,207721,0.781561
1,58056,0.218439


In [5]:
independents = [
    ['dti'],
    ['int_rate'],
    ['annual_inc'],
    ['loan_amnt'],
    ['revol_bal'],
    ['term'],
    ['delinq_2yrs'],
    ['home_ownership'],
    ['grade'],
    ['purpose'],
    ['emp_length']]
dependent = 'default'

<a id="#log_reg_hyperparams"></a>
## Hyperparameter Tuning with Grid Search

In [8]:
features = ['dti', 'int_rate', 'emp_length', 'home_ownership', 'purpose',
            'delinq_2yrs','revol_bal', 'loan_amnt', 'grade', 'term']
degree = 2
X, y = df.loc[:, features], df.loc[:, dependent]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11,
                                                    stratify=y)
pipeline = mcu.clf_pipeline(LogisticRegression(), features, degree)
weight_space = np.linspace(0.05, 0.95, 20)
class_weights = [{0: x, 1: 1.0-x} for x in weight_space]
hyperparameters = dict(clf__class_weight=class_weights)
gs = GridSearchCV(pipeline, hyperparameters, scoring='f1', cv=5)
gs.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('union',
                                        FeatureUnion(n_jobs=None,
                                                     transformer_list=[('numeric',
                                                                        Pipeline(memory=None,
                                                                                 steps=[('selector',
                                                                                         ItemSelector(key=['dti',
                                                                                                           'int_rate',
                                                                                                           'delinq_2yrs',
                                                                                                           'revol_bal',
                                   

In [15]:
print("Best Class Weights:\n{}".format(pd.DataFrame(gs.best_params_)))

Best Class Weights:
   clf__class_weight
0           0.239474
1           0.760526


In [17]:
print(pd.DataFrame(gs.best_params_).to_html())

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>clf__class_weight</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0.239474</td>
    </tr>
    <tr>
      <th>1</th>
      <td>0.760526</td>
    </tr>
  </tbody>
</table>
