# Classification Task
Objective: Predict which loans will default such that a lender can decide which loans to put money in.

In [None]:
import pandas as pd
import numpy as np
import pickle

In [None]:
from datetime import datetime

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
pd.set_option('display.max_columns', 100)

## About

This file contains complete loan data for all loans issued through the time period stated, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the "present" contains complete loan data for all loans issued through the previous completed calendar quarter.

The data dictionary is in an accompanying file titled LCDataDictionary.xlsx

Source: https://www.lendingclub.com/info/download-data.action

In [None]:
loans = pd.read_csv('lending-club-loan-data/loan.csv')
loans.head(2)

In [None]:
loans.shape

In [None]:
loans.columns.values

There are 74 columns. At a glance, some of the important columns that would help to predict defaults are listed below.

### Exploratory Data Analysis

In [None]:
# summary of all variables
train.describe(include='all')

In [None]:
# univariate plot
sns.distplot(train.loan_amnt);

The univariate distribution for the loan amount does not follow a normal distribution.

In [None]:
sns.boxplot(x = 'loan_status', y = 'loan_amnt', data=train);

The median value of loan amount for Charged Off loans is higher.

In [None]:
g = sns.FacetGrid(train, col ='loan_status')
g.map(sns.distplot, "loan_amnt");

The loan amount for both the loan statuses follows a roughly similar distribution.

In [None]:
# Create another column called 'default' to help with EDA
train['default'] = np.where(train.loan_status == 'Charged Off', 1, 0)

In [None]:
plt.figure(figsize=(15, 5))
sns.barplot(x="purpose", y="default", data=train,estimator=np.mean)
plt.xticks(rotation='vertical');

Small business loans have the highest proportion of defaults while loans for purchasing a car have the lowest.

### Task for Students:
* Explore each variable in the data set further.

In [None]:
# Remove observations containing ANY
loans_df = loans_df.query('home_ownership != "ANY"')

In [None]:
loans_df.home_ownership.value_counts()

#### Verification Status

In [None]:
loans_df.verification_status.value_counts()

#### Credit Line Time
Since the length of time an applicant has a credit history is an important predictor of whether he will default on his loan, we calculate the number of days between the `issue_d` and the `earliest_credit_line`.

In [None]:
loans_df.issue_d.value_counts()

In [None]:
loans_df.earliest_cr_line.value_counts()

Since the issue date and the earliest credit line variables are not in date time format, we'll have to convert them to datetime.

In [None]:
loans_df.issue_d = pd.to_datetime(loans_df.issue_d)
loans_df.issue_d.value_counts()

In [None]:
loans_df.earliest_cr_line = pd.to_datetime(loans_df.earliest_cr_line)
loans_df.earliest_cr_line.value_counts()

In [None]:
loans_df['credit_line_days'] = (loans_df['issue_d'] - loans_df['earliest_cr_line']).dt.days

Since we now do not need the `earliest_cr_line` and `issue_d` variables we drop them.

In [None]:
loans_df = loans_df.loc[:, pd.notnull(loans_df).sum() > int(len(loans_df)*(1 - frac))]

In [None]:
loans_df.head()

In [None]:
loans_df.shape

In [None]:
loans_df.isnull().sum()

Variables such as `mths_since_last_delinq`, `mths_since_last_record`, `revol_util` and `total_rev_hi_lim` still have `NA` values. These values will be replaced with the mean later.

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(loans_df, test_size=0.30, random_state=1)

In [None]:
train.shape

In [None]:
test.shape

In [None]:
imp_cols = ['loan_amnt', 'term', 'int_rate', 'installment', 'emp_length', 'home_ownership', 
           'annual_inc', 'verification_status', 'issue_d', 'loan_status', 'purpose', 'dti', 'delinq_2yrs', 'earliest_cr_line', 
           'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util',
           'total_acc', 'acc_now_delinq', 'max_bal_bc','all_util', 'total_rev_hi_lim', 'inq_fi', 'inq_last_12m']

Some columns have unstructured text such as `desc` and `emp_title`, while others such as `funded_amnt` are created after the loan has been invested in, a scenario which we do not care about.

#### Task For Students
Explore each variable in the data dictionary and see why or why not it was included.

In [None]:
# subset the original data frame to include the columns mentioned above.
loans_df = loans[imp_cols]
del (loans)

The number of columns in the data frame have been reduced from the previous 74.

## Feature Engineering
Feature engineering a few variables.

#### Term

In [None]:
loans_df.term.value_counts()

#### Employment Length

In [None]:
loans_df.emp_length.value_counts()

We could leave 'emp_length' as categorical data, but it shouldn't be treated as such or as ordinal data since the intervals are easy to determine. We thus convert it into numerical data.

In [None]:
loans_df.drop(axis=1, labels=['earliest_cr_line', 'issue_d'], inplace=True)

#### Loan Status

In [None]:
loans_df.loan_status.value_counts()

In [None]:
# consider only loans that are fully paid, charged off or default.
loans_df = loans_df.query('loan_status == "Fully Paid" | loan_status == "Charged Off" | loan_status == "Default"')

In [None]:
# Replace Default with Charged Off so there are only 2 factors
loans_df.loan_status.replace(inplace=True, to_replace='Default', value='Charged Off')

#### Find the number of NA values under each column.

#### Drop columns which have more than 90% NA values.

In [None]:
frac = 0.9

In [None]:
replace_dict = {'10+ years':10, '2 years':2, '< 1 year':0, '3 years':3, '1 year':1, '5 years':5, '4 years':4, 'n/a':0, 
                '7 years':7, '8 years':8, '6 years':6, '9 years':9}

In [None]:
loans_df.emp_length.replace(replace_dict, inplace=True)

#### Home Ownership

Since there are only a few observations with home ownership ANY, remove those observations.

In [None]:
X_test = pd.get_dummies(columns=['term', 'home_ownership', 'verification_status', 'purpose'], data=X_test)

In [None]:
X_test.columns.values

#### Replace NA values with the mean in the trainining and testing set.

In [None]:
X_train.isnull().sum()

In [None]:
X_train.fillna(X_train.mean(), inplace=True)

In [None]:
X_test.fillna(X_train.mean(), inplace=True)

In [None]:
X_test.isnull().sum()

In [None]:
y_train.value_counts()

In [None]:
y_train = np.where(y_train == 'Charged Off', 1, 0)

In [None]:
y_test = np.where(y_test == 'Charged Off', 1, 0)

### Modelling
Fit a Random Forest Model with Accuracy as the Performance Metric.