## Introduction
The purpose of this notebook is to discover a very large data set with about 100 columns and over 420,000 rows of data.  We become familiar with the memory and processing power of our equipment (laptop in my case) and are then challenged to pair down the data to the smallest amount possible while maintaining a 90% accuracy in a 10-fold cross validation.

First, we import necessary modules and methods.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import ensemble
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
%matplotlib inline

## Import data

In [2]:
# Replace the path with the correct path for your data.
y2015 = pd.read_csv(
    'https://www.dropbox.com/s/0so14yudedjmm5m/LoanStats3d.csv?dl=1',
    skipinitialspace=True,
    header=1
)

# Note the warning about dtypes.

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
y2015.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,68009401,72868139.0,16000.0,16000.0,16000.0,60 months,14.85%,379.39,C,C5,...,0.0,2.0,78.9,0.0,0.0,2.0,298100.0,31329.0,281300.0,13400.0
1,68354783,73244544.0,9600.0,9600.0,9600.0,36 months,7.49%,298.58,A,A4,...,0.0,2.0,100.0,66.7,0.0,0.0,88635.0,55387.0,12500.0,75635.0
2,68466916,73356753.0,25000.0,25000.0,25000.0,36 months,7.49%,777.55,A,A4,...,0.0,0.0,100.0,20.0,0.0,0.0,373572.0,68056.0,38400.0,82117.0
3,68466961,73356799.0,28000.0,28000.0,28000.0,36 months,6.49%,858.05,A,A2,...,0.0,0.0,91.7,22.2,0.0,0.0,304003.0,74920.0,41500.0,42503.0
4,68495092,73384866.0,8650.0,8650.0,8650.0,36 months,19.89%,320.99,E,E3,...,0.0,12.0,100.0,50.0,1.0,0.0,38998.0,18926.0,2750.0,18248.0


In [4]:
y2015.tail()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
421092,36271333,38982739.0,13000.0,13000.0,13000.0,60 months,15.99%,316.07,D,D2,...,0.0,3.0,100.0,50.0,1.0,0.0,51239.0,34178.0,10600.0,33239.0
421093,36490806,39222577.0,12000.0,12000.0,12000.0,60 months,19.99%,317.86,E,E3,...,1.0,2.0,95.0,66.7,0.0,0.0,96919.0,58418.0,9700.0,69919.0
421094,36271262,38982659.0,20000.0,20000.0,20000.0,36 months,11.99%,664.2,B,B5,...,0.0,1.0,100.0,50.0,0.0,1.0,43740.0,33307.0,41700.0,0.0
421095,Total amount funded in policy code 1: 6417608175,,,,,,,,,,...,,,,,,,,,,
421096,Total amount funded in policy code 2: 1944088810,,,,,,,,,,...,,,,,,,,,,


As you can see, there are over 420,000 rows and many, many columns.  The next script gathers and prints all of the categorical data types with their corresponding number of counts.  These datatypes can cause problems when using sklearn librarier, so we need to either remove them or convert them to dummy nerical values (we will use both methods further down).

## Clean and explore data

In [5]:
categorical = y2015.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

id
421097
term
2
int_rate
110
grade
7
sub_grade
35
emp_title
120812
emp_length
11
home_ownership
4
verification_status
3
issue_d
12
loan_status
7
pymnt_plan
1
url
421095
desc
34
purpose
14
title
27
zip_code
914
addr_state
49
earliest_cr_line
668
revol_util
1211
initial_list_status
2
last_pymnt_d
25
next_pymnt_d
4
last_credit_pull_d
26
application_type
2
verification_status_joint
3


We need to remove some of the data in order to avoid hardware memory problems and to make the data easier to work with.  We'll drop 400,000 rows to start.

In [6]:
y2015.drop(y2015.tail(421000).index, inplace=True)

After testing this method multiple times, we are able to drop as many as 421,000 rows and still maintain an accruacy above 90% in the cross validation score as you will see below.

In [7]:
y2015.tail()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
92,68537594,73427429.0,5000.0,5000.0,5000.0,36 months,7.91%,156.48,A,A5,...,0.0,0.0,100.0,0.0,1.0,0.0,21690.0,11665.0,10100.0,8190.0
93,68374676,73264437.0,10000.0,10000.0,10000.0,60 months,16.59%,246.33,D,D2,...,0.0,2.0,96.9,0.0,1.0,0.0,85929.0,65214.0,21300.0,62929.0
94,68376217,73266027.0,23100.0,23100.0,23100.0,60 months,20.50%,618.46,E,E4,...,0.0,3.0,87.5,85.7,0.0,0.0,102524.0,55048.0,21700.0,80824.0
95,68547679,73437541.0,18500.0,18500.0,18500.0,60 months,12.88%,419.8,C,C2,...,0.0,3.0,89.5,16.7,0.0,0.0,143748.0,28564.0,27500.0,17944.0
96,68446784,73336607.0,12000.0,12000.0,12000.0,36 months,10.78%,391.62,B,B4,...,0.0,2.0,90.5,20.0,0.0,0.0,59420.0,31915.0,21200.0,32820.0


Just 97 rows total here which is all we will need.

Convert ID and Interest Rate columns to numerical values in order to use prediction methods in sklearn.

In [8]:
# Convert ID and Interest Rate to numeric.
y2015['id'] = pd.to_numeric(y2015['id'], errors='coerce')
y2015['int_rate'] = pd.to_numeric(y2015['int_rate'].str.strip('%'), errors='coerce')

# Drop other columns with many unique variables
y2015.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
            'sub_grade', 'addr_state', 'desc'], 1, inplace=True)

In [9]:
y2015.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,emp_length,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,68009401,72868139.0,16000.0,16000.0,16000.0,60 months,14.85,379.39,C,10+ years,...,0.0,2.0,78.9,0.0,0.0,2.0,298100.0,31329.0,281300.0,13400.0
1,68354783,73244544.0,9600.0,9600.0,9600.0,36 months,7.49,298.58,A,8 years,...,0.0,2.0,100.0,66.7,0.0,0.0,88635.0,55387.0,12500.0,75635.0
2,68466916,73356753.0,25000.0,25000.0,25000.0,36 months,7.49,777.55,A,10+ years,...,0.0,0.0,100.0,20.0,0.0,0.0,373572.0,68056.0,38400.0,82117.0
3,68466961,73356799.0,28000.0,28000.0,28000.0,36 months,6.49,858.05,A,10+ years,...,0.0,0.0,91.7,22.2,0.0,0.0,304003.0,74920.0,41500.0,42503.0
4,68495092,73384866.0,8650.0,8650.0,8650.0,36 months,19.89,320.99,E,8 years,...,0.0,12.0,100.0,50.0,1.0,0.0,38998.0,18926.0,2750.0,18248.0


Let's check to see some of our columns left in play and their corresponding data types.

In [10]:
y2015.dtypes

id                                  int64
member_id                         float64
loan_amnt                         float64
funded_amnt                       float64
funded_amnt_inv                   float64
term                               object
int_rate                          float64
installment                       float64
grade                              object
emp_length                         object
home_ownership                     object
annual_inc                        float64
verification_status                object
issue_d                            object
loan_status                        object
pymnt_plan                         object
purpose                            object
title                              object
dti                               float64
delinq_2yrs                       float64
inq_last_6mths                    float64
mths_since_last_delinq            float64
mths_since_last_record            float64
open_acc                          

Check to see what we are trying to predict which is the status of the loan.

In [11]:
y2015['loan_status'].head()

0       Current
1       Current
2    Fully Paid
3       Current
4    Fully Paid
Name: loan_status, dtype: object

In [12]:
# print the number of columns in our dataset
len(y2015.columns)

103

## Building the model
So, we have 103 columns and 97 rows (paring down the # of deleted rows through trial and error... first, deleted 150,000 then eventually 421,000).  We need to figure out which columns or features explain the most variance in our dependent variable or 'loan status' in this case.  

We use the SelectKBest method in sklearn and have chosen the number of features to use as 2.  We print out the # of rows in our final data that the RandomForestClassifier model uses and as you can see, SelectKBest determined that the best 2 features were 'tot_rec_prncp' and 'last_pymnt_amnt'.

The cross val score is still above 95%.

In [13]:
rfc = ensemble.RandomForestClassifier()
X = y2015.drop('loan_status', 1)
Y = y2015['loan_status']
X = pd.get_dummies(X)
X = X.dropna(axis=1)

selector = SelectKBest(chi2, k=2)
selector.fit(X, Y)

X_new = selector.transform(X)
print(X_new.shape)

X.columns[selector.get_support(indices=True)]

# 1st way to get the list
vector_names = list(X.columns[selector.get_support(indices=True)])
print(vector_names)

score = cross_val_score(rfc, X_new, Y, cv=10)
print(score)
print(score.mean())

(97, 2)
['total_rec_prncp', 'last_pymnt_amnt']
[0.91666667 0.90909091 0.90909091 0.9        1.         0.88888889
 1.         1.         1.         1.        ]
0.9523737373737374




Print the 2 most important features and the dependent variable which we are trying to predict just to see what some of the data looks like.

In [14]:
y2015[['total_rec_prncp', 'last_pymnt_amnt', 'loan_status']].head(10)

Unnamed: 0,total_rec_prncp,last_pymnt_amnt,loan_status
0,2331.12,379.39,Current
1,2964.31,298.58,Current
2,25000.0,20807.39,Fully Paid
3,8736.23,858.05,Current
4,8650.0,8251.42,Fully Paid
5,3856.31,471.77,Current
6,4553.34,678.49,Current
7,18000.0,18004.9,Fully Paid
8,5099.11,829.9,Current
9,10400.0,10128.96,Fully Paid


We are then asked to build the model without the use of 'loan_status' to compare the results.  Therefore, in the steps below, we will delete columns related to payment amount or outstanding principal.

In [15]:
# Drop other columns related to payment amount or outstanding principal (NOTE:  DROPPING 'LAST_PYMNT_AMNT')
y2015.drop(['last_pymnt_amnt', 'total_rec_prncp', 'total_pymnt_inv', 'out_prncp_inv', 'total_pymnt', 'out_prncp'], 1, inplace=True)

BUILD MODEL AGAIN

In [16]:
rfc = ensemble.RandomForestClassifier()
X = y2015.drop('loan_status', 1)
Y = y2015['loan_status']
X = pd.get_dummies(X)
X = X.dropna(axis=1)

selector = SelectKBest(chi2, k=2)
selector.fit(X, Y)

X_new = selector.transform(X)
print(X_new.shape)

X.columns[selector.get_support(indices=True)]

# 1st way to get the list
vector_names = list(X.columns[selector.get_support(indices=True)])
print(vector_names)

score = cross_val_score(rfc, X_new, Y, cv=10)
print(score)
print(score.mean())

(97, 2)
['annual_inc', 'total_bal_il']
[0.66666667 0.63636364 0.63636364 0.7        0.77777778 0.66666667
 0.55555556 0.77777778 0.77777778 0.75      ]
0.6944949494949495




In [18]:
y2015[['annual_inc', 'total_bal_il', 'loan_status']].tail(10)

Unnamed: 0,annual_inc,total_bal_il,loan_status
87,40000.0,0.0,Current
88,41600.0,27723.0,Current
89,135000.0,16182.0,Fully Paid
90,53750.0,6468.0,Fully Paid
91,50000.0,6600.0,Current
92,32000.0,6113.0,Current
93,30000.0,59096.0,Fully Paid
94,110000.0,34454.0,Charged Off
95,70000.0,12681.0,Current
96,40000.0,20907.0,Fully Paid


## Evaluate the model
As you can see, after dropping 6 features related to payment amount or outstanding principal we have an accuracy of about **70%**.

## Conclusion and discussion
Prior to dropping 'out_prncp' the model accuracy was around **93%** and dropping this one feature reduced accuracy to 70%.  As a next step, we would drop fewer rows (maybe 420,000 insteand of 421,000) and rerun the whole model to see if the 70% accuracy increases.

We see that annual income became an important feature after dropping payment amount and outstanding princiapal features.

After working with this model, and experiencing all of the memory errors, we see the importance of pairing down original data while maintaining effective features.