In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

We've talked about Random Forests. Now it's time to build one.

Here we'll use data from Lending Club to predict the state of a loan given some information about it. You can find the dataset [here](https://www.lendingclub.com/info/download-data.action). We'll use 2015 data. ([Thinkful mirror](https://www.dropbox.com/s/m7z42lubaiory33/LoanStats3d.csv?dl=0))

In [5]:
# Replace the path with the correct path for your data.
y2015 = pd.read_csv(
    'LoanStats3d.csv',
    skipinitialspace=True,
    header=1
)

# Note the warning about dtypes.

  interactivity=interactivity, compiler=compiler, result=result)


In [6]:
categorical = y2015.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

id
421097
term
2
int_rate
110
grade
7
sub_grade
35
emp_title
120812
emp_length
11
home_ownership
4
verification_status
3
issue_d
12
loan_status
7
pymnt_plan
1
url
421095
desc
34
purpose
14
title
27
zip_code
914
addr_state
49
earliest_cr_line
668
revol_util
1211
initial_list_status
2
last_pymnt_d
25
next_pymnt_d
4
last_credit_pull_d
26
application_type
2
verification_status_joint
3


In [7]:
# Convert ID and Interest Rate to numeric.
y2015['id'] = pd.to_numeric(y2015['id'], errors='coerce')
y2015['int_rate'] = pd.to_numeric(y2015['int_rate'].str.strip('%'), errors='coerce')

# Drop other columns with many unique variables
y2015.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
            'sub_grade', 'addr_state', 'desc'], 1, inplace=True)

Wonder what was causing the dtype error on the id column, which _should_ have all been integers? Let's look at the end of the file.

In [8]:
y2015.tail()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,emp_length,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
421092,36271333.0,38982739.0,13000.0,13000.0,13000.0,60 months,15.99,316.07,D,5 years,...,0.0,3.0,100.0,50.0,1.0,0.0,51239.0,34178.0,10600.0,33239.0
421093,36490806.0,39222577.0,12000.0,12000.0,12000.0,60 months,19.99,317.86,E,1 year,...,1.0,2.0,95.0,66.7,0.0,0.0,96919.0,58418.0,9700.0,69919.0
421094,36271262.0,38982659.0,20000.0,20000.0,20000.0,36 months,11.99,664.2,B,10+ years,...,0.0,1.0,100.0,50.0,0.0,1.0,43740.0,33307.0,41700.0,0.0
421095,,,,,,,,,,,...,,,,,,,,,,
421096,,,,,,,,,,,...,,,,,,,,,,


In [9]:
# Remove two summary rows at the end that don't actually contain data.
y2015 = y2015[:-2]

Now this should be better. Let's try again.

In [10]:
pd.get_dummies(y2015)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,...,last_credit_pull_d_Nov-2016,last_credit_pull_d_Oct-2015,last_credit_pull_d_Oct-2016,last_credit_pull_d_Sep-2015,last_credit_pull_d_Sep-2016,application_type_INDIVIDUAL,application_type_JOINT,verification_status_joint_Not Verified,verification_status_joint_Source Verified,verification_status_joint_Verified
0,68009401.0,72868139.0,16000.0,16000.0,16000.0,14.85,379.39,48000.0,33.18,0.0,...,0,0,0,0,0,1,0,0,0,0
1,68354783.0,73244544.0,9600.0,9600.0,9600.0,7.49,298.58,60000.0,22.44,0.0,...,0,0,0,0,0,1,0,0,0,0
2,68466916.0,73356753.0,25000.0,25000.0,25000.0,7.49,777.55,109000.0,26.02,0.0,...,0,0,0,0,0,1,0,0,0,0
3,68466961.0,73356799.0,28000.0,28000.0,28000.0,6.49,858.05,92000.0,21.60,0.0,...,0,0,0,0,0,1,0,0,0,0
4,68495092.0,73384866.0,8650.0,8650.0,8650.0,19.89,320.99,55000.0,25.49,0.0,...,0,0,0,0,0,1,0,0,0,0
5,68506798.0,73396623.0,23000.0,23000.0,23000.0,8.49,471.77,64000.0,18.28,0.0,...,0,0,0,0,0,1,0,0,0,0
6,68566886.0,73456723.0,29900.0,29900.0,29900.0,12.88,678.49,65000.0,21.77,0.0,...,0,0,0,0,0,1,0,0,0,0
7,68577849.0,73467703.0,18000.0,18000.0,18000.0,11.99,400.31,112000.0,8.68,0.0,...,0,0,0,0,0,1,0,0,0,0
8,66310712.0,71035433.0,35000.0,35000.0,35000.0,14.85,829.90,110000.0,17.06,0.0,...,0,0,0,0,0,1,0,0,0,0
9,68476807.0,73366655.0,10400.0,10400.0,10400.0,22.45,289.91,104433.0,25.37,1.0,...,0,0,0,0,0,1,0,0,0,0


It finally works! We had to sacrifice sub grade, state address and description, but that's fine. If you want to include them you could run the dummies independently and then append them back to the dataframe.

## Second Attempt

Now let's try this model again.

We're also going to drop NA columns, rather than impute, because our data is rich enough that we can probably get away with it.

This model may take a few minutes to run.

In [11]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

rfc = ensemble.RandomForestClassifier()
X = y2015.drop('loan_status', 1)
Y = y2015['loan_status']
X = pd.get_dummies(X)
X = X.dropna(axis=1)

cross_val_score(rfc, X, Y, cv=10)

KeyboardInterrupt: 

The score cross validation reports is the accuracy of the tree. Here we're about 98% accurate.

That works pretty well, but there are a few potential problems. Firstly, we didn't really do much in the way of feature selection or model refinement. As such there are a lot of features in there that we don't really need. Some of them are actually quite impressively useless.

There's also some variance in the scores. The fact that one gave us only 93% accuracy while others gave higher than 98 is concerning. This variance could be corrected by increasing the number of estimators. That will make it take even longer to run, however, and it is already quite slow.

## DRILL: Third Attempt

So here's your task. Get rid of as much data as possible without dropping below an average of 90% accuracy in a 10-fold cross validation.

You'll want to do a few things in this process. First, dive into the data that we have and see which features are most important. This can be the raw features or the generated dummies. You may want to use PCA or correlation matrices.

Can you do it without using anything related to payment amount or outstanding principal? How do you know?

In [12]:
# Your code here.
y2015.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,emp_length,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,68009401.0,72868139.0,16000.0,16000.0,16000.0,60 months,14.85,379.39,C,10+ years,...,0.0,2.0,78.9,0.0,0.0,2.0,298100.0,31329.0,281300.0,13400.0
1,68354783.0,73244544.0,9600.0,9600.0,9600.0,36 months,7.49,298.58,A,8 years,...,0.0,2.0,100.0,66.7,0.0,0.0,88635.0,55387.0,12500.0,75635.0
2,68466916.0,73356753.0,25000.0,25000.0,25000.0,36 months,7.49,777.55,A,10+ years,...,0.0,0.0,100.0,20.0,0.0,0.0,373572.0,68056.0,38400.0,82117.0
3,68466961.0,73356799.0,28000.0,28000.0,28000.0,36 months,6.49,858.05,A,10+ years,...,0.0,0.0,91.7,22.2,0.0,0.0,304003.0,74920.0,41500.0,42503.0
4,68495092.0,73384866.0,8650.0,8650.0,8650.0,36 months,19.89,320.99,E,8 years,...,0.0,12.0,100.0,50.0,1.0,0.0,38998.0,18926.0,2750.0,18248.0


In [16]:
y2015.drop(['id', 'member_id'], axis=1, inplace=True)

In [18]:
y2015 = y2015.dropna(axis=1)

In [21]:
categorical = y2015.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

term
2
grade
7
home_ownership
4
verification_status
3
issue_d
12
loan_status
7
pymnt_plan
1
purpose
14
initial_list_status
2
application_type
2


In [29]:
y2015.drop(['issue_d', 'verification_status'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [32]:
y2015.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421095 entries, 0 to 421094
Data columns (total 64 columns):
loan_amnt                     421095 non-null float64
funded_amnt                   421095 non-null float64
funded_amnt_inv               421095 non-null float64
term                          421095 non-null object
int_rate                      421095 non-null float64
installment                   421095 non-null float64
grade                         421095 non-null object
home_ownership                421095 non-null object
annual_inc                    421095 non-null float64
loan_status                   421095 non-null object
pymnt_plan                    421095 non-null object
purpose                       421095 non-null object
dti                           421095 non-null float64
delinq_2yrs                   421095 non-null float64
inq_last_6mths                421095 non-null float64
open_acc                      421095 non-null float64
pub_rec                       4

In [37]:
y2015.corr() > .5

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,inq_last_6mths,open_acc,...,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
loan_amnt,True,True,True,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
funded_amnt,True,True,True,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
funded_amnt_inv,True,True,True,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
int_rate,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
installment,True,True,True,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
annual_inc,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
dti,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
delinq_2yrs,False,False,False,False,False,False,False,True,False,False,...,False,True,False,False,False,False,False,False,False,False
inq_last_6mths,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
open_acc,False,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False


In [38]:
y2015.drop(['funded_amnt', 'funded_amnt_inv', 'installment',
            'out_prncp', 'out_prncp_inv', 'total_pymnt',
            'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int'],
          axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [41]:
y2015.corr()>.75

Unnamed: 0,loan_amnt,int_rate,annual_inc,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,revol_bal,total_acc,...,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
loan_amnt,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
int_rate,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
annual_inc,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
dti,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
delinq_2yrs,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
inq_last_6mths,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
open_acc,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
pub_rec,False,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
revol_bal,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
total_acc,False,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False


In [42]:
y2015.drop(['num_sats', 'num_op_rev_tl', 'acc_open_past_24mths'], axis=1,
          inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [43]:
y2015.corr()

Unnamed: 0,loan_amnt,int_rate,annual_inc,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,revol_bal,total_acc,...,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
loan_amnt,1.0,0.140572,0.305734,0.006103,-0.010424,-0.03904,0.188214,-0.088899,0.334656,0.20843,...,0.002461,-0.031334,-0.042158,0.105056,-0.12443,0.002087,0.347289,0.289565,0.395843,0.203093
int_rate,0.140572,1.0,-0.090399,0.077932,0.04371,0.231139,-0.012985,0.058231,-0.057412,-0.040211,...,0.021897,0.031344,0.256831,-0.061652,0.065078,0.012234,-0.121081,-0.000676,-0.257222,0.004055
annual_inc,0.305734,-0.090399,1.0,-0.068237,0.03791,0.02299,0.122328,-0.006739,0.270174,0.159804,...,0.011602,0.005709,0.04533,-0.005866,-0.050312,0.038323,0.391129,0.297563,0.257184,0.242637
dti,0.006103,0.077932,-0.068237,1.0,-0.005868,0.003277,0.104364,-0.0232,0.048135,0.080499,...,0.002671,-0.005476,0.031698,0.04202,-0.015054,-0.014954,-0.000674,0.100831,0.012969,0.113685
delinq_2yrs,-0.010424,0.04371,0.03791,-0.005868,1.0,0.036139,0.040005,-0.01522,-0.035801,0.114623,...,0.111401,0.649762,-0.017943,-0.451422,-0.040508,0.011366,0.059349,0.026929,-0.081793,0.061466
inq_last_6mths,-0.03904,0.231139,0.02299,0.003277,0.036139,1.0,0.159144,0.097382,-0.012303,0.161029,...,-0.004539,0.049446,0.370187,-0.05829,0.12026,0.018906,0.009545,0.029615,-0.012923,0.039506
open_acc,0.188214,-0.012985,0.122328,0.104364,0.040005,0.159144,1.0,-0.029598,0.222987,0.709499,...,0.017054,0.002805,0.361077,0.127752,-0.032279,-0.010632,0.280703,0.39989,0.354028,0.359085
pub_rec,-0.088899,0.058231,-0.006739,-0.0232,-0.01522,0.097382,-0.029598,1.0,-0.106714,0.014886,...,-0.007092,0.011977,0.105336,-0.059723,0.615794,0.720037,-0.082777,-0.068061,-0.143872,-0.026032
revol_bal,0.334656,-0.057412,0.270174,0.048135,-0.035801,-0.012303,0.222987,-0.106714,1.0,0.180191,...,0.002292,-0.040672,-0.036702,0.128087,-0.132012,-0.019116,0.481402,0.514992,0.504089,0.109299
total_acc,0.20843,-0.040211,0.159804,0.080499,0.114623,0.161029,0.709499,0.014886,0.180191,1.0,...,0.021654,0.069411,0.335309,0.019182,0.058621,-0.023568,0.327725,0.427075,0.262416,0.419243


In [44]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

rfc = ensemble.RandomForestClassifier()
X = y2015.drop('loan_status', 1)
Y = y2015['loan_status']
X = pd.get_dummies(X)
X = X.dropna(axis=1)

cross_val_score(rfc, X, Y, cv=10)

array([0.89261713, 0.89584669, 0.89518178, 0.90064354, 0.90268345,
       0.9022085 , 0.9048422 , 0.90439099, 0.90640511, 0.89942051])

In [45]:
[0.89261713, 0.89584669, 0.89518178, 0.90064354, 0.90268345,
       0.9022085 , 0.9048422 , 0.90439099, 0.90640511, 0.89942051].mean()

AttributeError: 'list' object has no attribute 'mean'