<br>
<br>
**<font size=8><center>Milestone 6</center></font>**

### Authors:
Devon Luongo <br>
Ankit Agarwal <br>
Bryn Clarke <br>
Ben Yuen

### Libraries:

In [1]:
import pandas as pd
import numpy as np

**<font size=6>Data Cleaning</font>**

Some of the data files have fields that contain NAs for older time periods. In order to collapse the data sets into one file, all numerical data will be stored in float fields (integer fields do not support NA missing values). To do this, we first define a conversion dictionary that stores the numeric fields with lookups to the *float* data type.

In [4]:
convert_float = dict([s, float] for s in
                     ['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment',
                      'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'mths_since_last_delinq',                      
                      'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc',
                      'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp',
                      'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
                      'last_pymnt_amnt', 'collections_12_mths_ex_med', 'mths_since_last_major_derog',
                      'annual_inc_joint', 'dti_joint', 'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 
                      'open_acc_6m', 'open_il_6m', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 
                      'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 
                      'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m', 
                      'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util',
                      'chargeoff_within_12_mths', 'delinq_amnt', 'mo_sin_old_il_acct',
                      'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc',
                      'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_inq',
                      'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 
                      'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl',
                      'num_rev_accts', 'num_rev_tl_bal_gt_0', 'num_sats', 'num_tl_120dpd_2m',
                      'num_tl_30dpd', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'pct_tl_nvr_dlq',
                      'percent_bc_gt_75', 'pub_rec_bankruptcies', 'tax_liens', 'tot_hi_cred_lim',
                      'total_bal_ex_mort', 'total_bc_limit', 'total_il_high_credit_limit'])

We also define a dictionary of string fields, to handle situations where the inferred data type might be numeric even though the field should be read in as a string/object.

In [5]:
convert_str = dict([s, str] for s in
                    ['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 
                     'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc', 
                     'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 
                     'initial_list_status', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d',
                     'policy_code', 'application_type', 'verification_status_joint'])

In [126]:
convert_categ = dict([s, "categorical"] for s in
                      ["term", "grade", "sub_grade", "home_ownership", "emp_length",
                       "verification_status", "loan_status", "pymnt_plan", "purpose", "addr_state",
                       "initial_list_status", "policy_code", "application_type", "verification_status_joint"])

We read the input data from the CSV data files using pandas *read_csv*. There is a blank row in the data header and there are two blank rows in the footer of each file. To allow the use of *skip_footer*, we use the python engine rather than the C engine. The first two columns (*id* and *member_id*) are unique and used to create a table index.

In [78]:
def read_data(file_loc):
    data = pd.read_csv(file_loc,
                       skiprows=1,
                       skip_footer=2,
                       engine="python",
                       na_values=['NaN', 'nan'],
                       converters=convert_str,
                       index_col=[0,1])
    
    # Depending on the specific dataset used, the numeric values may be read in as integers.
    # For best performance and to enable mergining of the datasets, we convert those fields
    # to floats (which allow NaN values):
    for k, v in convert_float.items():
        data[k] = data[k].astype(v)
    
    # There are 5 object fields that contain dates in the format *YYYY-MMM* (e.g. '2010-Jan').
    # We parse those to return datetime fields, which are more easily input into time series models
    # or plotted in charts.
    data.issue_d = pd.to_datetime(data.issue_d, errors="coerce")
    data.last_pymnt_d = pd.to_datetime(data.last_pymnt_d, errors="coerce")
    data.next_pymnt_d = pd.to_datetime(data.next_pymnt_d, errors="coerce")
    data.last_credit_pull_d = pd.to_datetime(data.last_credit_pull_d, errors="coerce")
    data.earliest_cr_line = pd.to_datetime(data.earliest_cr_line, errors="coerce")
    
    return data

data = pd.concat([read_data(s) for s in ["./data/LoanStats3a.csv", "./data/LoanStats3b.csv", "./data/LoanStats3c.csv", "./data/LoanStats3d.csv"]])

In [79]:
print data.shape

(887441, 109)


Many of the remaining fields contain categorical data. We use the pandas *category* data type to store the data more efficiently.

In [None]:
for k, v in convert_categ.items():
    data[k] = pd.Categorical(data[k])

Some percentages are stored as strings (*int_rate*, *revol_util*). Here we convert them into a float by stripping the % symbol and dividing by 100.

In [81]:
def percent_to_float(s):
    if (type(s) == str):
        if ("%" in s):
            return float(str(s).strip("%"))/100
        else:
            if s == "None":
                return np.nan
            else:            
                return s
    else:
        return s

data.int_rate = [percent_to_float(s) for s in data.int_rate]
data.revol_util = [percent_to_float(s) for s in data.revol_util]

In [82]:
# The remaining 5 object type fields are dropped as they aren't easily incorporated in a design matrix
data = data.drop(["emp_title", "url", "desc", "title", "zip_code"], axis=1)

In [83]:
test = data.copy()

In [87]:
# We decided to convert the date fields to the number of days since 1/1/1900
for date_col in ["issue_d", "earliest_cr_line", "last_pymnt_d", "next_pymnt_d", "last_credit_pull_d"]:
    data.loc[:, date_col] = (data.loc[:, date_col] - pd.to_datetime("1/1/1900")).dt.days

In [88]:
# Let us see what status we have.
print data["loan_status"].value_counts()

Current                                                467679
Fully Paid                                             315192
Charged Off                                             75740
Late (31-120 days)                                      14548
In Grace Period                                          8432
Late (16-30 days)                                        2968
Does not meet the credit policy. Status:Fully Paid       1988
Does not meet the credit policy. Status:Charged Off       761
Default                                                   132
None                                                        1
Name: loan_status, dtype: int64


In [89]:
# Ok we dont want current or None status as we don't know what is going to happen with those.
# next also late and grace period we are not sure about.
# thirdly we remove loans that do not meet credit policy so we can convert our issue to a binar model
data = data[((data["loan_status"] == "Default") | (data["loan_status"] == "Charged Off") | (data["loan_status"] == "Fully Paid"))]

# Rename "Charged off to default"
data["loan_status"] = data["loan_status"].replace("Charged Off", "Default")
data["loan_status"] = data["loan_status"].replace("Default", 1)
data["loan_status"] = data["loan_status"].replace("Fully Paid", 0)
print data.shape
print data["loan_status"].value_counts()

(391064, 104)
0    315192
1     75872
Name: loan_status, dtype: int64


In [94]:
data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_il_6m,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
id,member_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1
1077501,1296599.0,5000.0,5000.0,4975.0,36 months,0.1065,162.87,B,B2,10+ years,Rent,24000.0,Verified,40876.0,0,N,Credit_Card,AZ,27.65,0.0,31046.0,1.0,,,3.0,0.0,13648.0,0.837,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,42003.0,171.62,,42612.0,0.0,,1,Individual,,,,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1077430,1314167.0,2500.0,2500.0,2500.0,60 months,0.1527,59.83,C,C4,< 1 year,Rent,30000.0,Source Verified,40876.0,1,N,Car,GA,1.0,0.0,36249.0,5.0,,,3.0,0.0,1687.0,0.094,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,41363.0,119.66,,42612.0,0.0,,1,Individual,,,,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1077175,1313524.0,2400.0,2400.0,2400.0,36 months,0.1596,84.33,C,C5,10+ years,Rent,12252.0,Not Verified,40876.0,0,N,Small_Business,IL,8.72,0.0,37194.0,2.0,,,2.0,0.0,2956.0,0.985,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,41789.0,649.91,,42612.0,0.0,,1,Individual,,,,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1076863,1277178.0,10000.0,10000.0,10000.0,36 months,0.1349,339.31,C,C1,10+ years,Rent,49200.0,Source Verified,40876.0,0,N,Other,CA,20.0,0.0,35094.0,1.0,35.0,,10.0,0.0,5598.0,0.21,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,42003.0,357.48,,42459.0,0.0,,1,Individual,,,,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1075269,1311441.0,5000.0,5000.0,5000.0,36 months,0.079,156.46,A,A4,3 years,Rent,36000.0,Source Verified,40876.0,0,N,Wedding,AZ,11.2,0.0,38290.0,3.0,,,9.0,0.0,7963.0,0.283,12.0,f,0.0,0.0,5632.21,5632.21,5000.0,632.21,0.0,0.0,0.0,42003.0,161.03,,42368.0,0.0,,1,Individual,,,,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,


In [96]:
# There are only 50 joint applications, so we coalesce the fields that have joint values (annual_inc, dti, verification_status)
for joint_col in ["annual_inc", "dti", "verification_status"]:
    data.loc[data.application_type=="Joint", joint_col] = data.loc[data.application_type=="Joint", joint_col + "_joint"]

data = data.drop(["annual_inc_joint", "dti_joint", "verification_status_joint"], axis=1)

In [104]:
data.pymnt_plan.value_counts()

N       391060
Y            4
None         0
Name: pymnt_plan, dtype: int64

In [105]:
# policy_code has only one level, "1", and pymnt_plan is all "N" except for 4 records so we drop both 
data = data.drop(["pymnt_plan", "policy_code"], axis=1)

In [107]:
# While grade and sub_grade are thought to be predictive, they use Lending Tree's credit model and we don't
# want to influence our scoring model using their model that will capture much of the same information
# This is also true of int_rate and installment, which are each correlated with LT's credit grade
data = data.drop(["int_rate", "installment", "grade", "sub_grade"], axis=1)

In [117]:
# Finally we drop other fields that are not known a priori and may contain information
# post-default / delinquency that will bias our results
# This includes payment history and loan balances
data = data.drop(["out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp",
                  "total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee",
                  "last_pymnt_d", "last_pymnt_amnt", "next_pymnt_d"], axis=1)

In [118]:
data.loc[data.loan_status == 1]

Unnamed: 0_level_0,Unnamed: 1_level_0,loan_amnt,funded_amnt,funded_amnt_inv,term,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,purpose,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,application_type,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_il_6m,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
id,member_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1
1077430,1314167.0,2500.0,2500.0,2500.000000,60 months,< 1 year,Rent,30000.0,Source Verified,40876.0,1,Car,GA,1.00,0.0,36249.0,5.0,,,3.0,0.0,1687.0,0.094,4.0,f,42612.0,0.0,,Individual,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1071795,1306957.0,5600.0,5600.0,5600.000000,60 months,4 years,Own,40000.0,Source Verified,40876.0,1,Small_Business,CA,5.55,0.0,38076.0,2.0,,,11.0,0.0,5210.0,0.326,13.0,f,42612.0,0.0,,Individual,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1071570,1306721.0,5375.0,5375.0,5350.000000,60 months,< 1 year,Rent,15000.0,Verified,40876.0,1,Other,TX,18.08,0.0,38229.0,0.0,,,2.0,0.0,9279.0,0.365,3.0,f,42612.0,0.0,,Individual,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1064687,1298717.0,9000.0,9000.0,9000.000000,36 months,< 1 year,Rent,30000.0,Source Verified,40876.0,1,Debt_Consolidation,VA,10.08,0.0,38076.0,1.0,,,4.0,0.0,10452.0,0.917,9.0,f,42612.0,0.0,,Individual,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1069057,1303503.0,10000.0,10000.0,10000.000000,36 months,3 years,Rent,100000.0,Source Verified,40876.0,1,Other,CA,7.06,0.0,33357.0,2.0,,,14.0,0.0,11997.0,0.555,29.0,f,42612.0,0.0,,Individual,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1039153,1269083.0,21000.0,21000.0,21000.000000,36 months,10+ years,Rent,105000.0,Verified,40876.0,1,Debt_Consolidation,FL,13.22,0.0,30346.0,0.0,,,7.0,0.0,32135.0,0.903,38.0,f,42612.0,0.0,,Individual,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1069559,1304634.0,6000.0,6000.0,6000.000000,36 months,1 year,Rent,76000.0,Not Verified,40876.0,1,Major_Purchase,CA,2.40,0.0,37041.0,1.0,,,7.0,0.0,5963.0,0.297,7.0,f,42612.0,0.0,,Individual,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1069800,1304679.0,15000.0,15000.0,8725.000000,36 months,9 years,Rent,60000.0,Not Verified,40876.0,1,Debt_Consolidation,NY,15.22,0.0,37893.0,1.0,,,7.0,0.0,5872.0,0.576,11.0,f,42612.0,0.0,,Individual,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1069657,1304764.0,5000.0,5000.0,5000.000000,60 months,2 years,Rent,50004.0,Not Verified,40876.0,1,Other,PA,13.97,3.0,37893.0,0.0,20.0,,14.0,0.0,4345.0,0.595,22.0,f,42612.0,0.0,,Individual,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,
1069465,1304521.0,5000.0,5000.0,5000.000000,36 months,10+ years,Mortgage,100000.0,Source Verified,40876.0,1,Debt_Consolidation,OH,16.33,0.0,34849.0,0.0,,,17.0,0.0,74351.0,0.621,35.0,f,42612.0,0.0,,Individual,0.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,


In [150]:
df_y = data["loan_status"]
df_X = data.drop("loan_status", axis=1)
# Remove member id.
df_X = df_X.ix[:, 1:]

In [151]:
print df_X.shape
print df_y.shape

(391064, 81)
(391064L,)


In [152]:
for k, v in convert_categ.items():
    if not(k in ["verification_status_joint", "policy_code", "grade", "sub_grade", "pymnt_plan", "loan_status"]):
        df_X = pd.concat([df_X,
                          pd.get_dummies(df_X[k], prefix=k)],
                         axis=1)
        df_X = df_X.drop(k, axis=1)

In [166]:
X = np.array(df_X)
y = np.array(df_y)

In [167]:
print X.shape

(391064L, 172L)


In [169]:
np.save("./data_compressed/X.npy", X)
np.save("./data_compressed/y.npy", y)