_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [1]:
!wget https://resources.lendingclub.com/LoanStats_2019Q1.csv.zip

--2019-07-18 22:48:34--  https://resources.lendingclub.com/LoanStats_2019Q1.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2019Q1.csv.zip’

LoanStats_2019Q1.cs     [    <=>             ]  19.30M   813KB/s    in 25s     

2019-07-18 22:49:04 (795 KB/s) - ‘LoanStats_2019Q1.csv.zip’ saved [20240936]



In [2]:
!ls

LoanStats_2019Q1.csv.zip  sample_data


In [3]:
!unzip LoanStats_2019Q1.csv.zip

Archive:  LoanStats_2019Q1.csv.zip
  inflating: LoanStats_2019Q1.csv    


In [4]:
!head LoanStats_2019Q1.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [5]:
import pandas as pd

df = pd.read_csv("LoanStats_2019Q1.csv", header=1, skipfooter=2, engine="python") # Or you can use skiprows=1. skipfooter skips the last two rows
print(df.shape)
df.head()

(115675, 144)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,20000,20000,20000,60 months,17.19%,499.1,C,C5,Front desk supervisor,6 years,RENT,47000.0,Source Verified,Mar-2019,Current,n,,,debt_consolidation,Debt consolidation,958xx,CA,14.02,0,Sep-2006,1,50.0,,15,0,10687,19.7%,53,w,19254.76,19254.76,1459.1,1459.1,...,12.5,0,0,75824,31546,33800,21524,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,,,12000,12000,12000,36 months,16.40%,424.26,C,C4,Executive Director,4 years,MORTGAGE,95000.0,Not Verified,Mar-2019,Current,n,,,debt_consolidation,Debt consolidation,436xx,OH,29.5,0,Nov-1992,0,29.0,,19,0,16619,64.9%,36,w,11475.92,11475.92,826.65,826.65,...,66.7,0,0,209488,51734,18500,54263,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,,,3000,3000,3000,36 months,14.74%,103.62,C,C2,Office Manager,4 years,MORTGAGE,58750.0,Verified,Mar-2019,Current,n,,,medical,Medical expenses,327xx,FL,30.91,0,Jun-2004,0,24.0,,16,0,20502,60.1%,25,f,2865.64,2865.64,202.33,202.33,...,71.4,0,0,69911,37816,11400,35811,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,,,35000,35000,35000,36 months,15.57%,1223.08,C,C3,Store Manager,10+ years,RENT,122000.0,Verified,Mar-2019,Current,n,,,credit_card,Credit card refinancing,333xx,FL,22.0,1,Dec-2009,0,20.0,,5,0,1441,24.4%,18,w,33459.43,33459.43,2446.76,2446.76,...,100.0,0,0,65640,24471,1600,59740,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,,,5000,5000,5000,36 months,15.57%,174.73,C,C3,Area Manager,3 years,OWN,65000.0,Verified,Mar-2019,Current,n,,,house,Home buying,640xx,MO,16.28,1,Jul-2001,0,7.0,,9,0,5604,64.4%,25,w,4778.86,4778.86,340.81,340.81,...,50.0,0,0,38190,29775,4400,29490,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


In [6]:
df.isna().sum()

id                                            115675
member_id                                     115675
loan_amnt                                          0
funded_amnt                                        0
funded_amnt_inv                                    0
term                                               0
int_rate                                           0
installment                                        0
grade                                              0
sub_grade                                          0
emp_title                                      19518
emp_length                                     11101
home_ownership                                     0
annual_inc                                         0
verification_status                                0
issue_d                                            0
loan_status                                        0
pymnt_plan                                         0
url                                           

In [0]:
#df[df.loan_amnt.isna()] This counts the NaN values in the row and finds them.

In [8]:
df.info() #the object typically means its a string.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115675 entries, 0 to 115674
Columns: 144 entries, id to settlement_term
dtypes: float64(56), int64(52), object(36)
memory usage: 127.1+ MB


In [9]:
df.head().T # The "T" stands for transpose and will flip the cloumns with rows.

Unnamed: 0,0,1,2,3,4
id,,,,,
member_id,,,,,
loan_amnt,20000,12000,3000,35000,5000
funded_amnt,20000,12000,3000,35000,5000
funded_amnt_inv,20000,12000,3000,35000,5000
term,60 months,36 months,36 months,36 months,36 months
int_rate,17.19%,16.40%,14.74%,15.57%,15.57%
installment,499.1,424.26,103.62,1223.08,174.73
grade,C,C,C,C,C
sub_grade,C5,C4,C2,C3,C3


In [10]:
pd.options.display.max_columns = 150 # will show you all the rows in a data frame and -1 stands for infinity also can do .max_rows
pd.options.display.max_rows = 150
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,20000,20000,20000,60 months,17.19%,499.1,C,C5,Front desk supervisor,6 years,RENT,47000.0,Source Verified,Mar-2019,Current,n,,,debt_consolidation,Debt consolidation,958xx,CA,14.02,0,Sep-2006,1,50.0,,15,0,10687,19.7%,53,w,19254.76,19254.76,1459.1,1459.1,745.24,713.86,0.0,0.0,0.0,Jun-2019,499.1,Jul-2019,Jun-2019,0,50.0,1,Individual,,,,0,0,31546,3,2,1,2,10.0,20859,97.0,4,9,5909,42.0,54300,6,1,3,11,2103.0,23647.0,30.0,0,0,150.0,100,1,1,0,5.0,,3.0,50.0,3,3,4,8,19,19,13,33,4,15,0.0,0,0,5,98.0,12.5,0,0,75824,31546,33800,21524,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,,,12000,12000,12000,36 months,16.40%,424.26,C,C4,Executive Director,4 years,MORTGAGE,95000.0,Not Verified,Mar-2019,Current,n,,,debt_consolidation,Debt consolidation,436xx,OH,29.5,0,Nov-1992,0,29.0,,19,0,16619,64.9%,36,w,11475.92,11475.92,826.65,826.65,524.08,302.57,0.0,0.0,0.0,May-2019,424.26,Jul-2019,Jun-2019,0,30.0,1,Individual,,,,0,0,176551,1,5,2,5,7.0,35115,65.0,3,6,4899,65.0,25600,4,0,3,12,9292.0,3231.0,82.5,0,0,316.0,269,5,5,2,12.0,58.0,7.0,29.0,1,6,8,6,9,10,13,23,8,19,0.0,0,0,6,80.6,66.7,0,0,209488,51734,18500,54263,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,,,3000,3000,3000,36 months,14.74%,103.62,C,C2,Office Manager,4 years,MORTGAGE,58750.0,Verified,Mar-2019,Current,n,,,medical,Medical expenses,327xx,FL,30.91,0,Jun-2004,0,24.0,,16,0,20502,60.1%,25,f,2865.64,2865.64,202.33,202.33,134.36,67.97,0.0,0.0,0.0,May-2019,103.62,Jul-2019,Jun-2019,0,,1,Individual,,,,0,0,37816,0,3,0,0,27.0,17314,48.0,0,0,3962,54.0,34100,0,1,0,0,2364.0,1208.0,89.4,0,0,177.0,164,35,27,4,40.0,24.0,,24.0,0,7,12,7,8,6,13,15,12,16,0.0,0,0,0,96.0,71.4,0,0,69911,37816,11400,35811,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,,,35000,35000,35000,36 months,15.57%,1223.08,C,C3,Store Manager,10+ years,RENT,122000.0,Verified,Mar-2019,Current,n,,,credit_card,Credit card refinancing,333xx,FL,22.0,1,Dec-2009,0,20.0,,5,0,1441,24.4%,18,w,33459.43,33459.43,2446.76,2446.76,1540.57,845.04,61.15,0.0,0.0,Jun-2019,1223.08,Jul-2019,Jun-2019,0,20.0,1,Individual,,,,0,0,24471,0,3,0,0,26.0,23030,39.0,0,0,1441,37.0,5900,2,4,1,1,4894.0,159.0,90.1,0,0,108.0,111,52,24,0,85.0,,9.0,,1,1,1,1,3,10,2,7,1,5,0.0,0,1,0,94.4,100.0,0,0,65640,24471,1600,59740,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,,,5000,5000,5000,36 months,15.57%,174.73,C,C3,Area Manager,3 years,OWN,65000.0,Verified,Mar-2019,Current,n,,,house,Home buying,640xx,MO,16.28,1,Jul-2001,0,7.0,,9,0,5604,64.4%,25,w,4778.86,4778.86,340.81,340.81,221.14,119.67,0.0,0.0,0.0,May-2019,174.73,Jul-2019,Jun-2019,0,7.0,1,Individual,,,,0,0,29775,0,1,0,1,17.0,24171,82.0,1,6,1824,78.0,8700,1,0,0,7,3722.0,1070.0,75.7,0,0,111.0,212,10,10,4,17.0,69.0,17.0,69.0,4,3,5,5,9,7,8,14,5,9,0.0,0,1,1,76.0,50.0,0,0,38190,29775,4400,29490,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [11]:
# Most models that you train you can't pass in strings. Need to represent them numerically.
df.describe() # This will take all the numerical columns and ignore the strings and give you some stats about them.

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,installment,annual_inc,url,desc,dti,delinq_2yrs,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,total_acc,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,annual_inc_joint,dti_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,deferral_term,hardship_amount,hardship_length,hardship_dpd,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,settlement_amount,settlement_percentage,settlement_term
count,0.0,0.0,115675.0,115675.0,115675.0,115675.0,115675.0,0.0,0.0,115418.0,115675.0,115675.0,49876.0,13423.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,24998.0,115675.0,16681.0,16681.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,112158.0,115675.0,98356.0,115675.0,115675.0,115675.0,115651.0,115675.0,115675.0,115675.0,115675.0,115675.0,115668.0,114352.0,114297.0,115675.0,115675.0,112158.0,115675.0,115675.0,115675.0,115675.0,114417.0,22598.0,101318.0,32896.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,112529.0,115675.0,115675.0,115675.0,115675.0,114346.0,115675.0,115675.0,115675.0,115675.0,115675.0,115675.0,16681.0,16681.0,16681.0,16681.0,16451.0,16681.0,16681.0,16681.0,16681.0,4901.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
mean,,,16671.263021,16671.263021,16668.042144,481.083874,85053.33,,,20.445427,0.227595,0.443069,36.654864,87.79483,11.723544,0.117484,17859.725835,23.121262,14769.657755,14766.934018,2526.362844,2525.758973,1880.341193,645.880945,0.140706,0.0,0.0,1026.634408,0.016693,46.419034,1.0,139303.4,19.539382,0.0,176.482818,153088.8,0.913776,2.79043,0.701145,1.586721,20.14336,37631.73,67.932236,1.167806,2.463384,6395.217324,54.217058,41527.28836,1.119118,1.54571,1.877994,4.316672,14307.482752,16306.032531,49.81788,0.007469,3.038591,125.115747,176.763761,15.652777,8.732648,1.378561,26.151743,39.70108,7.663525,37.31706,0.436395,3.741967,5.544664,5.020324,7.205749,8.475159,8.30163,13.071761,5.501388,11.705909,0.0,0.0,0.058007,1.992349,94.919273,32.61193,0.117441,0.0,197153.1,55856.0,29185.391589,48716.11,39336.440801,0.564535,1.647743,11.707931,55.987022,3.083148,12.661231,0.027277,0.052515,39.24342,3.0,35.21,3.0,11.0,105.63,5636.06,5.9,2839.0,65.0,18.0
std,,,10352.13279,10352.13279,10354.685658,290.651469,109894.9,,,20.480562,0.746965,0.736914,21.806311,21.745046,5.977082,0.327165,22735.701183,12.177136,10017.773578,10019.645719,3712.764848,3712.694129,3631.076603,520.540978,2.43211,0.0,0.0,3362.997605,0.140971,21.887349,0.0,80273.6,8.099141,0.0,1776.351801,171805.5,1.127978,2.912667,0.934572,1.554939,24.583029,47968.8,23.492706,1.436514,2.447904,5920.727685,20.36786,37737.346245,1.488827,2.729418,2.309489,3.15496,17725.923722,20214.367192,28.543757,0.095435,327.162309,56.721665,100.724495,19.253788,9.39791,1.720424,34.075895,22.369914,6.089147,22.12795,1.341723,2.482458,3.507135,3.263597,4.548438,7.413471,4.983926,7.92211,3.450848,5.974131,0.0,0.0,0.421806,1.85769,8.640351,34.464534,0.327114,0.0,194023.4,55911.84,26975.924435,50492.59,33132.18992,0.911911,1.795547,6.671331,25.583408,3.224937,8.226743,0.260578,0.310101,23.687664,,,,,,,,,,
min,,,1000.0,1000.0,725.0,30.64,0.0,,,0.0,0.0,0.0,1.0,12.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,17604.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,16.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,35.21,3.0,11.0,105.63,5636.06,5.9,2839.0,65.0,18.0
25%,,,9000.0,9000.0,9000.0,264.94,49000.0,,,12.21,0.0,0.0,19.0,74.0,8.0,0.0,6166.0,14.0,7211.5,7211.5,1014.86,1014.43,635.485,262.87,0.0,0.0,0.0,272.85,0.0,29.0,1.0,91000.0,13.53,0.0,0.0,30200.0,0.0,1.0,0.0,0.0,7.0,8849.0,54.0,0.0,1.0,2500.0,40.0,18200.0,0.0,0.0,0.0,2.0,3162.0,3373.0,26.5,0.0,0.0,86.0,101.0,4.0,3.0,0.0,6.0,22.0,3.0,19.0,0.0,2.0,3.0,3.0,4.0,3.0,5.0,7.0,3.0,7.0,0.0,0.0,0.0,1.0,92.9,0.0,0.0,0.0,58100.0,21680.0,11100.0,16060.0,18077.0,0.0,0.0,7.0,37.1,1.0,7.0,0.0,0.0,20.0,3.0,35.21,3.0,11.0,105.63,5636.06,5.9,2839.0,65.0,18.0
50%,,,15000.0,15000.0,15000.0,401.53,70000.0,,,18.48,0.0,0.0,34.0,92.0,11.0,0.0,12072.0,21.0,12783.75,12775.25,1660.86,1660.86,1044.94,501.21,0.0,0.0,0.0,427.72,0.0,47.0,1.0,123000.0,19.11,0.0,0.0,84442.0,1.0,2.0,0.0,1.0,13.0,24184.0,70.0,1.0,2.0,4904.0,55.0,31600.0,1.0,0.0,1.0,4.0,7673.5,9467.5,48.5,0.0,0.0,132.0,158.0,9.0,6.0,1.0,15.0,37.0,6.0,35.0,0.0,3.0,5.0,4.0,6.0,6.0,7.0,11.0,5.0,11.0,0.0,0.0,0.0,2.0,100.0,25.0,0.0,0.0,129400.0,40781.0,21500.0,36130.0,31108.0,0.0,1.0,11.0,57.4,2.0,11.0,0.0,0.0,38.0,3.0,35.21,3.0,11.0,105.63,5636.06,5.9,2839.0,65.0,18.0
75%,,,24000.0,24000.0,24000.0,648.1,100000.0,,,25.79,0.0,1.0,53.0,105.0,15.0,0.0,21914.0,30.0,21182.43,21179.855,2746.02,2746.02,1783.535,885.11,0.0,0.0,0.0,688.13,0.0,64.0,1.0,165000.0,25.2,0.0,0.0,232450.0,1.0,3.0,1.0,2.0,23.0,48601.5,84.0,2.0,3.0,8484.0,68.0,52800.0,2.0,2.0,3.0,6.0,19812.5,21678.0,73.6,0.0,0.0,158.0,228.0,20.0,11.0,2.0,32.0,56.0,12.0,52.0,0.0,5.0,7.0,7.0,9.0,11.0,11.0,17.0,7.0,15.0,0.0,0.0,0.0,3.0,100.0,50.0,0.0,0.0,287455.0,70670.0,38600.0,65640.5,50620.0,1.0,3.0,15.0,76.3,4.0,17.0,0.0,0.0,58.0,3.0,35.21,3.0,11.0,105.63,5636.06,5.9,2839.0,65.0,18.0
max,,,40000.0,40000.0,40000.0,1676.23,9000000.0,,,999.0,21.0,5.0,176.0,119.0,72.0,4.0,652794.0,164.0,40000.0,40000.0,43131.146484,43131.15,40000.0,4501.17,185.31,0.0,0.0,41409.34,6.0,190.0,1.0,1836000.0,39.98,0.0,199308.0,4062258.0,14.0,63.0,6.0,22.0,480.0,1260281.0,313.0,18.0,44.0,207484.0,216.0,759500.0,29.0,70.0,42.0,45.0,635178.0,506507.0,169.5,6.0,65000.0,696.0,902.0,564.0,226.0,28.0,564.0,198.0,25.0,181.0,44.0,30.0,45.0,45.0,60.0,149.0,70.0,98.0,43.0,72.0,0.0,0.0,20.0,19.0,100.0,100.0,4.0,0.0,7365708.0,1295455.0,509400.0,1587274.0,473339.0,6.0,15.0,75.0,150.1,36.0,110.0,13.0,16.0,132.0,3.0,35.21,3.0,11.0,105.63,5636.06,5.9,2839.0,65.0,18.0


In [12]:
df.describe(include="object") # This will only show the 36 strings or objects that we saw in .info

Unnamed: 0,term,int_rate,grade,sub_grade,emp_title,emp_length,home_ownership,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,earliest_cr_line,revol_util,initial_list_status,last_pymnt_d,next_pymnt_d,last_credit_pull_d,application_type,verification_status_joint,sec_app_earliest_cr_line,hardship_flag,hardship_type,hardship_reason,hardship_status,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_loan_status,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date
count,115675,115675,115675,115675,96157,104574,115675,115675,115675,115675,115675,115675,115675,115675,115675,115675,115546,115675,115445,110769,115673,115675,14624,16681,115675,1,1,1,1,1,1,1,115675,1,1,1
unique,2,53,7,33,39387,11,5,3,3,6,1,12,12,877,50,652,1055,2,7,3,7,2,3,577,2,1,1,1,1,1,1,1,2,1,1,1
top,36 months,8.19%,A,A4,Teacher,10+ years,MORTGAGE,Not Verified,Jan-2019,Current,n,debt_consolidation,Debt consolidation,750xx,CA,Aug-2006,0%,w,Jun-2019,Jul-2019,Jun-2019,Individual,Not Verified,Aug-2006,N,INTEREST ONLY-3 MONTHS DEFERRAL,INCOME_CURTAILMENT,ACTIVE,Jun-2019,Sep-2019,Jul-2019,In Grace Period,N,May-2019,ACTIVE,May-2019
freq,78429,11314,37060,11314,2037,34490,58578,54608,43584,109176,115675,63747,63747,1162,15902,1033,1054,101423,102101,110738,111240,98994,6395,150,115674,1,1,1,1,1,1,1,115674,1,1,1


In [13]:
df.emp_length.value_counts()

10+ years    34490
< 1 year     15044
2 years       9695
3 years       8719
1 year        7919
5 years       7189
4 years       6777
6 years       4636
7 years       3913
8 years       3625
9 years       2567
Name: emp_length, dtype: int64

In [14]:
df.grade.value_counts() # will show you the unique values that we see in describe

A    37060
B    33129
C    27277
D    14797
E     3364
F       31
G       17
Name: grade, dtype: int64

### Convert `int_rate`


In [15]:
x = "12.5%"
float(x.strip("%"))

12.5

In [0]:
df["int rate"] = df["int_rate"].str.strip('%').astype(float)

In [17]:
df["int rate"].head()

0    17.19
1    16.40
2    14.74
3    15.57
4    15.57
Name: int rate, dtype: float64

Define a function to remove percent signs from strings and convert to floats

Apply the function to the `int_rate` column

In [18]:
def remove_percent_sign(string):
  """This function takes a string as input, strips the %, and returns float interest rate."""
  return float(string.strip("%"))/100

remove_percent_sign(x)

0.125

In [19]:
df["int_rate"] = df["int_rate"].apply(remove_percent_sign)
df["int_rate"].head()

0    0.1719
1    0.1640
2    0.1474
3    0.1557
4    0.1557
Name: int_rate, dtype: float64

### Clean `emp_title`

Look at top 20 titles

In [20]:
df.emp_title.value_counts().head(20)

Teacher                     2037
Manager                     1626
Registered Nurse             898
Driver                       857
Supervisor                   655
RN                           623
Sales                        586
Office Manager               574
Project Manager              540
General Manager              486
Owner                        449
Director                     374
Operations Manager           313
Engineer                     309
Truck Driver                 308
Sales Manager                288
Nurse                        281
Administrative Assistant     267
Supervisor                   260
Accountant                   259
Name: emp_title, dtype: int64

In [21]:
df.emp_title.value_counts().head(20)/len(df)

Teacher                     0.017610
Manager                     0.014057
Registered Nurse            0.007763
Driver                      0.007409
Supervisor                  0.005662
RN                          0.005386
Sales                       0.005066
Office Manager              0.004962
Project Manager             0.004668
General Manager             0.004201
Owner                       0.003882
Director                    0.003233
Operations Manager          0.002706
Engineer                    0.002671
Truck Driver                0.002663
Sales Manager               0.002490
Nurse                       0.002429
Administrative Assistant    0.002308
Supervisor                  0.002248
Accountant                  0.002239
Name: emp_title, dtype: float64

How often is `emp_title` null?

In [22]:
df.emp_title.isna().sum()

19518

In [23]:
df.emp_title.isna().sum()/len(df)

0.1687313594121461

In [24]:
df.shape

(115675, 145)

In [25]:
df.emp_title.value_counts(dropna=False) # Now just showing NaN values

NaN                                         19518
Teacher                                      2037
Manager                                      1626
Registered Nurse                              898
Driver                                        857
Supervisor                                    655
RN                                            623
Sales                                         586
Office Manager                                574
Project Manager                               540
General Manager                               486
Owner                                         449
Director                                      374
Operations Manager                            313
Engineer                                      309
Truck Driver                                  308
Sales Manager                                 288
Nurse                                         281
Administrative Assistant                      267
Supervisor                                    260


Clean the title and handle missing values


*   Capitalize
*   Remove Spaces
*   Replace 'NaN' with missing




In [26]:
isinstance("dog", str)

True

In [27]:
import numpy as np
example = ['owner', 'Supervisor ', 'Project Manager', np.nan]

def clean_emp_title(x):
  if isinstance(x, str):
    return x.strip().title()
  else:
    return "Missing"

for ex in example:
  print(clean_emp_title(ex))
  
[clean_emp_title(x) for x in example]

Owner
Supervisor
Project Manager
Missing


['Owner', 'Supervisor', 'Project Manager', 'Missing']

In [28]:
df["emp_title"] = df["emp_title"].apply(clean_emp_title)
df["emp_title"].head(20)

0       Front Desk Supervisor
1          Executive Director
2              Office Manager
3               Store Manager
4                Area Manager
5                         Svp
6                 Vp Of Sales
7                       Sales
8                          Rn
9             Project Manager
10          Custom Applicator
11                 Av Manager
12    Human Resources Liaison
13                    Missing
14                      Agent
15            Sales/Marketing
16                        Ceo
17                  Route Rep
18                Maintenance
19        Executive Assistant
Name: emp_title, dtype: object

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [29]:
df["emp_title"].str.contains("manager", case=False).head(10) # Case=False looks for all versions no matter the capitalization

0    False
1    False
2     True
3     True
4     True
5    False
6    False
7    False
8    False
9     True
Name: emp_title, dtype: bool

In [30]:
df["emp_title"].iloc[0:10]

0    Front Desk Supervisor
1       Executive Director
2           Office Manager
3            Store Manager
4             Area Manager
5                      Svp
6              Vp Of Sales
7                    Sales
8                       Rn
9          Project Manager
Name: emp_title, dtype: object

In [31]:
df["emp_title_manager"] = df.emp_title.str.contains("Manager")
df["emp_title_manager"].value_counts()

False    99713
True     15962
Name: emp_title_manager, dtype: int64

In [32]:
df.groupby("emp_title_manager").mean()

Unnamed: 0_level_0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,url,desc,dti,delinq_2yrs,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,total_acc,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,annual_inc_joint,dti_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,deferral_term,hardship_amount,hardship_length,hardship_dpd,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,settlement_amount,settlement_percentage,settlement_term,int rate
emp_title_manager,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1
False,,,16448.182534,16448.182534,16444.838186,0.127442,475.519424,83559.458416,,,20.592776,0.22656,0.44267,36.631389,87.613775,11.61685,0.117918,17586.025112,22.963947,14567.067655,14564.236669,2499.304451,2498.678031,1861.299217,637.864935,0.140299,0.0,0.0,1017.02171,0.016628,46.328798,1.0,137541.633416,19.541955,0.0,174.940589,150180.009959,0.90927,2.760693,0.69327,1.568742,20.403991,37305.27524,67.872938,1.164984,2.458085,6295.674406,54.015357,40994.239477,1.107388,1.541043,1.865033,4.286502,14109.133703,16102.276128,49.612016,0.007471,3.473048,124.802636,177.265412,15.688697,8.783699,1.360906,26.318597,39.703281,7.666162,37.27717,0.434377,3.703158,5.497257,4.965902,7.154273,8.401051,8.234242,13.010851,5.453832,11.598628,0.0,0.0,0.058448,1.978709,94.926925,32.464631,0.117868,0.0,193665.50514,55243.302488,28729.7211,48153.360555,38804.680792,0.570886,1.644383,11.685335,55.920828,3.072208,12.682351,0.028748,0.054105,39.140069,,,,,,,,2839.0,65.0,18.0,12.744184
True,,,18064.82427,18064.82427,18062.374702,0.125701,515.844428,94385.379918,,,19.526991,0.234056,0.445558,36.798941,88.957965,12.390051,0.114773,19569.506578,24.103997,16035.217625,16033.163865,2695.393888,2694.930881,1999.294491,695.956149,0.14325,0.0,0.0,1086.683964,0.017103,46.980358,1.0,152753.20279,19.519736,0.0,186.116965,171259.574427,0.941925,2.976193,0.750345,1.699035,18.535068,39671.02863,68.288965,1.18544,2.496492,7017.051873,55.476942,44857.190828,1.192394,1.574865,1.958965,4.505137,15546.462223,17573.604256,51.098597,0.007455,0.324583,127.047887,173.629996,15.428392,8.413733,1.488849,25.113209,39.68736,7.647246,37.565189,0.449004,3.9844,5.840809,5.360293,7.527315,8.938103,8.722591,13.452262,5.798459,12.376081,0.0,0.0,0.055256,2.077559,94.87147,33.528227,0.114773,0.0,218939.590465,59683.425636,32031.919058,52231.558827,43395.927536,0.516046,1.673395,11.880435,56.491562,3.166667,12.5,0.016046,0.040373,40.067766,3.0,35.21,3.0,11.0,105.63,5636.06,5.9,,,,12.570085


## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [33]:
df["issue_d"].head().values

array(['Mar-2019', 'Mar-2019', 'Mar-2019', 'Mar-2019', 'Mar-2019'],
      dtype=object)

In [34]:
df["issue_d"].describe()

count       115675
unique           3
top       Jan-2019
freq         43584
Name: issue_d, dtype: object

In [0]:
df["issue_d"] = pd.to_datetime(df["issue_d"], infer_datetime_format=True)

In [36]:
df["issue_d"].describe()

count                  115675
unique                      3
top       2019-01-01 00:00:00
freq                    43584
first     2019-01-01 00:00:00
last      2019-03-01 00:00:00
Name: issue_d, dtype: object

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

# Answers to Assignment

## Loading and looking at data

In [46]:
# Reloading in data
import pandas as pd

df = pd.read_csv("LoanStats_2019Q1.csv", header=1, skipfooter=2, engine="python") # Or you can use skiprows=1. skipfooter skips the last two rows
print(df.shape)
df.head()

(115675, 144)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,20000,20000,20000,60 months,17.19%,499.1,C,C5,Front desk supervisor,6 years,RENT,47000.0,Source Verified,Mar-2019,Current,n,,,debt_consolidation,Debt consolidation,958xx,CA,14.02,0,Sep-2006,1,50.0,,15,0,10687,19.7%,53,w,19254.76,19254.76,1459.1,1459.1,745.24,713.86,0.0,0.0,0.0,Jun-2019,499.1,Jul-2019,Jun-2019,0,50.0,1,Individual,,,,0,0,31546,3,2,1,2,10.0,20859,97.0,4,9,5909,42.0,54300,6,1,3,11,2103.0,23647.0,30.0,0,0,150.0,100,1,1,0,5.0,,3.0,50.0,3,3,4,8,19,19,13,33,4,15,0.0,0,0,5,98.0,12.5,0,0,75824,31546,33800,21524,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,,,12000,12000,12000,36 months,16.40%,424.26,C,C4,Executive Director,4 years,MORTGAGE,95000.0,Not Verified,Mar-2019,Current,n,,,debt_consolidation,Debt consolidation,436xx,OH,29.5,0,Nov-1992,0,29.0,,19,0,16619,64.9%,36,w,11475.92,11475.92,826.65,826.65,524.08,302.57,0.0,0.0,0.0,May-2019,424.26,Jul-2019,Jun-2019,0,30.0,1,Individual,,,,0,0,176551,1,5,2,5,7.0,35115,65.0,3,6,4899,65.0,25600,4,0,3,12,9292.0,3231.0,82.5,0,0,316.0,269,5,5,2,12.0,58.0,7.0,29.0,1,6,8,6,9,10,13,23,8,19,0.0,0,0,6,80.6,66.7,0,0,209488,51734,18500,54263,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,,,3000,3000,3000,36 months,14.74%,103.62,C,C2,Office Manager,4 years,MORTGAGE,58750.0,Verified,Mar-2019,Current,n,,,medical,Medical expenses,327xx,FL,30.91,0,Jun-2004,0,24.0,,16,0,20502,60.1%,25,f,2865.64,2865.64,202.33,202.33,134.36,67.97,0.0,0.0,0.0,May-2019,103.62,Jul-2019,Jun-2019,0,,1,Individual,,,,0,0,37816,0,3,0,0,27.0,17314,48.0,0,0,3962,54.0,34100,0,1,0,0,2364.0,1208.0,89.4,0,0,177.0,164,35,27,4,40.0,24.0,,24.0,0,7,12,7,8,6,13,15,12,16,0.0,0,0,0,96.0,71.4,0,0,69911,37816,11400,35811,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,,,35000,35000,35000,36 months,15.57%,1223.08,C,C3,Store Manager,10+ years,RENT,122000.0,Verified,Mar-2019,Current,n,,,credit_card,Credit card refinancing,333xx,FL,22.0,1,Dec-2009,0,20.0,,5,0,1441,24.4%,18,w,33459.43,33459.43,2446.76,2446.76,1540.57,845.04,61.15,0.0,0.0,Jun-2019,1223.08,Jul-2019,Jun-2019,0,20.0,1,Individual,,,,0,0,24471,0,3,0,0,26.0,23030,39.0,0,0,1441,37.0,5900,2,4,1,1,4894.0,159.0,90.1,0,0,108.0,111,52,24,0,85.0,,9.0,,1,1,1,1,3,10,2,7,1,5,0.0,0,1,0,94.4,100.0,0,0,65640,24471,1600,59740,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,,,5000,5000,5000,36 months,15.57%,174.73,C,C3,Area Manager,3 years,OWN,65000.0,Verified,Mar-2019,Current,n,,,house,Home buying,640xx,MO,16.28,1,Jul-2001,0,7.0,,9,0,5604,64.4%,25,w,4778.86,4778.86,340.81,340.81,221.14,119.67,0.0,0.0,0.0,May-2019,174.73,Jul-2019,Jun-2019,0,7.0,1,Individual,,,,0,0,29775,0,1,0,1,17.0,24171,82.0,1,6,1824,78.0,8700,1,0,0,7,3722.0,1070.0,75.7,0,0,111.0,212,10,10,4,17.0,69.0,17.0,69.0,4,3,5,5,9,7,8,14,5,9,0.0,0,1,1,76.0,50.0,0,0,38190,29775,4400,29490,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


In [47]:
# Looking at total NaN values for each column
df.isna().sum()

id                                            115675
member_id                                     115675
loan_amnt                                          0
funded_amnt                                        0
funded_amnt_inv                                    0
term                                               0
int_rate                                           0
installment                                        0
grade                                              0
sub_grade                                          0
emp_title                                      19518
emp_length                                     11101
home_ownership                                     0
annual_inc                                         0
verification_status                                0
issue_d                                            0
loan_status                                        0
pymnt_plan                                         0
url                                           

In [88]:
# Grabing some info about the dataframe such as memory usage
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115675 entries, 0 to 115674
Columns: 145 entries, id to loan_status_is_great
dtypes: float64(56), int64(54), object(35)
memory usage: 128.0+ MB


In [89]:
# Transposing the head of the dataframe which is switching the columns and the rows place.
df.head().T

Unnamed: 0,0,1,2,3,4
id,,,,,
member_id,,,,,
loan_amnt,20000,12000,3000,35000,5000
funded_amnt,20000,12000,3000,35000,5000
funded_amnt_inv,20000,12000,3000,35000,5000
term,60,36,36,36,36
int_rate,17.19%,16.40%,14.74%,15.57%,15.57%
installment,499.1,424.26,103.62,1223.08,174.73
grade,C,C,C,C,C
sub_grade,C5,C4,C2,C3,C3


## Converting the term column from string to an integer.

In [48]:
# Looking at the first five rows in the term column 
df.term.head()

0     60 months
1     36 months
2     36 months
3     36 months
4     36 months
Name: term, dtype: object

In [0]:
# Removing the months in the term column and turning string into a integer
df["term"] = df["term"].str.strip(" months").astype(int)

In [51]:
# printing the term head now to see the months removed and that value is now an integer
df.term.head()

0    60
1    36
2    36
3    36
4    36
Name: term, dtype: int64

In [90]:
# Looking at the new head if the dataframe
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,loan_status_is_great
0,,,20000,20000,20000,60,17.19%,499.1,C,C5,Front desk supervisor,6 years,RENT,47000.0,Source Verified,Mar-2019,Current,n,,,debt_consolidation,Debt consolidation,958xx,CA,14.02,0,Sep-2006,1,50.0,,15,0,10687,19.7%,53,w,19254.76,19254.76,1459.1,1459.1,745.24,713.86,0.0,0.0,0.0,Jun-2019,499.1,Jul-2019,Jun-2019,0,50.0,1,Individual,,,,0,0,31546,3,2,1,2,10.0,20859,97.0,4,9,5909,42.0,54300,6,1,3,11,2103.0,23647.0,30.0,0,0,150.0,100,1,1,0,5.0,,3.0,50.0,3,3,4,8,19,19,13,33,4,15,0.0,0,0,5,98.0,12.5,0,0,75824,31546,33800,21524,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1
1,,,12000,12000,12000,36,16.40%,424.26,C,C4,Executive Director,4 years,MORTGAGE,95000.0,Not Verified,Mar-2019,Current,n,,,debt_consolidation,Debt consolidation,436xx,OH,29.5,0,Nov-1992,0,29.0,,19,0,16619,64.9%,36,w,11475.92,11475.92,826.65,826.65,524.08,302.57,0.0,0.0,0.0,May-2019,424.26,Jul-2019,Jun-2019,0,30.0,1,Individual,,,,0,0,176551,1,5,2,5,7.0,35115,65.0,3,6,4899,65.0,25600,4,0,3,12,9292.0,3231.0,82.5,0,0,316.0,269,5,5,2,12.0,58.0,7.0,29.0,1,6,8,6,9,10,13,23,8,19,0.0,0,0,6,80.6,66.7,0,0,209488,51734,18500,54263,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1
2,,,3000,3000,3000,36,14.74%,103.62,C,C2,Office Manager,4 years,MORTGAGE,58750.0,Verified,Mar-2019,Current,n,,,medical,Medical expenses,327xx,FL,30.91,0,Jun-2004,0,24.0,,16,0,20502,60.1%,25,f,2865.64,2865.64,202.33,202.33,134.36,67.97,0.0,0.0,0.0,May-2019,103.62,Jul-2019,Jun-2019,0,,1,Individual,,,,0,0,37816,0,3,0,0,27.0,17314,48.0,0,0,3962,54.0,34100,0,1,0,0,2364.0,1208.0,89.4,0,0,177.0,164,35,27,4,40.0,24.0,,24.0,0,7,12,7,8,6,13,15,12,16,0.0,0,0,0,96.0,71.4,0,0,69911,37816,11400,35811,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1
3,,,35000,35000,35000,36,15.57%,1223.08,C,C3,Store Manager,10+ years,RENT,122000.0,Verified,Mar-2019,Current,n,,,credit_card,Credit card refinancing,333xx,FL,22.0,1,Dec-2009,0,20.0,,5,0,1441,24.4%,18,w,33459.43,33459.43,2446.76,2446.76,1540.57,845.04,61.15,0.0,0.0,Jun-2019,1223.08,Jul-2019,Jun-2019,0,20.0,1,Individual,,,,0,0,24471,0,3,0,0,26.0,23030,39.0,0,0,1441,37.0,5900,2,4,1,1,4894.0,159.0,90.1,0,0,108.0,111,52,24,0,85.0,,9.0,,1,1,1,1,3,10,2,7,1,5,0.0,0,1,0,94.4,100.0,0,0,65640,24471,1600,59740,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1
4,,,5000,5000,5000,36,15.57%,174.73,C,C3,Area Manager,3 years,OWN,65000.0,Verified,Mar-2019,Current,n,,,house,Home buying,640xx,MO,16.28,1,Jul-2001,0,7.0,,9,0,5604,64.4%,25,w,4778.86,4778.86,340.81,340.81,221.14,119.67,0.0,0.0,0.0,May-2019,174.73,Jul-2019,Jun-2019,0,7.0,1,Individual,,,,0,0,29775,0,1,0,1,17.0,24171,82.0,1,6,1824,78.0,8700,1,0,0,7,3722.0,1070.0,75.7,0,0,111.0,212,10,10,4,17.0,69.0,17.0,69.0,4,3,5,5,9,7,8,14,5,9,0.0,0,1,1,76.0,50.0,0,0,38190,29775,4400,29490,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1


## Adding new column loan_status_is_great

In [70]:
# Looking up total of the differnt types of values in loan_status
df["loan_status"].value_counts()

Current               109176
Fully Paid              4730
Late (31-120 days)       795
In Grace Period          538
Late (16-30 days)        260
Charged Off              176
Name: loan_status, dtype: int64

In [0]:
# Adding a new column that gives the value 1 to Current & Fully Paid rows in loan_status and a 0 to anything else to a new called loan_status_is_great
import numpy as np
df["loan_status_is_great"] = np.where(df["loan_status"] == "Current", 1, 
                             np.where(df["loan_status"] == "Fully Paid", 1, 0))

In [86]:
# Checking to see if the totals are correct
df["loan_status_is_great"].value_counts()

1    113906
0      1769
Name: loan_status_is_great, dtype: int64

In [87]:
# Looking at some of the beggining and last rows to check.
df["loan_status_is_great"].head(-10)

0         1
1         1
2         1
3         1
4         1
5         1
6         1
7         1
8         1
9         1
10        1
11        1
12        1
13        1
14        1
15        1
16        1
17        1
18        1
19        1
20        1
21        1
22        1
23        1
24        1
25        1
26        1
27        1
28        1
29        1
30        1
31        1
32        1
33        1
34        1
35        1
36        1
37        1
38        1
39        1
40        1
41        1
42        1
43        1
44        1
45        1
46        1
47        1
48        1
49        1
50        1
51        1
52        1
53        1
54        1
55        1
56        1
57        1
58        1
59        1
60        1
61        1
62        1
63        1
64        1
65        1
66        1
67        1
68        1
69        1
70        1
71        1
72        1
73        1
74        1
         ..
115590    1
115591    1
115592    1
115593    1
115594    1
115595    1
115596    0
1155

## Making a last_pymnt_d_month column @ last_pymnt_d_year column

In [91]:
# Looking at the head of the last payment date column
df["last_pymnt_d"].head()

0    Jun-2019
1    May-2019
2    May-2019
3    Jun-2019
4    May-2019
Name: last_pymnt_d, dtype: object

In [92]:
# Looking at some info for the last payment date column
df["last_pymnt_d"].describe()

count       115445
unique           7
top       Jun-2019
freq        102101
Name: last_pymnt_d, dtype: object

In [0]:
df["last_pymnt_d"] = pd.to_datetime(df["last_pymnt_d"], infer_datetime_format=True)

In [95]:
df["last_pymnt_d"].describe()

count                  115445
unique                      7
top       2019-06-01 00:00:00
freq                   102101
first     2019-01-01 00:00:00
last      2019-07-01 00:00:00
Name: last_pymnt_d, dtype: object

In [0]:
# Adding a new column that shows the last payment month as an integer
df["last_pymnt_d_month"] = df["last_pymnt_d"].dt.month

In [98]:
#Looking at total count of values for last payment month which are 7 unique values meaning 7 different months
df["last_pymnt_d_month"].value_counts()

6.0    102101
5.0      9503
4.0      1581
3.0      1187
2.0       864
1.0       208
7.0         1
Name: last_pymnt_d_month, dtype: int64

In [0]:
# Adding a new column that shows the last payment year as an integer
df["last_pymnt_d_year"] = df["last_pymnt_d"].dt.year

In [99]:
# Looking at total count of values for last payment year which are all 2019
df["last_pymnt_d_year"].value_counts()

2019.0    115445
Name: last_pymnt_d_year, dtype: int64

In [100]:
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,loan_status_is_great,last_pymnt_d_month,last_pymnt_d_year
0,,,20000,20000,20000,60,17.19%,499.1,C,C5,Front desk supervisor,6 years,RENT,47000.0,Source Verified,Mar-2019,Current,n,,,debt_consolidation,Debt consolidation,958xx,CA,14.02,0,Sep-2006,1,50.0,,15,0,10687,19.7%,53,w,19254.76,19254.76,1459.1,1459.1,745.24,713.86,0.0,0.0,0.0,2019-06-01,499.1,Jul-2019,Jun-2019,0,50.0,1,Individual,,,,0,0,31546,3,2,1,2,10.0,20859,97.0,4,9,5909,42.0,54300,6,1,3,11,2103.0,23647.0,30.0,0,0,150.0,100,1,1,0,5.0,,3.0,50.0,3,3,4,8,19,19,13,33,4,15,0.0,0,0,5,98.0,12.5,0,0,75824,31546,33800,21524,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,6.0,2019.0
1,,,12000,12000,12000,36,16.40%,424.26,C,C4,Executive Director,4 years,MORTGAGE,95000.0,Not Verified,Mar-2019,Current,n,,,debt_consolidation,Debt consolidation,436xx,OH,29.5,0,Nov-1992,0,29.0,,19,0,16619,64.9%,36,w,11475.92,11475.92,826.65,826.65,524.08,302.57,0.0,0.0,0.0,2019-05-01,424.26,Jul-2019,Jun-2019,0,30.0,1,Individual,,,,0,0,176551,1,5,2,5,7.0,35115,65.0,3,6,4899,65.0,25600,4,0,3,12,9292.0,3231.0,82.5,0,0,316.0,269,5,5,2,12.0,58.0,7.0,29.0,1,6,8,6,9,10,13,23,8,19,0.0,0,0,6,80.6,66.7,0,0,209488,51734,18500,54263,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
2,,,3000,3000,3000,36,14.74%,103.62,C,C2,Office Manager,4 years,MORTGAGE,58750.0,Verified,Mar-2019,Current,n,,,medical,Medical expenses,327xx,FL,30.91,0,Jun-2004,0,24.0,,16,0,20502,60.1%,25,f,2865.64,2865.64,202.33,202.33,134.36,67.97,0.0,0.0,0.0,2019-05-01,103.62,Jul-2019,Jun-2019,0,,1,Individual,,,,0,0,37816,0,3,0,0,27.0,17314,48.0,0,0,3962,54.0,34100,0,1,0,0,2364.0,1208.0,89.4,0,0,177.0,164,35,27,4,40.0,24.0,,24.0,0,7,12,7,8,6,13,15,12,16,0.0,0,0,0,96.0,71.4,0,0,69911,37816,11400,35811,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
3,,,35000,35000,35000,36,15.57%,1223.08,C,C3,Store Manager,10+ years,RENT,122000.0,Verified,Mar-2019,Current,n,,,credit_card,Credit card refinancing,333xx,FL,22.0,1,Dec-2009,0,20.0,,5,0,1441,24.4%,18,w,33459.43,33459.43,2446.76,2446.76,1540.57,845.04,61.15,0.0,0.0,2019-06-01,1223.08,Jul-2019,Jun-2019,0,20.0,1,Individual,,,,0,0,24471,0,3,0,0,26.0,23030,39.0,0,0,1441,37.0,5900,2,4,1,1,4894.0,159.0,90.1,0,0,108.0,111,52,24,0,85.0,,9.0,,1,1,1,1,3,10,2,7,1,5,0.0,0,1,0,94.4,100.0,0,0,65640,24471,1600,59740,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,6.0,2019.0
4,,,5000,5000,5000,36,15.57%,174.73,C,C3,Area Manager,3 years,OWN,65000.0,Verified,Mar-2019,Current,n,,,house,Home buying,640xx,MO,16.28,1,Jul-2001,0,7.0,,9,0,5604,64.4%,25,w,4778.86,4778.86,340.81,340.81,221.14,119.67,0.0,0.0,0.0,2019-05-01,174.73,Jul-2019,Jun-2019,0,7.0,1,Individual,,,,0,0,29775,0,1,0,1,17.0,24171,82.0,1,6,1824,78.0,8700,1,0,0,7,3722.0,1070.0,75.7,0,0,111.0,212,10,10,4,17.0,69.0,17.0,69.0,4,3,5,5,9,7,8,14,5,9,0.0,0,1,1,76.0,50.0,0,0,38190,29775,4400,29490,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0


# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01