_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [1]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

--2019-04-25 16:36:22--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip’

LoanStats_2018Q4.cs     [               <=>  ]  21.37M   890KB/s    in 25s     

2019-04-25 16:36:48 (882 KB/s) - ‘LoanStats_2018Q4.csv.zip’ saved [22408410]



In [2]:
!unzip LoanStats_2018Q4.csv.zip

Archive:  LoanStats_2018Q4.csv.zip
  inflating: LoanStats_2018Q4.csv    


In [7]:
!tail LoanStats_2018Q4.csv

"","","5600","5600","5600"," 36 months"," 13.56%","190.21","C","C1","","n/a","RENT","15600","Not Verified","Oct-2018","Current","n","","","credit_card","Credit card refinancing","836xx","ID","15.31","0","Aug-2012","0","","97","9","1","5996","34.5%","11","w","4950.84","4950.84","940.5","940.50","649.16","291.34","0.0","0.0","0.0","Mar-2019","190.21","Apr-2019","Mar-2019","0","","1","Individual","","","","0","0","5996","0","0","0","1","20","0","","0","2","3017","35","17400","1","0","0","3","750","4689","45.5","0","0","20","73","13","13","0","13","","20","","0","3","5","4","4","1","9","10","5","9","0","0","0","0","100","25","1","0","17400","5996","8600","0","","","","","","","","","","","","N","","","","","","","","","","","","","","","Cash","N","","","","","",""
"","","23000","23000","23000"," 36 months"," 15.02%","797.53","C","C3","Tax Consultant","10+ years","MORTGAGE","75000","Source Verified","Oct-2018","Charged Off","n","","","debt_consolidation","Debt consolidation","352xx","AL","2

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import pandas as pd
pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)

In [62]:
loan_stats = pd.read_csv("LoanStats_2018Q4.csv", header=1, skipfooter=2, engine='python')
loan_stats.shape

(128412, 145)

In [55]:
loan_stats.isna().sum().sort_values(ascending=False)

id                                            128412
member_id                                     128412
url                                           128412
desc                                          128412
hardship_loan_status                          128411
hardship_amount                               128411
hardship_start_date                           128411
hardship_end_date                             128411
payment_plan_start_date                       128411
hardship_length                               128411
hardship_dpd                                  128411
hardship_payoff_balance_amount                128411
orig_projected_additional_accrued_interest    128411
hardship_status                               128411
hardship_last_payment_amount                  128411
hardship_reason                               128411
hardship_type                                 128411
deferral_term                                 128411
settlement_percentage                         

## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [56]:
loan_stats.dtypes

id                                            float64
member_id                                     float64
loan_amnt                                       int64
funded_amnt                                     int64
funded_amnt_inv                               float64
term                                           object
int_rate                                       object
installment                                   float64
grade                                          object
sub_grade                                      object
emp_title                                      object
emp_length                                     object
home_ownership                                 object
annual_inc                                    float64
verification_status                            object
issue_d                                        object
loan_status                                    object
pymnt_plan                                     object
url                         

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [0]:
def strip_percent(string):
  return float(string.strip('%'))/100

Apply the function to the `int_rate` column

In [63]:
loan_stats['int_rate'] = loan_stats['int_rate'].apply(strip_percent)
loan_stats.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,10000,10000,10000.0,36 months,0.1033,324.23,B,B1,,< 1 year,MORTGAGE,280000.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,Jan-1996,0,18.0,,14,0,9082,38%,23,w,9521.66,9521.66,639.85,639.85,478.34,161.51,0.0,0.0,0.0,Feb-2019,324.23,Apr-2019,Mar-2019,0,,1,Individual,,,,0,671,246828,1,3,2,3,1.0,48552,62.0,1,3,4923,46.0,23900,2,7,1,7,17631.0,11897.0,43.1,0,0,158.0,275,11,1,1,11.0,,11.0,,0,3,4,7,7,10,9,11,4,14,0.0,0,0,4,91.3,28.6,0,0,367828,61364,20900,54912,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,
1,,,2500,2500,2500.0,36 months,0.1356,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.3%,34,w,2386.02,2386.02,167.02,167.02,113.98,53.04,0.0,0.0,0.0,Feb-2019,84.92,Apr-2019,Mar-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
2,,,12000,12000,12000.0,60 months,0.1356,276.49,C,C1,,< 1 year,MORTGAGE,40000.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,180xx,PA,19.23,0,Aug-2005,2,,,8,0,23195,55.5%,9,w,11716.63,11716.63,539.42,539.42,283.37,256.05,0.0,0.0,0.0,Feb-2019,276.49,Apr-2019,Mar-2019,0,,1,Individual,,,,0,0,32462,1,1,1,1,4.0,9267,103.0,2,2,9747,64.0,41800,1,0,3,3,4058.0,16601.0,54.9,0,0,4.0,149,9,4,1,10.0,,4.0,,0,5,6,5,5,1,7,7,6,8,0.0,0,0,3,100.0,60.0,0,0,50800,32462,36800,9000,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
3,,,15000,15000,14975.0,60 months,0.1447,352.69,C,C2,,,MORTGAGE,30000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,756xx,TX,41.6,0,Oct-1999,0,,,20,0,20049,37.2%,30,f,14654.57,14630.15,687.29,686.14,345.43,341.86,0.0,0.0,0.0,Feb-2019,352.69,Apr-2019,Mar-2019,0,,1,Joint App,55000.0,34.95,Source Verified,0,0,65496,1,1,0,1,16.0,4458,31.0,1,1,3616,36.0,53900,0,2,0,2,3275.0,2545.0,74.3,0,0,149.0,230,6,6,1,31.0,,,,0,4,13,4,5,8,18,21,13,20,0.0,0,0,1,100.0,50.0,0,0,118790,24507,9900,14490,29704.0,Oct-1999,0.0,0.0,16.0,48.8,0.0,15.0,0.0,0.0,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
4,,,16000,16000,16000.0,60 months,0.1797,406.04,D,D1,Instructional Coordinator,5 years,MORTGAGE,51000.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,284xx,NC,21.91,0,Oct-2005,1,52.0,,11,0,7326,24.8%,27,w,15664.63,15664.63,788.12,788.12,335.37,452.75,0.0,0.0,0.0,Feb-2019,406.04,Apr-2019,Mar-2019,0,52.0,1,Individual,,,,0,0,39339,3,8,2,2,2.0,32013,83.0,1,1,3881,58.0,29500,0,0,1,3,3576.0,22174.0,24.8,0,0,123.0,158,6,2,0,6.0,52.0,4.0,52.0,6,2,2,3,6,21,3,6,2,11,0.0,0,0,3,77.8,0.0,0,0,67924,39339,29500,38424,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,


### Clean `emp_title`

Look at top 20 titles

In [59]:
loan_stats['emp_title'].head(20)

0                                  NaN
1                                 Chef
2                                  NaN
3                                  NaN
4            Instructional Coordinator
5                  driver coordinator 
6                             Security
7                        gas attendant
8                                  NaN
9                              Manager
10    Financial Relationship Associate
11                         Postmaster 
12                    Material Handler
13                      Administrative
14                            Operator
15                  Nursing Supervisor
16                 Sale Representative
17                    Sr Sales Manager
18                                 NaN
19                            Optician
Name: emp_title, dtype: object

How often is `emp_title` null?

In [60]:
loan_stats['emp_title'].isna().sum()

20947

Clean the title and handle missing values

0                   Unemployed
1                         Chef
2                   Unemployed
3                   Unemployed
4    Instructional Coordinator
Name: emp_title, dtype: object

### Create `emp_title_manager`

> Indented block



pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [64]:
import numpy as np
def clean_title(title):
  if isinstance(title, str):
    return title.strip().title()
  else:
    return "Unknown"
loan_stats['emp_title'] = loan_stats['emp_title'].apply(clean_title)
loan_stats.head(20)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,10000,10000,10000.0,36 months,0.1033,324.23,B,B1,Unknown,< 1 year,MORTGAGE,280000.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,Jan-1996,0,18.0,,14,0,9082,38%,23,w,9521.66,9521.66,639.85,639.85,478.34,161.51,0.0,0.0,0.0,Feb-2019,324.23,Apr-2019,Mar-2019,0,,1,Individual,,,,0,671,246828,1,3,2,3,1.0,48552,62.0,1,3,4923,46.0,23900,2,7,1,7,17631.0,11897.0,43.1,0,0,158.0,275,11,1,1,11.0,,11.0,,0,3,4,7,7,10,9,11,4,14,0.0,0,0,4,91.3,28.6,0,0,367828,61364,20900,54912,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,
1,,,2500,2500,2500.0,36 months,0.1356,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.3%,34,w,2386.02,2386.02,167.02,167.02,113.98,53.04,0.0,0.0,0.0,Feb-2019,84.92,Apr-2019,Mar-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
2,,,12000,12000,12000.0,60 months,0.1356,276.49,C,C1,Unknown,< 1 year,MORTGAGE,40000.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,180xx,PA,19.23,0,Aug-2005,2,,,8,0,23195,55.5%,9,w,11716.63,11716.63,539.42,539.42,283.37,256.05,0.0,0.0,0.0,Feb-2019,276.49,Apr-2019,Mar-2019,0,,1,Individual,,,,0,0,32462,1,1,1,1,4.0,9267,103.0,2,2,9747,64.0,41800,1,0,3,3,4058.0,16601.0,54.9,0,0,4.0,149,9,4,1,10.0,,4.0,,0,5,6,5,5,1,7,7,6,8,0.0,0,0,3,100.0,60.0,0,0,50800,32462,36800,9000,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
3,,,15000,15000,14975.0,60 months,0.1447,352.69,C,C2,Unknown,,MORTGAGE,30000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,756xx,TX,41.6,0,Oct-1999,0,,,20,0,20049,37.2%,30,f,14654.57,14630.15,687.29,686.14,345.43,341.86,0.0,0.0,0.0,Feb-2019,352.69,Apr-2019,Mar-2019,0,,1,Joint App,55000.0,34.95,Source Verified,0,0,65496,1,1,0,1,16.0,4458,31.0,1,1,3616,36.0,53900,0,2,0,2,3275.0,2545.0,74.3,0,0,149.0,230,6,6,1,31.0,,,,0,4,13,4,5,8,18,21,13,20,0.0,0,0,1,100.0,50.0,0,0,118790,24507,9900,14490,29704.0,Oct-1999,0.0,0.0,16.0,48.8,0.0,15.0,0.0,0.0,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
4,,,16000,16000,16000.0,60 months,0.1797,406.04,D,D1,Instructional Coordinator,5 years,MORTGAGE,51000.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,284xx,NC,21.91,0,Oct-2005,1,52.0,,11,0,7326,24.8%,27,w,15664.63,15664.63,788.12,788.12,335.37,452.75,0.0,0.0,0.0,Feb-2019,406.04,Apr-2019,Mar-2019,0,52.0,1,Individual,,,,0,0,39339,3,8,2,2,2.0,32013,83.0,1,1,3881,58.0,29500,0,0,1,3,3576.0,22174.0,24.8,0,0,123.0,158,6,2,0,6.0,52.0,4.0,52.0,6,2,2,3,6,21,3,6,2,11,0.0,0,0,3,77.8,0.0,0,0,67924,39339,29500,38424,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
5,,,9600,9600,9600.0,36 months,0.234,373.62,E,E1,Driver Coordinator,9 years,RENT,65000.0,Not Verified,Dec-2018,Current,n,,,credit_card,Credit card refinancing,265xx,WV,23.01,1,Sep-2003,0,16.0,,12,0,10678,37.5%,20,f,9223.52,9223.52,728.52,728.52,376.48,352.04,0.0,0.0,0.0,Feb-2019,373.62,Apr-2019,Mar-2019,0,,1,Individual,,,,0,66,24165,0,2,0,0,43.0,13487,33.0,1,1,2264,35.0,28500,0,0,0,1,2014.0,232.0,96.9,0,0,183.0,92,12,12,0,27.0,,15.0,30.0,0,7,9,7,8,8,10,12,9,12,0.0,0,0,1,90.0,85.7,0,0,68902,24165,7600,40402,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
6,,,4000,4000,4000.0,36 months,0.234,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.2%,20,w,3843.13,3843.13,303.56,303.56,156.87,146.69,0.0,0.0,0.0,Feb-2019,155.68,Apr-2019,Mar-2019,0,,1,Individual,,,,0,0,66926,5,4,3,4,5.0,61727,86.0,6,11,1353,68.0,27100,4,0,4,15,4462.0,20174.0,7.9,0,0,147.0,118,2,2,0,2.0,,0.0,,0,5,7,9,9,8,11,12,7,15,0.0,0,0,9,95.0,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
7,,,3500,3500,3500.0,36 months,0.2089,131.67,D,D4,Gas Attendant,10+ years,MORTGAGE,40000.0,Source Verified,Dec-2018,Current,n,,,car,Car financing,078xx,NJ,9.09,0,Oct-2004,1,24.0,,4,0,1944,33.5%,18,w,3357.29,3357.29,257.25,257.25,142.71,114.54,0.0,0.0,0.0,Feb-2019,131.67,Apr-2019,Mar-2019,0,48.0,1,Joint App,59500.0,10.59,Source Verified,0,0,189794,1,1,1,1,7.0,24958,100.0,2,2,1317,87.0,5800,1,0,3,3,47449.0,3683.0,26.3,0,0,150.0,170,1,1,4,8.0,48.0,1.0,48.0,2,1,2,1,8,3,2,11,2,4,0.0,0,0,3,38.9,0.0,0,0,217000,26902,5000,25000,6902.0,May-2003,0.0,3.0,6.0,47.9,0.0,21.0,0.0,0.0,46.0,N,,,,,,,,,,,,,,,Cash,N,,,,,,
8,,,9600,9600,9600.0,36 months,0.1298,323.37,B,B5,Unknown,,MORTGAGE,35704.0,Not Verified,Dec-2018,Current,n,,,home_improvement,Home improvement,401xx,KY,0.84,0,Nov-2003,0,69.0,,5,0,748,11.5%,23,w,9158.56,9158.56,670.98,670.98,441.44,229.54,0.0,0.0,0.0,Mar-2019,323.37,Apr-2019,Mar-2019,0,,1,Individual,,,,0,0,748,0,0,0,0,44.0,0,,0,3,748,12.0,6500,0,0,1,3,150.0,3452.0,17.8,0,0,181.0,100,13,13,0,16.0,,3.0,,0,1,1,2,2,16,5,7,1,5,0.0,0,0,0,95.5,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
9,,,8000,8000,8000.0,36 months,0.234,311.35,E,E1,Manager,10+ years,OWN,43000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,357xx,AL,33.24,0,Jan-1995,0,,107.0,8,1,9019,81.3%,16,w,7686.27,7686.27,607.1,607.1,313.73,293.37,0.0,0.0,0.0,Feb-2019,311.35,Apr-2019,Mar-2019,0,,1,Individual,,,,0,0,169223,0,3,2,2,7.0,22059,69.0,0,0,2174,72.0,11100,1,1,1,2,21153.0,126.0,94.5,0,0,148.0,287,44,7,1,51.0,,7.0,,0,1,4,1,2,8,4,7,4,8,0.0,0,0,2,100.0,100.0,1,0,199744,31078,2300,32206,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,


In [65]:
loan_stats['emp_title_manager'] = loan_stats['emp_title'] == 'Manager'
loan_stats['emp_title_manager'].head(100)

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9      True
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
45    False
46    False
47    False
48    False
49    False
50    False
51    False
52    False
53    False
54    False
55    False
56    False
57    False
58    False
59    False
60    False
61    False
62    False
63    False
64    False
65    False
66    False
67    False
68    False
69    False
70    False
71    False
72    False
73    False
74    False
75    False
76    False
77    False
78    False
79    False
80    False
81    False
82    False
83  

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [0]:
loan_stats['issue_d'] = loan_stats['issue_d'].apply(pd.to_datetime)

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [67]:
loan_stats['term'] = loan_stats['term'].apply(lambda term: int(term.strip(' months')))
loan_stats['term'].head()

0    36
1    36
2    60
3    60
4    60
Name: term, dtype: int64

In [70]:
loan_stats['loan_status_is_great'] = loan_stats['loan_status'].apply(lambda status: status == 'Current' or status == 'Fully Paid')
loan_stats['loan_status_is_great'].value_counts()

True     126584
False      1828
Name: loan_status_is_great, dtype: int64

In [71]:
loan_stats['last_pymnt_d'] = pd.to_datetime(loan_stats['last_pymnt_d'])
loan_stats['last_pymnt_d'].head()

0   2019-02-01
1   2019-02-01
2   2019-02-01
3   2019-02-01
4   2019-02-01
Name: last_pymnt_d, dtype: datetime64[ns]

In [72]:
loan_stats['last_pymnt_d_month'] = loan_stats['last_pymnt_d'].dt.month
loan_stats['last_pymnt_d_year'] = loan_stats['last_pymnt_d'].dt.year
loan_stats['last_pymnt_d_month'].head()

0    2.0
1    2.0
2    2.0
3    2.0
4    2.0
Name: last_pymnt_d_month, dtype: float64

# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

In [75]:
loan_stats['revol_util'] = loan_stats['revol_util'].apply(strip_percent)
loan_stats['revol_util'].head()

AttributeError: ignored

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01