<a href="https://colab.research.google.com/github/MAL3X-01/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/module4-makefeatures/LS_DS_124_Make_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [0]:
#!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

In [0]:
#!unzip LoanStats_2018Q4.csv.zip

In [0]:
#!head LoanStats_2018Q4.csv

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

Apply the function to the `int_rate` column

### Clean `emp_title`

Look at top 20 titles

How often is `emp_title` null?

Clean the title and handle missing values

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [0]:
# getting zip file from lendingclub website
#!wget -nc https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

In [0]:
# unzipping File
#!unzip LoanStats_2018Q4.csv.zip

In [0]:
# importing python libraries
import numpy as np
import pandas as pd

# Increased length of display 
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

In [7]:
# using pandas to read csv file 
df1 = pd.read_csv('LoanStats_2018Q4.csv', skiprows=1, 
                  skipfooter=2, engine='python')

print(df1.shape)
df1.head()

(128412, 144)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,9600,9600,9600.0,36 months,12.98%,323.37,B,B5,,,MORTGAGE,35704.0,Not Verified,Dec-2018,Current,n,,,home_improvement,Home improvement,401xx,KY,0.84,0,Nov-2003,0,69.0,,5,0,748,11.5%,23,w,8707.52,8707.52,1317.72,1317.72,892.48,425.24,0.0,0.0,0.0,May-2019,323.37,Jun-2019,May-2019,0,,1,Individual,,,,0,0,748,0,0,0,0,44.0,0,,0,3,748,12.0,6500,0,0,1,3,150.0,3452.0,17.8,0,0,181.0,100,13,13,0,16.0,,3.0,,0,1,1,2,2,16,5,7,1,5,0.0,0,0,0,95.5,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,,,4000,4000,4000.0,36 months,23.40%,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.2%,20,w,3596.15,3596.15,770.6,770.6,403.85,366.75,0.0,0.0,0.0,May-2019,155.68,Jun-2019,May-2019,0,,1,Individual,,,,0,0,66926,5,4,3,4,5.0,61727,86.0,6,11,1353,68.0,27100,4,0,4,15,4462.0,20174.0,7.9,0,0,147.0,118,2,2,0,2.0,,0.0,,0,5,7,9,9,8,11,12,7,15,0.0,0,0,9,95.0,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,,,2500,2500,2500.0,36 months,13.56%,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.3%,34,w,2210.17,2210.17,421.78,421.78,289.83,131.95,0.0,0.0,0.0,May-2019,84.92,Jun-2019,May-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,,,30000,30000,30000.0,60 months,18.94%,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,Jun-1987,0,71.0,75.0,13,1,12315,24.2%,44,w,20872.06,20872.06,11338.8,11338.8,9127.94,2210.86,0.0,0.0,0.0,May-2019,777.23,Jun-2019,May-2019,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,,,5000,5000,5000.0,36 months,17.97%,180.69,D,D1,Administrative,6 years,MORTGAGE,59280.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,490xx,MI,10.51,0,Apr-2011,0,,,8,0,4599,19.1%,13,w,4567.57,4567.57,715.27,715.27,432.43,282.84,0.0,0.0,0.0,May-2019,180.69,Jun-2019,May-2019,0,,1,Individual,,,,0,0,110299,0,1,0,2,14.0,7150,72.0,0,2,0,35.0,24100,1,5,0,4,18383.0,13800.0,0.0,0,0,87.0,92,15,14,2,77.0,,14.0,,0,0,3,3,3,4,6,7,3,8,0.0,0,0,0,100.0,0.0,0,0,136927,11749,13800,10000,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


In [8]:
# Checking for null values for each column in descending order
df1.isnull().sum().sort_values(ascending=False)

id                                            128412
member_id                                     128412
url                                           128412
desc                                          128412
hardship_dpd                                  128411
deferral_term                                 128411
hardship_amount                               128411
hardship_start_date                           128411
hardship_end_date                             128411
payment_plan_start_date                       128411
hardship_length                               128411
orig_projected_additional_accrued_interest    128411
hardship_loan_status                          128411
hardship_reason                               128411
hardship_payoff_balance_amount                128411
hardship_last_payment_amount                  128411
hardship_type                                 128411
hardship_status                               128411
settlement_percentage                         

In [0]:
# Dropping useless column with all or mostly null values
df1 = df1.drop(['id','url','desc','member_id'], axis='columns')

In [10]:
df1.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,9600,9600,9600.0,36 months,12.98%,323.37,B,B5,,,MORTGAGE,35704.0,Not Verified,Dec-2018,Current,n,home_improvement,Home improvement,401xx,KY,0.84,0,Nov-2003,0,69.0,,5,0,748,11.5%,23,w,8707.52,8707.52,1317.72,1317.72,892.48,425.24,0.0,0.0,0.0,May-2019,323.37,Jun-2019,May-2019,0,,1,Individual,,,,0,0,748,0,0,0,0,44.0,0,,0,3,748,12.0,6500,0,0,1,3,150.0,3452.0,17.8,0,0,181.0,100,13,13,0,16.0,,3.0,,0,1,1,2,2,16,5,7,1,5,0.0,0,0,0,95.5,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,4000,4000,4000.0,36 months,23.40%,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.2%,20,w,3596.15,3596.15,770.6,770.6,403.85,366.75,0.0,0.0,0.0,May-2019,155.68,Jun-2019,May-2019,0,,1,Individual,,,,0,0,66926,5,4,3,4,5.0,61727,86.0,6,11,1353,68.0,27100,4,0,4,15,4462.0,20174.0,7.9,0,0,147.0,118,2,2,0,2.0,,0.0,,0,5,7,9,9,8,11,12,7,15,0.0,0,0,9,95.0,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,2500,2500,2500.0,36 months,13.56%,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.3%,34,w,2210.17,2210.17,421.78,421.78,289.83,131.95,0.0,0.0,0.0,May-2019,84.92,Jun-2019,May-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,30000,30000,30000.0,60 months,18.94%,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,Jun-1987,0,71.0,75.0,13,1,12315,24.2%,44,w,20872.06,20872.06,11338.8,11338.8,9127.94,2210.86,0.0,0.0,0.0,May-2019,777.23,Jun-2019,May-2019,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,5000,5000,5000.0,36 months,17.97%,180.69,D,D1,Administrative,6 years,MORTGAGE,59280.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,490xx,MI,10.51,0,Apr-2011,0,,,8,0,4599,19.1%,13,w,4567.57,4567.57,715.27,715.27,432.43,282.84,0.0,0.0,0.0,May-2019,180.69,Jun-2019,May-2019,0,,1,Individual,,,,0,0,110299,0,1,0,2,14.0,7150,72.0,0,2,0,35.0,24100,1,5,0,4,18383.0,13800.0,0.0,0,0,87.0,92,15,14,2,77.0,,14.0,,0,0,3,3,3,4,6,7,3,8,0.0,0,0,0,100.0,0.0,0,0,136927,11749,13800,10000,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


In [11]:
# Showing a string in term column and has a space in front
df1['term'].loc[0]

' 36 months'

In [12]:
df1['term'].isnull().sum()

0

In [0]:
# overriding term column to remove spacing and string, replaced with int only
df1['term'] = df1['term'].apply(lambda x: int(x.strip()[0:2]))

In [14]:
df1.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,9600,9600,9600.0,36,12.98%,323.37,B,B5,,,MORTGAGE,35704.0,Not Verified,Dec-2018,Current,n,home_improvement,Home improvement,401xx,KY,0.84,0,Nov-2003,0,69.0,,5,0,748,11.5%,23,w,8707.52,8707.52,1317.72,1317.72,892.48,425.24,0.0,0.0,0.0,May-2019,323.37,Jun-2019,May-2019,0,,1,Individual,,,,0,0,748,0,0,0,0,44.0,0,,0,3,748,12.0,6500,0,0,1,3,150.0,3452.0,17.8,0,0,181.0,100,13,13,0,16.0,,3.0,,0,1,1,2,2,16,5,7,1,5,0.0,0,0,0,95.5,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,4000,4000,4000.0,36,23.40%,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.2%,20,w,3596.15,3596.15,770.6,770.6,403.85,366.75,0.0,0.0,0.0,May-2019,155.68,Jun-2019,May-2019,0,,1,Individual,,,,0,0,66926,5,4,3,4,5.0,61727,86.0,6,11,1353,68.0,27100,4,0,4,15,4462.0,20174.0,7.9,0,0,147.0,118,2,2,0,2.0,,0.0,,0,5,7,9,9,8,11,12,7,15,0.0,0,0,9,95.0,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,2500,2500,2500.0,36,13.56%,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.3%,34,w,2210.17,2210.17,421.78,421.78,289.83,131.95,0.0,0.0,0.0,May-2019,84.92,Jun-2019,May-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,30000,30000,30000.0,60,18.94%,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,Jun-1987,0,71.0,75.0,13,1,12315,24.2%,44,w,20872.06,20872.06,11338.8,11338.8,9127.94,2210.86,0.0,0.0,0.0,May-2019,777.23,Jun-2019,May-2019,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,5000,5000,5000.0,36,17.97%,180.69,D,D1,Administrative,6 years,MORTGAGE,59280.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,490xx,MI,10.51,0,Apr-2011,0,,,8,0,4599,19.1%,13,w,4567.57,4567.57,715.27,715.27,432.43,282.84,0.0,0.0,0.0,May-2019,180.69,Jun-2019,May-2019,0,,1,Individual,,,,0,0,110299,0,1,0,2,14.0,7150,72.0,0,2,0,35.0,24100,1,5,0,4,18383.0,13800.0,0.0,0,0,87.0,92,15,14,2,77.0,,14.0,,0,0,3,3,3,4,6,7,3,8,0.0,0,0,0,100.0,0.0,0,0,136927,11749,13800,10000,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


In [15]:
df1['term'].loc[0]

36

In [16]:
# Confirming its an int
type(df1['term'].loc[0])

numpy.int64

In [17]:
df1.describe()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,installment,annual_inc,dti,delinq_2yrs,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,total_acc,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,annual_inc_joint,dti_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,deferral_term,hardship_amount,hardship_length,hardship_dpd,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,settlement_amount,settlement_percentage,settlement_term
count,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128175.0,128412.0,128412.0,56216.0,15450.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,29180.0,128412.0,16782.0,16782.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,123934.0,128412.0,108138.0,128412.0,128412.0,128412.0,128375.0,128412.0,128412.0,128412.0,128412.0,128412.0,128399.0,126721.0,126658.0,128412.0,128412.0,123934.0,128412.0,128412.0,128412.0,128412.0,126821.0,25169.0,112365.0,36782.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,125553.0,128412.0,128412.0,128412.0,128412.0,126720.0,128412.0,128412.0,128412.0,128412.0,128412.0,128412.0,16782.0,16782.0,16782.0,16782.0,16524.0,16782.0,16782.0,16782.0,16782.0,5154.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,16.0,16.0,16.0
mean,15971.321021,15971.321021,15968.498166,43.519484,463.257143,82797.33,19.933178,0.227837,0.447038,36.880337,86.130162,11.564052,0.12185,16898.0,22.677413,13215.874173,13213.710139,3610.799892,3610.004688,2680.64798,929.758408,0.269793,0.123707,0.022267,1223.614717,0.017958,46.553461,1.0,133551.6,19.226602,0.0,188.304286,146792.2,0.939507,2.760202,0.689071,1.572665,20.201519,36272.28,68.211757,1.195901,2.540783,6062.072275,54.298812,39382.18,1.114561,1.519928,1.876912,4.382698,13828.378874,15357.302957,50.048701,0.007087,0.967768,123.056925,173.063623,15.431634,8.6354,1.323155,26.013957,40.415233,7.552832,37.817275,0.461553,3.659876,5.414128,4.882129,7.065944,8.288524,8.19921,12.866702,5.386342,11.546717,0.0,0.0,0.059488,2.011642,94.659843,32.900756,0.121733,0.0,188485.2,53560.97,27439.534163,46821.8,36642.8,0.585985,1.587177,11.436003,55.878831,3.027112,12.423907,0.03617,0.062984,38.51591,3.0,378.39,3.0,22.0,1135.17,15351.85,1045.41,5702.2575,59.386875,14.8125
std,10150.384233,10150.384233,10152.16897,11.132203,285.71767,108298.5,20.143542,0.733793,0.73448,21.813805,21.880055,5.981599,0.332825,24082.55,12.129216,9570.760896,9571.809597,4348.489422,4348.426901,4204.676651,755.282796,3.621427,20.16513,3.629723,3831.057018,0.146569,21.801716,0.0,96870.01,8.141631,0.0,1569.290033,173872.7,1.145306,2.942377,0.935776,1.565118,24.86993,47263.87,23.589461,1.470123,2.524326,5932.797469,20.655736,37945.75,1.501455,2.712328,2.296951,3.209299,17487.605341,19792.599023,28.791276,0.095924,112.158421,56.581697,100.200199,19.053976,9.573207,1.713149,34.306721,22.30527,6.057088,22.023835,1.349412,2.448079,3.439644,3.205305,4.517165,7.389195,4.958905,7.873911,3.38838,5.977599,0.0,0.0,0.410652,1.880559,8.989288,34.899647,0.332552,0.0,196553.6,55993.35,26377.282557,49574.91,32525.66,0.936053,1.801878,6.690119,26.071241,3.254318,8.190067,0.347726,0.364083,23.659436,,,,,,,,3536.322386,8.346258,5.431007
min,1000.0,1000.0,725.0,36.0,30.48,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,9000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,378.39,3.0,22.0,1135.17,15351.85,1045.41,699.12,45.0,1.0
25%,8000.0,8000.0,8000.0,36.0,253.63,47058.0,11.76,0.0,0.0,19.0,72.0,7.0,0.0,5599.0,14.0,5851.01,5851.01,1552.5,1552.5,964.04,371.9475,0.0,0.0,0.0,264.22,0.0,29.0,1.0,87394.0,13.2,0.0,0.0,27465.5,0.0,1.0,0.0,0.0,6.0,8194.0,54.0,0.0,1.0,2291.0,40.0,17000.0,0.0,0.0,0.0,2.0,2927.0,3006.0,26.4,0.0,0.0,84.0,98.0,4.0,3.0,0.0,6.0,23.0,3.0,20.0,0.0,2.0,3.0,3.0,4.0,3.0,5.0,7.0,3.0,7.0,0.0,0.0,0.0,1.0,92.3,0.0,0.0,0.0,53159.75,20091.0,10100.0,15000.0,16104.0,0.0,0.0,7.0,36.4,1.0,7.0,0.0,0.0,19.0,3.0,378.39,3.0,22.0,1135.17,15351.85,1045.41,3158.0,53.7575,12.0
50%,14000.0,14000.0,14000.0,36.0,382.905,68000.0,17.99,0.0,0.0,34.0,90.0,10.0,0.0,11199.5,21.0,10999.79,10995.955,2472.76,2471.89,1565.91,711.77,0.0,0.0,0.0,412.88,0.0,47.0,1.0,117000.0,18.655,0.0,0.0,74985.0,1.0,2.0,0.0,1.0,13.0,22990.0,71.0,1.0,2.0,4583.0,55.0,29700.0,1.0,0.0,1.0,4.0,7048.0,8628.0,48.9,0.0,0.0,130.0,155.0,9.0,6.0,1.0,15.0,38.0,6.0,35.0,0.0,3.0,5.0,4.0,6.0,6.0,7.0,11.0,5.0,10.0,0.0,0.0,0.0,2.0,100.0,25.0,0.0,0.0,117700.0,38469.0,19800.0,34605.0,28585.0,0.0,1.0,10.0,57.5,2.0,11.0,0.0,0.0,37.0,3.0,378.39,3.0,22.0,1135.17,15351.85,1045.41,4985.0,65.0,18.0
75%,21600.0,21600.0,21600.0,60.0,622.68,99000.0,25.3,0.0,1.0,53.0,104.0,15.0,0.0,20563.0,29.0,18744.48,18744.48,4119.0275,4117.61,2663.31,1280.62,0.0,0.0,0.0,688.13,0.0,64.0,1.0,158679.0,24.93,0.0,0.0,221823.5,1.0,3.0,1.0,2.0,24.0,46830.25,85.0,2.0,4.0,8013.0,69.0,49800.0,2.0,2.0,3.0,6.0,19122.5,20181.0,74.2,0.0,0.0,156.0,224.0,20.0,11.0,2.0,31.0,57.0,12.0,53.0,0.0,5.0,7.0,6.0,9.0,11.0,10.0,17.0,7.0,14.0,0.0,0.0,0.0,3.0,100.0,55.6,0.0,0.0,273606.5,68041.25,36100.0,63289.5,47228.75,1.0,3.0,15.0,76.9,4.0,16.0,0.0,0.0,58.0,3.0,378.39,3.0,22.0,1135.17,15351.85,1045.41,7224.75,65.01,18.0
max,40000.0,40000.0,40000.0,60.0,1618.24,9757200.0,999.0,24.0,5.0,160.0,119.0,94.0,6.0,2358150.0,160.0,39507.58,39507.58,43818.228393,43818.23,40000.0,7025.79,212.4,5204.43,936.7974,41253.54,8.0,162.0,1.0,6282000.0,39.99,0.0,208593.0,9971659.0,13.0,56.0,6.0,19.0,505.0,1837038.0,428.0,22.0,40.0,338224.0,188.0,1596201.0,29.0,56.0,51.0,43.0,623229.0,605996.0,183.8,7.0,31593.0,827.0,826.0,382.0,382.0,87.0,640.0,152.0,25.0,160.0,31.0,48.0,59.0,64.0,69.0,130.0,69.0,127.0,37.0,94.0,0.0,0.0,23.0,24.0,100.0,100.0,6.0,0.0,9999999.0,2622906.0,666200.0,2118996.0,1110019.0,6.0,15.0,67.0,170.1,34.0,95.0,21.0,15.0,153.0,3.0,378.39,3.0,22.0,1135.17,15351.85,1045.41,14292.0,65.12,24.0


In [18]:
df1['loan_status'].loc[0]

'Current'

In [0]:
# Creating new column with apply and lambda 
df1['loan_status_is_great'] = df1['loan_status'].apply(lambda x: 1 if x == 'Current' else (1 if x=='Fully Paid' else 0))

In [20]:
# Showing the new column and column related 
df1[['loan_status','loan_status_is_great']].head(71)

Unnamed: 0,loan_status,loan_status_is_great
0,Current,1
1,Current,1
2,Current,1
3,Current,1
4,Current,1
5,Current,1
6,Current,1
7,Current,1
8,Current,1
9,Current,1


In [21]:
df1[['loan_status','loan_status_is_great']].isnull().sum().sort_values(ascending=False)

loan_status_is_great    0
loan_status             0
dtype: int64

Make last_pymnt_d_month and last_pymnt_d_year columns


In [22]:
# checking the string 
df1['last_pymnt_d'].head().values

array(['May-2019', 'May-2019', 'May-2019', 'May-2019', 'May-2019'],
      dtype=object)

In [23]:
df1['last_pymnt_d'].describe()

count       128253
unique           9
top       May-2019
freq        117659
Name: last_pymnt_d, dtype: object

In [0]:
# reformating the column to datetime form
df1['last_pymnt_d'] = pd.to_datetime(df1['last_pymnt_d'], infer_datetime_format=True)

In [25]:
df1['last_pymnt_d'].head().values

array(['2019-05-01T00:00:00.000000000', '2019-05-01T00:00:00.000000000',
       '2019-05-01T00:00:00.000000000', '2019-05-01T00:00:00.000000000',
       '2019-05-01T00:00:00.000000000'], dtype='datetime64[ns]')

In [26]:
df1['last_pymnt_d'].describe()

count                  128253
unique                      9
top       2019-05-01 00:00:00
freq                   117659
first     2018-10-01 00:00:00
last      2019-06-01 00:00:00
Name: last_pymnt_d, dtype: object

In [0]:
# Creating 2 new columns with date time function
df1['last_pymnt_d_month'] = df1['last_pymnt_d'].dt.month
df1['last_pymnt_d_year'] = df1['last_pymnt_d'].dt.year

In [28]:
# showing the outcome of the 2 new column
df1[['last_pymnt_d_year','last_pymnt_d_month']].head()

Unnamed: 0,last_pymnt_d_year,last_pymnt_d_month
0,2019.0,5.0
1,2019.0,5.0
2,2019.0,5.0
3,2019.0,5.0
4,2019.0,5.0


In [29]:
df1[['last_pymnt_d_year','last_pymnt_d_month']].isnull().sum().sort_values(ascending=False)

last_pymnt_d_month    159
last_pymnt_d_year     159
dtype: int64

# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

# **Stretch option Lending Club**

In [30]:
df1.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,loan_status_is_great,last_pymnt_d_month,last_pymnt_d_year
0,9600,9600,9600.0,36,12.98%,323.37,B,B5,,,MORTGAGE,35704.0,Not Verified,Dec-2018,Current,n,home_improvement,Home improvement,401xx,KY,0.84,0,Nov-2003,0,69.0,,5,0,748,11.5%,23,w,8707.52,8707.52,1317.72,1317.72,892.48,425.24,0.0,0.0,0.0,2019-05-01,323.37,Jun-2019,May-2019,0,,1,Individual,,,,0,0,748,0,0,0,0,44.0,0,,0,3,748,12.0,6500,0,0,1,3,150.0,3452.0,17.8,0,0,181.0,100,13,13,0,16.0,,3.0,,0,1,1,2,2,16,5,7,1,5,0.0,0,0,0,95.5,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
1,4000,4000,4000.0,36,23.40%,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.2%,20,w,3596.15,3596.15,770.6,770.6,403.85,366.75,0.0,0.0,0.0,2019-05-01,155.68,Jun-2019,May-2019,0,,1,Individual,,,,0,0,66926,5,4,3,4,5.0,61727,86.0,6,11,1353,68.0,27100,4,0,4,15,4462.0,20174.0,7.9,0,0,147.0,118,2,2,0,2.0,,0.0,,0,5,7,9,9,8,11,12,7,15,0.0,0,0,9,95.0,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
2,2500,2500,2500.0,36,13.56%,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.3%,34,w,2210.17,2210.17,421.78,421.78,289.83,131.95,0.0,0.0,0.0,2019-05-01,84.92,Jun-2019,May-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
3,30000,30000,30000.0,60,18.94%,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,Jun-1987,0,71.0,75.0,13,1,12315,24.2%,44,w,20872.06,20872.06,11338.8,11338.8,9127.94,2210.86,0.0,0.0,0.0,2019-05-01,777.23,Jun-2019,May-2019,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
4,5000,5000,5000.0,36,17.97%,180.69,D,D1,Administrative,6 years,MORTGAGE,59280.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,490xx,MI,10.51,0,Apr-2011,0,,,8,0,4599,19.1%,13,w,4567.57,4567.57,715.27,715.27,432.43,282.84,0.0,0.0,0.0,2019-05-01,180.69,Jun-2019,May-2019,0,,1,Individual,,,,0,0,110299,0,1,0,2,14.0,7150,72.0,0,2,0,35.0,24100,1,5,0,4,18383.0,13800.0,0.0,0,0,87.0,92,15,14,2,77.0,,14.0,,0,0,3,3,3,4,6,7,3,8,0.0,0,0,0,100.0,0.0,0,0,136927,11749,13800,10000,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0


In [31]:
df1['int_rate'].loc[0]

' 12.98%'

In [0]:
df1['int_rate'] = df1['int_rate'].apply(lambda x: float(x.strip() and x.strip('%')))

In [33]:
df1['int_rate'].head()

0    12.98
1    23.40
2    13.56
3    18.94
4    17.97
Name: int_rate, dtype: float64

In [34]:
df1['revol_util'].head(20)

0     11.5%
1     19.2%
2     10.3%
3     24.2%
4     19.1%
5       38%
6       13%
7     81.3%
8       32%
9     55.5%
10    73.7%
11    24.8%
12    37.2%
13    65.5%
14    83.1%
15    37.5%
16    33.5%
17    50.9%
18    93.7%
19    39.8%
Name: revol_util, dtype: object

In [0]:
df1['revol_util'] = df1['revol_util'].apply(lambda x: str(x))

In [0]:
df1['revol_util'] = df1['revol_util'].apply(lambda x: float(x.strip() and x.strip('%')))

In [37]:
df1['revol_util'].head()

0    11.5
1    19.2
2    10.3
3    24.2
4    19.1
Name: revol_util, dtype: float64

In [38]:
df1['revol_util'].isnull().sum()

156

In [39]:
#234
df1[['revol_bal', 'total_rev_hi_lim', 'revol_util']].loc[234]

revol_bal           0.0
total_rev_hi_lim    0.0
revol_util          NaN
Name: 234, dtype: float64

In [0]:
df1['revol_util'] = (df1['revol_bal']/df1['total_rev_hi_lim'])*100
df1['revol_util'] = df1['revol_util'].apply(lambda x: str(x))
df1['revol_util'] = df1['revol_util'].apply(lambda x: 0 if x =='nan' else x  )
df1['revol_util'] = df1['revol_util'].apply(lambda x: float(x))

In [41]:
df1['revol_util'][234]

0.0

In [42]:
df1['revol_util'].isnull().sum()

0

In [43]:
df1['revol_bal'].isnull().sum()

0

In [44]:
df1['total_rev_hi_lim'].isnull().sum()

0

In [45]:
df1[['revol_bal', 'total_rev_hi_lim', 'revol_util']].head(15)

Unnamed: 0,revol_bal,total_rev_hi_lim,revol_util
0,748,6500,11.507692
1,5199,27100,19.184502
2,4341,42000,10.335714
3,12315,50800,24.242126
4,4599,24100,19.082988
5,9082,23900,38.0
6,976,7500,13.013333
7,9019,11100,81.252252
8,19077,59600,32.008389
9,23195,41800,55.490431


In [46]:
df1.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,loan_status_is_great,last_pymnt_d_month,last_pymnt_d_year
0,9600,9600,9600.0,36,12.98,323.37,B,B5,,,MORTGAGE,35704.0,Not Verified,Dec-2018,Current,n,home_improvement,Home improvement,401xx,KY,0.84,0,Nov-2003,0,69.0,,5,0,748,11.507692,23,w,8707.52,8707.52,1317.72,1317.72,892.48,425.24,0.0,0.0,0.0,2019-05-01,323.37,Jun-2019,May-2019,0,,1,Individual,,,,0,0,748,0,0,0,0,44.0,0,,0,3,748,12.0,6500,0,0,1,3,150.0,3452.0,17.8,0,0,181.0,100,13,13,0,16.0,,3.0,,0,1,1,2,2,16,5,7,1,5,0.0,0,0,0,95.5,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
1,4000,4000,4000.0,36,23.4,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.184502,20,w,3596.15,3596.15,770.6,770.6,403.85,366.75,0.0,0.0,0.0,2019-05-01,155.68,Jun-2019,May-2019,0,,1,Individual,,,,0,0,66926,5,4,3,4,5.0,61727,86.0,6,11,1353,68.0,27100,4,0,4,15,4462.0,20174.0,7.9,0,0,147.0,118,2,2,0,2.0,,0.0,,0,5,7,9,9,8,11,12,7,15,0.0,0,0,9,95.0,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
2,2500,2500,2500.0,36,13.56,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.335714,34,w,2210.17,2210.17,421.78,421.78,289.83,131.95,0.0,0.0,0.0,2019-05-01,84.92,Jun-2019,May-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
3,30000,30000,30000.0,60,18.94,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,Jun-1987,0,71.0,75.0,13,1,12315,24.242126,44,w,20872.06,20872.06,11338.8,11338.8,9127.94,2210.86,0.0,0.0,0.0,2019-05-01,777.23,Jun-2019,May-2019,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
4,5000,5000,5000.0,36,17.97,180.69,D,D1,Administrative,6 years,MORTGAGE,59280.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,490xx,MI,10.51,0,Apr-2011,0,,,8,0,4599,19.082988,13,w,4567.57,4567.57,715.27,715.27,432.43,282.84,0.0,0.0,0.0,2019-05-01,180.69,Jun-2019,May-2019,0,,1,Individual,,,,0,0,110299,0,1,0,2,14.0,7150,72.0,0,2,0,35.0,24100,1,5,0,4,18383.0,13800.0,0.0,0,0,87.0,92,15,14,2,77.0,,14.0,,0,0,3,3,3,4,6,7,3,8,0.0,0,0,0,100.0,0.0,0,0,136927,11749,13800,10000,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0


In [47]:
df1['emp_title'].value_counts(dropna=False).head(20)

NaN                   20947
Teacher                2090
Manager                1773
Registered Nurse        952
Driver                  924
RN                      726
Supervisor              697
Sales                   580
Project Manager         526
General Manager         523
Office Manager          521
Owner                   420
Director                402
Operations Manager      387
Truck Driver            387
Nurse                   326
Engineer                325
Sales Manager           304
manager                 301
Supervisor              270
Name: emp_title, dtype: int64

In [0]:
df1['emp_title'] = df1['emp_title'].apply(lambda x: str(x))

In [49]:
df1['emp_title'][0]

'nan'

In [0]:
df1['emp_title'] = df1['emp_title'].apply(lambda x: 'Unknown' if x == 'nan' else x)

In [0]:
df1['emp_title'] = df1['emp_title'].str.capitalize()

In [0]:
counts = df1['emp_title'].value_counts().head(20)

In [53]:
counts.head(20)

Unknown               20947
Teacher                2373
Manager                2130
Registered nurse       1261
Driver                 1156
Supervisor              867
Rn                      806
Truck driver            751
Sales                   741
Office manager          718
General manager         690
Project manager         644
Owner                   567
Operations manager      456
Sales manager           442
Director                435
Nurse                   392
Engineer                368
Police officer          354
Store manager           337
Name: emp_title, dtype: int64

In [0]:
def word_count(df):
  d = {}
  for word in df: 
    d[word] = d.get(word, 0) + 1
    
  word_freq = []
  for key, value in d.items():
    word_freq.append((value, key))
  
  word_freq.sort(reverse=True) 

  return  word_freq[0:20]

In [55]:
emp = word_count(df1['emp_title'])
emp

[(20947, 'Unknown'),
 (2373, 'Teacher'),
 (2130, 'Manager'),
 (1261, 'Registered nurse'),
 (1156, 'Driver'),
 (867, 'Supervisor'),
 (806, 'Rn'),
 (751, 'Truck driver'),
 (741, 'Sales'),
 (718, 'Office manager'),
 (690, 'General manager'),
 (644, 'Project manager'),
 (567, 'Owner'),
 (456, 'Operations manager'),
 (442, 'Sales manager'),
 (435, 'Director'),
 (392, 'Nurse'),
 (368, 'Engineer'),
 (354, 'Police officer'),
 (337, 'Store manager')]

In [0]:
emp_list = []
for key, value in emp:
      emp_list.append(value)

In [57]:
emp_list

['Unknown',
 'Teacher',
 'Manager',
 'Registered nurse',
 'Driver',
 'Supervisor',
 'Rn',
 'Truck driver',
 'Sales',
 'Office manager',
 'General manager',
 'Project manager',
 'Owner',
 'Operations manager',
 'Sales manager',
 'Director',
 'Nurse',
 'Engineer',
 'Police officer',
 'Store manager']

In [0]:
def search_emp(x):
  for item in emp_list:
    while x == item:
      return x     

In [0]:
df1['emp_title'] = df1['emp_title'].apply(lambda x: search_emp(x))

In [60]:
df1['emp_title'].value_counts()

Unknown               20947
Teacher                2373
Manager                2130
Registered nurse       1261
Driver                 1156
Supervisor              867
Rn                      806
Truck driver            751
Sales                   741
Office manager          718
General manager         690
Project manager         644
Owner                   567
Operations manager      456
Sales manager           442
Director                435
Nurse                   392
Engineer                368
Police officer          354
Store manager           337
Name: emp_title, dtype: int64

In [65]:
df1['emp_title'].isnull().sum()

91977

In [0]:
df1['emp_title'] = df1['emp_title'].fillna('Other')

In [70]:
df1['emp_title'].value_counts()

Other                 91977
Unknown               20947
Teacher                2373
Manager                2130
Registered nurse       1261
Driver                 1156
Supervisor              867
Rn                      806
Truck driver            751
Sales                   741
Office manager          718
General manager         690
Project manager         644
Owner                   567
Operations manager      456
Sales manager           442
Director                435
Nurse                   392
Engineer                368
Police officer          354
Store manager           337
Name: emp_title, dtype: int64

In [71]:
df1.head(20)

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,loan_status_is_great,last_pymnt_d_month,last_pymnt_d_year
0,9600,9600,9600.0,36,12.98,323.37,B,B5,Unknown,,MORTGAGE,35704.0,Not Verified,Dec-2018,Current,n,home_improvement,Home improvement,401xx,KY,0.84,0,Nov-2003,0,69.0,,5,0,748,11.507692,23,w,8707.52,8707.52,1317.72,1317.72,892.48,425.24,0.0,0.0,0.0,2019-05-01,323.37,Jun-2019,May-2019,0,,1,Individual,,,,0,0,748,0,0,0,0,44.0,0,,0,3,748,12.0,6500,0,0,1,3,150.0,3452.0,17.8,0,0,181.0,100,13,13,0,16.0,,3.0,,0,1,1,2,2,16,5,7,1,5,0.0,0,0,0,95.5,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
1,4000,4000,4000.0,36,23.4,155.68,E,E1,Other,3 years,RENT,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.184502,20,w,3596.15,3596.15,770.6,770.6,403.85,366.75,0.0,0.0,0.0,2019-05-01,155.68,Jun-2019,May-2019,0,,1,Individual,,,,0,0,66926,5,4,3,4,5.0,61727,86.0,6,11,1353,68.0,27100,4,0,4,15,4462.0,20174.0,7.9,0,0,147.0,118,2,2,0,2.0,,0.0,,0,5,7,9,9,8,11,12,7,15,0.0,0,0,9,95.0,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
2,2500,2500,2500.0,36,13.56,84.92,C,C1,Other,10+ years,RENT,55000.0,Not Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.335714,34,w,2210.17,2210.17,421.78,421.78,289.83,131.95,0.0,0.0,0.0,2019-05-01,84.92,Jun-2019,May-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
3,30000,30000,30000.0,60,18.94,777.23,D,D2,Other,10+ years,MORTGAGE,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,Jun-1987,0,71.0,75.0,13,1,12315,24.242126,44,w,20872.06,20872.06,11338.8,11338.8,9127.94,2210.86,0.0,0.0,0.0,2019-05-01,777.23,Jun-2019,May-2019,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
4,5000,5000,5000.0,36,17.97,180.69,D,D1,Other,6 years,MORTGAGE,59280.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,490xx,MI,10.51,0,Apr-2011,0,,,8,0,4599,19.082988,13,w,4567.57,4567.57,715.27,715.27,432.43,282.84,0.0,0.0,0.0,2019-05-01,180.69,Jun-2019,May-2019,0,,1,Individual,,,,0,0,110299,0,1,0,2,14.0,7150,72.0,0,2,0,35.0,24100,1,5,0,4,18383.0,13800.0,0.0,0,0,87.0,92,15,14,2,77.0,,14.0,,0,0,3,3,3,4,6,7,3,8,0.0,0,0,0,100.0,0.0,0,0,136927,11749,13800,10000,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
5,10000,10000,10000.0,36,10.33,324.23,B,B1,Unknown,< 1 year,MORTGAGE,280000.0,Not Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,Jan-1996,0,18.0,,14,0,9082,38.0,23,w,8788.59,8788.59,1612.54,1612.54,1211.41,401.13,0.0,0.0,0.0,2019-05-01,324.23,Jun-2019,May-2019,0,,1,Individual,,,,0,671,246828,1,3,2,3,1.0,48552,62.0,1,3,4923,46.0,23900,2,7,1,7,17631.0,11897.0,43.1,0,0,158.0,275,11,1,1,11.0,,11.0,,0,3,4,7,7,10,9,11,4,14,0.0,0,0,4,91.3,28.6,0,0,367828,61364,20900,54912,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
6,23000,23000,23000.0,60,20.89,620.81,D,D4,Other,5 years,RENT,68107.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,672xx,KS,0.52,0,Feb-1997,0,,,5,0,976,13.013333,10,w,21838.48,21838.48,3065.83,3065.83,1161.52,1904.31,0.0,0.0,0.0,2019-05-01,620.81,Jun-2019,May-2019,1,,1,Individual,,,,0,2693,976,0,0,0,0,36.0,0,,3,4,0,13.0,7500,2,2,4,4,195.0,3300.0,0.0,0,0,237.0,262,10,10,0,10.0,,9.0,,0,0,1,3,4,3,5,7,1,5,0.0,0,0,3,100.0,0.0,0,0,7500,976,3300,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
7,8000,8000,8000.0,36,23.4,311.35,E,E1,Manager,10+ years,OWN,43000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,357xx,AL,33.24,0,Jan-1995,0,,107.0,8,1,9019,81.252252,16,w,7192.36,7192.36,1541.15,1541.15,807.64,733.51,0.0,0.0,0.0,2019-05-01,311.35,Jun-2019,May-2019,0,,1,Individual,,,,0,0,169223,0,3,2,2,7.0,22059,69.0,0,0,2174,72.0,11100,1,1,1,2,21153.0,126.0,94.5,0,0,148.0,287,44,7,1,51.0,,7.0,,0,1,4,1,2,8,4,7,4,8,0.0,0,0,2,100.0,100.0,1,0,199744,31078,2300,32206,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
8,32075,32075,32075.0,60,11.8,710.26,B,B4,Other,10+ years,MORTGAGE,150000.0,Not Verified,Dec-2018,Current,n,credit_card,Credit card refinancing,231xx,VA,22.21,0,Aug-2005,0,,,17,0,19077,32.008389,24,w,30061.51,30061.51,3519.76,3519.76,2013.49,1506.27,0.0,0.0,0.0,2019-05-01,710.26,Jun-2019,May-2019,0,,1,Individual,,,,0,0,272667,1,4,1,1,9.0,37558,47.0,1,1,3910,41.0,59600,1,2,0,3,16039.0,10446.0,47.8,0,0,160.0,70,4,4,2,27.0,,14.0,,0,4,10,4,4,8,12,14,10,17,0.0,0,0,2,100.0,50.0,0,0,360433,56635,20000,80125,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0
9,12000,12000,12000.0,60,13.56,276.49,C,C1,Unknown,< 1 year,MORTGAGE,40000.0,Not Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,180xx,PA,19.23,0,Aug-2005,2,,,8,0,23195,55.490431,9,w,11279.45,11279.45,1368.89,1368.89,720.55,648.34,0.0,0.0,0.0,2019-05-01,276.49,Jun-2019,May-2019,0,,1,Individual,,,,0,0,32462,1,1,1,1,4.0,9267,103.0,2,2,9747,64.0,41800,1,0,3,3,4058.0,16601.0,54.9,0,0,4.0,149,9,4,1,10.0,,4.0,,0,5,6,5,5,1,7,7,6,8,0.0,0,0,3,100.0,60.0,0,0,50800,32462,36800,9000,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,1,5.0,2019.0


# Instacart Option

In [0]:
#!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
#!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
#%cd instacart_2017_05_01