<a href="https://colab.research.google.com/github/jazzathoth/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/module4-make-features/LS_DS_124_Make_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [1]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

--2019-03-28 20:53:11--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip’

LoanStats_2018Q4.cs     [               <=>  ]  21.29M   895KB/s    in 25s     

2019-03-28 20:53:37 (880 KB/s) - ‘LoanStats_2018Q4.csv.zip’ saved [22329081]



In [2]:
!unzip LoanStats_2018Q4.csv.zip

Archive:  LoanStats_2018Q4.csv.zip
  inflating: LoanStats_2018Q4.csv    


In [3]:
!head LoanStats_2018Q4.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

In [4]:
!tail LoanStats_2018Q4.csv

"","","5600","5600","5600"," 36 months"," 13.56%","190.21","C","C1","","n/a","RENT","15600","Not Verified","Oct-2018","Current","n","","","credit_card","Credit card refinancing","836xx","ID","15.31","0","Aug-2012","0","","97","9","1","5996","34.5%","11","w","5083.61","5083.61","750.29","750.29","516.39","233.90","0.0","0.0","0.0","Feb-2019","190.21","Mar-2019","Feb-2019","0","","1","Individual","","","","0","0","5996","0","0","0","1","20","0","","0","2","3017","35","17400","1","0","0","3","750","4689","45.5","0","0","20","73","13","13","0","13","","20","","0","3","5","4","4","1","9","10","5","9","0","0","0","0","100","25","1","0","17400","5996","8600","0","","","","","","","","","","","","N","","","","","","","","","","","","","","","Cash","N","","","","","",""
"","","23000","23000","23000"," 36 months"," 15.02%","797.53","C","C3","Tax Consultant","10+ years","MORTGAGE","75000","Source Verified","Oct-2018","Charged Off","n","","","debt_consolidation","Debt consolidation","352xx","AL","

In [5]:
!ls ../content

LoanStats_2018Q4.csv  LoanStats_2018Q4.csv.zip	sample_data


## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import pandas as pd
df = pd.read_csv('LoanStats_2018Q4.csv', 
                 skiprows=1, 
                 skipfooter=2, 
                 engine='python')

In [0]:
pd.options.display.max_rows = 500
pd.options.display.max_columns = 500


In [8]:
df.shape

(128412, 145)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128412 entries, 0 to 128411
Columns: 145 entries, id to settlement_term
dtypes: float64(57), int64(51), object(37)
memory usage: 142.1+ MB


In [10]:
df.isnull().sum()

id                                            128412
member_id                                     128412
loan_amnt                                          0
funded_amnt                                        0
funded_amnt_inv                                    0
term                                               0
int_rate                                           0
installment                                        0
grade                                              0
sub_grade                                          0
emp_title                                      20947
emp_length                                     11704
home_ownership                                     0
annual_inc                                         0
verification_status                                0
issue_d                                            0
loan_status                                        0
pymnt_plan                                         0
url                                           

## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [11]:
df.describe(exclude='number')

Unnamed: 0,term,int_rate,grade,sub_grade,emp_title,emp_length,home_ownership,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,earliest_cr_line,revol_util,initial_list_status,last_pymnt_d,next_pymnt_d,last_credit_pull_d,application_type,verification_status_joint,sec_app_earliest_cr_line,hardship_flag,hardship_type,hardship_reason,hardship_status,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_loan_status,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date
count,128412,128412,128412,128412,107465,116708,128412,128412,128412,128412,128412,128412,128412,128412,128412,128412,128256,128412,128250,124941,128411,128412,14848,16782,128412,1,1,1,1,1,1,1,128412,128412,2,2,2
unique,2,46,7,35,43892,11,4,3,3,6,2,12,12,880,50,644,1074,2,5,3,7,2,3,573,2,1,1,1,1,1,1,1,2,2,1,1,1
top,36 months,13.56%,A,A4,Teacher,10+ years,MORTGAGE,Not Verified,Oct-2018,Current,n,debt_consolidation,Debt consolidation,112xx,CA,Aug-2006,0%,w,Feb-2019,Mar-2019,Feb-2019,Individual,Not Verified,Aug-2006,N,INTEREST ONLY-3 MONTHS DEFERRAL,UNEMPLOYMENT,ACTIVE,Feb-2019,Apr-2019,Feb-2019,Late (16-30 days),Cash,N,Feb-2019,ACTIVE,Feb-2019
freq,88179,6976,38011,9770,2090,38826,63490,58350,46305,123768,128411,70603,70603,1370,17879,1130,1132,114498,123797,124903,125061,111630,6360,155,128411,1,1,1,1,1,1,1,102516,128410,2,2,2


In [12]:
type(df['int_rate'].str.strip('%').astype(float))

pandas.core.series.Series

In [0]:
df['int_rate'] = df['int_rate'].str.strip('%').astype(float)

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [14]:
string= '13.56%'

def remove_percent(string):
  return(float(string.strip('%')))
remove_percent(string)

13.56

Apply the function to the `int_rate` column

In [0]:
df['int_rate'].apply(remove_percent)

### Clean `emp_title`

Look at top 20 titles

In [18]:
df['emp_title'].value_counts().head(20)

Teacher                     2090
Manager                     1773
Registered Nurse             952
Driver                       924
RN                           726
Supervisor                   697
Sales                        580
Project Manager              526
General Manager              523
Office Manager               521
Owner                        420
Director                     402
Truck Driver                 387
Operations Manager           387
Nurse                        326
Engineer                     325
Sales Manager                304
manager                      301
Supervisor                   270
Administrative Assistant     269
Name: emp_title, dtype: int64

How often is `emp_title` null?

In [19]:
df['emp_title'].isnull().sum()

20947

Clean the title and handle missing values

In [20]:
import numpy as np
examples = ['owner', 'Supervisor ',
           ' Project Manager', np.nan]

def clean_title(title):
  if isinstance(title, str):
    return title.strip().title()
  else:
    return 'Unknown'

[clean_title(x) for x in examples]

['Owner', 'Supervisor', 'Project Manager', 'Unknown']

In [0]:
df['emp_title'] = df['emp_title'].apply(clean_title)

In [22]:
df['emp_title'].head(10)

0                   Chef
1             Postmaster
2         Administrative
3          It Supervisor
4               Mechanic
5           Director Coe
6        Account Manager
7     Assistant Director
8    Legal Assistant Iii
9                Unknown
Name: emp_title, dtype: object

In [23]:
df['emp_title'].value_counts(ascending=False)

Unknown                                     20947
Teacher                                      2557
Manager                                      2395
Registered Nurse                             1418
Driver                                       1258
Supervisor                                   1160
Truck Driver                                  920
Rn                                            834
Office Manager                                805
Sales                                         803
General Manager                               791
Project Manager                               720
Owner                                         625
Director                                      523
Operations Manager                            518
Sales Manager                                 500
Police Officer                                440
Nurse                                         425
Technician                                    420
Engineer                                      412


In [24]:
df['emp_title'].isnull().sum()

0

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [0]:
df['emp_title_manager'] = df['emp_title'].str.contains('Manager')

In [26]:
df.shape

(128412, 146)

In [27]:
df['emp_title_manager'].value_counts(normalize=True)

False    0.860745
True     0.139255
Name: emp_title_manager, dtype: float64

In [28]:
df.groupby('emp_title_manager')['int_rate'].mean()

emp_title_manager
False    12.958054
True     12.761840
Name: int_rate, dtype: float64

In [29]:
# % data missing

df.isnull().sum().sort_values(ascending=False) / len(df)

id                                            1.000000
member_id                                     1.000000
url                                           1.000000
desc                                          1.000000
hardship_length                               0.999992
hardship_amount                               0.999992
hardship_type                                 0.999992
hardship_reason                               0.999992
hardship_status                               0.999992
hardship_last_payment_amount                  0.999992
hardship_payoff_balance_amount                0.999992
orig_projected_additional_accrued_interest    0.999992
hardship_loan_status                          0.999992
hardship_dpd                                  0.999992
deferral_term                                 0.999992
payment_plan_start_date                       0.999992
hardship_end_date                             0.999992
hardship_start_date                           0.999992
settlement

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [30]:
df['issue_d'].head().values

array(['Dec-2018', 'Dec-2018', 'Dec-2018', 'Dec-2018', 'Dec-2018'],
      dtype=object)

In [31]:
df['issue_d'].describe()

count       128412
unique           3
top       Oct-2018
freq         46305
Name: issue_d, dtype: object

In [0]:
df['issue_d'] = pd.to_datetime(df['issue_d'], infer_datetime_format=True)

In [33]:
df['issue_d'].head().values

array(['2018-12-01T00:00:00.000000000', '2018-12-01T00:00:00.000000000',
       '2018-12-01T00:00:00.000000000', '2018-12-01T00:00:00.000000000',
       '2018-12-01T00:00:00.000000000'], dtype='datetime64[ns]')

In [34]:
df['issue_d'].describe()

count                  128412
unique                      3
top       2018-10-01 00:00:00
freq                    46305
first     2018-10-01 00:00:00
last      2018-12-01 00:00:00
Name: issue_d, dtype: object

In [35]:
df['issue_d'].dt.month.value_counts()

10    46305
11    41973
12    40134
Name: issue_d, dtype: int64

In [0]:
df['issue_month'] = df['issue_d'].dt.month
df['issue_year'] = df['issue_d'].dt.year

In [37]:
df.shape

(128412, 148)

In [0]:
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], infer_datetime_format=True)

In [39]:
df['earliest_cr_line'].head()

0   2001-04-01
1   1987-06-01
2   2011-04-01
3   2006-02-01
4   2000-12-01
Name: earliest_cr_line, dtype: datetime64[ns]

In [0]:
df['days_since_first_credit_issue'] = (df['issue_d'] - df['earliest_cr_line']).dt.days

In [41]:
df['days_since_first_credit_issue'].describe()

count    128412.000000
mean       5859.891490
std        2886.535578
min        1126.000000
25%        4049.000000
50%        5266.000000
75%        7244.000000
max       25171.000000
Name: days_since_first_credit_issue, dtype: float64

In [0]:
df['days_since_first_credit_issue']

In [43]:
# find all columns with date/time content

[col for col in df if col.endswith('_d')]

['issue_d', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']

In [0]:
for col in ['last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']:
  df[col] = pd.to_datetime(df[col], infer_datetime_format=True)

In [45]:
df.describe(include='datetime')

Unnamed: 0,issue_d,earliest_cr_line,last_pymnt_d,next_pymnt_d,last_credit_pull_d
count,128412,128412,128250,124941,128411
unique,3,644,5,3,7
top,2018-10-01 00:00:00,2006-08-01 00:00:00,2019-02-01 00:00:00,2019-03-01 00:00:00,2019-02-01 00:00:00
freq,46305,1130,123797,124903,125061
first,2018-10-01 00:00:00,1950-01-01 00:00:00,2018-10-01 00:00:00,2019-02-01 00:00:00,2018-08-01 00:00:00
last,2018-12-01 00:00:00,2015-11-01 00:00:00,2019-02-01 00:00:00,2019-04-01 00:00:00,2019-02-01 00:00:00


# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [0]:
# Converting 'term' column to int by removing "months" and converting to int
df['term'] = (df['term']
              .str
              .replace('months', '')
              .astype(int))

In [61]:
# Check that it worked:
df['term'].dtypes

dtype('int64')

In [62]:
df['term'].value_counts()

36    88179
60    40233
Name: term, dtype: int64

In [63]:
df['loan_status'].value_counts()

Current               123768
Fully Paid              3438
Late (31-120 days)       509
In Grace Period          441
Late (16-30 days)        223
Charged Off               33
Name: loan_status, dtype: int64

In [73]:
(df['loan_status'] == 'Fully Paid').value_counts()

False    124974
True       3438
Name: loan_status, dtype: int64

In [0]:
status_condition = ((df['loan_status'] == 'Fully Paid') | 
                    (df['loan_status'] == 'Current'))

df['loan_status_is_great'] = status_condition.astype(int)

In [84]:
df.shape

(128412, 150)

In [79]:
#making sure this works the way i think it does
bool_series = pd.Series([True, True, False, True, False, False, False], dtype='bool')
print(bool_series.astype(int))


0    1
1    1
2    0
3    1
4    0
5    0
6    0
dtype: int64


In [86]:
#make sure column is in the proper format
df['last_pymnt_d'].head()

0   2019-02-01
1   2019-02-01
2   2019-02-01
3   2019-02-01
4   2019-02-01
Name: last_pymnt_d, dtype: datetime64[ns]

In [89]:
# adding last_pymnt_d(month and year) columns and populate from last_pymnt_d

df['last_pymnt_d_month'] = df['last_pymnt_d'].dt.month
df['last_pymnt_d_year'] = df['last_pymnt_d'].dt.year
df.shape

(128412, 152)

# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
df.iloc[1].str.find('%')


In [106]:
df.loc[:, df.iloc[1].str.find('%').astype(bool)]

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager,issue_month,issue_year,days_since_first_credit_issue,loan_status_is_great,last_pymnt_d_month,last_pymnt_d_year
0,,,2500,2500,2500.000000,36,13.56,84.92,C,C1,Chef,10+ years,RENT,55000.00,Not Verified,2018-12-01,Current,n,,,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,2001-04-01,1,,45.0,9,1,4341,10.3%,34,w,2386.02,2386.02,167.020000,167.02,113.98,53.04,0.0,0.0,0.0,2019-02-01,84.92,2019-03-01,2019-02-01,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,12,2018,6453,1,2.0,2019.0
1,,,30000,30000,30000.000000,60,18.94,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.00,Source Verified,2018-12-01,Current,n,,,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,1987-06-01,0,71.0,75.0,13,1,12315,24.2%,44,w,29387.75,29387.75,1507.110000,1507.11,612.25,894.86,0.0,0.0,0.0,2019-02-01,777.23,2019-03-01,2019-02-01,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,12,2018,11506,1,2.0,2019.0
2,,,5000,5000,5000.000000,36,17.97,180.69,D,D1,Administrative,6 years,MORTGAGE,59280.00,Source Verified,2018-12-01,Current,n,,,debt_consolidation,Debt consolidation,490xx,MI,10.51,0,2011-04-01,0,,,8,0,4599,19.1%,13,w,4787.21,4787.21,353.890000,353.89,212.79,141.10,0.0,0.0,0.0,2019-02-01,180.69,2019-03-01,2019-02-01,0,,1,Individual,,,,0,0,110299,0,1,0,2,14.0,7150,72.0,0,2,0,35.0,24100,1,5,0,4,18383.0,13800.0,0.0,0,0,87.0,92,15,14,2,77.0,,14.0,,0,0,3,3,3,4,6,7,3,8,0.0,0,0,0,100.0,0.0,0,0,136927,11749,13800,10000,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,12,2018,2801,1,2.0,2019.0
3,,,4000,4000,4000.000000,36,18.94,146.51,D,D2,It Supervisor,10+ years,MORTGAGE,92000.00,Source Verified,2018-12-01,Current,n,,,debt_consolidation,Debt consolidation,985xx,WA,16.74,0,2006-02-01,0,,,10,0,5468,78.1%,13,w,3831.93,3831.93,286.710000,286.71,168.07,118.64,0.0,0.0,0.0,2019-02-01,146.51,2019-03-01,2019-02-01,0,,1,Individual,,,,0,686,305049,1,5,3,5,5.0,30683,68.0,0,0,3761,70.0,7000,2,4,3,5,30505.0,1239.0,75.2,0,0,62.0,154,64,5,3,64.0,,5.0,,0,1,2,1,2,7,2,3,2,10,0.0,0,0,3,100.0,100.0,0,0,385183,36151,5000,44984,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,12,2018,4686,1,2.0,2019.0
4,,,30000,30000,30000.000000,60,16.14,731.78,C,C4,Mechanic,10+ years,MORTGAGE,57250.00,Not Verified,2018-12-01,Current,n,,,debt_consolidation,Debt consolidation,212xx,MD,26.35,0,2000-12-01,0,,,12,0,829,3.6%,26,w,29339.02,29339.02,1423.210000,1423.21,660.98,762.23,0.0,0.0,0.0,2019-02-01,731.78,2019-03-01,2019-02-01,0,,1,Individual,,,,0,0,116007,3,5,3,5,4.0,28845,89.0,2,4,516,54.0,23100,1,0,0,9,9667.0,8471.0,8.9,0,0,53.0,216,2,2,2,2.0,,13.0,,0,2,2,3,8,9,6,15,2,12,0.0,0,0,5,92.3,0.0,0,0,157548,29674,9300,32332,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,12,2018,6574,1,2.0,2019.0
5,,,5550,5550,5550.000000,36,15.02,192.45,C,C3,Director Coe,10+ years,MORTGAGE,152500.00,Not Verified,2018-12-01,Current,n,,,credit_card,Credit card refinancing,461xx,IN,37.94,0,2002-09-01,3,,,18,0,53854,48.1%,44,w,5302.50,5302.50,377.950000,377.95,247.50,130.45,0.0,0.0,0.0,2019-02-01,192.45,2019-03-01,2019-02-01,0,,1,Individual,,,,0,0,685749,1,7,2,3,4.0,131524,72.0,1,4,17584,58.0,111900,2,4,6,8,40338.0,23746.0,64.0,0,0,195.0,176,10,4,6,20.0,,3.0,,0,4,6,6,10,23,9,15,7,18,0.0,0,0,4,100.0,60.0,0,0,831687,185378,65900,203159,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,12,2018,5935,1,2.0,2019.0
6,,,2000,2000,2000.000000,36,17.97,72.28,D,D1,Account Manager,4 years,RENT,51000.00,Source Verified,2018-12-01,Current,n,,,debt_consolidation,Debt consolidation,606xx,IL,2.40,0,2004-11-01,1,,,1,0,0,,9,w,1914.71,1914.71,141.560000,141.56,85.29,56.27,0.0,0.0,0.0,2019-02-01,72.28,2019-03-01,2019-02-01,0,,1,Individual,,,,0,0,854,0,0,2,3,7.0,0,,0,1,0,100.0,0,0,0,1,4,854.0,,,0,0,169.0,40,23,7,0,,,1.0,,0,0,0,0,3,5,0,3,0,1,0.0,0,0,2,100.0,,0,0,854,854,0,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,True,12,2018,5143,1,2.0,2019.0
7,,,6000,6000,6000.000000,36,13.56,203.79,C,C1,Assistant Director,10+ years,RENT,65000.00,Source Verified,2018-12-01,Current,n,,,credit_card,Credit card refinancing,460xx,IN,30.10,0,1997-11-01,0,,,19,0,38476,69.3%,37,w,5864.01,5864.01,201.530000,201.53,135.99,65.54,0.0,0.0,0.0,2019-02-01,208.31,2019-03-01,2019-02-01,0,,1,Individual,,,,0,0,91535,0,5,0,1,23.0,53059,87.0,0,2,9413,74.0,55500,1,2,0,3,5085.0,3034.0,90.8,0,0,169.0,253,13,13,1,14.0,,13.0,,0,7,12,8,10,15,14,20,12,19,0.0,0,0,0,100.0,85.7,0,0,117242,91535,33100,61742,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,,False,12,2018,7700,1,2.0,2019.0
8,,,5000,5000,5000.000000,36,17.97,180.69,D,D1,Legal Assistant Iii,10+ years,MORTGAGE,53580.00,Source Verified,2018-12-01,Current,n,,,debt_consolidation,Debt consolidation,327xx,FL,21.16,0,1998-08-01,1,32.0,,8,0,8018,35.2%,38,w,4786.79,4786.79,353.890000,353.89,213.21,140.68,0.0,0.0,0.0,2019-02-01,180.69,2019-03-01,2019-02-01,0,45.0,1,Individual,,,,0,0,41882,5,2,5,5,3.0,33864,98.0,1,6,3132,73.0,22800,2,1,4,12,5235.0,13786.0,35.9,0,0,145.0,244,6,3,3,6.0,33.0,2.0,32.0,2,4,5,5,10,20,6,15,5,8,0.0,0,0,6,78.9,60.0,0,0,57426,41882,21500,34626,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,12,2018,7427,1,2.0,2019.0
9,,,6000,6000,6000.000000,36,14.47,206.44,C,C2,Unknown,< 1 year,OWN,300000.00,Not Verified,2018-12-01,Current,n,,,debt_consolidation,Debt consolidation,068xx,CT,17.43,1,2002-04-01,1,17.0,,38,0,65950,49.8%,58,w,5730.20,5730.20,405.640000,405.64,269.80,135.84,0.0,0.0,0.0,2019-02-01,206.44,2019-03-01,2019-02-01,0,,1,Individual,,,,0,0,349502,1,4,1,3,7.0,39961,45.0,1,12,15926,48.0,132500,2,2,2,15,9197.0,38683.0,60.6,0,0,166.0,200,4,4,1,4.0,,4.0,17.0,0,16,20,19,26,9,33,48,20,38,0.0,0,0,2,100.0,26.3,0,0,477390,105911,98300,89600,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,12,2018,6088,1,2.0,2019.0


In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01