_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [1]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

--2019-03-28 19:27:37--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip’

LoanStats_2018Q4.cs     [               <=>  ]  21.29M   805KB/s    in 27s     

2019-03-28 19:28:10 (794 KB/s) - ‘LoanStats_2018Q4.csv.zip’ saved [22329081]



In [2]:
!unzip LoanStats_2018Q4.csv.zip

Archive:  LoanStats_2018Q4.csv.zip
  inflating: LoanStats_2018Q4.csv    


In [3]:
!head LoanStats_2018Q4.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

In [4]:
!tail LoanStats_2018Q4.csv

"","","5600","5600","5600"," 36 months"," 13.56%","190.21","C","C1","","n/a","RENT","15600","Not Verified","Oct-2018","Current","n","","","credit_card","Credit card refinancing","836xx","ID","15.31","0","Aug-2012","0","","97","9","1","5996","34.5%","11","w","5083.61","5083.61","750.29","750.29","516.39","233.90","0.0","0.0","0.0","Feb-2019","190.21","Mar-2019","Feb-2019","0","","1","Individual","","","","0","0","5996","0","0","0","1","20","0","","0","2","3017","35","17400","1","0","0","3","750","4689","45.5","0","0","20","73","13","13","0","13","","20","","0","3","5","4","4","1","9","10","5","9","0","0","0","0","100","25","1","0","17400","5996","8600","0","","","","","","","","","","","","N","","","","","","","","","","","","","","","Cash","N","","","","","",""
"","","23000","23000","23000"," 36 months"," 15.02%","797.53","C","C3","Tax Consultant","10+ years","MORTGAGE","75000","Source Verified","Oct-2018","Charged Off","n","","","debt_consolidation","Debt consolidation","352xx","AL","

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import pandas as pd
import numpy as np

df = pd.read_csv('LoanStats_2018Q4.csv',skiprows=1, skipfooter=2, engine='python')

## engine='python' due to warning message when we did it. Because of skipfooter

In [6]:
df.shape

(128412, 145)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128412 entries, 0 to 128411
Columns: 145 entries, id to settlement_term
dtypes: float64(57), int64(51), object(37)
memory usage: 142.1+ MB


In [0]:
# want to skip the first row in the beginning.. as you can see from initially looking at the shape and the info.. 
# the first line is treated as the column and the rest is it's rows
# this is what the 'skiprows' parameter does 

# because we looked at the tail.. we know there are data points in the end we have to deal with as well

In [0]:
df.isnull().sum().sort_values()

In [9]:
## there are at least 2 nulls for everything.. that is weird.
## id is always null.. weird.

pd.options.display.max_columns = 500
pd.options.display.max_rows = 500

## options just makes sure to show data instead of seeing '...' in large datasets
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,2500,2500,2500.0,36 months,13.56%,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.3%,34,w,2386.02,2386.02,167.02,167.02,113.98,53.04,0.0,0.0,0.0,Feb-2019,84.92,Mar-2019,Feb-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
1,,,30000,30000,30000.0,60 months,18.94%,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,Jun-1987,0,71.0,75.0,13,1,12315,24.2%,44,w,29387.75,29387.75,1507.11,1507.11,612.25,894.86,0.0,0.0,0.0,Feb-2019,777.23,Mar-2019,Feb-2019,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
2,,,5000,5000,5000.0,36 months,17.97%,180.69,D,D1,Administrative,6 years,MORTGAGE,59280.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,490xx,MI,10.51,0,Apr-2011,0,,,8,0,4599,19.1%,13,w,4787.21,4787.21,353.89,353.89,212.79,141.1,0.0,0.0,0.0,Feb-2019,180.69,Mar-2019,Feb-2019,0,,1,Individual,,,,0,0,110299,0,1,0,2,14.0,7150,72.0,0,2,0,35.0,24100,1,5,0,4,18383.0,13800.0,0.0,0,0,87.0,92,15,14,2,77.0,,14.0,,0,0,3,3,3,4,6,7,3,8,0.0,0,0,0,100.0,0.0,0,0,136927,11749,13800,10000,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
3,,,4000,4000,4000.0,36 months,18.94%,146.51,D,D2,IT Supervisor,10+ years,MORTGAGE,92000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,985xx,WA,16.74,0,Feb-2006,0,,,10,0,5468,78.1%,13,w,3831.93,3831.93,286.71,286.71,168.07,118.64,0.0,0.0,0.0,Feb-2019,146.51,Mar-2019,Feb-2019,0,,1,Individual,,,,0,686,305049,1,5,3,5,5.0,30683,68.0,0,0,3761,70.0,7000,2,4,3,5,30505.0,1239.0,75.2,0,0,62.0,154,64,5,3,64.0,,5.0,,0,1,2,1,2,7,2,3,2,10,0.0,0,0,3,100.0,100.0,0,0,385183,36151,5000,44984,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
4,,,30000,30000,30000.0,60 months,16.14%,731.78,C,C4,Mechanic,10+ years,MORTGAGE,57250.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,212xx,MD,26.35,0,Dec-2000,0,,,12,0,829,3.6%,26,w,29339.02,29339.02,1423.21,1423.21,660.98,762.23,0.0,0.0,0.0,Feb-2019,731.78,Mar-2019,Feb-2019,0,,1,Individual,,,,0,0,116007,3,5,3,5,4.0,28845,89.0,2,4,516,54.0,23100,1,0,0,9,9667.0,8471.0,8.9,0,0,53.0,216,2,2,2,2.0,,13.0,,0,2,2,3,8,9,6,15,2,12,0.0,0,0,5,92.3,0.0,0,0,157548,29674,9300,32332,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,


## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [10]:
df.describe(exclude='number') ## show which features are strings

Unnamed: 0,term,int_rate,grade,sub_grade,emp_title,emp_length,home_ownership,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,earliest_cr_line,revol_util,initial_list_status,last_pymnt_d,next_pymnt_d,last_credit_pull_d,application_type,verification_status_joint,sec_app_earliest_cr_line,hardship_flag,hardship_type,hardship_reason,hardship_status,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_loan_status,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date
count,128412,128412,128412,128412,107465,116708,128412,128412,128412,128412,128412,128412,128412,128412,128412,128412,128256,128412,128250,124941,128411,128412,14848,16782,128412,1,1,1,1,1,1,1,128412,128412,2,2,2
unique,2,46,7,35,43892,11,4,3,3,6,2,12,12,880,50,644,1074,2,5,3,7,2,3,573,2,1,1,1,1,1,1,1,2,2,1,1,1
top,36 months,13.56%,A,A4,Teacher,10+ years,MORTGAGE,Not Verified,Oct-2018,Current,n,debt_consolidation,Debt consolidation,112xx,CA,Aug-2006,0%,w,Feb-2019,Mar-2019,Feb-2019,Individual,Not Verified,Aug-2006,N,INTEREST ONLY-3 MONTHS DEFERRAL,UNEMPLOYMENT,ACTIVE,Feb-2019,Apr-2019,Feb-2019,Late (16-30 days),Cash,N,Feb-2019,ACTIVE,Feb-2019
freq,88179,6976,38011,9770,2090,38826,63490,58350,46305,123768,128411,70603,70603,1370,17879,1130,1132,114498,123797,124903,125061,111630,6360,155,128411,1,1,1,1,1,1,1,102516,128410,2,2,2


In [0]:
## one can see that some columns that you would think are integers, are still treated as strings
## for certain reasons

### Convert `int_rate`



Directly manipulate the 'int_rate' column

In [0]:
type(float('13.56%'.strip('%')))

float

In [0]:
## strip percentage signs and change type to float
df['int_rate'] = df['int_rate'].str.strip('%').astype(float)

In [0]:
df['int_rate'].head()

0    13.56
1    18.94
2    17.97
3    18.94
4    16.14
Name: int_rate, dtype: float64

Define a function to remove percent signs from strings and convert to floats

In [0]:
string = '13.56%'

def remove_percent(string):
  return float(string.strip('%'))

remove_percent(string)

13.56

In [0]:
df['int_rate'] = df['int_rate'].apply(remove_percent)

In [0]:
df['int_rate'].head()

0    13.56
1    18.94
2    17.97
3    18.94
4    16.14
Name: int_rate, dtype: float64

Apply the function to the `int_rate` column ^^

### Clean `emp_title`

Look at top 20 titles

In [0]:
df['emp_title'].value_counts().head(20)

Teacher                     2090
Manager                     1773
Registered Nurse             952
Driver                       924
RN                           726
Supervisor                   697
Sales                        580
Project Manager              526
General Manager              523
Office Manager               521
Owner                        420
Director                     402
Truck Driver                 387
Operations Manager           387
Nurse                        326
Engineer                     325
Sales Manager                304
manager                      301
Supervisor                   270
Administrative Assistant     269
Name: emp_title, dtype: int64

In [0]:
## can see some trailing spaces.. some NaNs, Mispellings, casing differences, aggregate types of roles?
## need some cleanings
df['emp_title'].isnull().sum()

20947

In [0]:
## What to do
## consistent casing, spacing, and replace NaN's with unkwown or missing
examples = ['owner', 'Supervisor  ', ' Project Manager', np.nan]

def clean_title(x):
  if isinstance(x, str):
    return x.strip().title()
  else:
    return 'Unknown'
  
[clean_title(x) for x in examples]

## function that cleans data.. checked on small dataset first



['Owner', 'Supervisor', 'Project Manager', 'Unknown']

In [0]:
df['emp_title'] = df['emp_title'].apply(clean_title)

In [0]:
df['emp_title'].head(10)

0                   Chef
1             Postmaster
2         Administrative
3          It Supervisor
4               Mechanic
5           Director Coe
6        Account Manager
7     Assistant Director
8    Legal Assistant Iii
9                Unknown
Name: emp_title, dtype: object

How often is `emp_title` null?

In [0]:
df['emp_title'].value_counts().head(20)

Unknown               20947
Teacher                2557
Manager                2395
Registered Nurse       1418
Driver                 1258
Supervisor             1160
Truck Driver            920
Rn                      834
Office Manager          805
Sales                   803
General Manager         791
Project Manager         720
Owner                   625
Director                523
Operations Manager      518
Sales Manager           500
Police Officer          440
Nurse                   425
Technician              420
Engineer                412
Name: emp_title, dtype: int64

Clean the title and handle missing values

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [0]:
df['emp_title_manager'] = df['emp_title'].str.contains('Manager')
df['emp_title_manager'].value_counts(normalize=True)

False    0.860745
True     0.139255
Name: emp_title_manager, dtype: float64

In [0]:
df.groupby('emp_title_manager')['int_rate'].mean()

emp_title_manager
False    12.958054
True     12.761840
Name: int_rate, dtype: float64

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [57]:
df['issue_d'].head().values

array(['Dec-2018', 'Dec-2018', 'Dec-2018', 'Dec-2018', 'Dec-2018'],
      dtype=object)

In [0]:
df['issue_d'] = pd.to_datetime(df['issue_d'], infer_datetime_format=True)

In [59]:
df['issue_d'].head().values

array(['2018-12-01T00:00:00.000000000', '2018-12-01T00:00:00.000000000',
       '2018-12-01T00:00:00.000000000', '2018-12-01T00:00:00.000000000',
       '2018-12-01T00:00:00.000000000'], dtype='datetime64[ns]')

In [60]:
df['issue_d'].describe()

count                  128412
unique                      3
top       2018-10-01 00:00:00
freq                    46305
first     2018-10-01 00:00:00
last      2018-12-01 00:00:00
Name: issue_d, dtype: object

In [0]:
## .dt can help you get more specific outcomes from the dates! cool
## can add columns with this.. and analyze those

df['issue_d'].dt.month.head()
df['issue_month'] = df['issue_d'].dt.month

In [62]:
df['issue_month'].head()

0    12
1    12
2    12
3    12
4    12
Name: issue_month, dtype: int64

In [0]:
## repeat example. .addition with dates

df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], infer_datetime_format=True)

In [64]:
df['earliest_cr_line'].head()

0   2001-04-01
1   1987-06-01
2   2011-04-01
3   2006-02-01
4   2000-12-01
Name: earliest_cr_line, dtype: datetime64[ns]

In [0]:
## time between credit line opened and issue date of first loan
## dt.days will make the date time objects into integers.. so you can deal with them

df['days_from_earliest_credit_to_issue'] = (df['issue_d'] - df['earliest_cr_line']).dt.days



In [66]:
df['days_from_earliest_credit_to_issue'].describe()

count    128412.000000
mean       5859.891490
std        2886.535578
min        1126.000000
25%        4049.000000
50%        5266.000000
75%        7244.000000
max       25171.000000
Name: days_from_earliest_credit_to_issue, dtype: float64

In [67]:
## lots of cool things can be dones with data.. this is incredible

## dealing with columns you want to do things to.. as a group

[col for col in df if col.endswith('_d')]

['issue_d', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']

In [0]:
for col in ['issue_d', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']:
  df[col] = pd.to_datetime(df[col], infer_datetime_format=True)

In [69]:
df.describe(include='datetime')

Unnamed: 0,issue_d,earliest_cr_line,last_pymnt_d,next_pymnt_d,last_credit_pull_d
count,128412,128412,128250,124941,128411
unique,3,644,5,3,7
top,2018-10-01 00:00:00,2006-08-01 00:00:00,2019-02-01 00:00:00,2019-03-01 00:00:00,2019-02-01 00:00:00
freq,46305,1130,123797,124903,125061
first,2018-10-01 00:00:00,1950-01-01 00:00:00,2018-10-01 00:00:00,2019-02-01 00:00:00,2018-08-01 00:00:00
last,2018-12-01 00:00:00,2015-11-01 00:00:00,2019-02-01 00:00:00,2019-04-01 00:00:00,2019-02-01 00:00:00


# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

***convert 'term' column from string to integer***

In [11]:
# check the column to see current state
df['term'].head()

0     36 months
1     60 months
2     36 months
3     36 months
4     60 months
Name: term, dtype: object

In [12]:
df['term'].isnull().sum()

0

In [0]:
## strip the ' months' from the data and change type to integer
df['term'] = df['term'].str.strip(' months').astype(int)

In [22]:
df['term'].head()

0    36
1    60
2    36
3    36
4    60
Name: term, dtype: int64

***Make column named 'loan_status_is_great'. It should contain the integer 1 if loan_status is "Current" or "Fully Paid." Else it should contain the integer 0.***

In [26]:
# check data in column
df['loan_status'].isnull().sum()

0

In [44]:
# check to see if data contains 'Current' OR 'Fully Paid'. 
# contains returns Boolean.. so can change type to 'int' and it will
# return 1's and 0's
df['loan_status_is_great'] = df['loan_status'].str.contains('Current|Fully Paid').astype(int)
df.columns

Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       ...
       'hardship_last_payment_amount', 'disbursement_method',
       'debt_settlement_flag', 'debt_settlement_flag_date',
       'settlement_status', 'settlement_date', 'settlement_amount',
       'settlement_percentage', 'settlement_term', 'loan_status_is_great'],
      dtype='object', length=146)

In [56]:
df['loan_status_is_great'].sample(10)

85145     1
112074    1
20457     0
36876     1
10713     1
26020     1
76988     1
67562     1
18333     1
71384     1
Name: loan_status_is_great, dtype: int64

***Make last_pymnt_d_month and last_pymnt_d_year columns.***

In [0]:
## 'last_pymnt_d_month' column
df['last_pymnt_d_month'] = df['last_pymnt_d'].dt.strftime('%b')


In [0]:
# 'last_pymnt_d_year' column
df['last_pymnt_d_year'] = df['last_pymnt_d'].dt.strftime('%Y')

In [84]:
#check to see if columns were added to the dataframe
df[['last_pymnt_d_month', 'last_pymnt_d_year', 'last_pymnt_d']].sample(20)

Unnamed: 0,last_pymnt_d_month,last_pymnt_d_year,last_pymnt_d
43946,Feb,2019,2019-02-01
41284,Feb,2019,2019-02-01
46058,Jan,2019,2019-01-01
4417,Feb,2019,2019-02-01
79878,Feb,2019,2019-02-01
50637,Feb,2019,2019-02-01
106137,Feb,2019,2019-02-01
6677,Feb,2019,2019-02-01
84010,Feb,2019,2019-02-01
112475,Feb,2019,2019-02-01


# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01