<a href="https://colab.research.google.com/github/Ravenha/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/Bethany_LS_DS_124_Make_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [0]:
!wget https://resources.lendingclub.com/LoanStats_2019Q1.csv.zip

--2019-07-19 02:23:43--  https://resources.lendingclub.com/LoanStats_2019Q1.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2019Q1.csv.zip.1’

LoanStats_2019Q1.cs     [              <=>   ]  19.30M  1.69MB/s    in 12s     

2019-07-19 02:23:55 (1.68 MB/s) - ‘LoanStats_2019Q1.csv.zip.1’ saved [20240936]



In [0]:
!ls

LoanStats_2019Q1.csv	  LoanStats_2019Q1.csv.zip.1
LoanStats_2019Q1.csv.zip  sample_data


In [0]:
!unzip LoanStats_2019Q1.csv.zip

Archive:  LoanStats_2019Q1.csv.zip
replace LoanStats_2019Q1.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

KeyboardInterrupt: ignored

In [0]:
!head LoanStats_2019Q1.csv

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import pandas as pd

In [0]:
df=pd.read_csv('LoanStats_2019Q1.csv', skiprows=1, skipfooter=2)
print(df.shape)
df.head()

In [0]:
df.isnull().sum()

In [0]:
df[df.loan_amnt.isna()]

In [0]:
df.describe(include='object')

## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [0]:
df.describe(include='object')

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [0]:
x = '12.5%'

In [0]:
float(x.strip('%'))

In [0]:
df.int_rate = df.int_rate.str.strip('%').astype(float)

In [0]:
df.int_rate.head()

In [0]:
df.int_rate.describe()

In [0]:
import numpy as np

Apply the function to the `int_rate` column

In [0]:
example = ['Owner', 'Supervisor ', ' Project manager', np.nan]
def clean_emp_title(x):
  if isinstance(x, str):
    return x.strip().title()
  else:
    return 'Missing'

for ex in example:
  print(clean_emp_title(ex))

In [0]:
df['emp_title'].str.contains('Manager')

In [0]:
df['int_rate'].groupby(df['emp_title']).mean()

### Clean `emp_title`

Look at top 20 titles

How often is `emp_title` null?

Clean the title and handle missing values

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [0]:
df['issue_d'].describe()

In [0]:
df['issue_d'] = pd.to_datetime(df['issue_d'], infer_datetime_format=True)

In [0]:
df['issue_d'].describe()

In [0]:
df['issue_d'].iloc[0:5].dt.month

In [0]:
[col for col in df if col.endswith('_d')]

In [0]:
for col in ['issue_d', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']:
  df[col] = pd.to_datetime(df[col])

In [0]:
df.describe(include='datetime')

In [0]:
df.term.value_counts()

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [0]:
!wget https://resources.lendingclub.com/LoanStats_2019Q1.csv.zip

--2019-07-19 02:31:50--  https://resources.lendingclub.com/LoanStats_2019Q1.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2019Q1.csv.zip.2’

LoanStats_2019Q1.cs     [               <=>  ]  19.30M  1.70MB/s    in 11s     

2019-07-19 02:32:02 (1.69 MB/s) - ‘LoanStats_2019Q1.csv.zip.2’ saved [20240936]



In [0]:
!ls

LoanStats_2019Q1.csv	  LoanStats_2019Q1.csv.zip.1  sample_data
LoanStats_2019Q1.csv.zip  LoanStats_2019Q1.csv.zip.2


In [0]:
!unzip LoanStats_2019Q1.csv.zip

Archive:  LoanStats_2019Q1.csv.zip
replace LoanStats_2019Q1.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [0]:
!head LoanStats_2019Q1.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

In [0]:
df=pd.read_csv('LoanStats_2019Q1.csv', skiprows=1, skipfooter=2)
print(df.shape)
df.head()

  """Entry point for launching an IPython kernel.


(115675, 144)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,20000,20000,20000,60 months,17.19%,499.1,C,C5,Front desk supervisor,6 years,RENT,47000.0,Source Verified,Mar-2019,Current,n,,,debt_consolidation,Debt consolidation,958xx,CA,14.02,0,Sep-2006,1,50.0,,15,0,10687,19.7%,53,w,19254.76,19254.76,1459.1,1459.1,...,12.5,0,0,75824,31546,33800,21524,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,,,12000,12000,12000,36 months,16.40%,424.26,C,C4,Executive Director,4 years,MORTGAGE,95000.0,Not Verified,Mar-2019,Current,n,,,debt_consolidation,Debt consolidation,436xx,OH,29.5,0,Nov-1992,0,29.0,,19,0,16619,64.9%,36,w,11475.92,11475.92,826.65,826.65,...,66.7,0,0,209488,51734,18500,54263,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,,,3000,3000,3000,36 months,14.74%,103.62,C,C2,Office Manager,4 years,MORTGAGE,58750.0,Verified,Mar-2019,Current,n,,,medical,Medical expenses,327xx,FL,30.91,0,Jun-2004,0,24.0,,16,0,20502,60.1%,25,f,2865.64,2865.64,202.33,202.33,...,71.4,0,0,69911,37816,11400,35811,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,,,35000,35000,35000,36 months,15.57%,1223.08,C,C3,Store Manager,10+ years,RENT,122000.0,Verified,Mar-2019,Current,n,,,credit_card,Credit card refinancing,333xx,FL,22.0,1,Dec-2009,0,20.0,,5,0,1441,24.4%,18,w,33459.43,33459.43,2446.76,2446.76,...,100.0,0,0,65640,24471,1600,59740,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,,,5000,5000,5000,36 months,15.57%,174.73,C,C3,Area Manager,3 years,OWN,65000.0,Verified,Mar-2019,Current,n,,,house,Home buying,640xx,MO,16.28,1,Jul-2001,0,7.0,,9,0,5604,64.4%,25,w,4778.86,4778.86,340.81,340.81,...,50.0,0,0,38190,29775,4400,29490,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


In [0]:
df.isnull().sum()

id                                            115675
member_id                                     115675
loan_amnt                                          0
funded_amnt                                        0
funded_amnt_inv                                    0
term                                               0
int_rate                                           0
installment                                        0
grade                                              0
sub_grade                                          0
emp_title                                      19518
emp_length                                     11101
home_ownership                                     0
annual_inc                                         0
verification_status                                0
issue_d                                            0
loan_status                                        0
pymnt_plan                                         0
url                                           

In [0]:
df[df.loan_amnt.isna()]

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term


In [0]:
df.describe(include='object')

Unnamed: 0,term,int_rate,grade,sub_grade,emp_title,emp_length,home_ownership,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,earliest_cr_line,revol_util,initial_list_status,last_pymnt_d,next_pymnt_d,last_credit_pull_d,application_type,verification_status_joint,sec_app_earliest_cr_line,hardship_flag,hardship_type,hardship_reason,hardship_status,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_loan_status,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date
count,115675,115675,115675,115675,96157,104574,115675,115675,115675,115675,115675,115675,115675,115675,115675,115675,115546,115675,115445,110769,115673,115675,14624,16681,115675,1,1,1,1,1,1,1,115675,1,1,1
unique,2,53,7,33,39387,11,5,3,3,6,1,12,12,877,50,652,1055,2,7,3,7,2,3,577,2,1,1,1,1,1,1,1,2,1,1,1
top,36 months,8.19%,A,A4,Teacher,10+ years,MORTGAGE,Not Verified,Jan-2019,Current,n,debt_consolidation,Debt consolidation,750xx,CA,Aug-2006,0%,w,Jun-2019,Jul-2019,Jun-2019,Individual,Not Verified,Aug-2006,N,INTEREST ONLY-3 MONTHS DEFERRAL,INCOME_CURTAILMENT,ACTIVE,Jun-2019,Sep-2019,Jul-2019,In Grace Period,N,May-2019,ACTIVE,May-2019
freq,78429,11314,37060,11314,2037,34490,58578,54608,43584,109176,115675,63747,63747,1162,15902,1033,1054,101423,102101,110738,111240,98994,6395,150,115674,1,1,1,1,1,1,1,115674,1,1,1


In [0]:
df.int_rate = df.int_rate.str.strip('%').astype(float)

In [0]:
df.int_rate.head()

0    17.19
1    16.40
2    14.74
3    15.57
4    15.57
Name: int_rate, dtype: float64

In [0]:
df.int_rate.describe()

count    115675.000000
mean         12.720160
std           4.854991
min           6.000000
25%           8.190000
50%          11.800000
75%          15.570000
max          30.890000
Name: int_rate, dtype: float64

In [0]:
example = ['Owner', 'Supervisor ', ' Project manager', np.nan]
def clean_emp_title(x):
  if isinstance(x, str):
    return x.strip().title()
  else:
    return 'Missing'

for ex in example:
  print(clean_emp_title(ex))

Owner
Supervisor
Project Manager
Missing


In [0]:
df['emp_title'].str.contains('Manager')

0         False
1         False
2          True
3          True
4          True
5         False
6         False
7         False
8         False
9          True
10        False
11         True
12        False
13          NaN
14        False
15        False
16        False
17        False
18        False
19        False
20        False
21        False
22        False
23        False
24        False
25        False
26        False
27          NaN
28        False
29        False
          ...  
115645     True
115646    False
115647    False
115648     True
115649    False
115650    False
115651    False
115652    False
115653    False
115654    False
115655    False
115656    False
115657      NaN
115658    False
115659      NaN
115660      NaN
115661    False
115662      NaN
115663      NaN
115664    False
115665    False
115666      NaN
115667      NaN
115668    False
115669      NaN
115670    False
115671      NaN
115672    False
115673    False
115674    False
Name: emp_title, Length:

In [0]:
df['int_rate'].groupby(df['emp_title']).mean()

emp_title
 \tProductión Team Leader               13.900000
   Manager                               8.190000
   hairdresser                          13.560000
  Executive  assistant                  28.800000
  Sales specialist                      19.920000
  Sergeant first class                  13.560000
  field operator                        14.740000
  site manager                           8.810000
 Account Representative                  8.810000
 Accounting & HR & Sales                11.020000
 Accounting assistant                   14.470000
 Accounts Receivable Rep                 8.190000
 Aerospace Painter                      13.560000
 Analyst                                13.900000
 Assistant manager                       8.810000
 Assistant manager                      20.000000
 Assistant supervisor                   10.720000
 Associate director Of public safety    27.270000
 Asst. food service director            16.400000
 Auditor                                

In [0]:
df['issue_d'].describe()

count       115675
unique           3
top       Jan-2019
freq         43584
Name: issue_d, dtype: object

In [0]:
df['issue_d'] = pd.to_datetime(df['issue_d'], infer_datetime_format=True)

In [0]:
df['issue_d'].describe()

count                  115675
unique                      3
top       2019-01-01 00:00:00
freq                    43584
first     2019-01-01 00:00:00
last      2019-03-01 00:00:00
Name: issue_d, dtype: object

In [0]:
df['issue_d'].iloc[0:5].dt.month

0    3
1    3
2    3
3    3
4    3
Name: issue_d, dtype: int64

In [0]:
[col for col in df if col.endswith('_d')]

['issue_d', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']

In [0]:
for col in ['issue_d', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']:
  df[col] = pd.to_datetime(df[col])

In [0]:
df.describe(include='datetime')

Unnamed: 0,issue_d,last_pymnt_d,next_pymnt_d,last_credit_pull_d
count,115675,115445,110769,115673
unique,3,7,3,7
top,2019-01-01 00:00:00,2019-06-01 00:00:00,2019-07-01 00:00:00,2019-06-01 00:00:00
freq,43584,102101,110738,111240
first,2019-01-01 00:00:00,2019-01-01 00:00:00,2019-06-01 00:00:00,2018-12-01 00:00:00
last,2019-03-01 00:00:00,2019-07-01 00:00:00,2019-08-01 00:00:00,2019-06-01 00:00:00


In [0]:
df.term.value_counts()

 36 months    78429
 60 months    37246
Name: term, dtype: int64

In [0]:
df['term'] = pd.to_numeric(df['term'], errors='coerce')
df.dtypes

id                                                   float64
member_id                                            float64
loan_amnt                                              int64
funded_amnt                                            int64
funded_amnt_inv                                        int64
term                                                 float64
int_rate                                             float64
installment                                          float64
grade                                                 object
sub_grade                                             object
emp_title                                             object
emp_length                                            object
home_ownership                                        object
annual_inc                                           float64
verification_status                                   object
issue_d                                       datetime64[ns]
loan_status             

In [0]:
df['loan_status_is_great'] = df.loan_status.str.contains('Current', 'Fully Paid')
df['loan_status_is_great'].value_counts()

True     109176
False      6499
Name: loan_status_is_great, dtype: int64

In [0]:
df.loan_status_is_great = int(loan_status_is_great == 'true')

In [0]:
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,...,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,loan_status_is_great,1
0,,,20000,20000,20000,,17.19,499.1,C,C5,Front desk supervisor,6 years,RENT,47000.0,Source Verified,2019-03-01,Current,n,,,debt_consolidation,Debt consolidation,958xx,CA,14.02,0,Sep-2006,1,50.0,,15,0,10687,19.7%,53,w,19254.76,19254.76,1459.1,1459.1,...,0,75824,31546,33800,21524,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,0,<class 'tuple'>
1,,,12000,12000,12000,,16.4,424.26,C,C4,Executive Director,4 years,MORTGAGE,95000.0,Not Verified,2019-03-01,Current,n,,,debt_consolidation,Debt consolidation,436xx,OH,29.5,0,Nov-1992,0,29.0,,19,0,16619,64.9%,36,w,11475.92,11475.92,826.65,826.65,...,0,209488,51734,18500,54263,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,0,<class 'tuple'>
2,,,3000,3000,3000,,14.74,103.62,C,C2,Office Manager,4 years,MORTGAGE,58750.0,Verified,2019-03-01,Current,n,,,medical,Medical expenses,327xx,FL,30.91,0,Jun-2004,0,24.0,,16,0,20502,60.1%,25,f,2865.64,2865.64,202.33,202.33,...,0,69911,37816,11400,35811,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,0,<class 'tuple'>
3,,,35000,35000,35000,,15.57,1223.08,C,C3,Store Manager,10+ years,RENT,122000.0,Verified,2019-03-01,Current,n,,,credit_card,Credit card refinancing,333xx,FL,22.0,1,Dec-2009,0,20.0,,5,0,1441,24.4%,18,w,33459.43,33459.43,2446.76,2446.76,...,0,65640,24471,1600,59740,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,0,<class 'tuple'>
4,,,5000,5000,5000,,15.57,174.73,C,C3,Area Manager,3 years,OWN,65000.0,Verified,2019-03-01,Current,n,,,house,Home buying,640xx,MO,16.28,1,Jul-2001,0,7.0,,9,0,5604,64.4%,25,w,4778.86,4778.86,340.81,340.81,...,0,38190,29775,4400,29490,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,0,<class 'tuple'>


In [0]:
[col for col in df if col.endswith('_d')]
for col in ['last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']:
  df[col] = pd.to_datetime(df[col])
df.describe(include='datetime')

Unnamed: 0,issue_d,last_pymnt_d,next_pymnt_d,last_credit_pull_d
count,115675,115445,110769,115673
unique,3,7,3,7
top,2019-01-01 00:00:00,2019-06-01 00:00:00,2019-07-01 00:00:00,2019-06-01 00:00:00
freq,43584,102101,110738,111240
first,2019-01-01 00:00:00,2019-01-01 00:00:00,2019-06-01 00:00:00,2018-12-01 00:00:00
last,2019-03-01 00:00:00,2019-07-01 00:00:00,2019-08-01 00:00:00,2019-06-01 00:00:00


# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01