_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

In [1]:
!wget https://resources.lendingclub.com/LoanStats_2018Q3.csv.zip

--2019-01-17 20:55:13--  https://resources.lendingclub.com/LoanStats_2018Q3.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q3.csv.zip’

LoanStats_2018Q3.cs     [                <=> ]  21.42M   883KB/s    in 25s     

2019-01-17 20:55:39 (878 KB/s) - ‘LoanStats_2018Q3.csv.zip’ saved [22461905]



In [2]:
!unzip LoanStats_2018Q3.csv.zip

Archive:  LoanStats_2018Q3.csv.zip
  inflating: LoanStats_2018Q3.csv    


In [3]:
!head LoanStats_2018Q3.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

In [4]:
!tail LoanStats_2018Q3.csv

"","","12000","12000","12000"," 36 months"," 14.03%","410.31","C","C2","Medical Support Staff","10+ years","OWN","53414","Source Verified","Jul-2018","Current","n","","","home_improvement","Home improvement","136xx","NY","25.68","0","Mar-1997","0","","","8","0","6527","44.4%","20","w","10618.01","10618.01","2037.52","2037.52","1381.99","655.53","0.0","0.0","0.0","Dec-2018","410.31","Jan-2019","Dec-2018","0","","1","Individual","","","","0","0","29869","0","2","0","2","13","23342","73","0","0","3977","64","14700","1","1","0","2","3734","7173","47.6","0","0","139","255","42","13","0","66","","18","","0","2","2","5","9","8","6","12","2","8","0","0","0","0","100","40","0","0","46792","29869","13700","32092","","","","","","","","","","","","N","","","","","","","","","","","","","","","Cash","N","","","","","",""
"","","5000","5000","5000"," 36 months"," 16.46%","176.93","C","C5","Labor Worker","3 years","MORTGAGE","57000","Not Verified","Jul-2018","Current","n","","","debt_consolidation",

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [5]:
import pandas as pd

df = pd.read_csv('LoanStats_2018Q3.csv', skiprows=1, skipfooter=2)
df.shape

  This is separate from the ipykernel package so we can avoid doing imports until


(128194, 145)

In [6]:
pd.options.display.max_columns = 500
pd.options.display.max_rows = 500
df.head().T # .T transposes the df

Unnamed: 0,0,1,2,3,4
id,,,,,
member_id,,,,,
loan_amnt,20000,25000,30000,6000,10650
funded_amnt,20000,25000,30000,6000,10650
funded_amnt_inv,20000,25000,30000,6000,10650
term,60 months,60 months,36 months,36 months,36 months
int_rate,17.97%,13.56%,18.94%,7.84%,7.84%
installment,507.55,576.02,1098.78,187.58,332.95
grade,D,C,D,A,A
sub_grade,D1,C1,D2,A4,A4


In [7]:
# Some columns very sparse -- This column is 99% blank
df['hardship_payoff_balance_amount'].isnull().sum() / len(df)

0.999898591197716

## Work with strings

For machine learning, we usually want to replace strings with numbers

In [0]:
import numpy as np

def all_numeric(df):
  return all ((df.dtypes==np.number) |
             (df.dtypes==bool))

def no_nulls(df):
  return not any(df.isnull().sum())

def ready_for_sklearn(df):
  return all_numeric(df) and no_nulls(df)

In [9]:
# See which columns are numbers
df.dtypes==np.number

id                                             True
member_id                                      True
loan_amnt                                     False
funded_amnt                                   False
funded_amnt_inv                               False
term                                          False
int_rate                                      False
installment                                    True
grade                                         False
sub_grade                                     False
emp_title                                     False
emp_length                                    False
home_ownership                                False
annual_inc                                     True
verification_status                           False
issue_d                                       False
loan_status                                   False
pymnt_plan                                    False
url                                            True
desc        

In [10]:
# Using all function to see if all columns are numbers
all(df.dtypes==np.number)

False

In [11]:
# Seeing if df is ready for ML -- it's not
ready_for_sklearn(df)

False

In [12]:
# Seeing if df is all numeric -- it's not
all_numeric(df)

False

We can get info which columns have dtype or object (string)

In [13]:
# Looking at all columns that are objects
df.select_dtypes('object').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128194 entries, 0 to 128193
Data columns (total 37 columns):
term                         128194 non-null object
int_rate                     128194 non-null object
grade                        128194 non-null object
sub_grade                    128194 non-null object
emp_title                    114757 non-null object
emp_length                   117807 non-null object
home_ownership               128194 non-null object
verification_status          128194 non-null object
issue_d                      128194 non-null object
loan_status                  128194 non-null object
pymnt_plan                   128194 non-null object
purpose                      128194 non-null object
title                        128194 non-null object
zip_code                     128194 non-null object
addr_state                   128194 non-null object
earliest_cr_line             128194 non-null object
revol_util                   128065 non-null object
initi

**Convert Int rate**

In [14]:
# looking at first 5 values of int_rate column -- they are strings
df.int_rate.head().values

array([' 17.97%', ' 13.56%', ' 18.94%', '  7.84%', '  7.84%'],
      dtype=object)

Define a function to remove percent signs from strings and convert to floats

In [0]:
string = '17.97%' # Start with simple example

# string.replace('%', '')
# float(string.replace('%', '')) # wrap the above in float

In [17]:
def remove_percent(string):
  return float(string.strip('%'))

remove_percent(string)

17.97

Apply the function to the int_rate column

In [0]:
# Changing int_rate column to float -- raising error 
df['int_rate'] = df['int_rate'].apply(remove_percent)

In [19]:
# Column is now floats
df['int_rate'].head()

0    17.97
1    13.56
2    18.94
3     7.84
4     7.84
Name: int_rate, dtype: float64

**Clean emp_title**

Look at top 20 titles

In [20]:
# Looking at first 20 emp_title values (descending)
df['emp_title'].value_counts().head(20)

Teacher               2294
Manager               2075
Owner                 1231
Driver                1089
Registered Nurse       944
Supervisor             810
RN                     757
Sales                  726
Project Manager        637
General Manager        548
Office Manager         542
Director               482
owner                  398
Engineer               383
Truck Driver           367
Operations Manager     366
President              350
Sales Manager          323
Supervisor             321
Server                 319
Name: emp_title, dtype: int64

In [21]:
# Getting closer look at formatting for these titles. One 'Supervisor' has extra space at end
df['emp_title'].value_counts().head(20).index

Index(['Teacher', 'Manager', 'Owner', 'Driver', 'Registered Nurse',
       'Supervisor', 'RN', 'Sales', 'Project Manager', 'General Manager',
       'Office Manager', 'Director', 'owner', 'Engineer', 'Truck Driver',
       'Operations Manager', 'President', 'Sales Manager', 'Supervisor ',
       'Server'],
      dtype='object')

How often is emp_title null?

In [22]:
df['emp_title'].isnull().sum()

13437

Clean title and handle missing values


In [23]:
examples = ['owner', 'Supervisor ', ' Project manager', np.nan] # simple example

def clean_title(x):
  if isinstance(x, str): # This checks if x is string-like
#   if type(x) == str  This checks if x is exactly string
    return x.strip().title()
#   elif np.nan:
#     return 'Unknown'
  else:
    return 'Unknown'

for example in examples:
  print(clean_title(example))

Owner
Supervisor
Project Manager
Unknown


In [0]:
df['emp_title'] = df['emp_title'].apply(clean_title)

In [25]:
df['emp_title'].value_counts().head(20)

Unknown                     13437
Teacher                      2843
Manager                      2749
Owner                        1856
Driver                       1498
Registered Nurse             1386
Supervisor                   1345
Sales                         980
Truck Driver                  921
Rn                            905
Office Manager                846
Project Manager               835
General Manager               809
Director                      585
Operations Manager            516
Sales Manager                 510
Engineer                      474
Administrative Assistant      466
Store Manager                 466
President                     464
Name: emp_title, dtype: int64

In [26]:
# To square int_rate, write like this:
df.int_rate ** 2

# Not like this:

# def square(x):
#   return x ** 2

# df.int_rate.apply(square)

0         322.9209
1         183.8736
2         358.7236
3          61.4656
4          61.4656
5          37.3321
6          44.4889
7          61.4656
8          51.9841
9         225.6004
10        225.6004
11         61.4656
12        183.8736
13         71.5716
14         51.9841
15         71.5716
16         71.5716
17        436.3921
18         37.3321
19        499.5225
20         51.9841
21        285.9481
22        260.4996
23         44.4889
24        183.8736
25         61.4656
26        183.8736
27         51.9841
28         44.4889
29         61.4656
30         44.4889
31        358.7236
32        162.0529
33        209.3809
34         37.3321
35        183.8736
36        133.4025
37         71.5716
38        285.9481
39        285.9481
40        109.6209
41         37.3321
42        122.3236
43        209.3809
44         44.4889
45         37.3321
46         71.5716
47         44.4889
48         61.4656
49         37.3321
50        101.6064
51         71.5716
52        49

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [0]:
# Boolean column, True if 'emp_title' contains 'Manager', otherwise False

In [0]:
df['emp_title_manager'] = df['emp_title'].str.contains('Manager')

In [29]:
# Look at value counts of new emp_title_manager column
df['emp_title_manager'].value_counts()

False    109498
True      18696
Name: emp_title_manager, dtype: int64

In [30]:
# See that it's on the df
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager
0,,,20000,20000,20000,60 months,17.97,507.55,D,D1,Motor Vehicle Operator,7 years,RENT,68000.0,Source Verified,Sep-2018,Current,n,,,debt_consolidation,Debt consolidation,254xx,WV,15.8,0,Feb-2009,1,30.0,118.0,11,1,13483,69.1%,26,w,19580.78,19580.78,975.17,975.17,419.22,555.95,0.0,0.0,0.0,Dec-2018,507.55,Jan-2019,Dec-2018,0,,1,Individual,,,,0,0,194908,0,2,0,2,17.0,45299,93.0,0,1,9815,86.0,19500,1,4,2,5,19491.0,2185.0,81.8,0,0,115.0,94,13,7,4,25.0,,2.0,30.0,0,1,3,2,4,11,8,10,3,11,0.0,0,0,1,92.3,50.0,1,0,205625,58782,12000,48733,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,,False
1,,,25000,25000,25000,60 months,13.56,576.02,C,C1,Firefighter,7 years,MORTGAGE,61000.0,Verified,Sep-2018,Current,n,,,major_purchase,Major purchase,770xx,TX,16.74,0,Jan-2009,0,,,5,0,6299,48.5%,14,w,24109.45,24109.45,1567.98,1567.98,890.55,677.43,0.0,0.0,0.0,Dec-2018,576.02,Jan-2019,Dec-2018,0,,1,Individual,,,,0,0,204007,1,2,0,1,19.0,13862,37.0,0,0,0,40.0,13000,0,7,4,2,40801.0,1500.0,0.0,0,0,54.0,116,26,3,1,74.0,,4.0,,0,0,1,1,2,5,2,8,1,5,0.0,0,0,1,100.0,0.0,0,0,234573,20161,1500,37273,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False
2,,,30000,30000,30000,36 months,18.94,1098.78,D,D2,Unknown,< 1 year,RENT,100000.0,Source Verified,Sep-2018,Current,n,,,debt_consolidation,Debt consolidation,300xx,GA,16.07,0,Mar-2008,1,,114.0,6,1,14574,70.1%,9,w,28739.57,28739.57,2134.43,2134.43,1260.43,874.0,0.0,0.0,0.0,Dec-2018,1098.78,Jan-2019,Dec-2018,0,,1,Individual,,,,0,0,44048,1,2,1,1,1.0,29474,69.0,1,2,8521,69.0,20800,1,0,2,3,8810.0,5226.0,73.6,0,0,126.0,104,11,1,0,17.0,,1.0,,0,2,2,2,2,4,4,5,2,6,0.0,0,0,2,100.0,0.0,1,0,63636,44048,19800,42836,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False
3,,,6000,6000,6000,36 months,7.84,187.58,A,A4,Unknown,,RENT,30000.0,Not Verified,Sep-2018,Current,n,,,debt_consolidation,Debt consolidation,923xx,CA,5.44,0,Apr-2000,0,,104.0,8,1,5936,34.5%,11,w,5702.27,5702.27,369.93,369.93,297.73,72.2,0.0,0.0,0.0,Dec-2018,187.58,Jan-2019,Dec-2018,0,,1,Individual,,,,0,350,5936,0,0,0,1,23.0,0,,1,4,2913,35.0,17200,2,0,0,5,848.0,7698.0,35.9,0,0,139.0,221,7,7,0,7.0,,18.0,,0,2,3,4,4,2,8,9,3,8,0.0,0,0,1,100.0,33.3,1,0,17200,5936,12000,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,,False
4,,,10650,10650,10650,36 months,7.84,332.95,A,A4,Unknown,,RENT,28000.0,Verified,Sep-2018,Current,n,,,medical,Medical expenses,430xx,OH,16.89,0,Nov-2002,0,,,3,0,37,0.3%,3,w,10121.54,10121.54,656.62,656.62,528.46,128.16,0.0,0.0,0.0,Dec-2018,332.95,Jan-2019,Dec-2018,0,,1,Joint App,43000.0,14.01,Source Verified,0,0,18254,0,1,0,1,16.0,18217,81.0,0,0,0,54.0,11500,1,1,0,1,6085.0,,,0,0,16.0,190,113,16,0,,,16.0,,0,0,1,0,0,1,2,2,2,3,0.0,0,0,0,100.0,,0,0,33876,18254,0,22376,2024.0,Oct-1996,0.0,0.0,8.0,18.7,1.0,9.0,0.0,0.0,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False


In [31]:
df.shape

(128194, 146)

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [32]:
# See first few values of issue_d
df['issue_d'].head().values

array(['Sep-2018', 'Sep-2018', 'Sep-2018', 'Sep-2018', 'Sep-2018'],
      dtype=object)

In [0]:
# Change issue_d column to datetime format
df['issue_d'] = pd.to_datetime(df['issue_d'], infer_datetime_format=True)

In [34]:
# Look at new values
df['issue_d'].head().values

array(['2018-09-01T00:00:00.000000000', '2018-09-01T00:00:00.000000000',
       '2018-09-01T00:00:00.000000000', '2018-09-01T00:00:00.000000000',
       '2018-09-01T00:00:00.000000000'], dtype='datetime64[ns]')

In [35]:
df['issue_d'].describe()

count                  128194
unique                      3
top       2018-08-01 00:00:00
freq                    46079
first     2018-07-01 00:00:00
last      2018-09-01 00:00:00
Name: issue_d, dtype: object

In [0]:
# Creating two new columns with just the year and month of issue_d
df['issue_year'] = df['issue_d'].dt.year
df['issue_month'] = df['issue_d'].dt.month

In [37]:
# Check new column with random sample
df['issue_month'].sample(n=10).values

array([9, 8, 9, 8, 7, 8, 9, 8, 7, 7])

In [38]:
# See values of earliest_cr_line column
df['earliest_cr_line'].head().values

array(['Feb-2009', 'Jan-2009', 'Mar-2008', 'Apr-2000', 'Nov-2002'],
      dtype=object)

In [0]:
# Change column to datetime format
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], infer_datetime_format=True)

In [0]:
# Creat new column by subtrating earliest_cr_line from issue_d, in days
df['days_from_earliest_credit_to_issue'] = (
    df['issue_d'] - df['earliest_cr_line']).dt.days

In [41]:
# Looking at value of new column
df['days_from_earliest_credit_to_issue'].head().values

array([3499, 3530, 3836, 6727, 5783])

In [42]:
# List comprehension to see all columns that end in '_d'
[col for col in df if col.endswith('_d')]

['issue_d', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']

In [0]:
# Taking above list and using another list comprehension to change all columns
# in list to datetime format
for col in ['issue_d', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']:
  df[col] = pd.to_datetime(df[col], infer_datetime_format=True)

In [44]:
df['loan_status'].value_counts()

Current               121082
Fully Paid              4786
In Grace Period          948
Late (31-120 days)       920
Late (16-30 days)        348
Charged Off              110
Name: loan_status, dtype: int64

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.



In [45]:
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager,issue_year,issue_month,days_from_earliest_credit_to_issue
0,,,20000,20000,20000,60 months,17.97,507.55,D,D1,Motor Vehicle Operator,7 years,RENT,68000.0,Source Verified,2018-09-01,Current,n,,,debt_consolidation,Debt consolidation,254xx,WV,15.8,0,2009-02-01,1,30.0,118.0,11,1,13483,69.1%,26,w,19580.78,19580.78,975.17,975.17,419.22,555.95,0.0,0.0,0.0,2018-12-01,507.55,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,194908,0,2,0,2,17.0,45299,93.0,0,1,9815,86.0,19500,1,4,2,5,19491.0,2185.0,81.8,0,0,115.0,94,13,7,4,25.0,,2.0,30.0,0,1,3,2,4,11,8,10,3,11,0.0,0,0,1,92.3,50.0,1,0,205625,58782,12000,48733,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,,False,2018,9,3499
1,,,25000,25000,25000,60 months,13.56,576.02,C,C1,Firefighter,7 years,MORTGAGE,61000.0,Verified,2018-09-01,Current,n,,,major_purchase,Major purchase,770xx,TX,16.74,0,2009-01-01,0,,,5,0,6299,48.5%,14,w,24109.45,24109.45,1567.98,1567.98,890.55,677.43,0.0,0.0,0.0,2018-12-01,576.02,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,204007,1,2,0,1,19.0,13862,37.0,0,0,0,40.0,13000,0,7,4,2,40801.0,1500.0,0.0,0,0,54.0,116,26,3,1,74.0,,4.0,,0,0,1,1,2,5,2,8,1,5,0.0,0,0,1,100.0,0.0,0,0,234573,20161,1500,37273,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,9,3530
2,,,30000,30000,30000,36 months,18.94,1098.78,D,D2,Unknown,< 1 year,RENT,100000.0,Source Verified,2018-09-01,Current,n,,,debt_consolidation,Debt consolidation,300xx,GA,16.07,0,2008-03-01,1,,114.0,6,1,14574,70.1%,9,w,28739.57,28739.57,2134.43,2134.43,1260.43,874.0,0.0,0.0,0.0,2018-12-01,1098.78,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,44048,1,2,1,1,1.0,29474,69.0,1,2,8521,69.0,20800,1,0,2,3,8810.0,5226.0,73.6,0,0,126.0,104,11,1,0,17.0,,1.0,,0,2,2,2,2,4,4,5,2,6,0.0,0,0,2,100.0,0.0,1,0,63636,44048,19800,42836,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,9,3836
3,,,6000,6000,6000,36 months,7.84,187.58,A,A4,Unknown,,RENT,30000.0,Not Verified,2018-09-01,Current,n,,,debt_consolidation,Debt consolidation,923xx,CA,5.44,0,2000-04-01,0,,104.0,8,1,5936,34.5%,11,w,5702.27,5702.27,369.93,369.93,297.73,72.2,0.0,0.0,0.0,2018-12-01,187.58,2019-01-01,2018-12-01,0,,1,Individual,,,,0,350,5936,0,0,0,1,23.0,0,,1,4,2913,35.0,17200,2,0,0,5,848.0,7698.0,35.9,0,0,139.0,221,7,7,0,7.0,,18.0,,0,2,3,4,4,2,8,9,3,8,0.0,0,0,1,100.0,33.3,1,0,17200,5936,12000,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,,False,2018,9,6727
4,,,10650,10650,10650,36 months,7.84,332.95,A,A4,Unknown,,RENT,28000.0,Verified,2018-09-01,Current,n,,,medical,Medical expenses,430xx,OH,16.89,0,2002-11-01,0,,,3,0,37,0.3%,3,w,10121.54,10121.54,656.62,656.62,528.46,128.16,0.0,0.0,0.0,2018-12-01,332.95,2019-01-01,2018-12-01,0,,1,Joint App,43000.0,14.01,Source Verified,0,0,18254,0,1,0,1,16.0,18217,81.0,0,0,0,54.0,11500,1,1,0,1,6085.0,,,0,0,16.0,190,113,16,0,,,16.0,,0,0,1,0,0,1,2,2,2,3,0.0,0,0,0,100.0,,0,0,33876,18254,0,22376,2024.0,Oct-1996,0.0,0.0,8.0,18.7,1.0,9.0,0.0,0.0,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,9,5783


**Converting 'term' column from string to integer**


In [46]:
df.term.value_counts()

 36 months    89574
 60 months    38620
Name: term, dtype: int64

In [47]:
df.term.values

array([' 60 months', ' 60 months', ' 36 months', ..., ' 36 months',
       ' 36 months', ' 36 months'], dtype=object)

In [0]:
# Putting 'in months' in the term header for clarity
df = df.rename(index=str, columns={'term': 'term (in months)'})

In [52]:
test = '60 months' # Test example

def turn_to_int(string):
  return int(string.replace('months', ''))

turn_to_int(test)

60

In [0]:
df['term (in months)'] = df['term (in months)'].apply(turn_to_int)

In [55]:
df['term (in months)'].values

array([60, 60, 36, ..., 36, 36, 36])

**Making loan_status_is_great column**

In [63]:
df['loan_status'].value_counts()

Current               121082
Fully Paid              4786
In Grace Period          948
Late (31-120 days)       920
Late (16-30 days)        348
Charged Off              110
Name: loan_status, dtype: int64

In [69]:
test2 = 'Late (31-120 days)'

def status_great(string):
  if string == 'Current':
    return 1
  elif string == 'Fully Paid':
    return 1
  else:
    return 0
  
status_great(test2)

0

In [0]:
df['loan_status_is_great'] = df['loan_status'].apply(status_great)

In [71]:
df['loan_status_is_great'].value_counts()

1    125868
0      2326
Name: loan_status_is_great, dtype: int64

In [72]:
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term (in months),int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager,issue_year,issue_month,days_from_earliest_credit_to_issue,loan_status_is_great
0,,,20000,20000,20000,60,17.97,507.55,D,D1,Motor Vehicle Operator,7 years,RENT,68000.0,Source Verified,2018-09-01,Current,n,,,debt_consolidation,Debt consolidation,254xx,WV,15.8,0,2009-02-01,1,30.0,118.0,11,1,13483,69.1%,26,w,19580.78,19580.78,975.17,975.17,419.22,555.95,0.0,0.0,0.0,2018-12-01,507.55,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,194908,0,2,0,2,17.0,45299,93.0,0,1,9815,86.0,19500,1,4,2,5,19491.0,2185.0,81.8,0,0,115.0,94,13,7,4,25.0,,2.0,30.0,0,1,3,2,4,11,8,10,3,11,0.0,0,0,1,92.3,50.0,1,0,205625,58782,12000,48733,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,,False,2018,9,3499,1
1,,,25000,25000,25000,60,13.56,576.02,C,C1,Firefighter,7 years,MORTGAGE,61000.0,Verified,2018-09-01,Current,n,,,major_purchase,Major purchase,770xx,TX,16.74,0,2009-01-01,0,,,5,0,6299,48.5%,14,w,24109.45,24109.45,1567.98,1567.98,890.55,677.43,0.0,0.0,0.0,2018-12-01,576.02,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,204007,1,2,0,1,19.0,13862,37.0,0,0,0,40.0,13000,0,7,4,2,40801.0,1500.0,0.0,0,0,54.0,116,26,3,1,74.0,,4.0,,0,0,1,1,2,5,2,8,1,5,0.0,0,0,1,100.0,0.0,0,0,234573,20161,1500,37273,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,9,3530,1
2,,,30000,30000,30000,36,18.94,1098.78,D,D2,Unknown,< 1 year,RENT,100000.0,Source Verified,2018-09-01,Current,n,,,debt_consolidation,Debt consolidation,300xx,GA,16.07,0,2008-03-01,1,,114.0,6,1,14574,70.1%,9,w,28739.57,28739.57,2134.43,2134.43,1260.43,874.0,0.0,0.0,0.0,2018-12-01,1098.78,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,44048,1,2,1,1,1.0,29474,69.0,1,2,8521,69.0,20800,1,0,2,3,8810.0,5226.0,73.6,0,0,126.0,104,11,1,0,17.0,,1.0,,0,2,2,2,2,4,4,5,2,6,0.0,0,0,2,100.0,0.0,1,0,63636,44048,19800,42836,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,9,3836,1
3,,,6000,6000,6000,36,7.84,187.58,A,A4,Unknown,,RENT,30000.0,Not Verified,2018-09-01,Current,n,,,debt_consolidation,Debt consolidation,923xx,CA,5.44,0,2000-04-01,0,,104.0,8,1,5936,34.5%,11,w,5702.27,5702.27,369.93,369.93,297.73,72.2,0.0,0.0,0.0,2018-12-01,187.58,2019-01-01,2018-12-01,0,,1,Individual,,,,0,350,5936,0,0,0,1,23.0,0,,1,4,2913,35.0,17200,2,0,0,5,848.0,7698.0,35.9,0,0,139.0,221,7,7,0,7.0,,18.0,,0,2,3,4,4,2,8,9,3,8,0.0,0,0,1,100.0,33.3,1,0,17200,5936,12000,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,,False,2018,9,6727,1
4,,,10650,10650,10650,36,7.84,332.95,A,A4,Unknown,,RENT,28000.0,Verified,2018-09-01,Current,n,,,medical,Medical expenses,430xx,OH,16.89,0,2002-11-01,0,,,3,0,37,0.3%,3,w,10121.54,10121.54,656.62,656.62,528.46,128.16,0.0,0.0,0.0,2018-12-01,332.95,2019-01-01,2018-12-01,0,,1,Joint App,43000.0,14.01,Source Verified,0,0,18254,0,1,0,1,16.0,18217,81.0,0,0,0,54.0,11500,1,1,0,1,6085.0,,,0,0,16.0,190,113,16,0,,,16.0,,0,0,1,0,0,1,2,2,2,3,0.0,0,0,0,100.0,,0,0,33876,18254,0,22376,2024.0,Oct-1996,0.0,0.0,8.0,18.7,1.0,9.0,0.0,0.0,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,9,5783,1


**Making last_pymnt_d_month and last_pymnt_d_year columns**

In [73]:
df['last_pymnt_d'].values

array(['2018-12-01T00:00:00.000000000', '2018-12-01T00:00:00.000000000',
       '2018-12-01T00:00:00.000000000', ...,
       '2018-12-01T00:00:00.000000000', '2018-12-01T00:00:00.000000000',
       '2018-12-01T00:00:00.000000000'], dtype='datetime64[ns]')

In [0]:
df['last_pymnt_d_month'] = df['last_pymnt_d'].dt.month
df['last_pymnt_d_year'] = df['last_pymnt_d'].dt.year

In [79]:
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term (in months),int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager,issue_year,issue_month,days_from_earliest_credit_to_issue,loan_status_is_great,last_pymnt_d_month,last_pymnt_d_year
0,,,20000,20000,20000,60,17.97,507.55,D,D1,Motor Vehicle Operator,7 years,RENT,68000.0,Source Verified,2018-09-01,Current,n,,,debt_consolidation,Debt consolidation,254xx,WV,15.8,0,2009-02-01,1,30.0,118.0,11,1,13483,69.1%,26,w,19580.78,19580.78,975.17,975.17,419.22,555.95,0.0,0.0,0.0,2018-12-01,507.55,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,194908,0,2,0,2,17.0,45299,93.0,0,1,9815,86.0,19500,1,4,2,5,19491.0,2185.0,81.8,0,0,115.0,94,13,7,4,25.0,,2.0,30.0,0,1,3,2,4,11,8,10,3,11,0.0,0,0,1,92.3,50.0,1,0,205625,58782,12000,48733,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,,False,2018,9,3499,1,12.0,2018.0
1,,,25000,25000,25000,60,13.56,576.02,C,C1,Firefighter,7 years,MORTGAGE,61000.0,Verified,2018-09-01,Current,n,,,major_purchase,Major purchase,770xx,TX,16.74,0,2009-01-01,0,,,5,0,6299,48.5%,14,w,24109.45,24109.45,1567.98,1567.98,890.55,677.43,0.0,0.0,0.0,2018-12-01,576.02,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,204007,1,2,0,1,19.0,13862,37.0,0,0,0,40.0,13000,0,7,4,2,40801.0,1500.0,0.0,0,0,54.0,116,26,3,1,74.0,,4.0,,0,0,1,1,2,5,2,8,1,5,0.0,0,0,1,100.0,0.0,0,0,234573,20161,1500,37273,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,9,3530,1,12.0,2018.0
2,,,30000,30000,30000,36,18.94,1098.78,D,D2,Unknown,< 1 year,RENT,100000.0,Source Verified,2018-09-01,Current,n,,,debt_consolidation,Debt consolidation,300xx,GA,16.07,0,2008-03-01,1,,114.0,6,1,14574,70.1%,9,w,28739.57,28739.57,2134.43,2134.43,1260.43,874.0,0.0,0.0,0.0,2018-12-01,1098.78,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,44048,1,2,1,1,1.0,29474,69.0,1,2,8521,69.0,20800,1,0,2,3,8810.0,5226.0,73.6,0,0,126.0,104,11,1,0,17.0,,1.0,,0,2,2,2,2,4,4,5,2,6,0.0,0,0,2,100.0,0.0,1,0,63636,44048,19800,42836,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,9,3836,1,12.0,2018.0
3,,,6000,6000,6000,36,7.84,187.58,A,A4,Unknown,,RENT,30000.0,Not Verified,2018-09-01,Current,n,,,debt_consolidation,Debt consolidation,923xx,CA,5.44,0,2000-04-01,0,,104.0,8,1,5936,34.5%,11,w,5702.27,5702.27,369.93,369.93,297.73,72.2,0.0,0.0,0.0,2018-12-01,187.58,2019-01-01,2018-12-01,0,,1,Individual,,,,0,350,5936,0,0,0,1,23.0,0,,1,4,2913,35.0,17200,2,0,0,5,848.0,7698.0,35.9,0,0,139.0,221,7,7,0,7.0,,18.0,,0,2,3,4,4,2,8,9,3,8,0.0,0,0,1,100.0,33.3,1,0,17200,5936,12000,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,,False,2018,9,6727,1,12.0,2018.0
4,,,10650,10650,10650,36,7.84,332.95,A,A4,Unknown,,RENT,28000.0,Verified,2018-09-01,Current,n,,,medical,Medical expenses,430xx,OH,16.89,0,2002-11-01,0,,,3,0,37,0.3%,3,w,10121.54,10121.54,656.62,656.62,528.46,128.16,0.0,0.0,0.0,2018-12-01,332.95,2019-01-01,2018-12-01,0,,1,Joint App,43000.0,14.01,Source Verified,0,0,18254,0,1,0,1,16.0,18217,81.0,0,0,0,54.0,11500,1,1,0,1,6085.0,,,0,0,16.0,190,113,16,0,,,16.0,,0,0,1,0,0,1,2,2,2,3,0.0,0,0,0,100.0,,0,0,33876,18254,0,22376,2024.0,Oct-1996,0.0,0.0,8.0,18.7,1.0,9.0,0.0,0.0,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,9,5783,1,12.0,2018.0


In [81]:
df.shape

(128194, 152)

# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Process the dataframe so that `ready_for_sklearn(df)` returns `True`. You can drop columns, or select the subset of numeric columns with no missing values. (Or you can try automating the process to handle missing values and convert objects to numbers!)
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01

## **More LendingClub**

In [82]:
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term (in months),int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager,issue_year,issue_month,days_from_earliest_credit_to_issue,loan_status_is_great,last_pymnt_d_month,last_pymnt_d_year
0,,,20000,20000,20000,60,17.97,507.55,D,D1,Motor Vehicle Operator,7 years,RENT,68000.0,Source Verified,2018-09-01,Current,n,,,debt_consolidation,Debt consolidation,254xx,WV,15.8,0,2009-02-01,1,30.0,118.0,11,1,13483,69.1%,26,w,19580.78,19580.78,975.17,975.17,419.22,555.95,0.0,0.0,0.0,2018-12-01,507.55,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,194908,0,2,0,2,17.0,45299,93.0,0,1,9815,86.0,19500,1,4,2,5,19491.0,2185.0,81.8,0,0,115.0,94,13,7,4,25.0,,2.0,30.0,0,1,3,2,4,11,8,10,3,11,0.0,0,0,1,92.3,50.0,1,0,205625,58782,12000,48733,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,,False,2018,9,3499,1,12.0,2018.0
1,,,25000,25000,25000,60,13.56,576.02,C,C1,Firefighter,7 years,MORTGAGE,61000.0,Verified,2018-09-01,Current,n,,,major_purchase,Major purchase,770xx,TX,16.74,0,2009-01-01,0,,,5,0,6299,48.5%,14,w,24109.45,24109.45,1567.98,1567.98,890.55,677.43,0.0,0.0,0.0,2018-12-01,576.02,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,204007,1,2,0,1,19.0,13862,37.0,0,0,0,40.0,13000,0,7,4,2,40801.0,1500.0,0.0,0,0,54.0,116,26,3,1,74.0,,4.0,,0,0,1,1,2,5,2,8,1,5,0.0,0,0,1,100.0,0.0,0,0,234573,20161,1500,37273,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,9,3530,1,12.0,2018.0
2,,,30000,30000,30000,36,18.94,1098.78,D,D2,Unknown,< 1 year,RENT,100000.0,Source Verified,2018-09-01,Current,n,,,debt_consolidation,Debt consolidation,300xx,GA,16.07,0,2008-03-01,1,,114.0,6,1,14574,70.1%,9,w,28739.57,28739.57,2134.43,2134.43,1260.43,874.0,0.0,0.0,0.0,2018-12-01,1098.78,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,44048,1,2,1,1,1.0,29474,69.0,1,2,8521,69.0,20800,1,0,2,3,8810.0,5226.0,73.6,0,0,126.0,104,11,1,0,17.0,,1.0,,0,2,2,2,2,4,4,5,2,6,0.0,0,0,2,100.0,0.0,1,0,63636,44048,19800,42836,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,9,3836,1,12.0,2018.0
3,,,6000,6000,6000,36,7.84,187.58,A,A4,Unknown,,RENT,30000.0,Not Verified,2018-09-01,Current,n,,,debt_consolidation,Debt consolidation,923xx,CA,5.44,0,2000-04-01,0,,104.0,8,1,5936,34.5%,11,w,5702.27,5702.27,369.93,369.93,297.73,72.2,0.0,0.0,0.0,2018-12-01,187.58,2019-01-01,2018-12-01,0,,1,Individual,,,,0,350,5936,0,0,0,1,23.0,0,,1,4,2913,35.0,17200,2,0,0,5,848.0,7698.0,35.9,0,0,139.0,221,7,7,0,7.0,,18.0,,0,2,3,4,4,2,8,9,3,8,0.0,0,0,1,100.0,33.3,1,0,17200,5936,12000,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,,False,2018,9,6727,1,12.0,2018.0
4,,,10650,10650,10650,36,7.84,332.95,A,A4,Unknown,,RENT,28000.0,Verified,2018-09-01,Current,n,,,medical,Medical expenses,430xx,OH,16.89,0,2002-11-01,0,,,3,0,37,0.3%,3,w,10121.54,10121.54,656.62,656.62,528.46,128.16,0.0,0.0,0.0,2018-12-01,332.95,2019-01-01,2018-12-01,0,,1,Joint App,43000.0,14.01,Source Verified,0,0,18254,0,1,0,1,16.0,18217,81.0,0,0,0,54.0,11500,1,1,0,1,6085.0,,,0,0,16.0,190,113,16,0,,,16.0,,0,0,1,0,0,1,2,2,2,3,0.0,0,0,0,100.0,,0,0,33876,18254,0,22376,2024.0,Oct-1996,0.0,0.0,8.0,18.7,1.0,9.0,0.0,0.0,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,9,5783,1,12.0,2018.0


**Converting revol_util column to floats and handling missing values**

In [84]:
df['revol_util'].values

array(['69.1%', '48.5%', '70.1%', ..., '63%', '28.1%', '45.4%'],
      dtype=object)

In [105]:
df['revol_util'].isnull().sum()

129

In [110]:
test3 = '10.6%'

def turn_to_float(string):
  if isinstance(string, str):
    return float(string.strip('%'))
  elif string == np.nan:
    return np.nan

turn_to_float(test3)

10.6

In [0]:
df['revol_util'] = df['revol_util'].apply(turn_to_float)

In [113]:
df['revol_util'].isnull().sum()

129

In [114]:
df['revol_util'].mean()

44.45021668683871

In [0]:
df['revol_util'] = df['revol_util'].fillna(df['revol_util'].mean)

In [117]:
df['revol_util'].isnull().sum()

0

**Modifying emp_title column to replace titles with 'Other' if the title is not in the top 20. **

In [125]:
top_twenty = df['emp_title'].value_counts().head(20).reset_index()
top_twenty

Unnamed: 0,index,emp_title
0,Unknown,13437
1,Teacher,2843
2,Manager,2749
3,Owner,1856
4,Driver,1498
5,Registered Nurse,1386
6,Supervisor,1345
7,Sales,980
8,Truck Driver,921
9,Rn,905


In [132]:
test4 = 'Basketball Player'

def replace_with_other(string):
  if string in top_twenty['index'].tolist():
    return string
  else:
    return 'Other'
  
replace_with_other(test4)

'Other'

In [0]:
df['emp_title'] = df['emp_title'].apply(replace_with_other)

In [138]:
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term (in months),int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager,issue_year,issue_month,days_from_earliest_credit_to_issue,loan_status_is_great,last_pymnt_d_month,last_pymnt_d_year
0,,,20000,20000,20000,60,17.97,507.55,D,D1,Other,7 years,RENT,68000.0,Source Verified,2018-09-01,Current,n,,,debt_consolidation,Debt consolidation,254xx,WV,15.8,0,2009-02-01,1,30.0,118.0,11,1,13483,69.1,26,w,19580.78,19580.78,975.17,975.17,419.22,555.95,0.0,0.0,0.0,2018-12-01,507.55,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,194908,0,2,0,2,17.0,45299,93.0,0,1,9815,86.0,19500,1,4,2,5,19491.0,2185.0,81.8,0,0,115.0,94,13,7,4,25.0,,2.0,30.0,0,1,3,2,4,11,8,10,3,11,0.0,0,0,1,92.3,50.0,1,0,205625,58782,12000,48733,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,,False,2018,9,3499,1,12.0,2018.0
1,,,25000,25000,25000,60,13.56,576.02,C,C1,Other,7 years,MORTGAGE,61000.0,Verified,2018-09-01,Current,n,,,major_purchase,Major purchase,770xx,TX,16.74,0,2009-01-01,0,,,5,0,6299,48.5,14,w,24109.45,24109.45,1567.98,1567.98,890.55,677.43,0.0,0.0,0.0,2018-12-01,576.02,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,204007,1,2,0,1,19.0,13862,37.0,0,0,0,40.0,13000,0,7,4,2,40801.0,1500.0,0.0,0,0,54.0,116,26,3,1,74.0,,4.0,,0,0,1,1,2,5,2,8,1,5,0.0,0,0,1,100.0,0.0,0,0,234573,20161,1500,37273,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,9,3530,1,12.0,2018.0
2,,,30000,30000,30000,36,18.94,1098.78,D,D2,Unknown,< 1 year,RENT,100000.0,Source Verified,2018-09-01,Current,n,,,debt_consolidation,Debt consolidation,300xx,GA,16.07,0,2008-03-01,1,,114.0,6,1,14574,70.1,9,w,28739.57,28739.57,2134.43,2134.43,1260.43,874.0,0.0,0.0,0.0,2018-12-01,1098.78,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,44048,1,2,1,1,1.0,29474,69.0,1,2,8521,69.0,20800,1,0,2,3,8810.0,5226.0,73.6,0,0,126.0,104,11,1,0,17.0,,1.0,,0,2,2,2,2,4,4,5,2,6,0.0,0,0,2,100.0,0.0,1,0,63636,44048,19800,42836,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,9,3836,1,12.0,2018.0
3,,,6000,6000,6000,36,7.84,187.58,A,A4,Unknown,,RENT,30000.0,Not Verified,2018-09-01,Current,n,,,debt_consolidation,Debt consolidation,923xx,CA,5.44,0,2000-04-01,0,,104.0,8,1,5936,34.5,11,w,5702.27,5702.27,369.93,369.93,297.73,72.2,0.0,0.0,0.0,2018-12-01,187.58,2019-01-01,2018-12-01,0,,1,Individual,,,,0,350,5936,0,0,0,1,23.0,0,,1,4,2913,35.0,17200,2,0,0,5,848.0,7698.0,35.9,0,0,139.0,221,7,7,0,7.0,,18.0,,0,2,3,4,4,2,8,9,3,8,0.0,0,0,1,100.0,33.3,1,0,17200,5936,12000,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,,False,2018,9,6727,1,12.0,2018.0
4,,,10650,10650,10650,36,7.84,332.95,A,A4,Unknown,,RENT,28000.0,Verified,2018-09-01,Current,n,,,medical,Medical expenses,430xx,OH,16.89,0,2002-11-01,0,,,3,0,37,0.3,3,w,10121.54,10121.54,656.62,656.62,528.46,128.16,0.0,0.0,0.0,2018-12-01,332.95,2019-01-01,2018-12-01,0,,1,Joint App,43000.0,14.01,Source Verified,0,0,18254,0,1,0,1,16.0,18217,81.0,0,0,0,54.0,11500,1,1,0,1,6085.0,,,0,0,16.0,190,113,16,0,,,16.0,,0,0,1,0,0,1,2,2,2,3,0.0,0,0,0,100.0,,0,0,33876,18254,0,22376,2024.0,Oct-1996,0.0,0.0,8.0,18.7,1.0,9.0,0.0,0.0,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,9,5783,1,12.0,2018.0


**Processing  dataframe so that ready_for_sklearn(df) returns True.**

In [0]:
# import numpy as np

# def all_numeric(df):
#   return all ((df.dtypes==np.number) |
#              (df.dtypes==bool))

# def no_nulls(df):
#   return not any(df.isnull().sum())

# def ready_for_sklearn(df):
#   return all_numeric(df) and no_nulls(df)

In [141]:
df.shape

(128194, 152)

In [140]:
df['id'].isnull().sum()

128194

In [0]:
df = df.drop('id', axis=1)

In [145]:
df['member_id'].isnull().sum()

128194

In [0]:
df = df.drop('member_id', axis=1)

In [149]:
df['loan_amnt'].isnull().sum()

0

In [152]:
df['funded_amnt'].isnull().sum(), df['funded_amnt'].dtype

(0, dtype('int64'))

In [153]:
df['funded_amnt_inv'].isnull().sum(), df['funded_amnt_inv'].dtype

(0, dtype('int64'))

In [154]:
df['int_rate'].isnull().sum(), df['int_rate'].dtype

(0, dtype('float64'))

In [155]:
df['installment'].isnull().sum(), df['installment'].dtype

(0, dtype('float64'))

In [157]:
df['emp_length'].isnull().sum(), df['emp_length'].dtype

(10387, dtype('O'))

In [158]:
df['emp_length'].value_counts()

10+ years    42035
2 years      11407
< 1 year     11353
3 years      10540
1 year        8295
4 years       8238
5 years       8103
6 years       5750
7 years       4629
8 years       4311
9 years       3146
Name: emp_length, dtype: int64