<a href="https://colab.research.google.com/github/nrvanwyck/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/module4-makefeatures/LS_DS_124_Make_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [1]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

--2019-05-30 21:02:17--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip.2’

LoanStats_2018Q4.cs     [               <=>  ]  21.42M   798KB/s    in 28s     

2019-05-30 21:02:46 (791 KB/s) - ‘LoanStats_2018Q4.csv.zip.2’ saved [22458773]



In [2]:
!unzip LoanStats_2018Q4.csv.zip

Archive:  LoanStats_2018Q4.csv.zip
replace LoanStats_2018Q4.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: yes
  inflating: LoanStats_2018Q4.csv    


In [3]:
!head LoanStats_2018Q4.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

In [0]:
df = pd.read_csv('LoanStats_2018Q4.csv', skiprows=1, skipfooter=2, engine='python')

In [0]:
df = df.drop(['id', 'url', 'desc', 'member_id'], axis='columns')

## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [0]:
def remove_percent(string):
  return float(string.strip('%'))

Apply the function to the `int_rate` column

In [0]:
df['int_rate'] = df['int_rate'].apply(remove_percent)

### Clean `emp_title`

Look at top 20 titles

In [10]:
df['emp_title'].value_counts(dropna=False).head(20)

NaN                   20947
Teacher                2090
Manager                1773
Registered Nurse        952
Driver                  924
RN                      726
Supervisor              697
Sales                   580
Project Manager         526
General Manager         523
Office Manager          521
Owner                   420
Director                402
Operations Manager      387
Truck Driver            387
Nurse                   326
Engineer                325
Sales Manager           304
manager                 301
Supervisor              270
Name: emp_title, dtype: int64

How often is `emp_title` null?

In [11]:
df['emp_title'].isnull().sum()

20947

Clean the title and handle missing values

In [0]:
def clean_title(title):
  if isinstance(title, str):
    return title.strip().title()
  else:
    return "Unknown"

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [0]:
df['emp_title'] = df['emp_title'].apply(clean_title)

In [14]:
df['emp_title'].value_counts()

Unknown                                     20947
Teacher                                      2557
Manager                                      2395
Registered Nurse                             1418
Driver                                       1258
Supervisor                                   1160
Truck Driver                                  920
Rn                                            834
Office Manager                                805
Sales                                         803
General Manager                               791
Project Manager                               720
Owner                                         625
Director                                      523
Operations Manager                            518
Sales Manager                                 500
Police Officer                                440
Nurse                                         425
Technician                                    420
Engineer                                      412


In [15]:
df['emp_title_manager'] = df['emp_title'].str.contains("Manager")
df['emp_title_manager']

0         False
1         False
2         False
3         False
4         False
5         False
6         False
7          True
8         False
9         False
10        False
11        False
12        False
13        False
14        False
15        False
16        False
17        False
18         True
19        False
20        False
21        False
22        False
23        False
24        False
25        False
26        False
27        False
28        False
29        False
30         True
31        False
32        False
33        False
34        False
35        False
36        False
37         True
38        False
39        False
40        False
41        False
42        False
43        False
44        False
45        False
46         True
47        False
48        False
49        False
50        False
51        False
52        False
53        False
54        False
55         True
56        False
57        False
58        False
59        False
60        False
61        False
62      

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [0]:
df['issue_d'] = pd.to_datetime(df['issue_d'], infer_datetime_format=True)

In [0]:
df['issue_year'] = df['issue_d'].dt.year
df['issue_month'] = df['issue_d'].dt.month

In [0]:
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'],
                                       infer_datetime_format=True)

In [21]:
df['days_from_earliest_credit_to_issue'] = (df['issue_d'] - 
                                            df['earliest_cr_line']).dt.days

df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager,issue_year,issue_month,days_from_earliest_credit_to_issue
0,10000,10000,10000.0,36 months,10.33,324.23,B,B1,Unknown,< 1 year,MORTGAGE,280000.0,Not Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,1996-01-01,0,18.0,,14,0,9082,38%,23,w,9279.39,9279.39,964.08,964.08,720.61,243.47,0.0,0.0,0.0,Apr-2019,324.23,Apr-2019,Apr-2019,0,,1,Individual,,,,0,671,246828,1,3,2,3,1.0,48552,62.0,1,3,4923,46.0,23900,2,7,1,7,17631.0,11897.0,43.1,0,0,158.0,275,11,1,1,11.0,,11.0,,0,3,4,7,7,10,9,11,4,14,0.0,0,0,4,91.3,28.6,0,0,367828,61364,20900,54912,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,8370
1,9600,9600,9600.0,36 months,12.98,323.37,B,B5,Unknown,,MORTGAGE,35704.0,Not Verified,2018-12-01,Current,n,home_improvement,Home improvement,401xx,KY,0.84,0,2003-11-01,0,69.0,,5,0,748,11.5%,23,w,8934.25,8934.25,994.35,994.35,665.75,328.6,0.0,0.0,0.0,Apr-2019,323.37,May-2019,Apr-2019,0,,1,Individual,,,,0,0,748,0,0,0,0,44.0,0,,0,3,748,12.0,6500,0,0,1,3,150.0,3452.0,17.8,0,0,181.0,100,13,13,0,16.0,,3.0,,0,1,1,2,2,16,5,7,1,5,0.0,0,0,0,95.5,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,5509
2,4000,4000,4000.0,36 months,23.4,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,2006-09-01,4,59.0,,15,0,5199,19.2%,20,w,3762.39,3762.39,459.24,459.24,237.61,221.63,0.0,0.0,0.0,Apr-2019,155.68,Apr-2019,Apr-2019,0,,1,Individual,,,,0,0,66926,5,4,3,4,5.0,61727,86.0,6,11,1353,68.0,27100,4,0,4,15,4462.0,20174.0,7.9,0,0,147.0,118,2,2,0,2.0,,0.0,,0,5,7,9,9,8,11,12,7,15,0.0,0,0,9,95.0,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,4474
3,2500,2500,2500.0,36 months,13.56,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,2001-04-01,1,,45.0,9,1,4341,10.3%,34,w,2328.06,2328.06,251.94,251.94,171.94,80.0,0.0,0.0,0.0,Apr-2019,84.92,Apr-2019,Apr-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,6453
4,30000,30000,30000.0,60 months,18.94,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,1987-06-01,0,71.0,75.0,13,1,12315,24.2%,44,w,29074.35,29074.35,2284.34,2284.34,925.65,1358.69,0.0,0.0,0.0,Apr-2019,777.23,Apr-2019,Apr-2019,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,11506


# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [22]:
# code to get LendingClub dataset to end-of-lecture state is above

df['term']

0          36 months
1          36 months
2          36 months
3          36 months
4          60 months
5          36 months
6          60 months
7          36 months
8          60 months
9          60 months
10         60 months
11         60 months
12         60 months
13         60 months
14         36 months
15         36 months
16         36 months
17         60 months
18         60 months
19         60 months
20         36 months
21         36 months
22         36 months
23         60 months
24         60 months
25         36 months
26         36 months
27         36 months
28         36 months
29         60 months
30         60 months
31         36 months
32         36 months
33         36 months
34         36 months
35         36 months
36         36 months
37         36 months
38         60 months
39         36 months
40         60 months
41         60 months
42         36 months
43         36 months
44         60 months
45         36 months
46         60 months
47         36

In [28]:
def remove_months(string):
  return int(string.strip(' months'))

df['term'] = df['term'].apply(remove_months)

df['term']

0         36
1         36
2         36
3         36
4         60
5         36
6         60
7         36
8         60
9         60
10        60
11        60
12        60
13        60
14        36
15        36
16        36
17        60
18        60
19        60
20        36
21        36
22        36
23        60
24        60
25        36
26        36
27        36
28        36
29        60
30        60
31        36
32        36
33        36
34        36
35        36
36        36
37        36
38        60
39        36
40        60
41        60
42        36
43        36
44        60
45        36
46        60
47        36
48        36
49        60
50        36
51        60
52        60
53        36
54        60
55        36
56        60
57        36
58        60
59        36
60        36
61        36
62        36
63        60
64        60
65        60
66        60
67        36
68        36
69        36
70        36
71        36
72        36
73        36
74        36
75        60
76        36

In [46]:
def loan_status_evaluator(string):
  if string == 'Current':
    return 1
  elif string == 'Fully Paid':
    return 1
  else:
    return 0
  
df['loan_status_is_great'] = df['loan_status'].apply(loan_status_evaluator)

df[['loan_status', 'loan_status_is_great']]

Unnamed: 0,loan_status,loan_status_is_great
0,Current,1
1,Current,1
2,Current,1
3,Current,1
4,Current,1
5,Current,1
6,Current,1
7,Current,1
8,Current,1
9,Current,1


In [68]:
df['last_pymnt_d'] = pd.to_datetime(df['last_pymnt_d'], infer_datetime_format=True)

# without astype, new columns have floats due to NaNs; 'Int64' can handle NaNs
df['last_pymnt_d_month'] = df['last_pymnt_d'].dt.month.astype('Int64')
df['last_pymnt_d_year'] = df['last_pymnt_d'].dt.year.astype('Int64')

df[['last_pymnt_d', 'last_pymnt_d_month', 'last_pymnt_d_year']]

Int64
Int64


Unnamed: 0,last_pymnt_d,last_pymnt_d_month,last_pymnt_d_year
0,2019-04-01,4,2019
1,2019-04-01,4,2019
2,2019-04-01,4,2019
3,2019-04-01,4,2019
4,2019-04-01,4,2019
5,2019-04-01,4,2019
6,2019-04-01,4,2019
7,2019-04-01,4,2019
8,2019-04-01,4,2019
9,2019-04-01,4,2019


# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

In [84]:
df['revol_util'] = df['revol_util'].fillna(method='ffill')

df['revol_util'] = df['revol_util'].apply(remove_percent)

df['revol_util']

0          38.0
1          11.5
2          19.2
3          10.3
4          24.2
5          19.1
6          13.0
7          81.3
8          32.0
9          55.5
10         73.7
11         24.8
12         65.5
13         37.2
14         83.1
15         37.5
16         33.5
17         50.9
18         93.7
19         39.8
20         59.2
21         61.2
22        101.8
23         81.0
24         63.5
25         43.9
26         71.0
27         37.6
28         31.1
29         79.1
30         36.8
31          3.9
32         28.9
33          9.8
34         44.1
35          7.5
36         23.8
37         47.2
38         85.0
39         46.9
40         59.9
41         72.4
42         44.2
43         88.0
44         37.5
45         57.8
46         82.9
47         11.9
48         19.2
49         96.5
50         12.2
51          2.1
52         64.1
53         41.7
54         28.9
55          8.2
56         66.4
57         63.6
58         64.5
59          0.3
60         38.0
61         49.1
62      

In [85]:
df['emp_title'].value_counts(dropna=False).head(20)

Unknown               20947
Teacher                2557
Manager                2395
Registered Nurse       1418
Driver                 1258
Supervisor             1160
Truck Driver            920
Rn                      834
Office Manager          805
Sales                   803
General Manager         791
Project Manager         720
Owner                   625
Director                523
Operations Manager      518
Sales Manager           500
Police Officer          440
Nurse                   425
Technician              420
Engineer                412
Name: emp_title, dtype: int64

In [123]:
value_counts = df['emp_title'].value_counts() >= 412
top_20_titles = value_counts.index[0:20].tolist()

def titles_simplifier(string):
  if (string in top_20_titles):
    return string
  else:
    return 'Other'
  
df['emp_title'] = df['emp_title'].apply(titles_simplifier)

df['emp_title']

# 'unknown' jobs could be jobs in the top 20 for all we know, hence not 'other'

0                    Unknown
1                    Unknown
2                      Other
3                      Other
4                      Other
5                      Other
6                      Other
7                    Manager
8                      Other
9                    Unknown
10                     Other
11                     Other
12                     Other
13                   Unknown
14                     Other
15                     Other
16                     Other
17                     Other
18                     Other
19                     Other
20                     Other
21                     Other
22                   Unknown
23                   Unknown
24                     Other
25                     Other
26                   Unknown
27                     Other
28                   Unknown
29                     Other
30                     Other
31                   Unknown
32                     Other
33                     Other
34            

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01