_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [63]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

LoanStats_2018Q4.cs     [                <=> ]  21.40M   885KB/s    in 25s     

2019-05-05 08:58:59 (877 KB/s) - ‘LoanStats_2018Q4.csv.zip.1’ saved [22444881]

--2019-05-05 08:59:00--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip.2’

LoanStats_2018Q4.cs     [               <=>  ]  21.40M   880KB/s    in 25s     

2019-05-05 08:59:26 (873 KB/s) - ‘LoanStats_2018Q4.csv.zip.2’ saved [22444881]



In [64]:
!unzip LoanStats_2018Q4.csv.zip

Archive:  LoanStats_2018Q4.csv.zip
replace LoanStats_2018Q4.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: LoanStats_2018Q4.csv    


In [0]:
!head LoanStats_2018Q4.csv

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import pandas as pd


In [0]:
df = pd.read_csv(filepath_or_buffer='LoanStats_2018Q4.csv')
df.head

In [68]:
df =pd.read_csv(sep=',',filepath_or_buffer='LoanStats_2018Q4.csv', skiprows=1, skipfooter=2)

  """Entry point for launching an IPython kernel.


## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [0]:
def strip_percent(x_str):
    return float(x_str.strip('%%'))

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [0]:
df['int_rate']= df['int_rate'].apply(strip_percent)

Apply the function to the `int_rate` column

In [71]:
df.int_rate.head()

0    10.33
1    23.40
2    17.97
3    12.98
4    13.56
Name: int_rate, dtype: float64

### Clean `emp_title`

Look at top 20 titles

In [72]:
df['emp_title'].head(n=10)

0                   NaN
1              Security
2        Administrative
3                   NaN
4                  Chef
5           Postmaster 
6              Operator
7    Nursing Supervisor
8               Manager
9      Material Handler
Name: emp_title, dtype: object

How often is `emp_title` null?

In [73]:
df.emp_title.value_counts(dropna=False).head(20)

NaN                   20947
Teacher                2090
Manager                1773
Registered Nurse        952
Driver                  924
RN                      726
Supervisor              697
Sales                   580
Project Manager         526
General Manager         523
Office Manager          521
Owner                   420
Director                402
Operations Manager      387
Truck Driver            387
Nurse                   326
Engineer                325
Sales Manager           304
manager                 301
Supervisor              270
Name: emp_title, dtype: int64

Clean the title and handle missing values

In [74]:
df['emp_title'].isnull().sum()

20947

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [0]:
df['emp_title'].str.lower()

In [0]:
import numpy as np

In [0]:
def clean_title(title):
    if isinstance(title, str):
        return title.strip().lower()
    else:
        return 'unknown'

In [0]:
df['emp_title'] = df['emp_title'].apply(clean_title)

In [0]:
df.emp_title.value_counts(dropna=False).head(20)

In [0]:
df['emp_title_manager']= df['emp_title'].str.contains('manager')

In [0]:
df['emp_title_manager']


## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [82]:
df.term.head(20)

0      36 months
1      36 months
2      36 months
3      36 months
4      36 months
5      60 months
6      60 months
7      60 months
8      36 months
9      60 months
10     60 months
11     60 months
12     60 months
13     60 months
14     36 months
15     36 months
16     36 months
17     60 months
18     60 months
19     60 months
Name: term, dtype: object

In [0]:
def strip_months(m_str):
    return int(m_str.strip(' months'))

In [84]:
df.term.apply(strip_months).head(n=10)

0    36
1    36
2    36
3    36
4    36
5    60
6    60
7    60
8    36
9    60
Name: term, dtype: int64

In [85]:
df.columns


Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       ...
       'hardship_payoff_balance_amount', 'hardship_last_payment_amount',
       'debt_settlement_flag', 'debt_settlement_flag_date',
       'settlement_status', 'settlement_date', 'settlement_amount',
       'settlement_percentage', 'settlement_term', 'emp_title_manager'],
      dtype='object', length=145)

In [0]:
df['loan_status']

In [0]:
def loan_stat(stat):
    if stat == 'Current' or 'Fully Paid':
        return 1
    else:
        return 0

In [0]:
#Make a column named loan_status_is_great. It should contain the integer 1 if loan_status is "Current" or "Fully Paid."
#Else it should contain the integer 0.

df['loan_status_is_great'] = df['loan_status'].apply(loan_stat)

In [114]:
df['loan_status_is_great'].head(20)

0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    1
14    1
15    1
16    1
17    1
18    1
19    1
Name: loan_status_is_great, dtype: int64

In [115]:
df['loan_status_is_great'][128407]

1

In [107]:
df['loan_status'].dtype

dtype('O')

In [0]:
df['loan_status_is_great'] = df.loan_status.apply(loan_stat)

In [110]:
df['loan_status_is_great']

0         1
1         1
2         1
3         1
4         1
5         1
6         1
7         1
8         1
9         1
10        1
11        1
12        1
13        1
14        1
15        1
16        1
17        1
18        1
19        1
20        1
21        1
22        1
23        1
24        1
25        1
26        1
27        1
28        1
29        1
         ..
128382    1
128383    1
128384    1
128385    1
128386    1
128387    1
128388    1
128389    1
128390    1
128391    1
128392    1
128393    1
128394    1
128395    1
128396    1
128397    1
128398    1
128399    1
128400    1
128401    1
128402    1
128403    1
128404    1
128405    1
128406    1
128407    1
128408    1
128409    1
128410    1
128411    1
Name: loan_status_is_great, Length: 128412, dtype: int64

In [116]:
df.columns


Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       ...
       'hardship_last_payment_amount', 'debt_settlement_flag',
       'debt_settlement_flag_date', 'settlement_status', 'settlement_date',
       'settlement_amount', 'settlement_percentage', 'settlement_term',
       'emp_title_manager', 'loan_status_is_great'],
      dtype='object', length=146)

In [0]:
pd.set_option('display.max_columns', None)

In [120]:
df
# 'issue_d'  'term'

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager,loan_status_is_great
0,,,10000,10000,10000.0,36 months,10.33,324.23,B,B1,unknown,< 1 year,MORTGAGE,280000.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,Jan-1996,0,18.0,,14,0,9082,38%,23,w,9035.04,9035.04,1288.310000,1288.31,964.96,323.35,0.0,0.0,0.0,Apr-2019,324.23,May-2019,Apr-2019,0,,1,Individual,,,,0,671,246828,1,3,2,3,1.0,48552,62.0,1,3,4923,46.0,23900,2,7,1,7,17631.0,11897.0,43.1,0,0,158.0,275,11,1,1,11.0,,11.0,,0,3,4,7,7,10,9,11,4,14,0.0,0,0,4,91.3,28.6,0,0,367828,61364,20900,54912,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,1
1,,,4000,4000,4000.0,36 months,23.40,155.68,E,E1,security,3 years,RENT,90000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.2%,20,w,3680.07,3680.07,614.920000,614.92,319.93,294.99,0.0,0.0,0.0,Apr-2019,155.68,May-2019,Apr-2019,0,,1,Individual,,,,0,0,66926,5,4,3,4,5.0,61727,86.0,6,11,1353,68.0,27100,4,0,4,15,4462.0,20174.0,7.9,0,0,147.0,118,2,2,0,2.0,,0.0,,0,5,7,9,9,8,11,12,7,15,0.0,0,0,9,95.0,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,1
2,,,5000,5000,5000.0,36 months,17.97,180.69,D,D1,administrative,6 years,MORTGAGE,59280.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,490xx,MI,10.51,0,Apr-2011,0,,,8,0,4599,19.1%,13,w,4567.57,4567.57,715.270000,715.27,432.43,282.84,0.0,0.0,0.0,Apr-2019,180.69,May-2019,Apr-2019,0,,1,Individual,,,,0,0,110299,0,1,0,2,14.0,7150,72.0,0,2,0,35.0,24100,1,5,0,4,18383.0,13800.0,0.0,0,0,87.0,92,15,14,2,77.0,,14.0,,0,0,3,3,3,4,6,7,3,8,0.0,0,0,0,100.0,0.0,0,0,136927,11749,13800,10000,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,1
3,,,9600,9600,9600.0,36 months,12.98,323.37,B,B5,unknown,,MORTGAGE,35704.0,Not Verified,Dec-2018,Current,n,,,home_improvement,Home improvement,401xx,KY,0.84,0,Nov-2003,0,69.0,,5,0,748,11.5%,23,w,8934.25,8934.25,994.350000,994.35,665.75,328.60,0.0,0.0,0.0,Apr-2019,323.37,May-2019,Apr-2019,0,,1,Individual,,,,0,0,748,0,0,0,0,44.0,0,,0,3,748,12.0,6500,0,0,1,3,150.0,3452.0,17.8,0,0,181.0,100,13,13,0,16.0,,3.0,,0,1,1,2,2,16,5,7,1,5,0.0,0,0,0,95.5,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,1
4,,,2500,2500,2500.0,36 months,13.56,84.92,C,C1,chef,10+ years,RENT,55000.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.3%,34,w,2269.45,2269.45,336.860000,336.86,230.55,106.31,0.0,0.0,0.0,Apr-2019,84.92,May-2019,Apr-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,1
5,,,30000,30000,30000.0,60 months,18.94,777.23,D,D2,postmaster,10+ years,MORTGAGE,90000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,Jun-1987,0,71.0,75.0,13,1,12315,24.2%,44,w,26250.75,26250.75,5561.570000,5561.57,3749.25,1812.32,0.0,0.0,0.0,Apr-2019,777.23,May-2019,Apr-2019,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,1
6,,,23000,23000,23000.0,60 months,20.89,620.81,D,D4,operator,5 years,RENT,68107.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,672xx,KS,0.52,0,Feb-1997,0,,,5,0,976,13%,10,w,22307.21,22307.21,1824.210000,1824.21,692.79,1131.42,0.0,0.0,0.0,Apr-2019,620.81,May-2019,Apr-2019,1,,1,Individual,,,,0,2693,976,0,0,0,0,36.0,0,,3,4,0,13.0,7500,2,2,4,4,195.0,3300.0,0.0,0,0,237.0,262,10,10,0,10.0,,9.0,,0,0,1,3,4,3,5,7,1,5,0.0,0,0,3,100.0,0.0,0,0,7500,976,3300,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,1
7,,,32075,32075,32075.0,60 months,11.80,710.26,B,B4,nursing supervisor,10+ years,MORTGAGE,150000.0,Not Verified,Dec-2018,Current,n,,,credit_card,Credit card refinancing,231xx,VA,22.21,0,Aug-2005,0,,,17,0,19077,32%,24,w,30472.13,30472.13,2809.500000,2809.50,1602.87,1206.63,0.0,0.0,0.0,Apr-2019,710.26,May-2019,Apr-2019,0,,1,Individual,,,,0,0,272667,1,4,1,1,9.0,37558,47.0,1,1,3910,41.0,59600,1,2,0,3,16039.0,10446.0,47.8,0,0,160.0,70,4,4,2,27.0,,14.0,,0,4,10,4,4,8,12,14,10,17,0.0,0,0,2,100.0,50.0,0,0,360433,56635,20000,80125,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,1
8,,,8000,8000,8000.0,36 months,23.40,311.35,E,E1,manager,10+ years,OWN,43000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,357xx,AL,33.24,0,Jan-1995,0,,107.0,8,1,9019,81.3%,16,w,7360.19,7360.19,1229.800000,1229.80,639.81,589.99,0.0,0.0,0.0,Apr-2019,311.35,May-2019,Apr-2019,0,,1,Individual,,,,0,0,169223,0,3,2,2,7.0,22059,69.0,0,0,2174,72.0,11100,1,1,1,2,21153.0,126.0,94.5,0,0,148.0,287,44,7,1,51.0,,7.0,,0,1,4,1,2,8,4,7,4,8,0.0,0,0,2,100.0,100.0,1,0,199744,31078,2300,32206,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,True,1
9,,,10000,10000,10000.0,60 months,19.92,264.50,D,D3,material handler,10+ years,MORTGAGE,80000.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,648xx,MO,20.67,0,Apr-2005,0,37.0,,11,0,13198,73.7%,19,w,9596.08,9596.08,1041.400000,1041.40,403.92,637.48,0.0,0.0,0.0,Apr-2019,264.50,May-2019,Apr-2019,0,37.0,1,Individual,,,,0,116,110120,0,2,0,1,15.0,24408,70.0,1,1,6521,72.0,17900,0,0,1,2,10011.0,779.0,90.3,0,0,148.0,164,11,11,1,149.0,37.0,11.0,37.0,0,2,8,2,4,6,8,12,8,11,0.0,0,0,1,84.2,50.0,0,0,170230,37606,8000,55030,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,1


# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01