<a href="https://colab.research.google.com/github/macscheffer/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/DS_Data_Wrangling%2C_Make_Features_Mac_Scheffer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [0]:
!wget https://resources.lendingclub.com/LoanStats_2018Q3.csv.zip

--2019-01-17 18:54:40--  https://resources.lendingclub.com/LoanStats_2018Q3.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q3.csv.zip’

LoanStats_2018Q3.cs     [         <=>        ]  21.42M  1.68MB/s    in 13s     

2019-01-17 18:54:53 (1.66 MB/s) - ‘LoanStats_2018Q3.csv.zip’ saved [22461905]



In [0]:
!unzip LoanStats_2018Q3.csv.zip

Archive:  LoanStats_2018Q3.csv.zip
  inflating: LoanStats_2018Q3.csv    


In [0]:
!head LoanStats_2018Q3.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

In [0]:
!tail LoanStats_2018Q3.csv

"","","12000","12000","12000"," 36 months"," 14.03%","410.31","C","C2","Medical Support Staff","10+ years","OWN","53414","Source Verified","Jul-2018","Current","n","","","home_improvement","Home improvement","136xx","NY","25.68","0","Mar-1997","0","","","8","0","6527","44.4%","20","w","10618.01","10618.01","2037.52","2037.52","1381.99","655.53","0.0","0.0","0.0","Dec-2018","410.31","Jan-2019","Dec-2018","0","","1","Individual","","","","0","0","29869","0","2","0","2","13","23342","73","0","0","3977","64","14700","1","1","0","2","3734","7173","47.6","0","0","139","255","42","13","0","66","","18","","0","2","2","5","9","8","6","12","2","8","0","0","0","0","100","40","0","0","46792","29869","13700","32092","","","","","","","","","","","","N","","","","","","","","","","","","","","","Cash","N","","","","","",""
"","","5000","5000","5000"," 36 months"," 16.46%","176.93","C","C5","Labor Worker","3 years","MORTGAGE","57000","Not Verified","Jul-2018","Current","n","","","debt_consolidation",

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import pandas as pd

df = pd.read_csv('LoanStats_2018Q3.csv', skiprows=1, skipfooter=2)

df.shape

  This is separate from the ipykernel package so we can avoid doing imports until


(128194, 145)

In [0]:
pd.options.display.max_columns = 500
pd.options.display.max_rows = 500

In [0]:
df['hardship_payoff_balance_amount'].isnull().sum() / len(df)

0.999898591197716

## Work with strings

For machine learning, we usually want to replace strings with numbers

In [0]:
import numpy as np

def all_numeric(df):
    return all((df.dtypes==np.number) | 
               (df.dtypes==bool))

def no_nulls(df):
    return not any(df.isnull().sum())

def ready_for_sklearn(df):
    return all_numeric(df) and no_nulls(df)

In [0]:
ready_for_sklearn(df)

False

In [0]:
all_numeric(df)

False

We can get info about which columns have a datatype of "object" (strings)

In [0]:
df.select_dtypes('object').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128194 entries, 0 to 128193
Data columns (total 37 columns):
term                         128194 non-null object
int_rate                     128194 non-null object
grade                        128194 non-null object
sub_grade                    128194 non-null object
emp_title                    114757 non-null object
emp_length                   117807 non-null object
home_ownership               128194 non-null object
verification_status          128194 non-null object
issue_d                      128194 non-null object
loan_status                  128194 non-null object
pymnt_plan                   128194 non-null object
purpose                      128194 non-null object
title                        128194 non-null object
zip_code                     128194 non-null object
addr_state                   128194 non-null object
earliest_cr_line             128194 non-null object
revol_util                   128065 non-null object
initi

### Convert `int_rate`

In [0]:
df.int_rate.head().values

array([' 17.97%', ' 13.56%', ' 18.94%', '  7.84%', '  7.84%'],
      dtype=object)

Define a function to remove percent signs from strings and convert to floats

In [0]:
def remove_percent(string):
    return float(string.strip('%'))/100

Apply the function to the `int_rate` column

In [0]:
df['int_rate'] = df['int_rate'].apply(remove_percent)

In [0]:
df['int_rate'].head()

0    0.1797
1    0.1356
2    0.1894
3    0.0784
4    0.0784
Name: int_rate, dtype: float64

### Clean `emp_title`

Look at top 20 titles

In [0]:
df['emp_title'].value_counts().head(20)

Teacher               2294
Manager               2075
Owner                 1231
Driver                1089
Registered Nurse       944
Supervisor             810
RN                     757
Sales                  726
Project Manager        637
General Manager        548
Office Manager         542
Director               482
owner                  398
Engineer               383
Truck Driver           367
Operations Manager     366
President              350
Sales Manager          323
Supervisor             321
Server                 319
Name: emp_title, dtype: int64

In [0]:
df['emp_title'].value_counts().head(20).index

Index(['Teacher', 'Manager', 'Owner', 'Driver', 'Registered Nurse',
       'Supervisor', 'RN', 'Sales', 'Project Manager', 'General Manager',
       'Office Manager', 'Director', 'owner', 'Engineer', 'Truck Driver',
       'Operations Manager', 'President', 'Sales Manager', 'Supervisor ',
       'Server'],
      dtype='object')

How often is `emp_title` null?

In [0]:
df['emp_title'].isnull().sum()

13437

Clean the title and handle missing values

In [0]:
examples = ['owner', 'Supervisor ', 
            ' Project manager', np.nan]

def clean_title(x):
    if isinstance(x, str):        
        return x.strip().title()
    else:
        return 'Unknown'

for example in examples:
    print(clean_title(example))

Owner
Supervisor
Project Manager
Unknown


In [0]:
df['emp_title'] = df['emp_title'].apply(clean_title)

In [0]:
df['emp_title'].value_counts().head(20)

Unknown                     13437
Teacher                      2843
Manager                      2749
Owner                        1856
Driver                       1498
Registered Nurse             1386
Supervisor                   1345
Sales                         980
Truck Driver                  921
Rn                            905
Office Manager                846
Project Manager               835
General Manager               809
Director                      585
Operations Manager            516
Sales Manager                 510
Engineer                      474
Administrative Assistant      466
Store Manager                 466
President                     464
Name: emp_title, dtype: int64

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [0]:
# Boolean column, True if `emp_title` contains 'Manager',
# otherwise False

In [0]:
df['emp_title_manager'] = df['emp_title'].str.contains('Manager')

In [0]:
df['emp_title_manager'].value_counts()

False    109498
True      18696
Name: emp_title_manager, dtype: int64

In [0]:
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager
0,,,20000,20000,20000,60 months,0.1797,507.55,D,D1,Motor Vehicle Operator,7 years,RENT,68000.0,Source Verified,Sep-2018,Current,n,,,debt_consolidation,Debt consolidation,254xx,WV,15.8,0,Feb-2009,1,30.0,118.0,11,1,13483,69.1%,26,w,19580.78,19580.78,975.17,975.17,419.22,555.95,0.0,0.0,0.0,Dec-2018,507.55,Jan-2019,Dec-2018,0,,1,Individual,,,,0,0,194908,0,2,0,2,17.0,45299,93.0,0,1,9815,86.0,19500,1,4,2,5,19491.0,2185.0,81.8,0,0,115.0,94,13,7,4,25.0,,2.0,30.0,0,1,3,2,4,11,8,10,3,11,0.0,0,0,1,92.3,50.0,1,0,205625,58782,12000,48733,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,,False
1,,,25000,25000,25000,60 months,0.1356,576.02,C,C1,Firefighter,7 years,MORTGAGE,61000.0,Verified,Sep-2018,Current,n,,,major_purchase,Major purchase,770xx,TX,16.74,0,Jan-2009,0,,,5,0,6299,48.5%,14,w,24109.45,24109.45,1567.98,1567.98,890.55,677.43,0.0,0.0,0.0,Dec-2018,576.02,Jan-2019,Dec-2018,0,,1,Individual,,,,0,0,204007,1,2,0,1,19.0,13862,37.0,0,0,0,40.0,13000,0,7,4,2,40801.0,1500.0,0.0,0,0,54.0,116,26,3,1,74.0,,4.0,,0,0,1,1,2,5,2,8,1,5,0.0,0,0,1,100.0,0.0,0,0,234573,20161,1500,37273,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False
2,,,30000,30000,30000,36 months,0.1894,1098.78,D,D2,Unknown,< 1 year,RENT,100000.0,Source Verified,Sep-2018,Current,n,,,debt_consolidation,Debt consolidation,300xx,GA,16.07,0,Mar-2008,1,,114.0,6,1,14574,70.1%,9,w,28739.57,28739.57,2134.43,2134.43,1260.43,874.0,0.0,0.0,0.0,Dec-2018,1098.78,Jan-2019,Dec-2018,0,,1,Individual,,,,0,0,44048,1,2,1,1,1.0,29474,69.0,1,2,8521,69.0,20800,1,0,2,3,8810.0,5226.0,73.6,0,0,126.0,104,11,1,0,17.0,,1.0,,0,2,2,2,2,4,4,5,2,6,0.0,0,0,2,100.0,0.0,1,0,63636,44048,19800,42836,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False
3,,,6000,6000,6000,36 months,0.0784,187.58,A,A4,Unknown,,RENT,30000.0,Not Verified,Sep-2018,Current,n,,,debt_consolidation,Debt consolidation,923xx,CA,5.44,0,Apr-2000,0,,104.0,8,1,5936,34.5%,11,w,5702.27,5702.27,369.93,369.93,297.73,72.2,0.0,0.0,0.0,Dec-2018,187.58,Jan-2019,Dec-2018,0,,1,Individual,,,,0,350,5936,0,0,0,1,23.0,0,,1,4,2913,35.0,17200,2,0,0,5,848.0,7698.0,35.9,0,0,139.0,221,7,7,0,7.0,,18.0,,0,2,3,4,4,2,8,9,3,8,0.0,0,0,1,100.0,33.3,1,0,17200,5936,12000,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,,False
4,,,10650,10650,10650,36 months,0.0784,332.95,A,A4,Unknown,,RENT,28000.0,Verified,Sep-2018,Current,n,,,medical,Medical expenses,430xx,OH,16.89,0,Nov-2002,0,,,3,0,37,0.3%,3,w,10121.54,10121.54,656.62,656.62,528.46,128.16,0.0,0.0,0.0,Dec-2018,332.95,Jan-2019,Dec-2018,0,,1,Joint App,43000.0,14.01,Source Verified,0,0,18254,0,1,0,1,16.0,18217,81.0,0,0,0,54.0,11500,1,1,0,1,6085.0,,,0,0,16.0,190,113,16,0,,,16.0,,0,0,1,0,0,1,2,2,2,3,0.0,0,0,0,100.0,,0,0,33876,18254,0,22376,2024.0,Oct-1996,0.0,0.0,8.0,18.7,1.0,9.0,0.0,0.0,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False


In [0]:
df.shape

(128194, 146)

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [0]:
df['issue_d'].head().values

array(['Sep-2018', 'Sep-2018', 'Sep-2018', 'Sep-2018', 'Sep-2018'],
      dtype=object)

In [0]:
df['issue_d'] = pd.to_datetime(df['issue_d'], 
                               infer_datetime_format=True)

In [0]:
df['issue_d'].head().values

array(['2018-09-01T00:00:00.000000000', '2018-09-01T00:00:00.000000000',
       '2018-09-01T00:00:00.000000000', '2018-09-01T00:00:00.000000000',
       '2018-09-01T00:00:00.000000000'], dtype='datetime64[ns]')

In [0]:
df['issue_d'].describe()

count                  128194
unique                      3
top       2018-08-01 00:00:00
freq                    46079
first     2018-07-01 00:00:00
last      2018-09-01 00:00:00
Name: issue_d, dtype: object

In [0]:
df['issue_year']  = df['issue_d'].dt.year
df['issue_month'] = df['issue_d'].dt.month

In [0]:
df['issue_month'].sample(n=10).values

array([8, 8, 9, 9, 9, 7, 8, 9, 8, 8])

In [0]:
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], 
                                        infer_datetime_format=True)

In [0]:
df['days_from_earliest_credit_to_issue'] = (
    df['issue_d'] - df['earliest_cr_line']).dt.days

In [0]:
df['days_from_earliest_credit_to_issue'].head().values

array([3499, 3530, 3836, 6727, 5783])

In [0]:
[col for col in df if col.endswith('_d')]

['issue_d', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']

In [0]:
for col in ['last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']:
    df[col] = pd.to_datetime(df[col],
                             infer_datetime_format=True)
    

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.



In [0]:
df.columns

Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       ...
       'debt_settlement_flag_date', 'settlement_status', 'settlement_date',
       'settlement_amount', 'settlement_percentage', 'settlement_term',
       'emp_title_manager', 'issue_year', 'issue_month',
       'days_from_earliest_credit_to_issue'],
      dtype='object', length=149)

In [0]:
df['term'] = df.term.str.replace('months','').astype(int)

In [0]:
df['term'].sample(n=5)

106685    36
20865     36
52541     36
125179    36
51693     36
Name: term, dtype: int64

In [0]:
df.columns

Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       ...
       'debt_settlement_flag_date', 'settlement_status', 'settlement_date',
       'settlement_amount', 'settlement_percentage', 'settlement_term',
       'emp_title_manager', 'issue_year', 'issue_month',
       'days_from_earliest_credit_to_issue'],
      dtype='object', length=149)

In [0]:
df.loan_status.unique()

array(['Current', 'Fully Paid', 'In Grace Period', 'Late (31-120 days)',
       'Late (16-30 days)', 'Charged Off'], dtype=object)

In [0]:
def loanstatus(series):
  if series == 'Current' or series == 'Fully Paid':
    series = 1
  else:
    series = 0
  return series

df['loan_status_is_great'] = df.loan_status.apply(loanstatus)

In [0]:
df.loan_status_is_great.sample(25)

In [0]:
# Make last_pymnt_d_month and last_pymnt_d_year columns.

df['last_pymnt_d_month'] = df.last_pymnt_d.dt.month
df['last_pymnt_d_year'] = df.last_pymnt_d.dt.year

# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Process the dataframe so that `ready_for_sklearn(df)` returns `True`. You can drop columns, or select the subset of numeric columns with no missing values. (Or you can try automating the process to handle missing values and convert objects to numbers!)
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

In [0]:
df.grade.head()

0    D
1    C
2    D
3    A
4    A
Name: grade, dtype: object

In [0]:
df.sub_grade.head()

0    D1
1    C1
2    D2
3    A4
4    A4
Name: sub_grade, dtype: object

In [0]:
df['lc_grade'] = str(df.grade) + df.sub_grade

In [0]:
df = df.drop('lc_grade',axis='columns')

In [0]:
df.pivot_table(index='sub_grade', values='int_rate').plot.bar();

In [0]:
df.pivot_table(index='sub_grade', values='annual_inc').plot.bar();

In [0]:
df.pivot_table(index='sub_grade', values='annual_inc_joint').plot.bar();

In [0]:
# if you open up the annual_inc and annual_inc_joint you can see that when there is not a big discrep, 
# they are usually placed in the lowest sub_grade

df['annual_inc_joint_minus_single'] = df.annual_inc_joint - df.annual_inc

In [0]:
df.annual_inc_joint_minus_single.corr(df.int_rate)

-0.09285012438557147