<a href="https://colab.research.google.com/github/derek-shing/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/LS_DS_124_Make_features_LIVE_LESSON.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [0]:
!wget https://resources.lendingclub.com/LoanStats_2018Q3.csv.zip

In [0]:
!unzip LoanStats_2018Q3.csv.zip

In [0]:
!head LoanStats_2018Q3.csv

In [0]:
!tail LoanStats_2018Q3.csv

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import pandas as pd

#want to skip some row in the csv file

df = pd.read_csv('LoanStats_2018Q3.csv', skipfooter=2, skiprows=1)

df.shape

df.head()

In [0]:
pd.options.display.max_columns = 500

In [0]:
df.head().T

## Work with strings

In [0]:
pd.options.display.max_rows = 500

In [0]:
df.head().T

For machine learning, we usually want to replace strings with numbers

In [0]:
import numpy as np

def all_numeric(df):
    return all((df.dtypes==np.number) | 
               (df.dtypes==bool))

def no_nulls(df):
    return not any(df.isnull().sum())

def ready_for_sklearn(df):
    return all_numeric(df) and no_nulls(df)

We can get info about which columns have a datatype of "object" (strings)

In [0]:
df.select_dtypes('object').info()

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [0]:
string = '17.97%'

float(string.replace('%',''))

def remove_percent(string):
  return float(string.strip('%'))

remove_percent(string)

Apply the function to the `int_rate` column

In [0]:
df['int_rate'] = df['int_rate'].apply(remove_percent)

In [0]:
df.int_rate.head()

### Clean `emp_title`

Look at top 20 titles

In [0]:
df.select_dtypes('object').info()

df['emp_title'].value_counts().head(20)

How often is `emp_title` null?

In [0]:
df['emp_title'].isnull().sum()

Clean the title and handle missing values

In [0]:
examples = ['owner','Supervisor ', ' Project Manager',42,np.nan]

def clean_title(x):
  if isinstance(x, str):
    return x.strip().title()
  else:
    return 'Unkown'

for example in examples:
  print(clean_title(example))

In [0]:
df['emp_title'] = df['emp_title'].apply(clean_title)

In [0]:
df['emp_title'].value_counts().head(20)

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [0]:
df['emp_title_manager'] = df['emp_title'].str.contains('Manager')


In [0]:
df['emp_title_manager'].value_counts()

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [42]:
df['issue_d'].head().values

array(['Sep-2018', 'Sep-2018', 'Sep-2018', 'Sep-2018', 'Sep-2018'],
      dtype=object)

In [0]:
df['issue_d']= pd.to_datetime(df['issue_d'],infer_datetime_format=True)

In [0]:
df['issue_year'] = df['issue_d'].dt.year
df['issue_month'] = df['issue_d'].dt.month

In [50]:
df['issue_month'].sample(10)

123904    7
19709     9
81971     8
87641     7
36225     9
98781     7
101620    7
120075    7
50717     8
116837    7
Name: issue_month, dtype: int64

In [0]:
df.head(1)

df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], infer_datetime_format=True)

In [0]:
date_col = [col for col in df if col.endswith('_d')]



In [0]:
for col in date_col:
  df[col] = pd.to_datetime(df[col],infer_datetime_format=True)

In [59]:
df.sample(5)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager,issue_year,issue_month
43201,,,36000,36000,36000,60 months,13.56,829.47,C,C1,Hearing Instrument Specialist,1 year,MORTGAGE,74500.0,Source Verified,2018-08-01,Current,n,,,credit_card,Credit card refinancing,530xx,WI,15.85,0,2005-01-01,1,48.0,,11,0,19657,65.1%,26,w,34717.61,34717.61,2461.29,2461.29,1282.39,1178.9,0.0,0.0,0.0,2018-12-01,829.47,2019-01-01,2018-12-01,0,48.0,1,Joint App,202500.0,8.36,Source Verified,0,0,42912,0,1,0,1,17.0,23017,80.0,0,0,6657,73.0,30200,0,10,3,2,3901.0,8335.0,65.6,0,0,160.0,163,32,17,3,32.0,,0.0,48.0,2,6,8,6,6,9,9,12,8,11,0.0,0,0,0,84.6,16.7,0,0,59215,42912,24200,28777,36449.0,Sep-2004,0.0,5.0,7.0,76.1,1.0,11.0,0.0,0.0,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,8
49304,,,5900,5900,5900,36 months,27.27,241.73,E,E5,Tax Repairer,10+ years,OWN,98200.0,Not Verified,2018-08-01,Current,n,,,home_improvement,Home improvement,925xx,CA,27.07,0,2004-05-01,0,77.0,82.0,10,1,6593,37%,14,f,5454.49,5454.49,949.04,949.04,445.51,503.53,0.0,0.0,0.0,2018-12-01,241.73,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,60668,1,3,1,1,1.0,54075,25.0,1,2,1853,28.0,17800,2,0,3,3,6741.0,6098.0,44.6,0,0,113.0,80,12,1,1,12.0,,1.0,,0,4,5,5,5,5,7,8,5,10,0.0,0,0,2,92.3,25.0,1,0,107064,60668,11000,89264,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,8
95252,,,25000,25000,25000,60 months,15.02,595.02,C,C3,Juvenile Probation Officer,3 years,MORTGAGE,72000.0,Verified,2018-07-01,Current,n,,,debt_consolidation,Debt consolidation,765xx,TX,33.23,0,2005-04-01,0,,87.0,10,1,22898,79%,25,w,22917.07,22917.07,3462.62,3462.62,2082.93,1379.69,0.0,0.0,0.0,2018-12-01,595.02,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,155053,2,3,2,3,6.0,50889,64.0,2,3,587,68.0,29000,0,2,1,7,15505.0,249.0,75.1,0,0,67.0,159,0,0,1,50.0,,12.0,,0,2,5,2,4,13,6,11,5,10,0.0,0,0,4,100.0,50.0,1,0,194081,73787,1000,80081,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,7
104170,,,35000,35000,35000,60 months,12.73,791.53,B,B5,Unkown,,OWN,110559.0,Verified,2018-07-01,Current,n,,,debt_consolidation,Debt consolidation,891xx,NV,19.78,0,1985-09-01,0,,110.0,8,1,38031,55.4%,25,w,32853.75,32853.75,3908.14,3908.14,2146.25,1761.89,0.0,0.0,0.0,2018-12-01,791.53,2019-01-01,2018-12-01,0,,1,Individual,,,,0,0,470439,0,1,1,1,12.0,14386,72.0,2,2,16337,59.0,68700,1,2,3,5,58805.0,23986.0,59.8,0,0,172.0,394,9,9,2,9.0,,12.0,,0,4,5,5,9,10,6,13,5,8,0.0,0,0,5,100.0,40.0,1,0,512800,52417,59700,20000,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,7
28526,,,30800,30800,30800,60 months,17.97,781.62,D,D1,Principal,6 years,MORTGAGE,104000.0,Source Verified,2018-09-01,Current,n,,,debt_consolidation,Debt consolidation,752xx,TX,11.04,0,2002-09-01,0,,,16,0,38803,54.8%,35,w,29824.36,29824.36,2437.11,2437.11,975.64,1461.47,0.0,0.0,0.0,2018-12-01,781.62,2019-01-01,2018-12-01,0,,1,Individual,,,,0,9313,285280,1,3,0,3,16.0,58773,45.0,2,3,14450,52.0,70800,3,1,0,7,17830.0,17522.0,66.6,0,0,169.0,192,4,4,3,11.0,,13.0,,0,7,10,7,11,11,12,21,10,16,0.0,0,0,3,100.0,42.9,0,0,330032,97576,52500,69232,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,False,2018,9


# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.



In [65]:
def conv_term(term):
  return int(term.strip('months'))

example = df['term'][0]

example
conv_term(example)

df['term'] = df['term'].apply(conv_term)

df['term'].sample(5)


55955     36
4391      36
114242    36
65142     36
51430     36
Name: term, dtype: int64

In [66]:
df['loan_status'].value_counts()

Current               121082
Fully Paid              4786
In Grace Period          948
Late (31-120 days)       920
Late (16-30 days)        348
Charged Off              110
Name: loan_status, dtype: int64

In [0]:
def loan_status_is_great(status):
  
  great_status=['Current','Fully Paid']
  if status in (great_status):
    return 1
  else:
    return 0

df['loan_status_is_great'] = df['loan_status'].apply(loan_status_is_great)

  

In [0]:
df.loc[df['loan_status_is_great']==0,['loan_status_is_great','loan_status']].sample(20)

In [85]:
df['last_pymnt_d'].sample(20)

49503    2018-12-01
115418   2018-12-01
46596    2018-11-01
127657   2018-12-01
73536    2018-12-01
2449     2018-12-01
83395    2018-12-01
112856   2018-12-01
23360    2018-12-01
17695    2018-12-01
86591    2018-12-01
80361    2018-12-01
48353    2018-11-01
91889    2018-12-01
52459    2018-12-01
22213    2018-12-01
74498    2018-12-01
107422   2018-12-01
12631    2018-11-01
77971    2018-12-01
Name: last_pymnt_d, dtype: datetime64[ns]

In [0]:
df['last_pymnt_d']= pd.to_datetime(df['last_pymnt_d'],infer_datetime_format=True)

In [0]:
df['last_pymnt_d_year'] = df['last_pymnt_d'].dt.year
df['last_pymnt_d_month'] = df['last_pymnt_d'].dt.month


In [89]:
df['last_pymnt_d_month'].value_counts()

12.0    116465
11.0      7793
10.0      1594
9.0       1136
8.0        838
7.0        222
Name: last_pymnt_d_month, dtype: int64

# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Process the dataframe so that `ready_for_sklearn(df)` returns `True`. You can drop columns, or select the subset of numeric columns with no missing values. (Or you can try automating the process to handle missing values and convert objects to numbers!)
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01