<a href="https://colab.research.google.com/github/llpk79/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/Paul_K_Make_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [0]:
import numpy as np

In [314]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

--2019-04-25 23:06:15--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip’

LoanStats_2018Q4.cs     [               <=>  ]  21.37M   890KB/s    in 25s     

2019-04-25 23:06:40 (875 KB/s) - ‘LoanStats_2018Q4.csv.zip’ saved [22408410]



In [315]:
!unzip LoanStats_2018Q4.csv.zip

Archive:  LoanStats_2018Q4.csv.zip
  inflating: LoanStats_2018Q4.csv    


In [316]:
!head LoanStats_2018Q4.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

In [317]:
!tail LoanStats_2018Q4.csv

"","","5600","5600","5600"," 36 months"," 13.56%","190.21","C","C1","","n/a","RENT","15600","Not Verified","Oct-2018","Current","n","","","credit_card","Credit card refinancing","836xx","ID","15.31","0","Aug-2012","0","","97","9","1","5996","34.5%","11","w","4950.84","4950.84","940.5","940.50","649.16","291.34","0.0","0.0","0.0","Mar-2019","190.21","Apr-2019","Mar-2019","0","","1","Individual","","","","0","0","5996","0","0","0","1","20","0","","0","2","3017","35","17400","1","0","0","3","750","4689","45.5","0","0","20","73","13","13","0","13","","20","","0","3","5","4","4","1","9","10","5","9","0","0","0","0","100","25","1","0","17400","5996","8600","0","","","","","","","","","","","","N","","","","","","","","","","","","","","","Cash","N","","","","","",""
"","","23000","23000","23000"," 36 months"," 15.02%","797.53","C","C3","Tax Consultant","10+ years","MORTGAGE","75000","Source Verified","Oct-2018","Charged Off","n","","","debt_consolidation","Debt consolidation","352xx","AL","2

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import pandas as pd

In [359]:
df = pd.read_csv('LoanStats_2018Q4.csv', header=1)

  interactivity=interactivity, compiler=compiler, result=result)


In [0]:
df.head()

In [0]:
pd.options.display.max_rows = 500
pd.options.display.max_columns = 500


In [0]:
# View number of nan per column.
df.isnull().sum().sort_values(ascending=False)

## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [0]:
df.dtypes

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [0]:
def strip_percent(string):
  # return float value of percentage.
  return float(str(string).strip('%'))

Apply the function to the `int_rate` column

In [0]:
df['int_rate'] = df['int_rate'].apply(strip_percent)

In [0]:
df['int_rate'].dtypes

But, I like regex. :)

In [0]:
# Look for numbers followed by '%' replace with number only.
df['int_rate'].replace(r'([0-9]+)%', r'\1', regex=True, inplace=True)

In [0]:
df['int_rate']

### Clean `emp_title`

Look at top 20 titles

In [0]:
# View job titles and counts including nan.
df['emp_title'].value_counts(dropna=False).head(20)

How often is `emp_title` null?

In [0]:
df['emp_title'].isnull().sum()

Clean the title and handle missing values

In [0]:
def make_title(title):
  if isinstance(title, str):
    # Remove any trailing whitespace and capitalize words.
    return title.strip().title()
  else: return 'Unknown'

In [0]:
df['emp_title'] = df['emp_title'].apply(make_title)

In [0]:
df['emp_title'].value_counts(dropna=False).head(20)

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [0]:
# Add a column indicating if emp is a manager.
df['is_manager'] = df['emp_title'].str.contains('Manager')

In [0]:
df['is_manager'].head()

In [0]:
managers = df[df['is_manager'] == True]
proles = df[df['is_manager'] == False]

In [0]:
print(managers['loan_amnt'].mean())
print(proles['loan_amnt'].mean())

In [0]:
print(managers['loan_amnt'].std())
print(proles['loan_amnt'].std())

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [0]:
columns = [column for column in df.columns if column[-2:] == '_d']

In [367]:
for column in columns:
  # Clean nans.
  df[column].replace(np.nan, 0, inplace=True)
  # Convert to datetime
  df[column] = pd.to_datetime(df[column], errors='raise')
  # Make sure no nan remains.
  print(column, df[column].isnull().sum())

issue_d 0
last_pymnt_d 0
next_pymnt_d 0
last_credit_pull_d 0


In [0]:
df['issue_year'] = df['issue_d'].dt.year
df['issue_month'] = df['issue_d'].dt.month

In [330]:
df['issue_year'].dtypes

dtype('int64')

In [0]:
df.head()

In [331]:
df['earliest_cr_line'].head()

0    Jan-1996
1    Apr-2001
2    Aug-2005
3    Oct-1999
4    Oct-2005
Name: earliest_cr_line, dtype: object

In [0]:
# Clean nan and convert to datetime.
df['earliest_cr_line'] = df['earliest_cr_line'].fillna(0)
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], infer_datetime_format=True, errors='ignore')

In [361]:
df['earliest_cr_line'].tail()

128409    Jun-2006
128410    Oct-2008
128411    Sep-2006
128412           0
128413           0
Name: earliest_cr_line, dtype: object

In [369]:
df['issue_d'].dtypes

dtype('<M8[ns]')

In [0]:
# Create column of days between first line of credit and loan issue date.
df['first_cr_to_issue'] = (df['issue_d'] - df['earliest_cr_line']).dt.days

In [0]:
df['first_cr_to_issue'].head()

In [0]:
df['loan_status_is_great'] = ((df['loan_status'] == 'Current') | (df['loan_status'] == 'Fully Paid'))
df['loan_status_is_great'] = df['loan_status_is_great'].apply(lambda x: 1 if x == True else 0)

In [0]:
df['loan_status_is_great'].describe()

In [0]:
df['loan_status_is_great'].head()

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [0]:
df['term'].replace(np.nan, 0, inplace=True)
df['term'].replace(r'([0-9]+).*', r'\1', regex=True, inplace=True)
df['term'] = df['term'].map(lambda x: int(x))

In [0]:
df['term'].head()

In [372]:
df['loan_status_is_great'] = ((df['loan_status'] == 'Current') | (df['loan_status'] == 'Fully Paid'))
df['loan_status_is_great'] = df['loan_status_is_great'].apply(lambda x: 1 if x == True else 0)
df['loan_status_is_great'].describe()

count    128414.000000
mean          0.985749
std           0.118523
min           0.000000
25%           1.000000
50%           1.000000
75%           1.000000
max           1.000000
Name: loan_status_is_great, dtype: float64

In [0]:
df['loan_status_is_great'].head()

In [0]:
df['last_pymnt_d'].head()

In [0]:
df['last_pymnt_d_month'] = df['last_pymnt_d'].dt.month
df['last_pymnt_d_year'] = df['last_pymnt_d'].dt.year

In [0]:
df = df.drop(columns='issue_yearss')

In [0]:
df.head()

In [0]:
grades = set(df['grade'].sort_values())

In [0]:
grade_ranks = {grade: rank for grade, rank in zip(grades, list(range(len(grades))))}

In [0]:
df['grade'] = df['grade'].map(grade_ranks)

In [0]:
own_ranks = {'OWN': 3, 'MORTGAGE': 2, 'RENT': 1, 'ANY': 0, np.nan: 0}

In [0]:
df['home_ownership'] = df['home_ownership'].map(own_ranks)

In [380]:
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,first_cr_to_issue,loan_status_is_great,last_pymnt_d_month,last_pymnt_d_year
0,,,10000.0,10000.0,10000.0,36,10.33%,324.23,7,B1,,< 1 year,2,280000.0,Not Verified,2018-12-01,Current,n,,,debt_consolidation,Debt consolidation,974xx,OR,6.15,2.0,1996-01-01,0.0,18.0,,14.0,0.0,9082.0,38%,23.0,w,9521.66,9521.66,639.85,639.85,478.34,161.51,0.0,0.0,0.0,2019-02-01,324.23,2019-04-01,2019-03-01,0.0,,1.0,Individual,,,,0.0,671.0,246828.0,1.0,3.0,2.0,3.0,1.0,48552.0,62.0,1.0,3.0,4923.0,46.0,23900.0,2.0,7.0,1.0,7.0,17631.0,11897.0,43.1,0.0,0.0,158.0,275.0,11.0,1.0,1.0,11.0,,11.0,,0.0,3.0,4.0,7.0,7.0,10.0,9.0,11.0,4.0,14.0,0.0,0.0,0.0,4.0,91.3,28.6,0.0,0.0,367828.0,61364.0,20900.0,54912.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,,8370,1,2,2019
1,,,2500.0,2500.0,2500.0,36,13.56%,84.92,1,C1,Chef,10+ years,1,55000.0,Not Verified,2018-12-01,Current,n,,,debt_consolidation,Debt consolidation,109xx,NY,18.24,0.0,2001-04-01,1.0,,45.0,9.0,1.0,4341.0,10.3%,34.0,w,2386.02,2386.02,167.02,167.02,113.98,53.04,0.0,0.0,0.0,2019-02-01,84.92,2019-04-01,2019-03-01,0.0,,1.0,Individual,,,,0.0,0.0,16901.0,2.0,2.0,1.0,2.0,2.0,12560.0,69.0,2.0,7.0,2137.0,28.0,42000.0,1.0,11.0,2.0,9.0,1878.0,34360.0,5.9,0.0,0.0,140.0,212.0,1.0,1.0,0.0,1.0,,2.0,,0.0,2.0,5.0,3.0,3.0,16.0,7.0,18.0,5.0,9.0,0.0,0.0,0.0,3.0,100.0,0.0,1.0,0.0,60124.0,16901.0,36500.0,18124.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,6453,1,2,2019
2,,,12000.0,12000.0,12000.0,60,13.56%,276.49,1,C1,,< 1 year,2,40000.0,Not Verified,2018-12-01,Current,n,,,debt_consolidation,Debt consolidation,180xx,PA,19.23,0.0,2005-08-01,2.0,,,8.0,0.0,23195.0,55.5%,9.0,w,11716.63,11716.63,539.42,539.42,283.37,256.05,0.0,0.0,0.0,2019-02-01,276.49,2019-04-01,2019-03-01,0.0,,1.0,Individual,,,,0.0,0.0,32462.0,1.0,1.0,1.0,1.0,4.0,9267.0,103.0,2.0,2.0,9747.0,64.0,41800.0,1.0,0.0,3.0,3.0,4058.0,16601.0,54.9,0.0,0.0,4.0,149.0,9.0,4.0,1.0,10.0,,4.0,,0.0,5.0,6.0,5.0,5.0,1.0,7.0,7.0,6.0,8.0,0.0,0.0,0.0,3.0,100.0,60.0,0.0,0.0,50800.0,32462.0,36800.0,9000.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,4870,1,2,2019
3,,,15000.0,15000.0,14975.0,60,14.47%,352.69,1,C2,,,2,30000.0,Source Verified,2018-12-01,Current,n,,,debt_consolidation,Debt consolidation,756xx,TX,41.6,0.0,1999-10-01,0.0,,,20.0,0.0,20049.0,37.2%,30.0,f,14654.57,14630.15,687.29,686.14,345.43,341.86,0.0,0.0,0.0,2019-02-01,352.69,2019-04-01,2019-03-01,0.0,,1.0,Joint App,55000.0,34.95,Source Verified,0.0,0.0,65496.0,1.0,1.0,0.0,1.0,16.0,4458.0,31.0,1.0,1.0,3616.0,36.0,53900.0,0.0,2.0,0.0,2.0,3275.0,2545.0,74.3,0.0,0.0,149.0,230.0,6.0,6.0,1.0,31.0,,,,0.0,4.0,13.0,4.0,5.0,8.0,18.0,21.0,13.0,20.0,0.0,0.0,0.0,1.0,100.0,50.0,0.0,0.0,118790.0,24507.0,9900.0,14490.0,29704.0,Oct-1999,0.0,0.0,16.0,48.8,0.0,15.0,0.0,0.0,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,7001,1,2,2019
4,,,16000.0,16000.0,16000.0,60,17.97%,406.04,6,D1,Instructional Coordinator,5 years,2,51000.0,Not Verified,2018-12-01,Current,n,,,debt_consolidation,Debt consolidation,284xx,NC,21.91,0.0,2005-10-01,1.0,52.0,,11.0,0.0,7326.0,24.8%,27.0,w,15664.63,15664.63,788.12,788.12,335.37,452.75,0.0,0.0,0.0,2019-02-01,406.04,2019-04-01,2019-03-01,0.0,52.0,1.0,Individual,,,,0.0,0.0,39339.0,3.0,8.0,2.0,2.0,2.0,32013.0,83.0,1.0,1.0,3881.0,58.0,29500.0,0.0,0.0,1.0,3.0,3576.0,22174.0,24.8,0.0,0.0,123.0,158.0,6.0,2.0,0.0,6.0,52.0,4.0,52.0,6.0,2.0,2.0,3.0,6.0,21.0,3.0,6.0,2.0,11.0,0.0,0.0,0.0,3.0,77.8,0.0,0.0,0.0,67924.0,39339.0,29500.0,38424.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,,4809,1,2,2019


# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
%cd instacart_2017_05_01

In [0]:
df1 = pd.read_csv('orders.csv')

In [0]:
df1.shape

In [0]:
df1.isna().sum()

In [0]:
df1.head()

In [0]:
len(set(df1['user_id'].sort_values()))

In [0]:
# Replace nan in 'days_since_prior_order with -1 to aviod confusion with customers who ordered today'
df1.replace(np.nan, -1, inplace=True)