_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [1]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

--2019-03-28 21:38:35--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip.1’

LoanStats_2018Q4.cs     [               <=>  ]  21.29M   894KB/s    in 25s     

2019-03-28 21:39:00 (877 KB/s) - ‘LoanStats_2018Q4.csv.zip.1’ saved [22329081]



In [2]:
!unzip LoanStats_2018Q4.csv.zip

Archive:  LoanStats_2018Q4.csv.zip
replace LoanStats_2018Q4.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: LoanStats_2018Q4.csv    


In [3]:
!head LoanStats_2018Q4.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [4]:
import pandas as pd
df = pd.read_csv('LoanStats_2018Q4.csv', skiprows=1, skipfooter=2, engine='python')
df.shape

(128412, 145)

In [0]:
pd.options.display.max_columns = 500
pd.options.display.max_rows =500


## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [7]:
df.describe(exclude='number')

Unnamed: 0,term,int_rate,grade,sub_grade,emp_title,emp_length,home_ownership,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,earliest_cr_line,revol_util,initial_list_status,last_pymnt_d,next_pymnt_d,last_credit_pull_d,application_type,verification_status_joint,sec_app_earliest_cr_line,hardship_flag,hardship_type,hardship_reason,hardship_status,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_loan_status,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date
count,128412,128412,128412,128412,107465,116708,128412,128412,128412,128412,128412,128412,128412,128412,128412,128412,128256,128412,128250,124941,128411,128412,14848,16782,128412,1,1,1,1,1,1,1,128412,128412,2,2,2
unique,2,46,7,35,43892,11,4,3,3,6,2,12,12,880,50,644,1074,2,5,3,7,2,3,573,2,1,1,1,1,1,1,1,2,2,1,1,1
top,36 months,13.56%,A,A4,Teacher,10+ years,MORTGAGE,Not Verified,Oct-2018,Current,n,debt_consolidation,Debt consolidation,112xx,CA,Aug-2006,0%,w,Feb-2019,Mar-2019,Feb-2019,Individual,Not Verified,Aug-2006,N,INTEREST ONLY-3 MONTHS DEFERRAL,UNEMPLOYMENT,ACTIVE,Feb-2019,Apr-2019,Feb-2019,Late (16-30 days),Cash,N,Feb-2019,ACTIVE,Feb-2019
freq,88179,6976,38011,9770,2090,38826,63490,58350,46305,123768,128411,70603,70603,1370,17879,1130,1132,114498,123797,124903,125061,111630,6360,155,128411,1,1,1,1,1,1,1,102516,128410,2,2,2


### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [8]:
'13.56%'.strip('%')

'13.56'

Apply the function to the `int_rate` column

In [9]:
df['int_rate'].str.strip('%').astype(float).head()

0    13.56
1    18.94
2    17.97
3    18.94
4    16.14
Name: int_rate, dtype: float64

In [0]:
def remove_percent(string):
  return float(string.strip('%'))

In [0]:
df['int_rate'] = df['int_rate'].apply(remove_percent)

### Clean `emp_title`

Look at top 20 titles

In [0]:
df

How often is `emp_title` null?

In [13]:
df['emp_title'].isnull().sum()

20947

Clean the title and handle missing values
- capitalize
- strip spaces
- replace NAN with 'Missing'

In [0]:
examples = ['owner', 'Supervisor ',
           ' Project Manager', np.nan]

def clean_title(x):
  if isinstance(x, str):
    return x.strip().title()
  else:
    return 'Unknown'

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [0]:
df['issue_d'].head().values

In [0]:
df['issue_d'].describe()

In [0]:
df['issue_d'] = pd.to_datetime(df['issue_d'], infer_datetime_format=True)

In [0]:
df['issue_d'].head().values

In [0]:
df['issue_d'].describe()

In [0]:
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'],
                                        infer_datetime_format=True)

In [0]:
df['earliest_cr_line'].head()

In [0]:
df['issue_d'] - df['earliest_cr_line']

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [0]:
#term to integer
df['term'] = df['term'].str.strip('months').astype('int')


In [16]:
df['term'].head()

0    36
1    60
2    36
3    36
4    60
Name: term, dtype: int64

In [17]:
#column loan_status_is_great
#I guess I should use replace

df['loan_status'].unique()
#no nulls, no weird values


array(['Current', 'Fully Paid', 'Late (31-120 days)', 'In Grace Period',
       'Charged Off', 'Late (16-30 days)'], dtype=object)

In [18]:
df['loan_status_is_great'] = df['loan_status'].replace(to_replace=[
    'Current', 'Fully Paid'], value=1).replace(to_replace=[
    'Late (31-120 days)', 'In Grace Period', 'Charged Off', 'Late (16-30 days)'], value=0).astype('int')
df['loan_status_is_great'].dtype


dtype('int64')

In [19]:
#spike_cols = [col for col in df.columns if 'spike' in col]
date_cols = [col for col in df.columns if 'pymnt' in col]
date_cols

['pymnt_plan',
 'total_pymnt',
 'total_pymnt_inv',
 'last_pymnt_d',
 'last_pymnt_amnt',
 'next_pymnt_d']

In [0]:
df['last_pymnt_d'].head()
#df['issue_d'] = pd.to_datetime(df['issue_d'], infer_datetime_format=True)
df['last_pymnt_d'] = pd.to_datetime(df[
    'last_pymnt_d'], infer_datetime_format=True)


In [21]:
df['last_pymnt_d'].head(20)

0    2019-02-01
1    2019-02-01
2    2019-02-01
3    2019-02-01
4    2019-02-01
5    2019-02-01
6    2019-02-01
7    2019-02-01
8    2019-02-01
9    2019-02-01
10   2019-02-01
11   2019-02-01
12   2019-02-01
13   2019-02-01
14   2019-02-01
15   2019-02-01
16   2019-02-01
17   2019-02-01
18   2019-02-01
19   2019-02-01
Name: last_pymnt_d, dtype: datetime64[ns]

In [23]:
#Make last_pymnt_d_month and last_pymnt_d_year columns.
#df['Year'] = pd.DatetimeIndex(df['date']).year  
df['last_pymnt_d_year'] = pd.DatetimeIndex(df['last_pymnt_d']).year
df['last_pymnt_d_year'].head()

0    2019.0
1    2019.0
2    2019.0
3    2019.0
4    2019.0
Name: last_pymnt_d_year, dtype: float64

In [24]:
df['last_pymnt_d_month'] = pd.DatetimeIndex(df['last_pymnt_d']).month
df['last_pymnt_d_month'].head()

0    2.0
1    2.0
2    2.0
3    2.0
4    2.0
Name: last_pymnt_d_month, dtype: float64

# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [29]:
'''There's one other column in the dataframe with percent signs. 
Remove them and convert to floats. You'll need to handle missing values.'''
#find the column that has the percents
#df['int_rate'].str.strip('%').astype(float).head()
df['revol_util'].str.strip('%').astype(float).head()

0    10.3
1    24.2
2    19.1
3    78.1
4     3.6
Name: revol_util, dtype: float64

In [32]:
df['revol_util'].fillna(method='ffill').isnull().sum()

0

In [0]:
#Modify the emp_title column to replace titles with 'Other' if the title is not in the top 20.


In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01