_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [19]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

--2019-05-05 20:16:59--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip.1’

LoanStats_2018Q4.cs     [                <=> ]  21.40M   808KB/s    in 28s     

2019-05-05 20:17:27 (795 KB/s) - ‘LoanStats_2018Q4.csv.zip.1’ saved [22444881]



In [20]:
!unzip LoanStats_2018Q4.csv.zip

Archive:  LoanStats_2018Q4.csv.zip
replace LoanStats_2018Q4.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [21]:
!head LoanStats_2018Q4.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [56]:
import pandas as pd
df = pd.read_csv(sep=',', filepath_or_buffer='LoanStats_2018Q4.csv', skiprows=1, skipfooter=2)
df.head()

  


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,10000,10000,10000.0,36 months,10.33%,324.23,B,B1,...,,,,N,,,,,,
1,,,4000,4000,4000.0,36 months,23.40%,155.68,E,E1,...,,,,N,,,,,,
2,,,5000,5000,5000.0,36 months,17.97%,180.69,D,D1,...,,,,N,,,,,,
3,,,9600,9600,9600.0,36 months,12.98%,323.37,B,B5,...,,,,N,,,,,,
4,,,2500,2500,2500.0,36 months,13.56%,84.92,C,C1,...,,,,N,,,,,,


In [24]:
df.tail()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
128407,,,23000,23000,23000.0,36 months,15.02%,797.53,C,C3,...,,,,N,,,,,,
128408,,,10000,10000,10000.0,36 months,15.02%,346.76,C,C3,...,,,,N,,,,,,
128409,,,5000,5000,5000.0,36 months,13.56%,169.83,C,C1,...,,,,N,,,,,,
128410,,,10000,10000,9750.0,36 months,11.06%,327.68,B,B3,...,,,,N,,,,,,
128411,,,10000,10000,10000.0,36 months,16.91%,356.08,C,C5,...,,,,N,,,,,,


In [25]:
df.shape

(128412, 144)

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128412 entries, 0 to 128411
Columns: 144 entries, id to settlement_term
dtypes: float64(57), int64(51), object(36)
memory usage: 141.1+ MB


In [27]:
df.dtypes.value_counts()

float64    57
int64      51
object     36
dtype: int64

In [28]:
df.isnull().sum(axis=0).sort_values(ascending=False)/len(df)

id                                            1.000000
member_id                                     1.000000
url                                           1.000000
desc                                          1.000000
hardship_dpd                                  0.999992
deferral_term                                 0.999992
hardship_amount                               0.999992
hardship_start_date                           0.999992
hardship_end_date                             0.999992
payment_plan_start_date                       0.999992
hardship_length                               0.999992
orig_projected_additional_accrued_interest    0.999992
hardship_loan_status                          0.999992
hardship_reason                               0.999992
hardship_payoff_balance_amount                0.999992
hardship_last_payment_amount                  0.999992
hardship_type                                 0.999992
hardship_status                               0.999992
settlement

## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [29]:
df.dtypes.value_counts()

float64    57
int64      51
object     36
dtype: int64

In [30]:
df['int_rate'].head()

0     10.33%
1     23.40%
2     17.97%
3     12.98%
4     13.56%
Name: int_rate, dtype: object

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [0]:
def strip_percent(x_str):
    return float(x_str.strip('%'))

Apply the function to the `int_rate` column

In [32]:
df['int_rate'] = df['int_rate'].apply(strip_percent)
df['int_rate'].head()

0    10.33
1    23.40
2    17.97
3    12.98
4    13.56
Name: int_rate, dtype: float64

### Clean `emp_title`

Look at top 20 titles

In [33]:
df['emp_title'].head(n=10)

0                   NaN
1              Security
2        Administrative
3                   NaN
4                  Chef
5           Postmaster 
6              Operator
7    Nursing Supervisor
8               Manager
9      Material Handler
Name: emp_title, dtype: object

How often is `emp_title` null?

In [34]:
df['emp_title'].value_counts(dropna=False).head(20)

NaN                   20947
Teacher                2090
Manager                1773
Registered Nurse        952
Driver                  924
RN                      726
Supervisor              697
Sales                   580
Project Manager         526
General Manager         523
Office Manager          521
Owner                   420
Director                402
Truck Driver            387
Operations Manager      387
Nurse                   326
Engineer                325
Sales Manager           304
manager                 301
Supervisor              270
Name: emp_title, dtype: int64

Clean the title and handle missing values

In [35]:
df['emp_title'].isnull().sum()

20947

In [36]:
import numpy as np
type(np.NaN)

float

In [0]:
def clean_title(title):
    if isinstance(title, str):
        return title.strip().lower()
    else:
        return 'unknown'

In [38]:
df['emp_title'] = df['emp_title'].apply(clean_title)
df['emp_title'].head()

0           unknown
1          security
2    administrative
3           unknown
4              chef
Name: emp_title, dtype: object

In [39]:
df['emp_title'].value_counts(dropna=False).head(20)

unknown               20947
teacher                2557
manager                2395
registered nurse       1418
driver                 1258
supervisor             1160
truck driver            920
rn                      834
office manager          805
sales                   803
general manager         791
project manager         720
owner                   625
director                523
operations manager      518
sales manager           500
police officer          440
nurse                   425
technician              420
engineer                412
Name: emp_title, dtype: int64

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [40]:
df['emp_title_manager'] = df['emp_title'].str.contains('manager')
df['emp_title_manager'].sample(10)

101686     True
124485     True
70041     False
86023     False
84534     False
58062     False
27323     False
53853     False
13923     False
27746     False
Name: emp_title_manager, dtype: bool

In [0]:
df.to_csv('tmp.csv', index=False)

In [42]:
idx_manager = df['emp_title_manager'] == True
df_managers = df[idx_manager]
df_managers.shape

(17885, 145)

In [43]:
idx_nonmanager = df['emp_title_manager'] == False
df_nonmanagers = df[idx_nonmanager]
df_nonmanagers.shape

(110527, 145)

In [44]:
print(df_managers['int_rate'].mean(),  df_nonmanagers['int_rate'].mean())

12.76060162146994 12.957682014350915


In [45]:
print(df_managers['int_rate'].std(),  df_nonmanagers['int_rate'].std())

5.070847083428044 5.092995080869786


## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [47]:
df_nonmanagers['issue_d'] = pd.to_datetime(df_nonmanagers['issue_d'])
df_nonmanagers['issue_d'].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


0   2018-12-01
1   2018-12-01
2   2018-12-01
3   2018-12-01
4   2018-12-01
Name: issue_d, dtype: datetime64[ns]

In [48]:
df_nonmanagers['issue_year'] = df_nonmanagers['issue_d'].dt.year
df_nonmanagers['issue_month'] = df_nonmanagers['issue_d'].dt.month
df_nonmanagers[['issue_year', 'issue_month']].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,issue_year,issue_month
0,2018,12
1,2018,12
2,2018,12
3,2018,12
4,2018,12


# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [57]:
#Convert the term column from string to integer.

df['term'].head()


0     36 months
1     36 months
2     36 months
3     36 months
4     36 months
Name: term, dtype: object

In [58]:
def strip_months(x_str):
    return int(x_str.strip(' months'))
  
df['term']=df['term'].apply(strip_months)

#df['term'].astype(int)
df['term'].head()

0    36
1    36
2    36
3    36
4    36
Name: term, dtype: int64

In [61]:
df['loan_status'].sample(10)

123975    Fully Paid
91929        Current
128023       Current
52888        Current
43921     Fully Paid
47280        Current
95456        Current
65753        Current
103550       Current
57847        Current
Name: loan_status, dtype: object

In [69]:
### Make a column named loan_status_is_great. It should contain the integer 1 if loan_status is "Current" or "Fully Paid." Else it should contain the integer 0.

if df['loan_status'] is 'Current' or 'Fully Paid':
  df['loan_status_is_great'] = 1 
else:
  df['loan_status_is_great'] = 0
  
df['loan_status_is_great'].sample(20)

5704      1
57906     1
67793     1
29685     1
71501     1
46715     1
118555    1
35214     1
122786    1
1671      1
95708     1
126345    1
946       1
5799      1
106971    1
30145     1
26582     1
70655     1
27133     1
32409     1
Name: loan_status_is_great, dtype: int64

In [71]:
df['last_pymnt_d'].head()

0    Apr-2019
1    Apr-2019
2    Apr-2019
3    Apr-2019
4    Apr-2019
Name: last_pymnt_d, dtype: object

In [76]:
# Make last_pymnt_d_month and last_pymnt_d_year columns. 

df['last_pymnt_d']=pd.to_datetime(df['last_pymnt_d'])
df['last_pymnt_d'].head()

df['last_pymnt_d_month'] = df['last_pymnt_d'].dt.month
df['last_pymnt_d_year'] = df['last_pymnt_d'].dt.year
df[['last_pymnt_d_month','last_pymnt_d_year']].head()

Unnamed: 0,last_pymnt_d_month,last_pymnt_d_year
0,4.0,2019.0
1,4.0,2019.0
2,4.0,2019.0
3,4.0,2019.0
4,4.0,2019.0


# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01