_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [141]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

--2019-05-05 09:30:52--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip.3’

LoanStats_2018Q4.cs     [                <=> ]  21.40M   880KB/s    in 25s     

2019-05-05 09:31:18 (864 KB/s) - ‘LoanStats_2018Q4.csv.zip.3’ saved [22444881]



In [142]:
!unzip LoanStats_2018Q4.csv.zip

Archive:  LoanStats_2018Q4.csv.zip
replace LoanStats_2018Q4.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: LoanStats_2018Q4.csv    


In [143]:
!head LoanStats_2018Q4.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import pandas as pd


In [145]:
df = pd.read_csv(filepath_or_buffer='LoanStats_2018Q4.csv')
df.head

  interactivity=interactivity, compiler=compiler, result=result)


<bound method NDFrame.head of                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

In [146]:
df =pd.read_csv(sep=',',filepath_or_buffer='LoanStats_2018Q4.csv', skiprows=1, skipfooter=2)

  """Entry point for launching an IPython kernel.


## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [0]:
def strip_percent(x_str):
    return float(x_str.strip('%%'))

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [0]:
df['int_rate']= df['int_rate'].apply(strip_percent)

Apply the function to the `int_rate` column

In [149]:
df.int_rate.head()

0    10.33
1    23.40
2    17.97
3    12.98
4    13.56
Name: int_rate, dtype: float64

### Clean `emp_title`

Look at top 20 titles

In [150]:
df['emp_title'].head(n=10)

0                   NaN
1              Security
2        Administrative
3                   NaN
4                  Chef
5           Postmaster 
6              Operator
7    Nursing Supervisor
8               Manager
9      Material Handler
Name: emp_title, dtype: object

How often is `emp_title` null?

In [151]:
df.emp_title.value_counts(dropna=False).head(20)

NaN                   20947
Teacher                2090
Manager                1773
Registered Nurse        952
Driver                  924
RN                      726
Supervisor              697
Sales                   580
Project Manager         526
General Manager         523
Office Manager          521
Owner                   420
Director                402
Operations Manager      387
Truck Driver            387
Nurse                   326
Engineer                325
Sales Manager           304
manager                 301
Supervisor              270
Name: emp_title, dtype: int64

Clean the title and handle missing values

In [152]:
df['emp_title'].isnull().sum()

20947

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [153]:
df['emp_title'].str.lower()

0                                              NaN
1                                         security
2                                   administrative
3                                              NaN
4                                             chef
5                                      postmaster 
6                                         operator
7                               nursing supervisor
8                                          manager
9                                 material handler
10                                             NaN
11                       instructional coordinator
12                                             NaN
13                financial relationship associate
14                             sale representative
15                             driver coordinator 
16                                   gas attendant
17        assistant athletic director of marketing
18                                sr sales manager
19                             

In [0]:
import numpy as np

In [0]:
def clean_title(title):
    if isinstance(title, str):
        return title.strip().lower()
    else:
        return 'unknown'

In [0]:
df['emp_title'] = df['emp_title'].apply(clean_title)

In [157]:
df.emp_title.value_counts(dropna=False).head(20)

unknown               20947
teacher                2557
manager                2395
registered nurse       1418
driver                 1258
supervisor             1160
truck driver            920
rn                      834
office manager          805
sales                   803
general manager         791
project manager         720
owner                   625
director                523
operations manager      518
sales manager           500
police officer          440
nurse                   425
technician              420
engineer                412
Name: emp_title, dtype: int64

In [0]:
df['emp_title_manager']= df['emp_title'].str.contains('manager')

In [159]:
df['emp_title_manager']


0         False
1         False
2         False
3         False
4         False
5         False
6         False
7         False
8          True
9         False
10        False
11        False
12        False
13        False
14        False
15        False
16        False
17        False
18         True
19        False
20        False
21        False
22        False
23        False
24        False
25        False
26        False
27        False
28        False
29        False
          ...  
128382    False
128383    False
128384     True
128385    False
128386    False
128387    False
128388    False
128389    False
128390    False
128391    False
128392    False
128393     True
128394    False
128395    False
128396    False
128397    False
128398    False
128399    False
128400    False
128401    False
128402    False
128403    False
128404    False
128405    False
128406    False
128407    False
128408    False
128409    False
128410    False
128411     True
Name: emp_title_manager,

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [160]:
df.term.head(20)

0      36 months
1      36 months
2      36 months
3      36 months
4      36 months
5      60 months
6      60 months
7      60 months
8      36 months
9      60 months
10     60 months
11     60 months
12     60 months
13     60 months
14     36 months
15     36 months
16     36 months
17     60 months
18     60 months
19     60 months
Name: term, dtype: object

In [0]:
def strip_months(m_str):
    return int(m_str.strip(' months'))

In [162]:
df.term.apply(strip_months).head(n=10)

0    36
1    36
2    36
3    36
4    36
5    60
6    60
7    60
8    36
9    60
Name: term, dtype: int64

In [163]:
df.columns


Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       ...
       'hardship_payoff_balance_amount', 'hardship_last_payment_amount',
       'debt_settlement_flag', 'debt_settlement_flag_date',
       'settlement_status', 'settlement_date', 'settlement_amount',
       'settlement_percentage', 'settlement_term', 'emp_title_manager'],
      dtype='object', length=145)

In [164]:
df['loan_status']

0                   Current
1                   Current
2                   Current
3                   Current
4                   Current
5                   Current
6                   Current
7                   Current
8                   Current
9                   Current
10                  Current
11                  Current
12                  Current
13                  Current
14                  Current
15                  Current
16                  Current
17                  Current
18                  Current
19                  Current
20                  Current
21                  Current
22                  Current
23                  Current
24                  Current
25                  Current
26                  Current
27                  Current
28                  Current
29                  Current
                ...        
128382              Current
128383              Current
128384              Current
128385              Current
128386              

In [0]:
def loan_stat(stat):
    if stat == 'Current' or stat == 'Fully Paid':
        return 1 
    else:
        return 0 
 

In [0]:
#Make a column named loan_status_is_great. It should contain the integer 1 if loan_status is "Current" or "Fully Paid."
#Else it should contain the integer 0.

df['loan_status_is_great'] = df['loan_status'].apply(loan_stat)

In [167]:
df['loan_status_is_great']

0         1
1         1
2         1
3         1
4         1
5         1
6         1
7         1
8         1
9         1
10        1
11        1
12        1
13        1
14        1
15        1
16        1
17        1
18        1
19        1
20        1
21        1
22        1
23        1
24        1
25        1
26        1
27        1
28        1
29        1
         ..
128382    1
128383    1
128384    1
128385    1
128386    1
128387    1
128388    1
128389    1
128390    1
128391    1
128392    1
128393    1
128394    0
128395    1
128396    1
128397    1
128398    1
128399    1
128400    1
128401    1
128402    1
128403    1
128404    1
128405    1
128406    1
128407    0
128408    1
128409    1
128410    1
128411    1
Name: loan_status_is_great, Length: 128412, dtype: int64

In [168]:
df['loan_status_is_great'][128407]

0

In [169]:
df['loan_status'].dtype

dtype('O')

In [0]:
df['loan_status_is_great'] = df.loan_status.apply(loan_stat)

In [171]:
df['loan_status_is_great']

0         1
1         1
2         1
3         1
4         1
5         1
6         1
7         1
8         1
9         1
10        1
11        1
12        1
13        1
14        1
15        1
16        1
17        1
18        1
19        1
20        1
21        1
22        1
23        1
24        1
25        1
26        1
27        1
28        1
29        1
         ..
128382    1
128383    1
128384    1
128385    1
128386    1
128387    1
128388    1
128389    1
128390    1
128391    1
128392    1
128393    1
128394    0
128395    1
128396    1
128397    1
128398    1
128399    1
128400    1
128401    1
128402    1
128403    1
128404    1
128405    1
128406    1
128407    0
128408    1
128409    1
128410    1
128411    1
Name: loan_status_is_great, Length: 128412, dtype: int64

In [172]:
df.columns


Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       ...
       'hardship_last_payment_amount', 'debt_settlement_flag',
       'debt_settlement_flag_date', 'settlement_status', 'settlement_date',
       'settlement_amount', 'settlement_percentage', 'settlement_term',
       'emp_title_manager', 'loan_status_is_great'],
      dtype='object', length=146)

In [0]:
pd.set_option('display.max_columns', None)

In [174]:

df['issue_d']

0         Dec-2018
1         Dec-2018
2         Dec-2018
3         Dec-2018
4         Dec-2018
5         Dec-2018
6         Dec-2018
7         Dec-2018
8         Dec-2018
9         Dec-2018
10        Dec-2018
11        Dec-2018
12        Dec-2018
13        Dec-2018
14        Dec-2018
15        Dec-2018
16        Dec-2018
17        Dec-2018
18        Dec-2018
19        Dec-2018
20        Dec-2018
21        Dec-2018
22        Dec-2018
23        Dec-2018
24        Dec-2018
25        Dec-2018
26        Dec-2018
27        Dec-2018
28        Dec-2018
29        Dec-2018
            ...   
128382    Oct-2018
128383    Oct-2018
128384    Oct-2018
128385    Oct-2018
128386    Oct-2018
128387    Oct-2018
128388    Oct-2018
128389    Oct-2018
128390    Oct-2018
128391    Oct-2018
128392    Oct-2018
128393    Oct-2018
128394    Oct-2018
128395    Oct-2018
128396    Oct-2018
128397    Oct-2018
128398    Oct-2018
128399    Oct-2018
128400    Oct-2018
128401    Oct-2018
128402    Oct-2018
128403    Oc

In [175]:
df['term']

0          36 months
1          36 months
2          36 months
3          36 months
4          36 months
5          60 months
6          60 months
7          60 months
8          36 months
9          60 months
10         60 months
11         60 months
12         60 months
13         60 months
14         36 months
15         36 months
16         36 months
17         60 months
18         60 months
19         60 months
20         36 months
21         36 months
22         36 months
23         36 months
24         60 months
25         60 months
26         36 months
27         36 months
28         36 months
29         60 months
             ...    
128382     60 months
128383     36 months
128384     36 months
128385     36 months
128386     36 months
128387     60 months
128388     36 months
128389     36 months
128390     36 months
128391     36 months
128392     36 months
128393     36 months
128394     36 months
128395     36 months
128396     60 months
128397     60 months
128398     60

# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01