_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [36]:
## Import libraries
# File Handling
import wget
import os.path
from os import path
import zipfile
import csv

# Data
import pandas as pd
import numpy as np


In [25]:
def downloadFile(url, name=None):
    '''
    Check if file exists and download to working directory if it does not.
    '''
    split_url = url_split = "https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip".split('/')
    filename = url_split[len(url_split)-1]
    
    if path.exists(filename):
        print('File already exists')
    else:
        try:
            filename = wget.download(url, out=name)
            print(filename, 'successfully downloaded')
        except:
            print('File could not be downloaded.  Check URL & connection.')
    return filename

filename = downloadFile("https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip", "loanStats.csv.zip")

loanStats.csv.zip successfully downloaded


In [26]:
## Check contents of .zip file
zfile = zipfile.ZipFile(filename)
print(zfile.namelist())

['LoanStats_2018Q4.csv']


In [32]:
## Unizip data
zfile.extractall()

In [33]:
## Rename data files to follow convention in given filename
for counter, memberName in enumerate(zfile.namelist()):
    splitName = filename.split('.')
    os.rename(memberName, splitName[0]+'_'+str(counter)+'.'splitName[1])

In [51]:
## Viewing head and tail of file and rename to given filename
def csvHead(pathname): 
    with open(pathname) as csvFile:
        csv_read = csv.reader(csvFile, delimiter=',')
        for counter, row in enumerate(csv_read):
            print(row)
            if counter > 10:
                break
                
csvHead("loanStats_0.csv")            

['Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)']
['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'acc_now_delinq'

In [None]:
## Something is wrong.  The first row isn't helpful. Moving to Pandas will make this a bit easier

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [64]:
lndata = pd.read_csv('loanStats_0.csv', skiprows=1, skipfooter=2, engine='python')
print(lndata.head())

   id  member_id  loan_amnt  funded_amnt  funded_amnt_inv        term  \
0 NaN        NaN       5000         5000           5000.0   36 months   
1 NaN        NaN      25000        25000          25000.0   60 months   
2 NaN        NaN      10000        10000          10000.0   36 months   
3 NaN        NaN       4000         4000           4000.0   36 months   
4 NaN        NaN      31450        31450          31450.0   36 months   

  int_rate  installment grade sub_grade  ...  \
0   17.97%       180.69     D        D1  ...   
1   14.47%       587.82     C        C2  ...   
2   10.33%       324.23     B        B1  ...   
3   23.40%       155.68     E        E1  ...   
4    7.56%       979.16     A        A3  ...   

  orig_projected_additional_accrued_interest hardship_payoff_balance_amount  \
0                                        NaN                            NaN   
1                                        NaN                            NaN   
2                                  

In [65]:
print(lndata.tail())
## the last two rows aren't data we need here. Revising read above...

        id  member_id  loan_amnt  funded_amnt  funded_amnt_inv        term  \
128407 NaN        NaN      23000        23000          23000.0   36 months   
128408 NaN        NaN      10000        10000          10000.0   36 months   
128409 NaN        NaN       5000         5000           5000.0   36 months   
128410 NaN        NaN      10000        10000           9750.0   36 months   
128411 NaN        NaN      10000        10000          10000.0   36 months   

       int_rate  installment grade sub_grade  ...  \
128407   15.02%       797.53     C        C3  ...   
128408   15.02%       346.76     C        C3  ...   
128409   13.56%       169.83     C        C1  ...   
128410   11.06%       327.68     B        B3  ...   
128411   16.91%       356.08     C        C5  ...   

       orig_projected_additional_accrued_interest  \
128407                                        NaN   
128408                                        NaN   
128409                                        NaN   


## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [70]:
## Here I decided to loop through all the columns and check their dtype, printing where the data type is object.
for col in lndata.columns:
    if lndata[col].dtype == 'object':
        print(col)

term
int_rate
grade
sub_grade
emp_title
emp_length
home_ownership
verification_status
issue_d
loan_status
pymnt_plan
purpose
title
zip_code
addr_state
earliest_cr_line
revol_util
initial_list_status
last_pymnt_d
next_pymnt_d
last_credit_pull_d
application_type
verification_status_joint
sec_app_earliest_cr_line
hardship_flag
hardship_type
hardship_reason
hardship_status
hardship_start_date
hardship_end_date
payment_plan_start_date
hardship_loan_status
debt_settlement_flag
debt_settlement_flag_date
settlement_status
settlement_date


### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [71]:
## Let's investigate int_rate
lndata['int_rate'].head()

0     17.97%
1     14.47%
2     10.33%
3     23.40%
4      7.56%
Name: int_rate, dtype: object

In [87]:
## Return float value or NaN if missing or float fails
def ptf(percent_str):
    try:
        fval = float(percent_str.strip('%'))
    except:
        fval = np.NaN
    
    return fval

Apply the function to the `int_rate` column

In [88]:
## First, I'll copy the original data
lndCopy = lndata.copy()
## Now I will replace the column with the function applied
lndCopy['int_rate'] = lndata['int_rate'].apply(ptf)
lndCopy['int_rate'].head()

0    17.97
1    14.47
2    10.33
3    23.40
4     7.56
Name: int_rate, dtype: float64

### Clean `emp_title`

Look at top 20 titles

In [94]:
for counter, row in enumerate(lndata['emp_title'].value_counts().iteritems()):
    if counter < 20:
        print(row)
    else:
        break

('Teacher', 2090)
('Manager', 1773)
('Registered Nurse', 952)
('Driver', 924)
('RN', 726)
('Supervisor', 697)
('Sales', 580)
('Project Manager', 526)
('General Manager', 523)
('Office Manager', 521)
('Owner', 420)
('Director', 402)
('Operations Manager', 387)
('Truck Driver', 387)
('Nurse', 326)
('Engineer', 325)
('Sales Manager', 304)
('manager', 301)
('Supervisor ', 270)
('Administrative Assistant', 269)


How often is `emp_title` null?

In [95]:
lndata['emp_title'].isna().sum()

20947

Clean the title and handle missing values

In [98]:
## strip any whitespace and capitalize ubiquitously
def cleanEntry(entry):
    if isinstance(entry, str):
        return entry.strip().title()
    else:
        return np.NaN

## Quick test
test_str = ['adfad ', ' adfadfad ', np.NaN]
print([cleanEntry(item) for item in test_str])

['Adfad', 'Adfadfad', nan]


In [99]:
## Hit the copied dataframe
lndCopy['emp_title'] = lndata['emp_title'].apply(cleanEntry)
lndCopy['emp_title'].head()

0          Administrative
1                 Teacher
2                     NaN
3                Security
4    Construction Manager
Name: emp_title, dtype: object

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [117]:
## Making the boolean manager column
managertypes = ['manager', 'executive', 'supervisor']
lndCopy['emp_title_manager'] = lndata['emp_title'].str.contains('|'.join(managertypes), case=False, regex=True)
lndCopy['emp_title_manager'] = lndCopy['emp_title_manager'].astype(bool)
lndCopy['emp_title_manager'].head(20)

0     False
1     False
2      True
3     False
4      True
5     False
6      True
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15     True
16    False
17    False
18    False
19    False
Name: emp_title_manager, dtype: bool

In [119]:
## Checkign interest rates for each group
lndCopy.groupby('emp_title_manager').mean()

Unnamed: 0_level_0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,url,desc,...,deferral_term,hardship_amount,hardship_length,hardship_dpd,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,settlement_amount,settlement_percentage,settlement_term
emp_title_manager,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
False,,,15920.22624,15920.22624,15917.406203,12.929892,460.318361,83290.769015,,,...,3.0,378.39,3.0,22.0,1135.17,15351.85,1045.41,6129.612,56.665,16.3
True,,,16070.51561,16070.51561,16067.687284,12.930058,468.957513,81839.369114,,,...,,,,,,,,5982.346111,54.455,15.5


## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [124]:
lndCopy['issue_d'] = pd.to_datetime(lndCopy['issue_d'], infer_datetime_format=True)
lndCopy['issue_d'].head(10)

0   2018-12-01
1   2018-12-01
2   2018-12-01
3   2018-12-01
4   2018-12-01
5   2018-12-01
6   2018-12-01
7   2018-12-01
8   2018-12-01
9   2018-12-01
Name: issue_d, dtype: datetime64[ns]

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [129]:
lndCopy['term'].value_counts()
#lndCopy['term'] = lndCopy['term'].astype(int)

 36 months    88179
 60 months    40233
Name: term, dtype: int64

In [138]:
## splitting on ' months' and returning number as integer since column very clean.
def stripMonths(term):
    return int(term.split(' months')[0])

lndCopy['term'] = lndata['term'].apply(stripMonths)
lndCopy['term'].head(5)

0    36
1    60
2    36
3    36
4    36
Name: term, dtype: int64

In [145]:
## Quick look at loan_status
lndata['loan_status'].value_counts()

Current               114514
Fully Paid              9785
Late (31-120 days)      1878
Charged Off             1028
In Grace Period          807
Late (16-30 days)        397
Default                    3
Name: loan_status, dtype: int64

In [149]:
## Building the feature with method from above
good_status = ['Fully Paid', 'Current']
lndCopy['loan_status_is_great'] = lndata['loan_status'].str.contains('|'.join(good_status), case=False, regex=True)
lndCopy['loan_status_is_great'].value_counts()
lndCopy['loan_status_is_great'] = lndCopy['loan_status_is_great'].astype(int)
lndCopy['loan_status_is_great'].head(10)

0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
Name: loan_status_is_great, dtype: int32

In [158]:
## Quick look at last_payment_d
print(lndata['last_pymnt_d'].head())
print(lndata['last_pymnt_d'].value_counts())
print(lndata['last_pymnt_d'].dtype)

type(lndata['last_pymnt_d'][1])

0    Jun-2019
1    May-2019
2    May-2019
3    May-2019
4    Jun-2019
Name: last_pymnt_d, dtype: object
Jun-2019    103574
May-2019     14806
Apr-2019      2408
Mar-2019      2002
Feb-2019      1925
Jan-2019      1560
Dec-2018      1054
Nov-2018       679
Oct-2018       234
Jul-2019        11
Name: last_pymnt_d, dtype: int64
object


str

In [167]:
lndCopy['last_pymnt_d'] = pd.to_datetime(lndata['last_pymnt_d'], infer_datetime_format=True)
print(lndCopy['last_pymnt_d'].value_counts())
print(lndCopy['last_pymnt_d'].dtype)

type(lndCopy['last_pymnt_d'][1])

2019-06-01    103574
2019-05-01     14806
2019-04-01      2408
2019-03-01      2002
2019-02-01      1925
2019-01-01      1560
2018-12-01      1054
2018-11-01       679
2018-10-01       234
2019-07-01        11
Name: last_pymnt_d, dtype: int64
datetime64[ns]


pandas._libs.tslibs.timestamps.Timestamp

In [172]:
## This column is clean.  Converting to datetime would allow calling month/year on command.
## Since the data is so clean, I'll just split on the hyphen.
def getMonth(x):
    return x.month_name()
def getYear(x):
    return x.year

lndCopy['last_pymnt_d_month'] = lndCopy['last_pymnt_d'].apply(getMonth)
lndCopy['last_pymnt_d_year'] = lndCopy['last_pymnt_d'].apply(getYear)
print(lndCopy['last_pymnt_d_month'].head(5))
print(lndCopy['last_pymnt_d_year'].head(5))

0    June
1     May
2     May
3     May
4    June
Name: last_pymnt_d_month, dtype: object
0    2019.0
1    2019.0
2    2019.0
3    2019.0
4    2019.0
Name: last_pymnt_d_year, dtype: float64


# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [None]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [None]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [None]:
# %cd instacart_2017_05_01