_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [180]:
import pandas as pd

In [181]:
#https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

In [182]:
#!unzip LoanStats_2018Q4.csv.zip

In [183]:
path = '/Users/ridleyleisy/Documents/lambda/unit_one/DS-Unit-1-Sprint-1-Dealing-With-Data/thursday-make-features/LoanStats_2018Q4.csv'

In [184]:
df = pd.read_csv(path,skiprows=1,skipfooter=2)

  """Entry point for launching an IPython kernel.


FileNotFoundError: [Errno 2] No such file or directory: '/Users/ridleyleisy/Documents/lambda/unit_one/DS-Unit-1-Sprint-1-Dealing-With-Data/thursday-make-features/LoanStats_2018Q4.csv'

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [105]:
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,10000,10000,10000.0,36 months,10.33%,324.23,B,B1,...,,,DirectPay,N,,,,,,
1,,,2500,2500,2500.0,36 months,13.56%,84.92,C,C1,...,,,Cash,N,,,,,,
2,,,12000,12000,12000.0,60 months,13.56%,276.49,C,C1,...,,,Cash,N,,,,,,
3,,,15000,15000,14975.0,60 months,14.47%,352.69,C,C2,...,,,Cash,N,,,,,,
4,,,16000,16000,16000.0,60 months,17.97%,406.04,D,D1,...,,,Cash,N,,,,,,


In [106]:
df.isnull().sum().sort_values(ascending=False)

id                                            128412
member_id                                     128412
url                                           128412
desc                                          128412
hardship_loan_status                          128411
hardship_amount                               128411
hardship_start_date                           128411
hardship_end_date                             128411
payment_plan_start_date                       128411
hardship_length                               128411
hardship_dpd                                  128411
hardship_payoff_balance_amount                128411
orig_projected_additional_accrued_interest    128411
hardship_status                               128411
hardship_last_payment_amount                  128411
hardship_reason                               128411
hardship_type                                 128411
deferral_term                                 128411
settlement_percentage                         

## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [107]:
df

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,10000,10000,10000.0,36 months,10.33%,324.23,B,B1,...,,,DirectPay,N,,,,,,
1,,,2500,2500,2500.0,36 months,13.56%,84.92,C,C1,...,,,Cash,N,,,,,,
2,,,12000,12000,12000.0,60 months,13.56%,276.49,C,C1,...,,,Cash,N,,,,,,
3,,,15000,15000,14975.0,60 months,14.47%,352.69,C,C2,...,,,Cash,N,,,,,,
4,,,16000,16000,16000.0,60 months,17.97%,406.04,D,D1,...,,,Cash,N,,,,,,
5,,,9600,9600,9600.0,36 months,23.40%,373.62,E,E1,...,,,Cash,N,,,,,,
6,,,4000,4000,4000.0,36 months,23.40%,155.68,E,E1,...,,,Cash,N,,,,,,
7,,,3500,3500,3500.0,36 months,20.89%,131.67,D,D4,...,,,Cash,N,,,,,,
8,,,9600,9600,9600.0,36 months,12.98%,323.37,B,B5,...,,,Cash,N,,,,,,
9,,,8000,8000,8000.0,36 months,23.40%,311.35,E,E1,...,,,Cash,N,,,,,,


### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [108]:
pd.to_numeric(df['int_rate'].str.strip('%'))

0         10.33
1         13.56
2         13.56
3         14.47
4         17.97
5         23.40
6         23.40
7         20.89
8         12.98
9         23.40
10        26.31
11        18.94
12        19.92
13        17.97
14        20.89
15        11.80
16        23.40
17        15.02
18        11.31
19        10.33
20        27.27
21        16.14
22        26.31
23        15.02
24        18.94
25        11.31
26        11.31
27        14.47
28        16.14
29        14.47
          ...  
128382    26.31
128383    13.56
128384    16.91
128385    17.97
128386    11.55
128387    17.97
128388     8.46
128389    19.92
128390     6.67
128391    26.31
128392    16.91
128393     6.11
128394    11.55
128395    11.55
128396    15.02
128397    15.02
128398    18.94
128399    15.02
128400    15.02
128401    16.14
128402    22.35
128403    11.55
128404     7.84
128405    16.14
128406    13.56
128407    15.02
128408    15.02
128409    13.56
128410    11.06
128411    16.91
Name: int_rate, Length: 

Apply the function to the `int_rate` column

### Clean `emp_title`

Look at top 20 titles

In [109]:
df['emp_title'].value_counts()

Teacher                             2090
Manager                             1773
Registered Nurse                     952
Driver                               924
RN                                   726
Supervisor                           697
Sales                                580
Project Manager                      526
General Manager                      523
Office Manager                       521
Owner                                420
Director                             402
Operations Manager                   387
Truck Driver                         387
Nurse                                326
Engineer                             325
Sales Manager                        304
manager                              301
Supervisor                           270
Administrative Assistant             269
Accountant                           268
Server                               265
Vice President                       261
Mechanic                             258
Account Manager 

How often is `emp_title` null?

In [110]:
df['emp_title'].isnull().sum()

20947

Clean the title and handle missing values

In [111]:
df['emp_title'] = df['emp_title'].str.lower()

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [178]:
pd.options.display.max_rows = 50
pd.options.display.max_columns = 50

In [123]:
df['term'] = df['term'].str.strip()
def strip_months(string):
    string = int(string[0:2])
    return string

df['term'] = df['term'].apply(strip_months)

In [161]:
def loan_status(string):
    if string == 'Current':
        string = 1
    elif string == 'Fully Paid':
        string = 1
    else:
        string = 0
    return string
df['loan_status_is_great'] = df['loan_status'].apply(loan_status)

In [167]:
df['last_pymnt_d'] = pd.to_datetime(df['last_pymnt_d'])

In [176]:
df['last_pymnt_d_month'] = df['last_pymnt_d'].dt.month
df['last_pymnt_d_year'] = df['last_pymnt_d'].dt.year

# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [179]:

'''
Finishing up a medium article that I want to focus on. 

'''

'\nFinishing up a medium article that I want to focus on. \n\n'

In [None]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [None]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [None]:
# %cd instacart_2017_05_01