<a href="https://colab.research.google.com/github/unburied/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/LS_DS_124_Make_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [1]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

--2019-05-03 02:08:47--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip’

LoanStats_2018Q4.cs     [              <=>   ]  21.40M   902KB/s    in 26s     

2019-05-03 02:09:14 (829 KB/s) - ‘LoanStats_2018Q4.csv.zip’ saved [22444881]



In [2]:
!unzip LoanStats_2018Q4.csv.zip

Archive:  LoanStats_2018Q4.csv.zip
  inflating: LoanStats_2018Q4.csv    


In [3]:
!head LoanStats_2018Q4.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import pandas as pd

In [5]:
df = pd.read_csv('LoanStats_2018Q4.csv', skiprows=1, skipfooter=2 )
df.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,10000,10000,10000.0,36 months,10.33%,324.23,B,B1,...,,,,N,,,,,,
1,,,4000,4000,4000.0,36 months,23.40%,155.68,E,E1,...,,,,N,,,,,,
2,,,5000,5000,5000.0,36 months,17.97%,180.69,D,D1,...,,,,N,,,,,,
3,,,9600,9600,9600.0,36 months,12.98%,323.37,B,B5,...,,,,N,,,,,,
4,,,2500,2500,2500.0,36 months,13.56%,84.92,C,C1,...,,,,N,,,,,,


## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [64]:
object_columns = df.select_dtypes(include='object')
object_list = object_columns.columns.tolist()
object_list



['grade',
 'sub_grade',
 'emp_title',
 'emp_length',
 'home_ownership',
 'verification_status',
 'loan_status',
 'pymnt_plan',
 'purpose',
 'title',
 'zip_code',
 'addr_state',
 'earliest_cr_line',
 'revol_util',
 'initial_list_status',
 'application_type',
 'verification_status_joint',
 'sec_app_earliest_cr_line',
 'hardship_flag',
 'hardship_type',
 'hardship_reason',
 'hardship_status',
 'hardship_start_date',
 'hardship_end_date',
 'payment_plan_start_date',
 'hardship_loan_status',
 'debt_settlement_flag',
 'debt_settlement_flag_date',
 'settlement_status',
 'settlement_date']

### Convert `int_rate`



In [0]:
df['int_rate'] = df['int_rate'].str.strip('%').astype(float)

In [8]:
df['int_rate'].head()

0    10.33
1    23.40
2    17.97
3    12.98
4    13.56
Name: int_rate, dtype: float64

### Clean `emp_title`

Look at top 20 titles

In [9]:
df['emp_title'].head(20)

0                                          NaN
1                                     Security
2                               Administrative
3                                          NaN
4                                         Chef
5                                  Postmaster 
6                                     Operator
7                           Nursing Supervisor
8                                      Manager
9                             Material Handler
10                                         NaN
11                   Instructional Coordinator
12                                         NaN
13            Financial Relationship Associate
14                         Sale Representative
15                         driver coordinator 
16                               gas attendant
17    Assistant Athletic Director of Marketing
18                            Sr Sales Manager
19                                 Casino Host
Name: emp_title, dtype: object

How often is `emp_title` null?

In [10]:
df.emp_title.isnull().sum()

20947

Clean the title and handle missing values

In [11]:
df.emp_title.fillna(value = "other", inplace = True)
df.emp_title.isnull().sum()


0

In [12]:
df.emp_title.head()

0             other
1          Security
2    Administrative
3             other
4              Chef
Name: emp_title, dtype: object

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [0]:
df['emp_title_manager'] = df.emp_title.str.lower().str.contains('manager')

In [14]:
df.emp_title_manager.value_counts()

False    110527
True      17885
Name: emp_title_manager, dtype: int64

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [0]:
#Create a feature list of all dates columns
dates_features = [dates for dates in df.columns if dates.endswith('_d')]

In [16]:
#use features list to convert to datetime objects
for feature in dates_features:
  df[feature] = pd.to_datetime(df[feature])

df[dates_features].dtypes

issue_d               datetime64[ns]
last_pymnt_d          datetime64[ns]
next_pymnt_d          datetime64[ns]
last_credit_pull_d    datetime64[ns]
dtype: object

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [0]:
df.term.head()

In [18]:
df['term'] = df.term.str.strip(' months').astype('int64')
df.term.head()


0    36
1    36
2    36
3    36
4    36
Name: term, dtype: int64

In [0]:
df.loan_status.head()

In [0]:
df.loan_status.value_counts()

In [31]:
#convert current and fully paid status to string '1' 
df['loan_status_is_great'] = df.loan_status.str.replace('Current' , '1').str.replace("Fully Paid", '1')
df['loan_status_is_great'].value_counts()

1                     125907
Late (31-120 days)      1168
In Grace Period          666
Late (16-30 days)        350
Charged Off              319
Default                    2
Name: loan_status_is_great, dtype: int64

In [49]:
#Convert remaining unique values to a list and pop the string '1'
zeroes = df['loan_status_is_great'].unique().tolist()
zeroes.pop(0)
zeroes

['Late (31-120 days)',
 'In Grace Period',
 'Late (16-30 days)',
 'Charged Off',
 'Default']

In [52]:
#convert remaining values in list to strin '0;'
df['loan_status_is_great'] = df['loan_status_is_great'].replace(to_replace = zeroes, value = '0')
df['loan_status_is_great'].value_counts()

1    125907
0      2505
Name: loan_status_is_great, dtype: int64

In [55]:
#cast series to int type
df['loan_status_is_great'] = df['loan_status_is_great'].astype('int64')
df['loan_status_is_great'].dtype

dtype('int64')

In [59]:
df['last_pymnt_d_month'] = df['last_pymnt_d'].dt.month
df['last_pymnt_d_month'].value_counts()

4.0     120098
3.0       2660
2.0       1956
1.0       1571
12.0      1055
11.0       679
10.0       234
Name: last_pymnt_d_month, dtype: int64

In [62]:
df['last_pymnt_d_year'] = df['last_pymnt_d'].dt.year
df['last_pymnt_d_year'].value_counts()

2019.0    126285
2018.0      1968
Name: last_pymnt_d_year, dtype: int64

# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

In [72]:
#Check all features to find any containing %
for target in object_list:
  if df[target].head().str.contains('%').sum() > 0:
    target_feature = target
    
target_feature

'revol_util'

In [73]:
df['revol_util'].head()

0      38%
1    19.2%
2    19.1%
3    11.5%
4    10.3%
Name: revol_util, dtype: object

In [74]:
df['revol_util'] = df['revol_util'].str.strip('%')
df['revol_util'].head()

0      38
1    19.2
2    19.1
3    11.5
4    10.3
Name: revol_util, dtype: object

In [76]:
df['revol_util'] = df['revol_util'].astype(float)
df['revol_util'].head()

0    38.0
1    19.2
2    19.1
3    11.5
4    10.3
Name: revol_util, dtype: float64

In [95]:
#convert to dict to assign values as keys to maintain order
emp_titles = df['emp_title'].value_counts().to_dict() 
 
#convert keys to iterable list
emp_titles_list = list(emp_titles.keys())

#use list to replace values past index 21 to 'other' and keep top 20
df['emp_title'] = df['emp_title'].replace(to_replace = emp_titles_list[21:], value = 'other')
df['emp_title'].value_counts()


other                       115709
Teacher                       2090
Manager                       1773
Registered Nurse               952
Driver                         924
RN                             726
Supervisor                     697
Sales                          580
Project Manager                526
General Manager                523
Office Manager                 521
Owner                          420
Director                       402
Operations Manager             387
Truck Driver                   387
Nurse                          326
Engineer                       325
Sales Manager                  304
manager                        301
Supervisor                     270
Administrative Assistant       269
Name: emp_title, dtype: int64

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01