_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

In [5]:
!mkdir -p downloads
%cd downloads

/Users/work/.work/src/projects/lambdaschool/unit-1/DS-Unit-1-Sprint-2-Data-Wrangling/module4-make-features/downloads


In [1]:
!wget https://resources.lendingclub.com/LoanStats_2018Q3.csv.zip

--2019-01-17 09:11:56--  https://resources.lendingclub.com/LoanStats_2018Q3.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q3.csv.zip’

LoanStats_2018Q3.cs     [           <=>      ]  21.42M   647KB/s    in 32s     

2019-01-17 09:12:29 (686 KB/s) - ‘LoanStats_2018Q3.csv.zip’ saved [22461905]



In [9]:
!unzip LoanStats_2018Q3.csv.zip

Archive:  LoanStats_2018Q3.csv.zip
  inflating: LoanStats_2018Q3.csv    


In [24]:
!head -3 LoanStats_2018Q3.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","o

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Load LendingClub data

In [101]:
file = 'LoanStats_2018Q3.csv'
loans = pd.read_csv(file, skiprows=1, skipfooter=3)

  


In [102]:
pd.options.display.max_columns = len(loans.columns)
loans.head().T

Unnamed: 0,0,1,2,3,4
id,,,,,
member_id,,,,,
loan_amnt,20000,25000,30000,6000,10650
funded_amnt,20000,25000,30000,6000,10650
funded_amnt_inv,20000,25000,30000,6000,10650
term,60 months,60 months,36 months,36 months,36 months
int_rate,17.97%,13.56%,18.94%,7.84%,7.84%
installment,507.55,576.02,1098.78,187.58,332.95
grade,D,C,D,A,A
sub_grade,D1,C1,D2,A4,A4


In [103]:
def drop_empty_columns(df):
    df_size = len(df)
    columns_to_drop = [column for column in df if df[column].isna().sum()/df_size == 1.0]
    
    return df.drop(columns=columns_to_drop)

cleanly_loans = drop_empty_columns(loans)
cleanly_loans.shape, loans.shape

((128194, 141), (128194, 145))

In [104]:
loans['emp_title'].value_counts().head(20)

Teacher               2294
Manager               2075
Owner                 1231
Driver                1089
Registered Nurse       944
Supervisor             810
RN                     757
Sales                  726
Project Manager        637
General Manager        548
Office Manager         542
Director               482
owner                  398
Engineer               383
Truck Driver           367
Operations Manager     366
President              350
Sales Manager          323
Supervisor             321
Server                 319
Name: emp_title, dtype: int64

## Work with strings

In [105]:
import numpy as np

def all_numeric(df):
    return all((df.dtypes==np.number) | 
               (df.dtypes==bool))

def no_nulls(df):
    return not any(df.isnull().sum())

def ready_for_sklearn(df):
    return all_numeric(df) and no_nulls(df)

In [106]:
ready_for_sklearn(cleanly_loans), ready_for_sklearn(loans)

(False, False)

In [107]:
loans.select_dtypes('object').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128194 entries, 0 to 128193
Data columns (total 37 columns):
term                         128194 non-null object
int_rate                     128194 non-null object
grade                        128194 non-null object
sub_grade                    128194 non-null object
emp_title                    114757 non-null object
emp_length                   117807 non-null object
home_ownership               128194 non-null object
verification_status          128194 non-null object
issue_d                      128194 non-null object
loan_status                  128194 non-null object
pymnt_plan                   128194 non-null object
purpose                      128194 non-null object
title                        128194 non-null object
zip_code                     128194 non-null object
addr_state                   128194 non-null object
earliest_cr_line             128194 non-null object
revol_util                   128065 non-null object
initi

In [108]:
loans.int_rate.head().values

array([' 17.97%', ' 13.56%', ' 18.94%', '  7.84%', '  7.84%'],
      dtype=object)

In [109]:
def is_manager(titles):
    return titles.str.contains('Manager')
    
loans['emp_title'].value_counts(), is_manager(loans['emp_title'])

(Teacher                                  2294
 Manager                                  2075
 Owner                                    1231
 Driver                                   1089
 Registered Nurse                          944
 Supervisor                                810
 RN                                        757
 Sales                                     726
 Project Manager                           637
 General Manager                           548
 Office Manager                            542
 Director                                  482
 owner                                     398
 Engineer                                  383
 Truck Driver                              367
 Operations Manager                        366
 President                                 350
 Sales Manager                             323
 Supervisor                                321
 Server                                    319
 Nurse                                     317
 teacher     

## Work with dates

In [110]:
loans['issue_d'] = pd.to_datetime(loans['issue_d'], infer_datetime_format=True)

In [111]:
loans.issue_d.describe()

count                  128194
unique                      3
top       2018-08-01 00:00:00
freq                    46079
first     2018-07-01 00:00:00
last      2018-09-01 00:00:00
Name: issue_d, dtype: object

In [112]:
loans['issue_year'] = loans['issue_d'].dt.year
loans['issue_month'] = loans['issue_d'].dt.month

In [113]:
loans['earliest_cr_line'] = pd.to_datetime(loans['earliest_cr_line'], infer_datetime_format=True)

In [114]:
(loans['issue_d'] - loans['earliest_cr_line']).head()

0   3499 days
1   3530 days
2   3836 days
3   6727 days
4   5783 days
dtype: timedelta64[ns]

In [115]:
loans['days_from_earliest_credit_to_issue'] = loans['issue_d'] - loans['earliest_cr_line']

ASSIGNMENT

    Replicate the lesson code.

    Convert the term column from string to integer.

    Make a column named loan_status_is_great. It should contain the integer 1 if loan_status is "Current" or "Fully Paid." Else it should contain the integer 0.

    Make last_pymnt_d_month and last_pymnt_d_year columns.

STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:

    There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
    Modify the emp_title column to replace titles with 'Other' if the title is not in the top 20.
    Process the dataframe so that ready_for_sklearn(df) returns True. You can drop columns, or select the subset of numeric columns with no missing values. (Or you can try automating the process to handle missing values and convert objects to numbers!)
    Take initiatve and work on your own ideas!

Instacart options:

    Read Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera, especially the Feature Engineering section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
    Read and replicate parts of Simple Exploration Notebook - Instacart. (It's the Python Notebook with the most upvotes for this Kaggle competition.)
    Take initiative and work on your own ideas!


## Assignment

In [116]:
loans['term'] = pd.to_numeric(loans['term'].str.replace(' months', ''))
loans['term'].dtype

dtype('int64')

In [117]:
loans['loan_status_is_great'] = (
    loans['loan_status'].isin(['Fully Paid', 'Current']).astype('int64')
)

pd.crosstab(loans['loan_status'], loans['loan_status_is_great'])

loan_status_is_great,0,1
loan_status,Unnamed: 1_level_1,Unnamed: 2_level_1
Charged Off,110,0
Current,0,121082
Fully Paid,0,4786
In Grace Period,948,0
Late (16-30 days),348,0
Late (31-120 days),920,0


In [118]:
loans['last_pymnt_d'] = (
    pd.to_datetime(loans['last_pymnt_d'], infer_datetime_format=True)
)

loans['last_pymnt_d'].head()

0   2018-12-01
1   2018-12-01
2   2018-12-01
3   2018-12-01
4   2018-12-01
Name: last_pymnt_d, dtype: datetime64[ns]

In [119]:
loans['last_pymnt_d_month'] = loans['last_pymnt_d'].dt.month
loans['last_pymnt_d_year'] = loans['last_pymnt_d'].dt.year

In [120]:
loans[['last_pymnt_d', 'last_pymnt_d_year', 'last_pymnt_d_month']].sample(5)

Unnamed: 0,last_pymnt_d,last_pymnt_d_year,last_pymnt_d_month
112307,2018-12-01,2018.0,12.0
66121,2018-12-01,2018.0,12.0
44252,2018-12-01,2018.0,12.0
18198,2018-12-01,2018.0,12.0
78505,2018-12-01,2018.0,12.0


## Stretch

#### LendingClub

In [121]:
top_20_employee_titles = (
    loans['emp_title'].value_counts()[:20].index.tolist()
)
is_not_in_top_20 = ~loans['emp_title'].isin(top_20_employee_titles)

loans.loc[is_not_in_top_20, 'emp_title'] = 'Other'

In [122]:
loans[['emp_title']].sample(10)

Unnamed: 0,emp_title
39753,Other
104129,Other
28580,Other
102761,Other
35004,RN
9327,Other
29299,Other
850,Other
35654,Other
19271,Other


In [126]:
loans['emp_title'].value_counts()

Other                 113232
Teacher                 2294
Manager                 2075
Owner                   1231
Driver                  1089
Registered Nurse         944
Supervisor               810
RN                       757
Sales                    726
Project Manager          637
General Manager          548
Office Manager           542
Director                 482
owner                    398
Engineer                 383
Truck Driver             367
Operations Manager       366
President                350
Sales Manager            323
Supervisor               321
Server                   319
Name: emp_title, dtype: int64

In [135]:
loans['int_rate'] = (
    loans['int_rate'].str.replace('%', '').fillna(0).astype('float64')
)

loans['int_rate'].sample(5)

55853     13.56
26159     13.56
102792    16.91
123029    19.92
64006     11.55
Name: int_rate, dtype: float64

#### Instacart

In [138]:
%cd instacart_2017_05_01

/Users/work/.work/src/projects/lambdaschool/unit-1/DS-Unit-1-Sprint-2-Data-Wrangling/module4-make-features/downloads/instacart_2017_05_01


In [139]:
!ls

aisles.csv                order_products__prior.csv orders.csv
departments.csv           order_products__train.csv products.csv


In [142]:
pd.options.mode.chained_assignment = None

order_products_train_df = pd.read_csv("order_products__train.csv")
order_products_prior_df = pd.read_csv("order_products__prior.csv")
orders_df = pd.read_csv("orders.csv")
products_df = pd.read_csv("products.csv")
aisles_df = pd.read_csv("aisles.csv")
departments_df = pd.read_csv("departments.csv")

In [143]:
orders_df.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [168]:
orders_df['user_id'].unique().size

206209