<a href="https://colab.research.google.com/github/trista-paul/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/module4-make-features/LS_DS_124_Make_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

In [1]:
!wget https://resources.lendingclub.com/LoanStats_2018Q3.csv.zip

--2019-01-17 17:24:58--  https://resources.lendingclub.com/LoanStats_2018Q3.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q3.csv.zip’

LoanStats_2018Q3.cs     [         <=>        ]  21.42M  1.67MB/s    in 13s     

2019-01-17 17:25:16 (1.66 MB/s) - ‘LoanStats_2018Q3.csv.zip’ saved [22461905]



In [2]:
!unzip LoanStats_2018Q3.csv.zip

Archive:  LoanStats_2018Q3.csv.zip
  inflating: LoanStats_2018Q3.csv    


In [3]:
!head LoanStats_2018Q3.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

## Load LendingClub data

In [5]:
import pandas as pd
df = pd.read_csv('LoanStats_2018Q3.csv', skiprows=1, skipfooter=2)

  


In [6]:
df['hardship_payoff_balance_amount'].isnull().sum()/len(df) #99% values are missing

0.999898591197716

## Work with strings

In [8]:
import numpy as np

def all_numeric(df):
  return all((df.dtypes==np.number) | #if all are number or all are bool
            (df.dtypes==bool))

def no_nulls(df):
  return not any(df.isnull().sum()) #dont return any nulls

def ready_fpr_sklearn(df):
  return all_numeric(df) and no_nulls(df) #these conditions are needed for scikit learn

ready_fpr_sklearn(df)

False

In [9]:
all_numeric(df)

False

In [10]:
df.select_dtypes('object').info() #37 objects

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128194 entries, 0 to 128193
Data columns (total 37 columns):
term                         128194 non-null object
int_rate                     128194 non-null object
grade                        128194 non-null object
sub_grade                    128194 non-null object
emp_title                    114757 non-null object
emp_length                   117807 non-null object
home_ownership               128194 non-null object
verification_status          128194 non-null object
issue_d                      128194 non-null object
loan_status                  128194 non-null object
pymnt_plan                   128194 non-null object
purpose                      128194 non-null object
title                        128194 non-null object
zip_code                     128194 non-null object
addr_state                   128194 non-null object
earliest_cr_line             128194 non-null object
revol_util                   128065 non-null object
initi

In [15]:
df.int_rate.head() #str because of percents
def remove_percent(string):
  return float(string.replace('%', ''))
#deal with missing values before applying functions, if any
df['int_rate'] = df['int_rate'].apply(remove_percent)
df.int_rate.head()

0    17.97
1    13.56
2    18.94
3     7.84
4     7.84
Name: int_rate, dtype: float64

In [16]:
#Clean emp_title
#look at top 20 titles
df['emp_title'].value_counts().head(20) #redundant managers, owner, supervisor

Teacher               2294
Manager               2075
Owner                 1231
Driver                1089
Registered Nurse       944
Supervisor             810
RN                     757
Sales                  726
Project Manager        637
General Manager        548
Office Manager         542
Director               482
owner                  398
Engineer               383
Truck Driver           367
Operations Manager     366
President              350
Sales Manager          323
Supervisor             321
Server                 319
Name: emp_title, dtype: int64

In [41]:
#how often is emp_title null?
df['emp_title'].isnull().sum()
type(df['emp_title'][0])

str

In [42]:
#clean the title and handle missing values
def clean_title(x):
    if isinstance(x, str):        
        return x.strip().title()
    else:
        return 'Unknown'

df['emp_title'] = df['emp_title'].apply(clean_title)
df['emp_title'].isnull().sum()

0

In [43]:
df['emp_title'].value_counts().head(20)

Unknown    128194
Name: emp_title, dtype: int64

In [40]:
#create emp_title_manager
df['emp_title_manager'] = df['emp_title'].str.contains('Manager')
df['emp_title_manager'].value_counts()

False    128194
Name: emp_title_manager, dtype: int64

## Work with dates

In [0]:
pd.to_datetime(df['issue_d'], infer_datetime_format=True)

## Load Instacart data

Let's return to the dataset of [3 Million Instacart Orders](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2)

If necessary, uncomment and run the cells below to re-download and extract the data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

Here's a list of the six CSV filenames

In [0]:
%cd instacart_2017_05_01

In [0]:
!ls -lh

Load the CSV files with pandas

## Do feature engineering