<a href="https://colab.research.google.com/github/trista-paul/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/module4-make-features/LS_DS_124_Make_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

In [44]:
!wget https://resources.lendingclub.com/LoanStats_2018Q3.csv.zip

--2019-01-17 20:07:31--  https://resources.lendingclub.com/LoanStats_2018Q3.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q3.csv.zip.1’

LoanStats_2018Q3.cs     [         <=>        ]  21.42M  1.71MB/s    in 13s     

2019-01-17 20:07:44 (1.69 MB/s) - ‘LoanStats_2018Q3.csv.zip.1’ saved [22461905]



In [45]:
!unzip LoanStats_2018Q3.csv.zip

Archive:  LoanStats_2018Q3.csv.zip
replace LoanStats_2018Q3.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [46]:
!head LoanStats_2018Q3.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

## Load LendingClub data

In [48]:
import pandas as pd
df = pd.read_csv('LoanStats_2018Q3.csv', skiprows=1, skipfooter=2)
df.shape

  


(128194, 145)

In [49]:
pd.options.display.max_columns = 500
pd.options.display.max_rows = 500
df.head().T

Unnamed: 0,0,1,2,3,4
id,,,,,
member_id,,,,,
loan_amnt,20000,25000,30000,6000,10650
funded_amnt,20000,25000,30000,6000,10650
funded_amnt_inv,20000,25000,30000,6000,10650
term,60 months,60 months,36 months,36 months,36 months
int_rate,17.97%,13.56%,18.94%,7.84%,7.84%
installment,507.55,576.02,1098.78,187.58,332.95
grade,D,C,D,A,A
sub_grade,D1,C1,D2,A4,A4


In [50]:
df['hardship_payoff_balance_amount'].isnull().sum()/len(df) #99% values are missing

0.999898591197716

## Work with strings

In [51]:
import numpy as np

def all_numeric(df):
  return all((df.dtypes==np.number) | #if all are number or all are bool
            (df.dtypes==bool))

def no_nulls(df):
  return not any(df.isnull().sum()) #dont return any nulls

def ready_fpr_sklearn(df):
  return all_numeric(df) and no_nulls(df) #these conditions are needed for scikit learn

ready_fpr_sklearn(df)

False

In [52]:
all_numeric(df)

False

In [53]:
df.select_dtypes('object').info() #37 objects

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128194 entries, 0 to 128193
Data columns (total 37 columns):
term                         128194 non-null object
int_rate                     128194 non-null object
grade                        128194 non-null object
sub_grade                    128194 non-null object
emp_title                    114757 non-null object
emp_length                   117807 non-null object
home_ownership               128194 non-null object
verification_status          128194 non-null object
issue_d                      128194 non-null object
loan_status                  128194 non-null object
pymnt_plan                   128194 non-null object
purpose                      128194 non-null object
title                        128194 non-null object
zip_code                     128194 non-null object
addr_state                   128194 non-null object
earliest_cr_line             128194 non-null object
revol_util                   128065 non-null object
initi

In [54]:
df.int_rate.head() #str because of percents
def remove_percent(string):
  return float(string.replace('%', ''))
#deal with missing values before applying functions, if any
df['int_rate'] = df['int_rate'].apply(remove_percent)
df.int_rate.head()

0    17.97
1    13.56
2    18.94
3     7.84
4     7.84
Name: int_rate, dtype: float64

In [55]:
#Clean emp_title
#look at top 20 titles
df['emp_title'].value_counts().head(20) #redundant managers, owner, supervisor

Teacher               2294
Manager               2075
Owner                 1231
Driver                1089
Registered Nurse       944
Supervisor             810
RN                     757
Sales                  726
Project Manager        637
General Manager        548
Office Manager         542
Director               482
owner                  398
Engineer               383
Truck Driver           367
Operations Manager     366
President              350
Sales Manager          323
Supervisor             321
Server                 319
Name: emp_title, dtype: int64

In [56]:
#how often is emp_title null?
df['emp_title'].isnull().sum()
type(df['emp_title'][0])

str

In [57]:
#clean the title and handle missing values
def clean_title(x):
    if isinstance(x, str):        
        return x.strip().title()
    else:
        return 'Unknown'

df['emp_title'] = df['emp_title'].apply(clean_title)
df['emp_title'].isnull().sum()

0

In [58]:
df['emp_title'].value_counts().head(20)

Unknown                     13437
Teacher                      2843
Manager                      2749
Owner                        1856
Driver                       1498
Registered Nurse             1386
Supervisor                   1345
Sales                         980
Truck Driver                  921
Rn                            905
Office Manager                846
Project Manager               835
General Manager               809
Director                      585
Operations Manager            516
Sales Manager                 510
Engineer                      474
Store Manager                 466
Administrative Assistant      466
President                     464
Name: emp_title, dtype: int64

In [59]:
#create emp_title_manager
df['emp_title_manager'] = df['emp_title'].str.contains('Manager')
df['emp_title_manager'].value_counts()

False    109498
True      18696
Name: emp_title_manager, dtype: int64

## Work with dates

In [61]:
df['issue_d'] = pd.to_datetime(df['issue_d'], infer_datetime_format=True)
df['issue_d'].describe()

count                  128194
unique                      3
top       2018-08-01 00:00:00
freq                    46079
first     2018-07-01 00:00:00
last      2018-09-01 00:00:00
Name: issue_d, dtype: object

In [62]:
df['issue_year']  = df['issue_d'].dt.year
df['issue_month'] = df['issue_d'].dt.month
df['issue_month'].sample(n=10).values

array([9, 8, 7, 8, 8, 9, 8, 7, 7, 8])

In [63]:
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], 
                                        infer_datetime_format=True)
(df['issue_d'] - df['earliest_cr_line']).head()

0   3499 days
1   3530 days
2   3836 days
3   6727 days
4   5783 days
dtype: timedelta64[ns]

In [64]:
df['days_from_earliest_credit_to_issue'] = (
    df['issue_d'] - df['earliest_cr_line']).dt.days
df['days_from_earliest_credit_to_issue'].head().values

array([3499, 3530, 3836, 6727, 5783])

In [151]:
[col for col in df if col.endswith('_d')]

['issue_d', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']

In [0]:
for col in ['last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']:
    df[col] = pd.to_datetime(df[col],
                             infer_datetime_format=True)

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.



In [68]:
#making df[loan_status_is_great]
print(df['loan_status'].value_counts())

Current               121082
Fully Paid              4786
In Grace Period          948
Late (31-120 days)       920
Late (16-30 days)        348
Charged Off              110
Name: loan_status, dtype: int64


In [101]:
type(df['loan_status'].str.contains('Fully Paid')) 
#as this is a Series it can be added as a new col without any more work

pandas.core.series.Series

In [100]:
df['loan_status_is_great'] = df['loan_status'].str.contains('Fully Paid') | df['loan_status'].str.contains('Current')
df['loan_status_is_great'].value_counts()

True     125868
False      2326
Name: loan_status_is_great, dtype: int64

In [103]:
print(df['loan_status'].str.contains('Fully Paid').sum()
      +df['loan_status'].str.contains('Current').sum())
#confirmation True = Fully paid + Current

125868


In [150]:
#Converting term from string to int
def clean(str):
  return str.replace(' months', '')
df.term = clean(df.term)
df.term = df.term.astype(int)

print(df['term'].value_counts())
print(type(df.term[0]))

36    89574
60    38620
Name: term, dtype: int64
<class 'numpy.int64'>


In [157]:
#Making df.last_payment_d_year and df.last_payment_d_month
df['last_pymnt_d'].value_counts()

2018-12-01    116465
2018-11-01      7793
2018-10-01      1594
2018-09-01      1136
2018-08-01       838
2018-07-01       222
Name: last_pymnt_d, dtype: int64

In [167]:
df['last_payment_d_year']  = df['last_pymnt_d'].dt.year
df['last_payment_d_month'] = df['last_pymnt_d'].dt.month
print(df['last_payment_d_year'].value_counts())
print(df['last_payment_d_month'].value_counts())

2018.0    128048
Name: last_payment_d_year, dtype: int64
12.0    116465
11.0      7793
10.0      1594
9.0       1136
8.0        838
7.0        222
Name: last_payment_d_month, dtype: int64


# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

If necessary, uncomment and run the cells below to re-download and extract the data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
#%cd instacart_2017_05_01

In [0]:
#!ls -lh