Lambda School Data Science

*Unit 1, Sprint 1, Module 2*

---

# Make Features 

- Student should be able to understand the purpose of feature engineering
- Student should be able to work with strings in pandas
- Student should be able to work with dates and times in pandas
- Student should be able to filter a dataframe based on conditions
- Student should be able to modify or create columns of a dataframe using the `.apply()` function


Helpful Links:
- [Minimally Sufficient Pandas](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428)
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series
- [Lambda Learning Method for DS - By Ryan Herr](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit?usp=sharing)

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [1]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

--2019-11-26 18:08:28--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip’

LoanStats_2018Q4.cs     [        <=>         ]  21.67M   411KB/s    in 56s     

2019-11-26 18:09:24 (398 KB/s) - ‘LoanStats_2018Q4.csv.zip’ saved [22727580]



In [2]:
!unzip LoanStats_2018Q4.csv.zip

Archive:  LoanStats_2018Q4.csv.zip
  inflating: LoanStats_2018Q4.csv    


In [3]:
!head LoanStats_2018Q4.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [5]:
import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt


data=pd.read_csv('LoanStats_2018Q4.csv', header=1)


  interactivity=interactivity, compiler=compiler, result=result)


In [6]:
data.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,35000.0,35000.0,35000.0,36 months,14.47%,1204.23,C,C2,Staff Physician,8 years,MORTGAGE,360000.0,Verified,Dec-2018,Fully Paid,n,,,credit_card,Credit card refinancing,336xx,FL,19.9,0.0,Apr-1995,1.0,,,24.0,0.0,57259.0,43.2%,51.0,w,0.0,0.0,38187.046837,38187.05,...,30.8,0.0,0.0,1222051.0,169286.0,124600.0,258401.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,,,5000.0,5000.0,5000.0,36 months,22.35%,191.86,D,D5,Director of Sales,10+ years,MORTGAGE,72000.0,Source Verified,Dec-2018,Fully Paid,n,,,debt_consolidation,Debt consolidation,333xx,FL,20.12,0.0,Mar-2010,0.0,,,13.0,0.0,11720.0,47.1%,26.0,f,0.0,0.0,5615.977674,5615.98,...,50.0,0.0,0.0,218686.0,34418.0,18200.0,37786.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,,,10000.0,10000.0,10000.0,60 months,23.40%,284.21,E,E1,,< 1 year,RENT,55000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,902xx,CA,13.51,0.0,Apr-2007,0.0,44.0,88.0,9.0,1.0,11859.0,53.9%,11.0,w,9131.55,9131.55,2538.39,2538.39,...,100.0,1.0,0.0,34386.0,21235.0,10500.0,12386.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,,,17100.0,17100.0,17100.0,36 months,18.94%,626.3,D,D2,Receptionist,10+ years,RENT,38000.0,Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,150xx,PA,38.09,0.0,Mar-1998,1.0,47.0,,14.0,0.0,15323.0,53%,21.0,w,13682.21,13682.21,5609.71,5609.71,...,75.0,0.0,0.0,70954.0,43351.0,16600.0,41784.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,,,4000.0,4000.0,4000.0,36 months,10.72%,130.43,B,B2,Extrusion assistant,10+ years,MORTGAGE,56000.0,Verified,Dec-2018,Current,n,,,credit_card,Credit card refinancing,301xx,GA,31.03,0.0,Sep-2006,0.0,,,7.0,0.0,4518.0,28.6%,11.0,w,3116.62,3116.62,1160.78,1160.78,...,0.0,0.0,0.0,221310.0,71375.0,12300.0,77865.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


In [7]:
data.shape

(128414, 144)

## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [8]:
data.columns[data.dtypes=='object']

Index(['id', 'term', 'int_rate', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'verification_status', 'issue_d',
       'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code',
       'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status',
       'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d',
       'application_type', 'verification_status_joint',
       'sec_app_earliest_cr_line', 'hardship_flag', 'hardship_type',
       'hardship_reason', 'hardship_status', 'hardship_start_date',
       'hardship_end_date', 'payment_plan_start_date', 'hardship_loan_status',
       'debt_settlement_flag', 'debt_settlement_flag_date',
       'settlement_status', 'settlement_date'],
      dtype='object')

In [9]:
data.dropna(axis=1, how='all', inplace=True)
data.shape

(128414, 141)

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [0]:
def remove_pct(column):
    return column.str.replace('%','').astype(float)

#data['int_rate']=data['int_rate'].str.replace('%', '').astype(float)

In [11]:
data['int_rate'].dtypes

dtype('O')

Apply the function to the `int_rate` column

In [0]:
data['int_rate']=remove_pct(data['int_rate'])

### Clean `emp_title`

Look at top 20 titles

In [13]:
data['emp_title'].value_counts()[0:40]

Teacher                     2090
Manager                     1773
Registered Nurse             952
Driver                       924
RN                           726
Supervisor                   697
Sales                        580
Project Manager              526
General Manager              523
Office Manager               521
Owner                        420
Director                     402
Truck Driver                 387
Operations Manager           387
Nurse                        326
Engineer                     325
Sales Manager                304
manager                      301
Supervisor                   270
Administrative Assistant     269
Accountant                   268
Server                       265
Vice President               261
Mechanic                     258
Account Manager              254
Police Officer               252
teacher                      249
Technician                   248
Manager                      246
Store Manager                222
Truck driv

How often is `emp_title` null?

In [14]:
data['emp_title'].isnull().value_counts()

False    107465
True      20949
Name: emp_title, dtype: int64

Clean the title and handle missing values

In [0]:
data['emp_title']=data['emp_title'].str.lower()
#data["emp_title"].fillna("No record", inplace = True) 



### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [16]:
data['emp_title'].str.contains('manager').sum()

17885

In [0]:

data.loc[data['emp_title'].str.contains('manager', na=False)==True, 'emp_title_manager']= 1
#data.loc[data['emp_title'].str.contains('manager', na=False)==False, 'emp_title_manager']= 0


In [18]:
data['emp_title_manager'].value_counts()

1.0    17885
Name: emp_title_manager, dtype: int64

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [0]:
import datetime
data['earliest_cr_line']=pd.to_datetime(data['earliest_cr_line'])

In [27]:
data['earliest_cr_line'].dtypes


dtype('<M8[ns]')

In [28]:
data['earliest_cr_line'].dt.weekday_name

0          Saturday
1            Monday
2            Sunday
3            Sunday
4            Friday
            ...    
128409     Thursday
128410    Wednesday
128411       Friday
128412          NaN
128413          NaN
Name: earliest_cr_line, Length: 128414, dtype: object

In [29]:
data['earliest_cr_line'].dt.year

0         1995.0
1         2010.0
2         2007.0
3         1998.0
4         2006.0
           ...  
128409    2006.0
128410    2008.0
128411    2006.0
128412       NaN
128413       NaN
Name: earliest_cr_line, Length: 128414, dtype: float64

In [35]:
data['earliest_cr_line'].dt.is_month_end

0         False
1         False
2         False
3         False
4         False
          ...  
128409    False
128410    False
128411    False
128412    False
128413    False
Name: earliest_cr_line, Length: 128414, dtype: bool