<a href="https://colab.research.google.com/github/ewuerfel66/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/EricWuerfel_LS_DS5_124_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [0]:
!wget -nc https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

In [0]:
!unzip LoanStats_2018Q4.csv.zip

In [0]:
!head LoanStats_2018Q4.csv

In [0]:
!tail LoanStats_2018Q4.csv

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

In [0]:
df = pd.read_csv('LoanStats_2018Q4.csv', skiprows=1, skipfooter=2) # skiprows skips the crap at the top
print(df.shape)
df.head()

In [0]:
df.isnull().sum().sort_values(ascending=False)

#Drop NaN

In [0]:
df = df.drop(['id', 'url', 'desc', 'member_id'], axis = 'columns') #drop is written fro rows. axis=1 sets to columns

In [0]:
df.head()

## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [0]:
string = '13.56%'

In [0]:
string.strip('%') #strip only acts on the beginning or the end

In [0]:
float(string.strip('%'))

In [0]:
example_list = ['12.56%','13.59%','13.36%']

In [0]:
def remove_percent(string):
  return float(string.strip('%'))

[remove_percent(x) for x in example_list]

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [0]:
def remove_percent(string):
  return float(string.strip('%'))

Apply the function to the `int_rate` column

In [0]:
df['int_rate'] = df['int_rate'].apply(remove_percent)

In [0]:
df.head()

In [0]:
df.describe(exclude='number') # top == Mode

##Filter dataframe by term

In [0]:
df[df['term'] == ' 36 months']

In [0]:
int_rate_14 = df[df['int_rate'] >= 14.0]
int_rate_14.head()

In [0]:
int_rate_14.term.iloc[0]

In [0]:
int_rate_14['term'].loc[2]

### Reset index

In [0]:
int_rate_14.reset_index()

### Clean `emp_title`

Look at top 20 titles

In [0]:
df['emp_title'].value_counts(dropna=False).head(20)

How often is `emp_title` null?

In [0]:
df.emp_title.isnull().sum()

Clean the title and handle missing values

In [0]:
examples = ['owner', 'Supervisor', ' Project Manager', np.NaN]

In [0]:
def clean_title(title):
  if isinstance(title, str):
    return title.strip( ).title()
  else:
    return "Unknown"

In [0]:
[clean_title(x) for x in examples]

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [0]:
df['emp_title'] = df['emp_title'].apply(clean_title)
#df.head()

In [0]:
df['emp_title'].value_counts()

## Lambda Function

In [0]:
df.apply(lambda X: x.strip().title() if isinstance(x, str))

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [0]:
df['issue_d'].head().values

In [0]:
df['issue_d'].describe()

In [0]:
df['issue_d'] = pd.to_datetime(df['issue_d'], infer_datetime_format=True)

In [0]:
df['issue_d'].head().values

In [0]:
df['issue_d'].dt.year

In [0]:
df['issue_d'].dt.month

In [0]:
df['issue_year'] = df['issue_d'].dt.year
df['issue_month'] = df['issue_d'].dt.month

In [0]:
df.head()

In [0]:
[col for col in df if col.endswith('_d')]

In [0]:
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], infer_datetime_format=True)

In [0]:
df['days_from_earliest_cr_line'] = df['issue_d'] - df['earliest_cr_line']

In [0]:
df.head()

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

### Replicate the Lesson Code

In [0]:
import numpy as np
import pandas as pd
import re
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

**Get the file**

In [131]:
!wget -nc https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

File ‘LoanStats_2018Q4.csv.zip’ already there; not retrieving.



In [0]:
#!unzip LoanStats_2018Q4.csv.zip #This cell may have to be STOPPED MANUALLY

In [133]:
!head LoanStats_2018Q4.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

In [134]:
!tail LoanStats_2018Q4.csv

"","","5600","5600","5600"," 36 months"," 13.56%","190.21","C","C1","","n/a","RENT","15600","Not Verified","Oct-2018","Current","n","","","credit_card","Credit card refinancing","836xx","ID","15.31","0","Aug-2012","0","","97","9","1","5996","34.5%","11","w","4816.58","4816.58","1130.71","1130.71","783.42","347.29","0.0","0.0","0.0","Apr-2019","190.21","May-2019","Apr-2019","0","","1","Individual","","","","0","0","5996","0","0","0","1","20","0","","0","2","3017","35","17400","1","0","0","3","750","4689","45.5","0","0","20","73","13","13","0","13","","20","","0","3","5","4","4","1","9","10","5","9","0","0","0","0","100","25","1","0","17400","5996","8600","0","","","","","","","","","","","","N","","","","","","","","","","","","","","","N","","","","","",""
"","","23000","23000","23000"," 36 months"," 15.02%","797.53","C","C3","Tax Consultant","10+ years","MORTGAGE","75000","Source Verified","Oct-2018","Charged Off","n","","","debt_consolidation","Debt consolidation","352xx","AL","20.95

Enter the data into a **DataFrame**

In [135]:
df = pd.read_csv('LoanStats_2018Q4.csv', skiprows=1, skipfooter=2, engine='python')

print(df.shape)

(128412, 144)


**Drop NA Columns**

In [136]:
df = df.drop(['id','url','desc','member_id'], axis='columns')
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,10000,10000,10000.0,36 months,10.33%,324.23,B,B1,,< 1 year,MORTGAGE,280000.0,Not Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,Jan-1996,0,18.0,,14,0,9082,38%,23,w,9279.39,9279.39,964.08,964.08,720.61,243.47,0.0,0.0,0.0,Apr-2019,324.23,Apr-2019,Apr-2019,0,,1,Individual,,,,0,671,246828,1,3,2,3,1.0,48552,62.0,1,3,4923,46.0,23900,2,7,1,7,17631.0,11897.0,43.1,0,0,158.0,275,11,1,1,11.0,,11.0,,0,3,4,7,7,10,9,11,4,14,0.0,0,0,4,91.3,28.6,0,0,367828,61364,20900,54912,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,9600,9600,9600.0,36 months,12.98%,323.37,B,B5,,,MORTGAGE,35704.0,Not Verified,Dec-2018,Current,n,home_improvement,Home improvement,401xx,KY,0.84,0,Nov-2003,0,69.0,,5,0,748,11.5%,23,w,8934.25,8934.25,994.35,994.35,665.75,328.6,0.0,0.0,0.0,Apr-2019,323.37,May-2019,Apr-2019,0,,1,Individual,,,,0,0,748,0,0,0,0,44.0,0,,0,3,748,12.0,6500,0,0,1,3,150.0,3452.0,17.8,0,0,181.0,100,13,13,0,16.0,,3.0,,0,1,1,2,2,16,5,7,1,5,0.0,0,0,0,95.5,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,4000,4000,4000.0,36 months,23.40%,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.2%,20,w,3762.39,3762.39,459.24,459.24,237.61,221.63,0.0,0.0,0.0,Apr-2019,155.68,Apr-2019,Apr-2019,0,,1,Individual,,,,0,0,66926,5,4,3,4,5.0,61727,86.0,6,11,1353,68.0,27100,4,0,4,15,4462.0,20174.0,7.9,0,0,147.0,118,2,2,0,2.0,,0.0,,0,5,7,9,9,8,11,12,7,15,0.0,0,0,9,95.0,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,2500,2500,2500.0,36 months,13.56%,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.3%,34,w,2328.06,2328.06,251.94,251.94,171.94,80.0,0.0,0.0,0.0,Apr-2019,84.92,Apr-2019,Apr-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,30000,30000,30000.0,60 months,18.94%,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,Jun-1987,0,71.0,75.0,13,1,12315,24.2%,44,w,29074.35,29074.35,2284.34,2284.34,925.65,1358.69,0.0,0.0,0.0,Apr-2019,777.23,Apr-2019,Apr-2019,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


**Remove %** from **`'int_rate'`**

In [0]:
def remove_percent(string):
  return float(string.strip('%'))

In [0]:
df['int_rate'] = df['int_rate'].apply(remove_percent)

In [139]:
print(df.int_rate.dtype)
df.head()

float64


Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,10000,10000,10000.0,36 months,10.33,324.23,B,B1,,< 1 year,MORTGAGE,280000.0,Not Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,Jan-1996,0,18.0,,14,0,9082,38%,23,w,9279.39,9279.39,964.08,964.08,720.61,243.47,0.0,0.0,0.0,Apr-2019,324.23,Apr-2019,Apr-2019,0,,1,Individual,,,,0,671,246828,1,3,2,3,1.0,48552,62.0,1,3,4923,46.0,23900,2,7,1,7,17631.0,11897.0,43.1,0,0,158.0,275,11,1,1,11.0,,11.0,,0,3,4,7,7,10,9,11,4,14,0.0,0,0,4,91.3,28.6,0,0,367828,61364,20900,54912,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,9600,9600,9600.0,36 months,12.98,323.37,B,B5,,,MORTGAGE,35704.0,Not Verified,Dec-2018,Current,n,home_improvement,Home improvement,401xx,KY,0.84,0,Nov-2003,0,69.0,,5,0,748,11.5%,23,w,8934.25,8934.25,994.35,994.35,665.75,328.6,0.0,0.0,0.0,Apr-2019,323.37,May-2019,Apr-2019,0,,1,Individual,,,,0,0,748,0,0,0,0,44.0,0,,0,3,748,12.0,6500,0,0,1,3,150.0,3452.0,17.8,0,0,181.0,100,13,13,0,16.0,,3.0,,0,1,1,2,2,16,5,7,1,5,0.0,0,0,0,95.5,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,4000,4000,4000.0,36 months,23.4,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.2%,20,w,3762.39,3762.39,459.24,459.24,237.61,221.63,0.0,0.0,0.0,Apr-2019,155.68,Apr-2019,Apr-2019,0,,1,Individual,,,,0,0,66926,5,4,3,4,5.0,61727,86.0,6,11,1353,68.0,27100,4,0,4,15,4462.0,20174.0,7.9,0,0,147.0,118,2,2,0,2.0,,0.0,,0,5,7,9,9,8,11,12,7,15,0.0,0,0,9,95.0,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,2500,2500,2500.0,36 months,13.56,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.3%,34,w,2328.06,2328.06,251.94,251.94,171.94,80.0,0.0,0.0,0.0,Apr-2019,84.92,Apr-2019,Apr-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,30000,30000,30000.0,60 months,18.94,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,Jun-1987,0,71.0,75.0,13,1,12315,24.2%,44,w,29074.35,29074.35,2284.34,2284.34,925.65,1358.69,0.0,0.0,0.0,Apr-2019,777.23,Apr-2019,Apr-2019,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


**Clean emp_title**

In [140]:
df['emp_title'].value_counts(dropna=False).head(20)

NaN                   20947
Teacher                2090
Manager                1773
Registered Nurse        952
Driver                  924
RN                      726
Supervisor              697
Sales                   580
Project Manager         526
General Manager         523
Office Manager          521
Owner                   420
Director                402
Operations Manager      387
Truck Driver            387
Nurse                   326
Engineer                325
Sales Manager           304
manager                 301
Supervisor              270
Name: emp_title, dtype: int64

In [0]:
def clean_title(title):
  if isinstance(title, str):
    return title.strip().title()
  else:
    return "Unknown"

In [0]:
df['emp_title'] = df['emp_title'].apply(clean_title)

In [143]:
df['emp_title'].value_counts()

Unknown                                     20947
Teacher                                      2557
Manager                                      2395
Registered Nurse                             1418
Driver                                       1258
Supervisor                                   1160
Truck Driver                                  920
Rn                                            834
Office Manager                                805
Sales                                         803
General Manager                               791
Project Manager                               720
Owner                                         625
Director                                      523
Operations Manager                            518
Sales Manager                                 500
Police Officer                                440
Nurse                                         425
Technician                                    420
Engineer                                      412


Create a **new column for the Truth value**

In [0]:
df['emp_title_manager'] = True

In [0]:
df['emp_title_manager'] = df['emp_title'].str.contains("Manager")
#df['emp_title_manager']

In [146]:
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager
0,10000,10000,10000.0,36 months,10.33,324.23,B,B1,Unknown,< 1 year,MORTGAGE,280000.0,Not Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,Jan-1996,0,18.0,,14,0,9082,38%,23,w,9279.39,9279.39,964.08,964.08,720.61,243.47,0.0,0.0,0.0,Apr-2019,324.23,Apr-2019,Apr-2019,0,,1,Individual,,,,0,671,246828,1,3,2,3,1.0,48552,62.0,1,3,4923,46.0,23900,2,7,1,7,17631.0,11897.0,43.1,0,0,158.0,275,11,1,1,11.0,,11.0,,0,3,4,7,7,10,9,11,4,14,0.0,0,0,4,91.3,28.6,0,0,367828,61364,20900,54912,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False
1,9600,9600,9600.0,36 months,12.98,323.37,B,B5,Unknown,,MORTGAGE,35704.0,Not Verified,Dec-2018,Current,n,home_improvement,Home improvement,401xx,KY,0.84,0,Nov-2003,0,69.0,,5,0,748,11.5%,23,w,8934.25,8934.25,994.35,994.35,665.75,328.6,0.0,0.0,0.0,Apr-2019,323.37,May-2019,Apr-2019,0,,1,Individual,,,,0,0,748,0,0,0,0,44.0,0,,0,3,748,12.0,6500,0,0,1,3,150.0,3452.0,17.8,0,0,181.0,100,13,13,0,16.0,,3.0,,0,1,1,2,2,16,5,7,1,5,0.0,0,0,0,95.5,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False
2,4000,4000,4000.0,36 months,23.4,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.2%,20,w,3762.39,3762.39,459.24,459.24,237.61,221.63,0.0,0.0,0.0,Apr-2019,155.68,Apr-2019,Apr-2019,0,,1,Individual,,,,0,0,66926,5,4,3,4,5.0,61727,86.0,6,11,1353,68.0,27100,4,0,4,15,4462.0,20174.0,7.9,0,0,147.0,118,2,2,0,2.0,,0.0,,0,5,7,9,9,8,11,12,7,15,0.0,0,0,9,95.0,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False
3,2500,2500,2500.0,36 months,13.56,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.3%,34,w,2328.06,2328.06,251.94,251.94,171.94,80.0,0.0,0.0,0.0,Apr-2019,84.92,Apr-2019,Apr-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False
4,30000,30000,30000.0,60 months,18.94,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,Jun-1987,0,71.0,75.0,13,1,12315,24.2%,44,w,29074.35,29074.35,2284.34,2284.34,925.65,1358.69,0.0,0.0,0.0,Apr-2019,777.23,Apr-2019,Apr-2019,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False


Add **year** and **month** columns

**issue_d**

In [0]:
df['issue_d'] = pd.to_datetime(df['issue_d'], infer_datetime_format=True)

In [0]:
df['issue_year'] = df['issue_d'].dt.year
df['issue_month'] = df['issue_d'].dt.month

**last_pymnt_d**

In [0]:
df['last_pymnt_d'] = pd.to_datetime(df['last_pymnt_d'], infer_datetime_format=True)

In [0]:
df['last_pymnt_year'] = df['last_pymnt_d'].dt.year
df['last_pymnt_month'] = df['last_pymnt_d'].dt.month

In [151]:
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager,issue_year,issue_month,last_pymnt_year,last_pymnt_month
0,10000,10000,10000.0,36 months,10.33,324.23,B,B1,Unknown,< 1 year,MORTGAGE,280000.0,Not Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,Jan-1996,0,18.0,,14,0,9082,38%,23,w,9279.39,9279.39,964.08,964.08,720.61,243.47,0.0,0.0,0.0,2019-04-01,324.23,Apr-2019,Apr-2019,0,,1,Individual,,,,0,671,246828,1,3,2,3,1.0,48552,62.0,1,3,4923,46.0,23900,2,7,1,7,17631.0,11897.0,43.1,0,0,158.0,275,11,1,1,11.0,,11.0,,0,3,4,7,7,10,9,11,4,14,0.0,0,0,4,91.3,28.6,0,0,367828,61364,20900,54912,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0
1,9600,9600,9600.0,36 months,12.98,323.37,B,B5,Unknown,,MORTGAGE,35704.0,Not Verified,2018-12-01,Current,n,home_improvement,Home improvement,401xx,KY,0.84,0,Nov-2003,0,69.0,,5,0,748,11.5%,23,w,8934.25,8934.25,994.35,994.35,665.75,328.6,0.0,0.0,0.0,2019-04-01,323.37,May-2019,Apr-2019,0,,1,Individual,,,,0,0,748,0,0,0,0,44.0,0,,0,3,748,12.0,6500,0,0,1,3,150.0,3452.0,17.8,0,0,181.0,100,13,13,0,16.0,,3.0,,0,1,1,2,2,16,5,7,1,5,0.0,0,0,0,95.5,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0
2,4000,4000,4000.0,36 months,23.4,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.2%,20,w,3762.39,3762.39,459.24,459.24,237.61,221.63,0.0,0.0,0.0,2019-04-01,155.68,Apr-2019,Apr-2019,0,,1,Individual,,,,0,0,66926,5,4,3,4,5.0,61727,86.0,6,11,1353,68.0,27100,4,0,4,15,4462.0,20174.0,7.9,0,0,147.0,118,2,2,0,2.0,,0.0,,0,5,7,9,9,8,11,12,7,15,0.0,0,0,9,95.0,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0
3,2500,2500,2500.0,36 months,13.56,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.3%,34,w,2328.06,2328.06,251.94,251.94,171.94,80.0,0.0,0.0,0.0,2019-04-01,84.92,Apr-2019,Apr-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0
4,30000,30000,30000.0,60 months,18.94,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,Jun-1987,0,71.0,75.0,13,1,12315,24.2%,44,w,29074.35,29074.35,2284.34,2284.34,925.65,1358.69,0.0,0.0,0.0,2019-04-01,777.23,Apr-2019,Apr-2019,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0


Add **days_from_earliest_credit_to_issue**

In [0]:
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], 
                                        infer_datetime_format=True)

In [0]:
df['days_from_earliest_credit_to_issue'] = (
    df['issue_d'] - df['earliest_cr_line']).dt.days

In [154]:
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager,issue_year,issue_month,last_pymnt_year,last_pymnt_month,days_from_earliest_credit_to_issue
0,10000,10000,10000.0,36 months,10.33,324.23,B,B1,Unknown,< 1 year,MORTGAGE,280000.0,Not Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,1996-01-01,0,18.0,,14,0,9082,38%,23,w,9279.39,9279.39,964.08,964.08,720.61,243.47,0.0,0.0,0.0,2019-04-01,324.23,Apr-2019,Apr-2019,0,,1,Individual,,,,0,671,246828,1,3,2,3,1.0,48552,62.0,1,3,4923,46.0,23900,2,7,1,7,17631.0,11897.0,43.1,0,0,158.0,275,11,1,1,11.0,,11.0,,0,3,4,7,7,10,9,11,4,14,0.0,0,0,4,91.3,28.6,0,0,367828,61364,20900,54912,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0,8370
1,9600,9600,9600.0,36 months,12.98,323.37,B,B5,Unknown,,MORTGAGE,35704.0,Not Verified,2018-12-01,Current,n,home_improvement,Home improvement,401xx,KY,0.84,0,2003-11-01,0,69.0,,5,0,748,11.5%,23,w,8934.25,8934.25,994.35,994.35,665.75,328.6,0.0,0.0,0.0,2019-04-01,323.37,May-2019,Apr-2019,0,,1,Individual,,,,0,0,748,0,0,0,0,44.0,0,,0,3,748,12.0,6500,0,0,1,3,150.0,3452.0,17.8,0,0,181.0,100,13,13,0,16.0,,3.0,,0,1,1,2,2,16,5,7,1,5,0.0,0,0,0,95.5,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0,5509
2,4000,4000,4000.0,36 months,23.4,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,2006-09-01,4,59.0,,15,0,5199,19.2%,20,w,3762.39,3762.39,459.24,459.24,237.61,221.63,0.0,0.0,0.0,2019-04-01,155.68,Apr-2019,Apr-2019,0,,1,Individual,,,,0,0,66926,5,4,3,4,5.0,61727,86.0,6,11,1353,68.0,27100,4,0,4,15,4462.0,20174.0,7.9,0,0,147.0,118,2,2,0,2.0,,0.0,,0,5,7,9,9,8,11,12,7,15,0.0,0,0,9,95.0,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0,4474
3,2500,2500,2500.0,36 months,13.56,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,2001-04-01,1,,45.0,9,1,4341,10.3%,34,w,2328.06,2328.06,251.94,251.94,171.94,80.0,0.0,0.0,0.0,2019-04-01,84.92,Apr-2019,Apr-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0,6453
4,30000,30000,30000.0,60 months,18.94,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,1987-06-01,0,71.0,75.0,13,1,12315,24.2%,44,w,29074.35,29074.35,2284.34,2284.34,925.65,1358.69,0.0,0.0,0.0,2019-04-01,777.23,Apr-2019,Apr-2019,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0,11506


## Convert the `term` column from string to integer

Start by converting **just one instance**

In [155]:
example_string = df.term[0]
print(example_string)

 36 months


In [156]:
example_string.strip(' ')

'36 months'

In [157]:
pattern = re.compile('[0-9]+')
term_in_months = re.findall(pattern, example_string.strip(' '))
int_term_in_months = int(term_in_months[0])
print(int_term_in_months)

36


In [0]:
def term_to_int(term):
  pattern = re.compile('[0-9]+')
  term_in_months = re.findall(pattern, example_string.strip(' '))
  int_term_in_months = int(term_in_months[0])
  return(int_term_in_months)

In [0]:
df['term_in_months'] = df['term'].apply(term_to_int)

In [160]:
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager,issue_year,issue_month,last_pymnt_year,last_pymnt_month,days_from_earliest_credit_to_issue,term_in_months
0,10000,10000,10000.0,36 months,10.33,324.23,B,B1,Unknown,< 1 year,MORTGAGE,280000.0,Not Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,1996-01-01,0,18.0,,14,0,9082,38%,23,w,9279.39,9279.39,964.08,964.08,720.61,243.47,0.0,0.0,0.0,2019-04-01,324.23,Apr-2019,Apr-2019,0,,1,Individual,,,,0,671,246828,1,3,2,3,1.0,48552,62.0,1,3,4923,46.0,23900,2,7,1,7,17631.0,11897.0,43.1,0,0,158.0,275,11,1,1,11.0,,11.0,,0,3,4,7,7,10,9,11,4,14,0.0,0,0,4,91.3,28.6,0,0,367828,61364,20900,54912,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0,8370,36
1,9600,9600,9600.0,36 months,12.98,323.37,B,B5,Unknown,,MORTGAGE,35704.0,Not Verified,2018-12-01,Current,n,home_improvement,Home improvement,401xx,KY,0.84,0,2003-11-01,0,69.0,,5,0,748,11.5%,23,w,8934.25,8934.25,994.35,994.35,665.75,328.6,0.0,0.0,0.0,2019-04-01,323.37,May-2019,Apr-2019,0,,1,Individual,,,,0,0,748,0,0,0,0,44.0,0,,0,3,748,12.0,6500,0,0,1,3,150.0,3452.0,17.8,0,0,181.0,100,13,13,0,16.0,,3.0,,0,1,1,2,2,16,5,7,1,5,0.0,0,0,0,95.5,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0,5509,36
2,4000,4000,4000.0,36 months,23.4,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,2006-09-01,4,59.0,,15,0,5199,19.2%,20,w,3762.39,3762.39,459.24,459.24,237.61,221.63,0.0,0.0,0.0,2019-04-01,155.68,Apr-2019,Apr-2019,0,,1,Individual,,,,0,0,66926,5,4,3,4,5.0,61727,86.0,6,11,1353,68.0,27100,4,0,4,15,4462.0,20174.0,7.9,0,0,147.0,118,2,2,0,2.0,,0.0,,0,5,7,9,9,8,11,12,7,15,0.0,0,0,9,95.0,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0,4474,36
3,2500,2500,2500.0,36 months,13.56,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,2001-04-01,1,,45.0,9,1,4341,10.3%,34,w,2328.06,2328.06,251.94,251.94,171.94,80.0,0.0,0.0,0.0,2019-04-01,84.92,Apr-2019,Apr-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0,6453,36
4,30000,30000,30000.0,60 months,18.94,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,1987-06-01,0,71.0,75.0,13,1,12315,24.2%,44,w,29074.35,29074.35,2284.34,2284.34,925.65,1358.69,0.0,0.0,0.0,2019-04-01,777.23,Apr-2019,Apr-2019,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0,11506,36


## Create **loan_status_is_great**

In [161]:
df['loan_status'].value_counts()

Current               120550
Fully Paid              5737
Late (31-120 days)      1030
In Grace Period          470
Late (16-30 days)        401
Charged Off              209
Default                   15
Name: loan_status, dtype: int64

In [0]:
def loan_status_analyzer(string):
  if string in ['Current', 'Fully Paid']:
    status = 1
  else:
    status = 0
  return(status)

In [0]:
df['loan_status_is_great'] = df['loan_status'].apply(loan_status_analyzer)

In [164]:
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager,issue_year,issue_month,last_pymnt_year,last_pymnt_month,days_from_earliest_credit_to_issue,term_in_months,loan_status_is_great
0,10000,10000,10000.0,36 months,10.33,324.23,B,B1,Unknown,< 1 year,MORTGAGE,280000.0,Not Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,1996-01-01,0,18.0,,14,0,9082,38%,23,w,9279.39,9279.39,964.08,964.08,720.61,243.47,0.0,0.0,0.0,2019-04-01,324.23,Apr-2019,Apr-2019,0,,1,Individual,,,,0,671,246828,1,3,2,3,1.0,48552,62.0,1,3,4923,46.0,23900,2,7,1,7,17631.0,11897.0,43.1,0,0,158.0,275,11,1,1,11.0,,11.0,,0,3,4,7,7,10,9,11,4,14,0.0,0,0,4,91.3,28.6,0,0,367828,61364,20900,54912,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0,8370,36,1
1,9600,9600,9600.0,36 months,12.98,323.37,B,B5,Unknown,,MORTGAGE,35704.0,Not Verified,2018-12-01,Current,n,home_improvement,Home improvement,401xx,KY,0.84,0,2003-11-01,0,69.0,,5,0,748,11.5%,23,w,8934.25,8934.25,994.35,994.35,665.75,328.6,0.0,0.0,0.0,2019-04-01,323.37,May-2019,Apr-2019,0,,1,Individual,,,,0,0,748,0,0,0,0,44.0,0,,0,3,748,12.0,6500,0,0,1,3,150.0,3452.0,17.8,0,0,181.0,100,13,13,0,16.0,,3.0,,0,1,1,2,2,16,5,7,1,5,0.0,0,0,0,95.5,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0,5509,36,1
2,4000,4000,4000.0,36 months,23.4,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,2006-09-01,4,59.0,,15,0,5199,19.2%,20,w,3762.39,3762.39,459.24,459.24,237.61,221.63,0.0,0.0,0.0,2019-04-01,155.68,Apr-2019,Apr-2019,0,,1,Individual,,,,0,0,66926,5,4,3,4,5.0,61727,86.0,6,11,1353,68.0,27100,4,0,4,15,4462.0,20174.0,7.9,0,0,147.0,118,2,2,0,2.0,,0.0,,0,5,7,9,9,8,11,12,7,15,0.0,0,0,9,95.0,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0,4474,36,1
3,2500,2500,2500.0,36 months,13.56,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,2001-04-01,1,,45.0,9,1,4341,10.3%,34,w,2328.06,2328.06,251.94,251.94,171.94,80.0,0.0,0.0,0.0,2019-04-01,84.92,Apr-2019,Apr-2019,0,,1,Individual,,,,0,0,16901,2,2,1,2,2.0,12560,69.0,2,7,2137,28.0,42000,1,11,2,9,1878.0,34360.0,5.9,0,0,140.0,212,1,1,0,1.0,,2.0,,0,2,5,3,3,16,7,18,5,9,0.0,0,0,3,100.0,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0,6453,36,1
4,30000,30000,30000.0,60 months,18.94,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,1987-06-01,0,71.0,75.0,13,1,12315,24.2%,44,w,29074.35,29074.35,2284.34,2284.34,925.65,1358.69,0.0,0.0,0.0,2019-04-01,777.23,Apr-2019,Apr-2019,0,,1,Individual,,,,0,1208,321915,4,4,2,3,3.0,87153,88.0,4,5,998,57.0,50800,2,15,2,10,24763.0,13761.0,8.3,0,0,163.0,378,4,3,3,4.0,,4.0,,0,2,4,4,9,27,8,14,4,13,0.0,0,0,6,95.0,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,2019.0,4.0,11506,36,1


# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

## `'revol_util'` also has a '%'.

In [165]:
type(df['revol_util'][0])

str

In [166]:
float(df['revol_util'][0].strip('%'))

38.0

In [0]:
def nan_remove_percent(string):
  if type(string) == str:
    return float(string.strip('%'))
  else:
    return "Unknown"

In [168]:
df['revol_util'] = df['revol_util'].apply(nan_remove_percent)
df['revol_util'][234]

'Unknown'

## Modify the `emp_title` column

In [0]:
df['emp_title'].value_counts()

In [0]:
emp_title_value_count_df = pd.DataFrame()

In [0]:
emp_titles = df['emp_title'].value_counts().index.tolist()

In [0]:
emp_counts = df['emp_title'].value_counts().tolist()

In [180]:
emp_title_value_count_df['emp_titles'] = emp_titles
emp_title_value_count_df['emp_counts'] = emp_counts

emp_title_value_count_df.head()

Unnamed: 0,emp_titles,emp_counts
0,Unknown,20947
1,Teacher,2557
2,Manager,2395
3,Registered Nurse,1418
4,Driver,1258


In [0]:
emp_titles_to_change = []

In [0]:
for index, row in emp_title_value_count_df.iterrows():
  if emp_title_value_count_df['emp_counts'][index] < 410:
    if emp_title_value_count_df['emp_titles'][index] not in emp_titles_to_change:
      emp_titles_to_change.append(emp_title_value_count_df['emp_titles'][index])

In [0]:
emp_titles_to_change

In [197]:
for index, row in df.iterrows():
  if df['emp_title'][index] in emp_titles_to_change:
    df['emp_title'][index] = "Other"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


KeyboardInterrupt: ignored