<a href="https://colab.research.google.com/github/spasatel13/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/module2-make-features/DS20_W1_D2_Make_Features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 1, Sprint 1, Module 2*

---

# Learning Objectives

- Objective 01 - understand the purpose of feature engineering
- Objective 02 - demonstrate how to work with strings in pandas
- Objective 03 - modify or create dataframe columns using the `apply()` function
- Objective 04 - work with dates and times in pandas


Helpful Links:
- [Minimally Sufficient Pandas](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428)
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series
- [Lambda Learning Method for DS - By Ryan Herr](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit?usp=sharing)

In [276]:
#df.column_header

#this one always works
#df['column_header']

# [Objective 01](#feature-engineering) - The Purpose of Feature Engineering



## Overview

Feature Engineering is the process of using a combination of domain knowledge, creativity, and the pre-existing columns of a dataset to create completely new columns.

 Machine Learning models try to detect patterns in the data and then associate those patterns with certain predictions. The hope is that by creating new columns on our dataset that we can expose our model to new patterns in the data so that it can make better and better predictions.

This is largely a matter of understanding how to work with individual columns of a dataframe with Pandas --which is what we'll be practicing today!

## Follow Along

Columns of a dataframe each hold a specific type of data. Lets inspect some of the common datatypes found in datasets and then we'll make a new feature on a dataset using pre-existing columns.

In [277]:
import pandas as pd

#Pandas Display Options:
pd.set_option('display.max_rows',150)
pd.set_option('display.max_columns',180)

In [278]:
# Lets take a look at the Ames Iowa Housing Dataset:
df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/Ames%20Housing%20Data/train.csv')

### Specific Columns hold specific kinds of data

In [279]:
print(df.shape)
df.head()

(1460, 81)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [None]:
df.dtypes     

Some columns hold integer values like the `BedroomAbvGr` which stands for "Bedrooms Above Grade." This is the number of non-basement bedrooms in the home.

For more information on specific column meanings view the [data dictionary](https://github.com/ryanleeallred/datasets/blob/master/Ames%20Housing%20Data/data_description.txt).

In [281]:
# Look at a few rows of the `BedroomAbvGr` column.
# Looks like integers to me!
df.BedroomAbvGr.head(10)


0    3
1    3
2    3
3    3
4    4
5    1
6    3
7    3
8    2
9    2
Name: BedroomAbvGr, dtype: int64

What type of variable is BedroomAbvGr?  Is it categorical or quantitative?  If you answered "categorical", is it ordinal, nominal or an identifier?  If you answered "quantitative", is it continuous or discrete?

It is quantitative and discrete.

Some columns hold float values like the `LotFrontage` column.

In [282]:
# Look at a few rows of the `LotFrontage` column.
df.LotFrontage.head(10)

0    65.0
1    80.0
2    68.0
3    60.0
4    84.0
5    85.0
6    75.0
7     NaN
8    51.0
9    50.0
Name: LotFrontage, dtype: float64

Hmmm, do the values above look like floats to you?

They all have .0 on them so technically they're being stored as floats, but *should* they be stored as floats?

Lets see what all of the possible values for this column are.

In [283]:
df.LotFrontage.value_counts()

#for bfiefness
df.LotFrontage.value_counts().head(10)

60.0    143
70.0     70
80.0     69
50.0     57
75.0     53
65.0     44
85.0     40
78.0     25
21.0     23
90.0     23
Name: LotFrontage, dtype: int64

In [284]:
print(type(df.LotFrontage.value_counts()))

<class 'pandas.core.series.Series'>


Looks to me like the `LotFrontage` column originally held integer values but was cast to a `float` meaning that each original integer values was converted to its corresponding float representation. 

Any guesses as to why that would have happened?


HINT: What's the most common `LotFrontage` value for this column?

In [None]:
# NaN is the most common value in this column. What is a NaN
df.LotFrontage.value_counts(dropna=False)

#for briefness
df.LotFrontage.value_counts(dropna=False).head(10)

`NaN` stands stands for "Not a Number" and is the default missing value indicator with Pandas. This means there were cells in this column that didn't have a LotFrontage value recorded for those homes. 

This is where domain knowledge starts to come in. Think about the context we're working with here: houses. What might a null or blank cell representing "Linear feet of street connected to property" mean in the context of a housing dataset?

Ok, so maybe it makes sense to have some NaNs in this column. What is the datatype of a NaN value?

Perhaps some of this data is truly missing or unrecorded data, but sometimes `NaNs` are more likely to indicate something that was "NA" or "Not Applicable" to a particular observation. There could be multiple reasons why there was no value recorded for a particular feature.

Remember - Pandas tries to maintain a single datatype for all values in a column, and therefore...

In [285]:
import numpy as np

# What is the datatype of NaN?
print(type(np.nan))


<class 'float'>


The datatype of a NaN is float!  This means that if we have a column of integer values, but the column has even a single `NaN` that column will not be treated with the integer datatype but all of the integers will be converted to floats in order to try and preserve the same datatype throughout the entire column.

You can see already how understanding column datatypes is crucial to understanding how Pandas help us manage our data.

Making new Features

Lets slim down the dataset and consider just a few specific columns:

- `TotalBsmtSF`
- `1stFlrSF`
- `2ndFlrSF`
- `SalePrice1`


In [289]:
# I can make a smaller dataframe with a few specific column headers
# by passing a list of column headers inside of the square brackets

small_df=df[['TotalBsmtSF','1stFlrSF','2ndFlrSF','SalePrice']]
small_df.head()

Unnamed: 0,TotalBsmtSF,1stFlrSF,2ndFlrSF,SalePrice
0,856,856,854,208500
1,1262,1262,0,181500
2,920,920,866,223500
3,756,961,756,140000
4,1145,1145,1053,250000


In [None]:
small_df.dtypes

### Syntax for creating new columns

When making a new column on a dataframe, we have to use the square bracket syntax of accessing a column. We can't use "dot syntax" here.

In [None]:
# Lets add up all of the square footage to get a single square footage 
# column for the entire dataset

# Using bracket syntax to make a new 'TotalSquareFootage' column

small_df['TotalSF']=small_df.TotalBsmtSF+small_df['1stFlrSF']+small_df['2ndFlrSF']
small_df.head()

In [None]:
# Lets make  another new column that is 'PricePerSqFt' by
# dividing the price by the square footage

small_df['TotalBsmtSF'][0]=2
small_df.head()
#df.head()
#the warning is useless since only the small_df is being edited, even though it's being passed by reference
#but to avoid the warning just use .copy() at the end.

In [292]:
small_df=df[['TotalBsmtSF','1stFlrSF','2ndFlrSF','SalePrice']].copy()
small_df.head()

Unnamed: 0,TotalBsmtSF,1stFlrSF,2ndFlrSF,SalePrice
0,856,856,854,208500
1,1262,1262,0,181500
2,920,920,866,223500
3,756,961,756,140000
4,1145,1145,1053,250000


In [293]:
small_df['TotalSF']=small_df.TotalBsmtSF+small_df['1stFlrSF']+small_df['2ndFlrSF']
small_df.head()

Unnamed: 0,TotalBsmtSF,1stFlrSF,2ndFlrSF,SalePrice,TotalSF
0,856,856,854,208500,2566
1,1262,1262,0,181500,2524
2,920,920,866,223500,2706
3,756,961,756,140000,2473
4,1145,1145,1053,250000,3343


In [294]:
#Reasigning row index
small_df.index=small_df['TotalSF']
small_df.head()

Unnamed: 0_level_0,TotalBsmtSF,1stFlrSF,2ndFlrSF,SalePrice,TotalSF
TotalSF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2566,856,856,854,208500,2566
2524,1262,1262,0,181500,2524
2706,920,920,866,223500,2706
2473,756,961,756,140000,2473
3343,1145,1145,1053,250000,3343


###We can also use if-then statements to create new variables.  

Say we want to categorize houses as having a high price per square foot (greater than or equal to 80 dollars per square foot) or a low price per square foot (less than 80 dollars per square foot).

In [295]:
small_df['PricePerSF']=small_df['SalePrice']/small_df['TotalSF']
small_df.head()

Unnamed: 0_level_0,TotalBsmtSF,1stFlrSF,2ndFlrSF,SalePrice,TotalSF,PricePerSF
TotalSF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2566,856,856,854,208500,2566,81.254871
2524,1262,1262,0,181500,2524,71.909667
2706,920,920,866,223500,2706,82.594235
2473,756,961,756,140000,2473,56.611403
3343,1145,1145,1053,250000,3343,74.783129


In [None]:
small_df=small_df.round(2)
small_df.head()

In [None]:
#pd.crosstab(small_df['High_ppsqft'],columns='count')

Now we have added several new columns on our small dataset.

- What does a **high** `PricePerSqFt` say about a home that the square footage and price alone don't capture as directly?

- What does a **low** `PricePerSqFt` say about a home that the square footage and price alone don't directly capture?



### Let's include "and" and "or" conditions in the if-then statements.

Let's identify the 2-story duplex houses.

## Challenge

I hope you can see how we have used existing columns to create a new column on a dataset that say something new about our unit of observation. This is what making new features (columns) on a dataset is all about and why it's so essential to data science --particularly predictive modeling "Machine Learning." 

We'll spend the rest of the lecture and assignment today trying to get as good as we can at manipulating (cleaning) and creating new columns on datasets.

# [Objective 02](#work-with-strings) - Work with Strings with Pandas

## Overview

So far we have worked with numeric datatypes (ints and floats) but we haven't worked with any columns containing string values. We can't simply use arithmetic to manipulate string values, so we'll need to learn some more techniques in order to work with this datatype.

## Follow Along

We're going to import a new dataset here to work with. This dataset is from LendingClub and holds information about loans issued in Q4 of 2018. This dataset is a bit messy so it will give us plenty of opportunities to clean up existing columns as well as create new ones.

The `!wget` shell command being used here does exactly the same thing that your browser does when you type a URL in the address. It makes a request or "gets" the file at that address. However, in our case the file isn't a webpage, it's a compressed CSV file. 

Try copying and pasting the URL from below into your browser, did it start an automatic download? Any URLs like this that start automatic downloads when navigated to can be used along with the `!wget` command to bring files directly into your notebook's memory.

### Load a new dataset

In [299]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

--2020-10-01 19:03:56--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 44.235.65.250, 44.224.91.33, 54.213.174.123
Connecting to resources.lendingclub.com (resources.lendingclub.com)|44.235.65.250|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip.4’

LoanStats_2018Q4.cs     [                 <=>]  22.30M  2.15MB/s    in 11s     

2020-10-01 19:04:07 (2.11 MB/s) - ‘LoanStats_2018Q4.csv.zip.4’ saved [23387713]



We need to use the `!unzip` command to extract the csv from the zipped folder.

In [300]:
!unzip LoanStats_2018Q4.csv.zip

Archive:  LoanStats_2018Q4.csv.zip
replace LoanStats_2018Q4.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: LoanStats_2018Q4.csv    


We can also use bash/shell commands to look at the raw file using the `!head` and `!tail` commands

In [301]:
!head LoanStats_2018Q4.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

In [302]:
!tail LoanStats_2018Q4.csv

"","","5600","5600","5600"," 36 months"," 13.56%","190.21","C","C1","","n/a","RENT","15600","Not Verified","Oct-2018","Current","n","","","credit_card","Credit card refinancing","836xx","ID","15.31","0","Aug-2012","0","","97","9","1","5996","34.5%","11","w","2287.33","2287.33","4364.28","4364.28","3312.67","1051.61","0.0","0.0","0.0","Sep-2020","190.21","Oct-2020","Sep-2020","0","","1","Individual","","","","0","0","5996","0","0","0","1","20","0","","0","2","3017","35","17400","1","0","0","3","750","4689","45.5","0","0","20","73","13","13","0","13","","20","","0","3","5","4","4","1","9","10","5","9","0","0","0","0","100","25","1","0","17400","5996","8600","0","","","","","","","","","","","","N","","","","","","","","","","","","","","","N","","","","","",""
"","","23000","23000","23000"," 36 months"," 15.02%","797.53","C","C3","Tax Consultant","10+ years","MORTGAGE","75000","Source Verified","Oct-2018","Charged Off","n","","","debt_consolidation","Debt consolidation","352xx","AL","20.

As we look at the raw file itself, do you see anything that might cause us trouble as we read in the CSV file to a dataframe?

In [303]:
# Read in the CSV
# Beacuse if additional info in the beggining and end of file this way produces weird looking df. Need to specify header
import pandas as pd

df = pd.read_csv('LoanStats_2018Q4.csv' )

print(df.shape)
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


(128381, 1)


Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,Unnamed: 33,Unnamed: 34,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39,Unnamed: 40,Unnamed: 41,Unnamed: 42,Unnamed: 43,Unnamed: 44,Unnamed: 45,Unnamed: 46,Unnamed: 47,Unnamed: 48,Unnamed: 49,Unnamed: 50,Unnamed: 51,Unnamed: 52,Unnamed: 53,Unnamed: 54,Unnamed: 55,Unnamed: 56,Unnamed: 57,Unnamed: 58,Unnamed: 59,Unnamed: 60,Unnamed: 61,Unnamed: 62,Unnamed: 63,Unnamed: 64,Unnamed: 65,Unnamed: 66,Unnamed: 67,Unnamed: 68,Unnamed: 69,Unnamed: 70,Unnamed: 71,Unnamed: 72,Unnamed: 73,Unnamed: 74,Unnamed: 75,Unnamed: 76,Unnamed: 77,Unnamed: 78,Unnamed: 79,Unnamed: 80,Unnamed: 81,Unnamed: 82,Unnamed: 83,Unnamed: 84,Unnamed: 85,Unnamed: 86,Unnamed: 87,Unnamed: 88,Unnamed: 89,Unnamed: 90,Unnamed: 91,Unnamed: 92,Unnamed: 93,Unnamed: 94,Unnamed: 95,Unnamed: 96,Unnamed: 97,Unnamed: 98,Unnamed: 99,Unnamed: 100,Unnamed: 101,Unnamed: 102,Unnamed: 103,Unnamed: 104,Unnamed: 105,Unnamed: 106,Unnamed: 107,Unnamed: 108,Unnamed: 109,Unnamed: 110,Unnamed: 111,Unnamed: 112,Unnamed: 113,Unnamed: 114,Unnamed: 115,Unnamed: 116,Unnamed: 117,Unnamed: 118,Unnamed: 119,Unnamed: 120,Unnamed: 121,Unnamed: 122,Unnamed: 123,Unnamed: 124,Unnamed: 125,Unnamed: 126,Unnamed: 127,Unnamed: 128,Unnamed: 129,Unnamed: 130,Unnamed: 131,Unnamed: 132,Unnamed: 133,Unnamed: 134,Unnamed: 135,Unnamed: 136,Unnamed: 137,Unnamed: 138,Unnamed: 139,Unnamed: 140,Unnamed: 141,Unnamed: 142,Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
,,5000,5000,5000,36 months,11.31%,164.43,B,B3,Shipping,2 years,RENT,40000,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,284xx,NC,9.66,0,Oct-2007,0,,,8,0,1070,17.5%,27,w,2770.45,2770.45,2956.61,2956.61,2229.55,727.06,0.0,0.0,0.0,Sep-2020,164.43,Oct-2020,Sep-2020,0,,1,Individual,,,,0,550,144979,1,2,0,1,13,143909,78,1,3,170,44,6100,1,0,1,4,18122,3283,8.8,0,0,134,86,6,6,0,6,,9,,0,2,5,3,3,19,6,8,5,8,0,0,0,1,100,0,0,0,125853,144979,3600,119753,,,,,,,,,,,,N,CVD19SKIP,INCOMECURT,COMPLETE,2,0,Apr-2020,Jun-2020,May-2020,2,0,ACTIVE,58.8,3121.41,164.43,N,,,,,,
,,10000,10000,10000,36 months,18.94%,366.26,D,D2,Especialist 1,7 years,MORTGAGE,37115,Not Verified,Dec-2018,Current,n,,,other,Other,891xx,NV,33.57,0,Jan-2002,0,38,,9,0,5837,53.1%,15,w,5147.99,5147.99,7309.42,7309.42,4852.01,2457.41,0.0,0.0,0.0,Sep-2020,366.26,Oct-2020,Sep-2020,0,38,1,Joint App,91115,18.61,Not Verified,0,0,240589,3,2,1,2,8,34508,88,2,3,1659,64,11000,2,0,5,6,26732,2669,52.3,0,0,158,203,4,4,1,5,38,7,38,1,2,5,3,6,4,6,10,5,9,0,0,0,4,93.3,33.3,0,0,253346,40345,5600,41060,19051,Jan-2016,1,1,9,76.4,2,7,0,0,,N,,,,,,,,,,,,,,,N,,,,,,
,,20000,20000,20000,36 months,7.56%,622.68,A,A3,Teacher,10+ years,MORTGAGE,100000,Not Verified,Dec-2018,Fully Paid,n,,,credit_card,Credit card refinancing,982xx,WA,18.92,0,Feb-1999,0,48,,9,0,25416,29.9%,19,w,0.00,0.00,20215.79243,20215.79,20000.00,215.79,0.0,0.0,0.0,Feb-2019,20228.39,,Feb-2019,0,,1,Joint App,190000,11.75,Not Verified,0,0,515779,1,2,0,1,13,46153,71,1,2,9759,39,85100,2,2,0,5,57309,59684,29.9,0,0,171,238,1,1,5,1,,13,48,0,5,5,5,6,5,5,9,5,9,0,0,0,1,94.7,20,0,0,622183,71569,85100,74833,43287,Aug-1998,0,3,10,29.7,2,7,0,0,,N,,,,,,,,,,,,,,,N,,,,,,
,,6500,6500,6500,36 months,11.80%,215.28,B,B4,Educator,2 years,RENT,46500,Source Verified,Dec-2018,Late (16-30 days),n,,,debt_consolidation,Debt consolidation,352xx,AL,21.01,0,Aug-2011,1,61,,24,0,6741,41.6%,30,w,3354.53,3354.53,4083.93,4083.93,3145.47,938.46,0.0,0.0,0.0,Jul-2020,215.28,Oct-2020,Sep-2020,0,61,1,Individual,,,,0,0,40223,4,12,2,2,5,33482,97,7,12,3662,79,16200,2,0,4,14,1749,7694,42.2,0,0,88,72,1,1,0,3,,5,61,2,6,10,6,6,14,11,16,10,24,0,0,0,9,93.3,0,0,0,50845,40223,13300,34645,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


In [304]:
df.tail()

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,Unnamed: 33,Unnamed: 34,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39,Unnamed: 40,Unnamed: 41,Unnamed: 42,Unnamed: 43,Unnamed: 44,Unnamed: 45,Unnamed: 46,Unnamed: 47,Unnamed: 48,Unnamed: 49,Unnamed: 50,Unnamed: 51,Unnamed: 52,Unnamed: 53,Unnamed: 54,Unnamed: 55,Unnamed: 56,Unnamed: 57,Unnamed: 58,Unnamed: 59,Unnamed: 60,Unnamed: 61,Unnamed: 62,Unnamed: 63,Unnamed: 64,Unnamed: 65,Unnamed: 66,Unnamed: 67,Unnamed: 68,Unnamed: 69,Unnamed: 70,Unnamed: 71,Unnamed: 72,Unnamed: 73,Unnamed: 74,Unnamed: 75,Unnamed: 76,Unnamed: 77,Unnamed: 78,Unnamed: 79,Unnamed: 80,Unnamed: 81,Unnamed: 82,Unnamed: 83,Unnamed: 84,Unnamed: 85,Unnamed: 86,Unnamed: 87,Unnamed: 88,Unnamed: 89,Unnamed: 90,Unnamed: 91,Unnamed: 92,Unnamed: 93,Unnamed: 94,Unnamed: 95,Unnamed: 96,Unnamed: 97,Unnamed: 98,Unnamed: 99,Unnamed: 100,Unnamed: 101,Unnamed: 102,Unnamed: 103,Unnamed: 104,Unnamed: 105,Unnamed: 106,Unnamed: 107,Unnamed: 108,Unnamed: 109,Unnamed: 110,Unnamed: 111,Unnamed: 112,Unnamed: 113,Unnamed: 114,Unnamed: 115,Unnamed: 116,Unnamed: 117,Unnamed: 118,Unnamed: 119,Unnamed: 120,Unnamed: 121,Unnamed: 122,Unnamed: 123,Unnamed: 124,Unnamed: 125,Unnamed: 126,Unnamed: 127,Unnamed: 128,Unnamed: 129,Unnamed: 130,Unnamed: 131,Unnamed: 132,Unnamed: 133,Unnamed: 134,Unnamed: 135,Unnamed: 136,Unnamed: 137,Unnamed: 138,Unnamed: 139,Unnamed: 140,Unnamed: 141,Unnamed: 142,Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
,,5000.0,5000.0,5000.0,36 months,13.56%,169.83,C,C1,Payoff Clerk,10+ years,MORTGAGE,35360.0,Not Verified,Oct-2018,Current,n,,,debt_consolidation,Debt consolidation,381xx,TN,11.3,1.0,Jun-2006,0.0,21.0,,9.0,0.0,2597.0,27.3%,15.0,f,2042.26,2042.26,3902.32,3902.32,2957.74,944.58,0.0,0.0,0.0,Sep-2020,169.83,Oct-2020,Sep-2020,0.0,,1.0,Individual,,,,0.0,1413.0,69785.0,0.0,2.0,0.0,1.0,16.0,2379.0,40.0,3.0,4.0,1826.0,32.0,9500.0,0.0,0.0,1.0,5.0,8723.0,1174.0,60.9,0.0,0.0,147.0,85.0,9.0,9.0,2.0,10.0,21.0,9.0,21.0,0.0,1.0,3.0,2.0,2.0,6.0,6.0,7.0,3.0,9.0,0.0,0.0,0.0,3.0,92.9,50.0,0.0,0.0,93908.0,4976.0,3000.0,6028.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
,,10000.0,10000.0,9750.0,36 months,11.06%,327.68,B,B3,,,RENT,44400.0,Source Verified,Oct-2018,Current,n,,,credit_card,Credit card refinancing,980xx,WA,11.78,0.0,Oct-2008,2.0,40.0,,15.0,0.0,6269.0,13.1%,25.0,f,3996.89,3896.97,7521.28,7333.25,6003.11,1518.17,0.0,0.0,0.0,Sep-2020,327.68,Oct-2020,Sep-2020,0.0,53.0,1.0,Individual,,,,0.0,520.0,16440.0,3.0,1.0,1.0,1.0,2.0,10171.0,100.0,2.0,5.0,404.0,28.0,47700.0,0.0,3.0,5.0,6.0,1265.0,20037.0,2.3,0.0,0.0,61.0,119.0,1.0,1.0,0.0,1.0,,1.0,40.0,1.0,2.0,4.0,6.0,8.0,3.0,14.0,22.0,4.0,15.0,0.0,0.0,0.0,3.0,92.0,0.0,0.0,0.0,57871.0,16440.0,20500.0,10171.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
,,10000.0,10000.0,10000.0,36 months,16.91%,356.08,C,C5,Key Accounts Manager,2 years,RENT,80000.0,Not Verified,Oct-2018,Current,n,,,other,Other,021xx,MA,17.72,1.0,Sep-2006,0.0,14.0,,17.0,0.0,1942.0,30.8%,31.0,w,4202.86,4202.86,8180.45,8180.45,5797.14,2383.31,0.0,0.0,0.0,Sep-2020,356.08,Oct-2020,Sep-2020,0.0,25.0,1.0,Individual,,,,0.0,0.0,59194.0,0.0,15.0,1.0,1.0,12.0,57252.0,85.0,0.0,0.0,1942.0,80.0,6300.0,0.0,5.0,0.0,1.0,3482.0,2058.0,48.5,0.0,0.0,144.0,142.0,40.0,12.0,0.0,131.0,30.0,,30.0,3.0,1.0,1.0,1.0,5.0,22.0,2.0,9.0,1.0,17.0,0.0,0.0,0.0,1.0,74.2,0.0,0.0,0.0,73669.0,59194.0,4000.0,67369.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
Total amount funded in policy code 1: 2050322100,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Total amount funded in policy code 2: 819980372,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [305]:
df = pd.read_csv('LoanStats_2018Q4.csv', header=1 )

print(df.shape)
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


(128380, 144)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,5000.0,5000.0,5000.0,36 months,11.31%,164.43,B,B3,Shipping,2 years,RENT,40000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,284xx,NC,9.66,0.0,Oct-2007,0.0,,,8.0,0.0,1070.0,17.5%,27.0,w,2770.45,2770.45,2956.61,2956.61,2229.55,727.06,0.0,0.0,0.0,Sep-2020,164.43,Oct-2020,Sep-2020,0.0,,1.0,Individual,,,,0.0,550.0,144979.0,1.0,2.0,0.0,1.0,13.0,143909.0,78.0,1.0,3.0,170.0,44.0,6100.0,1.0,0.0,1.0,4.0,18122.0,3283.0,8.8,0.0,0.0,134.0,86.0,6.0,6.0,0.0,6.0,,9.0,,0.0,2.0,5.0,3.0,3.0,19.0,6.0,8.0,5.0,8.0,0.0,0.0,0.0,1.0,100.0,0.0,0.0,0.0,125853.0,144979.0,3600.0,119753.0,,,,,,,,,,,,N,CVD19SKIP,INCOMECURT,COMPLETE,2.0,0.0,Apr-2020,Jun-2020,May-2020,2.0,0.0,ACTIVE,58.8,3121.41,164.43,N,,,,,,
1,,,10000.0,10000.0,10000.0,36 months,18.94%,366.26,D,D2,Especialist 1,7 years,MORTGAGE,37115.0,Not Verified,Dec-2018,Current,n,,,other,Other,891xx,NV,33.57,0.0,Jan-2002,0.0,38.0,,9.0,0.0,5837.0,53.1%,15.0,w,5147.99,5147.99,7309.42,7309.42,4852.01,2457.41,0.0,0.0,0.0,Sep-2020,366.26,Oct-2020,Sep-2020,0.0,38.0,1.0,Joint App,91115.0,18.61,Not Verified,0.0,0.0,240589.0,3.0,2.0,1.0,2.0,8.0,34508.0,88.0,2.0,3.0,1659.0,64.0,11000.0,2.0,0.0,5.0,6.0,26732.0,2669.0,52.3,0.0,0.0,158.0,203.0,4.0,4.0,1.0,5.0,38.0,7.0,38.0,1.0,2.0,5.0,3.0,6.0,4.0,6.0,10.0,5.0,9.0,0.0,0.0,0.0,4.0,93.3,33.3,0.0,0.0,253346.0,40345.0,5600.0,41060.0,19051.0,Jan-2016,1.0,1.0,9.0,76.4,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,
2,,,20000.0,20000.0,20000.0,36 months,7.56%,622.68,A,A3,Teacher,10+ years,MORTGAGE,100000.0,Not Verified,Dec-2018,Fully Paid,n,,,credit_card,Credit card refinancing,982xx,WA,18.92,0.0,Feb-1999,0.0,48.0,,9.0,0.0,25416.0,29.9%,19.0,w,0.0,0.0,20215.79243,20215.79,20000.0,215.79,0.0,0.0,0.0,Feb-2019,20228.39,,Feb-2019,0.0,,1.0,Joint App,190000.0,11.75,Not Verified,0.0,0.0,515779.0,1.0,2.0,0.0,1.0,13.0,46153.0,71.0,1.0,2.0,9759.0,39.0,85100.0,2.0,2.0,0.0,5.0,57309.0,59684.0,29.9,0.0,0.0,171.0,238.0,1.0,1.0,5.0,1.0,,13.0,48.0,0.0,5.0,5.0,5.0,6.0,5.0,5.0,9.0,5.0,9.0,0.0,0.0,0.0,1.0,94.7,20.0,0.0,0.0,622183.0,71569.0,85100.0,74833.0,43287.0,Aug-1998,0.0,3.0,10.0,29.7,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,
3,,,6500.0,6500.0,6500.0,36 months,11.80%,215.28,B,B4,Educator,2 years,RENT,46500.0,Source Verified,Dec-2018,Late (16-30 days),n,,,debt_consolidation,Debt consolidation,352xx,AL,21.01,0.0,Aug-2011,1.0,61.0,,24.0,0.0,6741.0,41.6%,30.0,w,3354.53,3354.53,4083.93,4083.93,3145.47,938.46,0.0,0.0,0.0,Jul-2020,215.28,Oct-2020,Sep-2020,0.0,61.0,1.0,Individual,,,,0.0,0.0,40223.0,4.0,12.0,2.0,2.0,5.0,33482.0,97.0,7.0,12.0,3662.0,79.0,16200.0,2.0,0.0,4.0,14.0,1749.0,7694.0,42.2,0.0,0.0,88.0,72.0,1.0,1.0,0.0,3.0,,5.0,61.0,2.0,6.0,10.0,6.0,6.0,14.0,11.0,16.0,10.0,24.0,0.0,0.0,0.0,9.0,93.3,0.0,0.0,0.0,50845.0,40223.0,13300.0,34645.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,,,25000.0,25000.0,25000.0,60 months,12.98%,568.58,B,B5,Tire builder,10+ years,MORTGAGE,85000.0,Not Verified,Dec-2018,Fully Paid,n,,,debt_consolidation,Debt consolidation,356xx,AL,36.67,1.0,Oct-1993,1.0,10.0,,14.0,0.0,26630.0,53.3%,27.0,w,0.0,0.0,29563.763968,29563.76,25000.0,4563.76,0.0,0.0,0.0,Aug-2020,19356.36,,Aug-2020,0.0,40.0,1.0,Joint App,140000.0,23.08,Not Verified,0.0,56.0,206980.0,1.0,7.0,1.0,3.0,3.0,69513.0,81.0,1.0,1.0,17292.0,68.0,50000.0,1.0,3.0,1.0,4.0,14784.0,18581.0,57.7,0.0,0.0,186.0,185.0,9.0,3.0,4.0,9.0,10.0,3.0,10.0,1.0,4.0,5.0,4.0,5.0,13.0,6.0,10.0,5.0,14.0,0.0,0.0,0.0,2.0,85.2,0.0,0.0,0.0,275691.0,96143.0,43900.0,99691.0,48711.0,Oct-1993,1.0,4.0,16.0,56.5,8.0,12.0,0.0,0.0,49.0,N,,,,,,,,,,,,,,,N,,,,,,


In [306]:
df.tail()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
128375,,,5000.0,5000.0,5000.0,36 months,13.56%,169.83,C,C1,Payoff Clerk,10+ years,MORTGAGE,35360.0,Not Verified,Oct-2018,Current,n,,,debt_consolidation,Debt consolidation,381xx,TN,11.3,1.0,Jun-2006,0.0,21.0,,9.0,0.0,2597.0,27.3%,15.0,f,2042.26,2042.26,3902.32,3902.32,2957.74,944.58,0.0,0.0,0.0,Sep-2020,169.83,Oct-2020,Sep-2020,0.0,,1.0,Individual,,,,0.0,1413.0,69785.0,0.0,2.0,0.0,1.0,16.0,2379.0,40.0,3.0,4.0,1826.0,32.0,9500.0,0.0,0.0,1.0,5.0,8723.0,1174.0,60.9,0.0,0.0,147.0,85.0,9.0,9.0,2.0,10.0,21.0,9.0,21.0,0.0,1.0,3.0,2.0,2.0,6.0,6.0,7.0,3.0,9.0,0.0,0.0,0.0,3.0,92.9,50.0,0.0,0.0,93908.0,4976.0,3000.0,6028.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
128376,,,10000.0,10000.0,9750.0,36 months,11.06%,327.68,B,B3,,,RENT,44400.0,Source Verified,Oct-2018,Current,n,,,credit_card,Credit card refinancing,980xx,WA,11.78,0.0,Oct-2008,2.0,40.0,,15.0,0.0,6269.0,13.1%,25.0,f,3996.89,3896.97,7521.28,7333.25,6003.11,1518.17,0.0,0.0,0.0,Sep-2020,327.68,Oct-2020,Sep-2020,0.0,53.0,1.0,Individual,,,,0.0,520.0,16440.0,3.0,1.0,1.0,1.0,2.0,10171.0,100.0,2.0,5.0,404.0,28.0,47700.0,0.0,3.0,5.0,6.0,1265.0,20037.0,2.3,0.0,0.0,61.0,119.0,1.0,1.0,0.0,1.0,,1.0,40.0,1.0,2.0,4.0,6.0,8.0,3.0,14.0,22.0,4.0,15.0,0.0,0.0,0.0,3.0,92.0,0.0,0.0,0.0,57871.0,16440.0,20500.0,10171.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
128377,,,10000.0,10000.0,10000.0,36 months,16.91%,356.08,C,C5,Key Accounts Manager,2 years,RENT,80000.0,Not Verified,Oct-2018,Current,n,,,other,Other,021xx,MA,17.72,1.0,Sep-2006,0.0,14.0,,17.0,0.0,1942.0,30.8%,31.0,w,4202.86,4202.86,8180.45,8180.45,5797.14,2383.31,0.0,0.0,0.0,Sep-2020,356.08,Oct-2020,Sep-2020,0.0,25.0,1.0,Individual,,,,0.0,0.0,59194.0,0.0,15.0,1.0,1.0,12.0,57252.0,85.0,0.0,0.0,1942.0,80.0,6300.0,0.0,5.0,0.0,1.0,3482.0,2058.0,48.5,0.0,0.0,144.0,142.0,40.0,12.0,0.0,131.0,30.0,,30.0,3.0,1.0,1.0,1.0,5.0,22.0,2.0,9.0,1.0,17.0,0.0,0.0,0.0,1.0,74.2,0.0,0.0,0.0,73669.0,59194.0,4000.0,67369.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
128378,Total amount funded in policy code 1: 2050322100,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
128379,Total amount funded in policy code 2: 819980372,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


The extra rows at the top and bottom of the file have done two things:

1) The top row has made it so that the entire dataset is being interpreted as column headers

2) The bottom two rows have been read into the 'id' column and are causing every column to have at least two `NaN` values in it.

In [307]:
# We can fix the header problem by using the 'skiprows' parameter
# This is another way to fix header problem
df = pd.read_csv('LoanStats_2018Q4.csv', skiprows=1)
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,5000.0,5000.0,5000.0,36 months,11.31%,164.43,B,B3,Shipping,2 years,RENT,40000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,284xx,NC,9.66,0.0,Oct-2007,0.0,,,8.0,0.0,1070.0,17.5%,27.0,w,2770.45,2770.45,2956.61,2956.61,2229.55,727.06,0.0,0.0,0.0,Sep-2020,164.43,Oct-2020,Sep-2020,0.0,,1.0,Individual,,,,0.0,550.0,144979.0,1.0,2.0,0.0,1.0,13.0,143909.0,78.0,1.0,3.0,170.0,44.0,6100.0,1.0,0.0,1.0,4.0,18122.0,3283.0,8.8,0.0,0.0,134.0,86.0,6.0,6.0,0.0,6.0,,9.0,,0.0,2.0,5.0,3.0,3.0,19.0,6.0,8.0,5.0,8.0,0.0,0.0,0.0,1.0,100.0,0.0,0.0,0.0,125853.0,144979.0,3600.0,119753.0,,,,,,,,,,,,N,CVD19SKIP,INCOMECURT,COMPLETE,2.0,0.0,Apr-2020,Jun-2020,May-2020,2.0,0.0,ACTIVE,58.8,3121.41,164.43,N,,,,,,
1,,,10000.0,10000.0,10000.0,36 months,18.94%,366.26,D,D2,Especialist 1,7 years,MORTGAGE,37115.0,Not Verified,Dec-2018,Current,n,,,other,Other,891xx,NV,33.57,0.0,Jan-2002,0.0,38.0,,9.0,0.0,5837.0,53.1%,15.0,w,5147.99,5147.99,7309.42,7309.42,4852.01,2457.41,0.0,0.0,0.0,Sep-2020,366.26,Oct-2020,Sep-2020,0.0,38.0,1.0,Joint App,91115.0,18.61,Not Verified,0.0,0.0,240589.0,3.0,2.0,1.0,2.0,8.0,34508.0,88.0,2.0,3.0,1659.0,64.0,11000.0,2.0,0.0,5.0,6.0,26732.0,2669.0,52.3,0.0,0.0,158.0,203.0,4.0,4.0,1.0,5.0,38.0,7.0,38.0,1.0,2.0,5.0,3.0,6.0,4.0,6.0,10.0,5.0,9.0,0.0,0.0,0.0,4.0,93.3,33.3,0.0,0.0,253346.0,40345.0,5600.0,41060.0,19051.0,Jan-2016,1.0,1.0,9.0,76.4,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,
2,,,20000.0,20000.0,20000.0,36 months,7.56%,622.68,A,A3,Teacher,10+ years,MORTGAGE,100000.0,Not Verified,Dec-2018,Fully Paid,n,,,credit_card,Credit card refinancing,982xx,WA,18.92,0.0,Feb-1999,0.0,48.0,,9.0,0.0,25416.0,29.9%,19.0,w,0.0,0.0,20215.79243,20215.79,20000.0,215.79,0.0,0.0,0.0,Feb-2019,20228.39,,Feb-2019,0.0,,1.0,Joint App,190000.0,11.75,Not Verified,0.0,0.0,515779.0,1.0,2.0,0.0,1.0,13.0,46153.0,71.0,1.0,2.0,9759.0,39.0,85100.0,2.0,2.0,0.0,5.0,57309.0,59684.0,29.9,0.0,0.0,171.0,238.0,1.0,1.0,5.0,1.0,,13.0,48.0,0.0,5.0,5.0,5.0,6.0,5.0,5.0,9.0,5.0,9.0,0.0,0.0,0.0,1.0,94.7,20.0,0.0,0.0,622183.0,71569.0,85100.0,74833.0,43287.0,Aug-1998,0.0,3.0,10.0,29.7,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,
3,,,6500.0,6500.0,6500.0,36 months,11.80%,215.28,B,B4,Educator,2 years,RENT,46500.0,Source Verified,Dec-2018,Late (16-30 days),n,,,debt_consolidation,Debt consolidation,352xx,AL,21.01,0.0,Aug-2011,1.0,61.0,,24.0,0.0,6741.0,41.6%,30.0,w,3354.53,3354.53,4083.93,4083.93,3145.47,938.46,0.0,0.0,0.0,Jul-2020,215.28,Oct-2020,Sep-2020,0.0,61.0,1.0,Individual,,,,0.0,0.0,40223.0,4.0,12.0,2.0,2.0,5.0,33482.0,97.0,7.0,12.0,3662.0,79.0,16200.0,2.0,0.0,4.0,14.0,1749.0,7694.0,42.2,0.0,0.0,88.0,72.0,1.0,1.0,0.0,3.0,,5.0,61.0,2.0,6.0,10.0,6.0,6.0,14.0,11.0,16.0,10.0,24.0,0.0,0.0,0.0,9.0,93.3,0.0,0.0,0.0,50845.0,40223.0,13300.0,34645.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,,,25000.0,25000.0,25000.0,60 months,12.98%,568.58,B,B5,Tire builder,10+ years,MORTGAGE,85000.0,Not Verified,Dec-2018,Fully Paid,n,,,debt_consolidation,Debt consolidation,356xx,AL,36.67,1.0,Oct-1993,1.0,10.0,,14.0,0.0,26630.0,53.3%,27.0,w,0.0,0.0,29563.763968,29563.76,25000.0,4563.76,0.0,0.0,0.0,Aug-2020,19356.36,,Aug-2020,0.0,40.0,1.0,Joint App,140000.0,23.08,Not Verified,0.0,56.0,206980.0,1.0,7.0,1.0,3.0,3.0,69513.0,81.0,1.0,1.0,17292.0,68.0,50000.0,1.0,3.0,1.0,4.0,14784.0,18581.0,57.7,0.0,0.0,186.0,185.0,9.0,3.0,4.0,9.0,10.0,3.0,10.0,1.0,4.0,5.0,4.0,5.0,13.0,6.0,10.0,5.0,14.0,0.0,0.0,0.0,2.0,85.2,0.0,0.0,0.0,275691.0,96143.0,43900.0,99691.0,48711.0,Oct-1993,1.0,4.0,16.0,56.5,8.0,12.0,0.0,0.0,49.0,N,,,,,,,,,,,,,,,N,,,,,,


Lets look at the NaN values of each column so that you can see the problem that the extra rows at the bottom of the file are creating for us

In [308]:
# Sum null values by column and sort from least to greatest
df.isnull().sum()

id                                            128378
member_id                                     128380
loan_amnt                                          2
funded_amnt                                        2
funded_amnt_inv                                    2
term                                               2
int_rate                                           2
installment                                        2
grade                                              2
sub_grade                                          2
emp_title                                      20945
emp_length                                     11703
home_ownership                                     2
annual_inc                                         2
verification_status                                2
issue_d                                            2
loan_status                                        2
pymnt_plan                                         2
url                                           

In [309]:
df.isnull().sum().sort_values()

inq_fi                                             2
mo_sin_old_rev_tl_op                               2
delinq_amnt                                        2
chargeoff_within_12_mths                           2
acc_open_past_24mths                               2
inq_last_12m                                       2
total_cu_tl                                        2
total_rev_hi_lim                                   2
open_rv_24m                                        2
open_rv_12m                                        2
total_bal_il                                       2
open_il_24m                                        2
open_il_12m                                        2
open_act_il                                        2
open_acc_6m                                        2
tot_cur_bal                                        2
tot_coll_amt                                       2
acc_now_delinq                                     2
application_type                              

In [310]:
# Address the extra NaNs in each column by skipping the footer as well.
# This way produces a warning because Panda is uing a C compliler, but the skpfooter wasnt implemented yet in C
#   can ingnore, or just use specific engine='python' designation
df = pd.read_csv('LoanStats_2018Q4.csv', skiprows=1, skipfooter=2)
print(df.shape)
df.head()

  after removing the cwd from sys.path.


(128378, 144)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,5000,5000,5000.0,36 months,11.31%,164.43,B,B3,Shipping,2 years,RENT,40000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,284xx,NC,9.66,0,Oct-2007,0,,,8,0,1070,17.5%,27,w,2770.45,2770.45,2956.61,2956.61,2229.55,727.06,0.0,0.0,0.0,Sep-2020,164.43,Oct-2020,Sep-2020,0,,1,Individual,,,,0,550,144979,1,2,0,1,13.0,143909,78.0,1,3,170,44.0,6100,1,0,1,4,18122.0,3283.0,8.8,0,0,134.0,86,6,6,0,6.0,,9.0,,0,2,5,3,3,19,6,8,5,8,0.0,0,0,1,100.0,0.0,0,0,125853,144979,3600,119753,,,,,,,,,,,,N,CVD19SKIP,INCOMECURT,COMPLETE,2.0,0.0,Apr-2020,Jun-2020,May-2020,2.0,0.0,ACTIVE,58.8,3121.41,164.43,N,,,,,,
1,,,10000,10000,10000.0,36 months,18.94%,366.26,D,D2,Especialist 1,7 years,MORTGAGE,37115.0,Not Verified,Dec-2018,Current,n,,,other,Other,891xx,NV,33.57,0,Jan-2002,0,38.0,,9,0,5837,53.1%,15,w,5147.99,5147.99,7309.42,7309.42,4852.01,2457.41,0.0,0.0,0.0,Sep-2020,366.26,Oct-2020,Sep-2020,0,38.0,1,Joint App,91115.0,18.61,Not Verified,0,0,240589,3,2,1,2,8.0,34508,88.0,2,3,1659,64.0,11000,2,0,5,6,26732.0,2669.0,52.3,0,0,158.0,203,4,4,1,5.0,38.0,7.0,38.0,1,2,5,3,6,4,6,10,5,9,0.0,0,0,4,93.3,33.3,0,0,253346,40345,5600,41060,19051.0,Jan-2016,1.0,1.0,9.0,76.4,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,
2,,,20000,20000,20000.0,36 months,7.56%,622.68,A,A3,Teacher,10+ years,MORTGAGE,100000.0,Not Verified,Dec-2018,Fully Paid,n,,,credit_card,Credit card refinancing,982xx,WA,18.92,0,Feb-1999,0,48.0,,9,0,25416,29.9%,19,w,0.0,0.0,20215.79243,20215.79,20000.0,215.79,0.0,0.0,0.0,Feb-2019,20228.39,,Feb-2019,0,,1,Joint App,190000.0,11.75,Not Verified,0,0,515779,1,2,0,1,13.0,46153,71.0,1,2,9759,39.0,85100,2,2,0,5,57309.0,59684.0,29.9,0,0,171.0,238,1,1,5,1.0,,13.0,48.0,0,5,5,5,6,5,5,9,5,9,0.0,0,0,1,94.7,20.0,0,0,622183,71569,85100,74833,43287.0,Aug-1998,0.0,3.0,10.0,29.7,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,
3,,,6500,6500,6500.0,36 months,11.80%,215.28,B,B4,Educator,2 years,RENT,46500.0,Source Verified,Dec-2018,Late (16-30 days),n,,,debt_consolidation,Debt consolidation,352xx,AL,21.01,0,Aug-2011,1,61.0,,24,0,6741,41.6%,30,w,3354.53,3354.53,4083.93,4083.93,3145.47,938.46,0.0,0.0,0.0,Jul-2020,215.28,Oct-2020,Sep-2020,0,61.0,1,Individual,,,,0,0,40223,4,12,2,2,5.0,33482,97.0,7,12,3662,79.0,16200,2,0,4,14,1749.0,7694.0,42.2,0,0,88.0,72,1,1,0,3.0,,5.0,61.0,2,6,10,6,6,14,11,16,10,24,0.0,0,0,9,93.3,0.0,0,0,50845,40223,13300,34645,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,,,25000,25000,25000.0,60 months,12.98%,568.58,B,B5,Tire builder,10+ years,MORTGAGE,85000.0,Not Verified,Dec-2018,Fully Paid,n,,,debt_consolidation,Debt consolidation,356xx,AL,36.67,1,Oct-1993,1,10.0,,14,0,26630,53.3%,27,w,0.0,0.0,29563.763968,29563.76,25000.0,4563.76,0.0,0.0,0.0,Aug-2020,19356.36,,Aug-2020,0,40.0,1,Joint App,140000.0,23.08,Not Verified,0,56,206980,1,7,1,3,3.0,69513,81.0,1,1,17292,68.0,50000,1,3,1,4,14784.0,18581.0,57.7,0,0,186.0,185,9,3,4,9.0,10.0,3.0,10.0,1,4,5,4,5,13,6,10,5,14,0.0,0,0,2,85.2,0.0,0,0,275691,96143,43900,99691,48711.0,Oct-1993,1.0,4.0,16.0,56.5,8.0,12.0,0.0,0.0,49.0,N,,,,,,,,,,,,,,,N,,,,,,


In [311]:
df = pd.read_csv('LoanStats_2018Q4.csv', skiprows=1, skipfooter=2, engine='python') 
#no warning now
print(df.shape)
df.head()

(128378, 144)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,5000,5000,5000.0,36 months,11.31%,164.43,B,B3,Shipping,2 years,RENT,40000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,284xx,NC,9.66,0,Oct-2007,0,,,8,0,1070,17.5%,27,w,2770.45,2770.45,2956.61,2956.61,2229.55,727.06,0.0,0.0,0.0,Sep-2020,164.43,Oct-2020,Sep-2020,0,,1,Individual,,,,0,550,144979,1,2,0,1,13.0,143909,78.0,1,3,170,44.0,6100,1,0,1,4,18122.0,3283.0,8.8,0,0,134.0,86,6,6,0,6.0,,9.0,,0,2,5,3,3,19,6,8,5,8,0.0,0,0,1,100.0,0.0,0,0,125853,144979,3600,119753,,,,,,,,,,,,N,CVD19SKIP,INCOMECURT,COMPLETE,2.0,0.0,Apr-2020,Jun-2020,May-2020,2.0,0.0,ACTIVE,58.8,3121.41,164.43,N,,,,,,
1,,,10000,10000,10000.0,36 months,18.94%,366.26,D,D2,Especialist 1,7 years,MORTGAGE,37115.0,Not Verified,Dec-2018,Current,n,,,other,Other,891xx,NV,33.57,0,Jan-2002,0,38.0,,9,0,5837,53.1%,15,w,5147.99,5147.99,7309.42,7309.42,4852.01,2457.41,0.0,0.0,0.0,Sep-2020,366.26,Oct-2020,Sep-2020,0,38.0,1,Joint App,91115.0,18.61,Not Verified,0,0,240589,3,2,1,2,8.0,34508,88.0,2,3,1659,64.0,11000,2,0,5,6,26732.0,2669.0,52.3,0,0,158.0,203,4,4,1,5.0,38.0,7.0,38.0,1,2,5,3,6,4,6,10,5,9,0.0,0,0,4,93.3,33.3,0,0,253346,40345,5600,41060,19051.0,Jan-2016,1.0,1.0,9.0,76.4,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,
2,,,20000,20000,20000.0,36 months,7.56%,622.68,A,A3,Teacher,10+ years,MORTGAGE,100000.0,Not Verified,Dec-2018,Fully Paid,n,,,credit_card,Credit card refinancing,982xx,WA,18.92,0,Feb-1999,0,48.0,,9,0,25416,29.9%,19,w,0.0,0.0,20215.79243,20215.79,20000.0,215.79,0.0,0.0,0.0,Feb-2019,20228.39,,Feb-2019,0,,1,Joint App,190000.0,11.75,Not Verified,0,0,515779,1,2,0,1,13.0,46153,71.0,1,2,9759,39.0,85100,2,2,0,5,57309.0,59684.0,29.9,0,0,171.0,238,1,1,5,1.0,,13.0,48.0,0,5,5,5,6,5,5,9,5,9,0.0,0,0,1,94.7,20.0,0,0,622183,71569,85100,74833,43287.0,Aug-1998,0.0,3.0,10.0,29.7,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,
3,,,6500,6500,6500.0,36 months,11.80%,215.28,B,B4,Educator,2 years,RENT,46500.0,Source Verified,Dec-2018,Late (16-30 days),n,,,debt_consolidation,Debt consolidation,352xx,AL,21.01,0,Aug-2011,1,61.0,,24,0,6741,41.6%,30,w,3354.53,3354.53,4083.93,4083.93,3145.47,938.46,0.0,0.0,0.0,Jul-2020,215.28,Oct-2020,Sep-2020,0,61.0,1,Individual,,,,0,0,40223,4,12,2,2,5.0,33482,97.0,7,12,3662,79.0,16200,2,0,4,14,1749.0,7694.0,42.2,0,0,88.0,72,1,1,0,3.0,,5.0,61.0,2,6,10,6,6,14,11,16,10,24,0.0,0,0,9,93.3,0.0,0,0,50845,40223,13300,34645,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,,,25000,25000,25000.0,60 months,12.98%,568.58,B,B5,Tire builder,10+ years,MORTGAGE,85000.0,Not Verified,Dec-2018,Fully Paid,n,,,debt_consolidation,Debt consolidation,356xx,AL,36.67,1,Oct-1993,1,10.0,,14,0,26630,53.3%,27,w,0.0,0.0,29563.763968,29563.76,25000.0,4563.76,0.0,0.0,0.0,Aug-2020,19356.36,,Aug-2020,0,40.0,1,Joint App,140000.0,23.08,Not Verified,0,56,206980,1,7,1,3,3.0,69513,81.0,1,1,17292,68.0,50000,1,3,1,4,14784.0,18581.0,57.7,0,0,186.0,185,9,3,4,9.0,10.0,3.0,10.0,1,4,5,4,5,13,6,10,5,14,0.0,0,0,2,85.2,0.0,0,0,275691,96143,43900,99691,48711.0,Oct-1993,1.0,4.0,16.0,56.5,8.0,12.0,0.0,0.0,49.0,N,,,,,,,,,,,,,,,N,,,,,,


In [None]:
df.tail()

For good measure, we'll also drop some columns that are made up completely of NaN values.

Why might LendingClub have included columns in their dataset that are 100% blank?

In [312]:
#By leaving the columns blank they protect user identity. 

In [313]:
df=df.drop(['id', 'member_id', 'desc', 'url'], axis=1)
# can also use df.drop(['id', 'member_id', 'desc', 'url'], axis=1, inplace=True)  No need to use df=
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,5000,5000,5000.0,36 months,11.31%,164.43,B,B3,Shipping,2 years,RENT,40000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,284xx,NC,9.66,0,Oct-2007,0,,,8,0,1070,17.5%,27,w,2770.45,2770.45,2956.61,2956.61,2229.55,727.06,0.0,0.0,0.0,Sep-2020,164.43,Oct-2020,Sep-2020,0,,1,Individual,,,,0,550,144979,1,2,0,1,13.0,143909,78.0,1,3,170,44.0,6100,1,0,1,4,18122.0,3283.0,8.8,0,0,134.0,86,6,6,0,6.0,,9.0,,0,2,5,3,3,19,6,8,5,8,0.0,0,0,1,100.0,0.0,0,0,125853,144979,3600,119753,,,,,,,,,,,,N,CVD19SKIP,INCOMECURT,COMPLETE,2.0,0.0,Apr-2020,Jun-2020,May-2020,2.0,0.0,ACTIVE,58.8,3121.41,164.43,N,,,,,,
1,10000,10000,10000.0,36 months,18.94%,366.26,D,D2,Especialist 1,7 years,MORTGAGE,37115.0,Not Verified,Dec-2018,Current,n,other,Other,891xx,NV,33.57,0,Jan-2002,0,38.0,,9,0,5837,53.1%,15,w,5147.99,5147.99,7309.42,7309.42,4852.01,2457.41,0.0,0.0,0.0,Sep-2020,366.26,Oct-2020,Sep-2020,0,38.0,1,Joint App,91115.0,18.61,Not Verified,0,0,240589,3,2,1,2,8.0,34508,88.0,2,3,1659,64.0,11000,2,0,5,6,26732.0,2669.0,52.3,0,0,158.0,203,4,4,1,5.0,38.0,7.0,38.0,1,2,5,3,6,4,6,10,5,9,0.0,0,0,4,93.3,33.3,0,0,253346,40345,5600,41060,19051.0,Jan-2016,1.0,1.0,9.0,76.4,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,
2,20000,20000,20000.0,36 months,7.56%,622.68,A,A3,Teacher,10+ years,MORTGAGE,100000.0,Not Verified,Dec-2018,Fully Paid,n,credit_card,Credit card refinancing,982xx,WA,18.92,0,Feb-1999,0,48.0,,9,0,25416,29.9%,19,w,0.0,0.0,20215.79243,20215.79,20000.0,215.79,0.0,0.0,0.0,Feb-2019,20228.39,,Feb-2019,0,,1,Joint App,190000.0,11.75,Not Verified,0,0,515779,1,2,0,1,13.0,46153,71.0,1,2,9759,39.0,85100,2,2,0,5,57309.0,59684.0,29.9,0,0,171.0,238,1,1,5,1.0,,13.0,48.0,0,5,5,5,6,5,5,9,5,9,0.0,0,0,1,94.7,20.0,0,0,622183,71569,85100,74833,43287.0,Aug-1998,0.0,3.0,10.0,29.7,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,
3,6500,6500,6500.0,36 months,11.80%,215.28,B,B4,Educator,2 years,RENT,46500.0,Source Verified,Dec-2018,Late (16-30 days),n,debt_consolidation,Debt consolidation,352xx,AL,21.01,0,Aug-2011,1,61.0,,24,0,6741,41.6%,30,w,3354.53,3354.53,4083.93,4083.93,3145.47,938.46,0.0,0.0,0.0,Jul-2020,215.28,Oct-2020,Sep-2020,0,61.0,1,Individual,,,,0,0,40223,4,12,2,2,5.0,33482,97.0,7,12,3662,79.0,16200,2,0,4,14,1749.0,7694.0,42.2,0,0,88.0,72,1,1,0,3.0,,5.0,61.0,2,6,10,6,6,14,11,16,10,24,0.0,0,0,9,93.3,0.0,0,0,50845,40223,13300,34645,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,25000,25000,25000.0,60 months,12.98%,568.58,B,B5,Tire builder,10+ years,MORTGAGE,85000.0,Not Verified,Dec-2018,Fully Paid,n,debt_consolidation,Debt consolidation,356xx,AL,36.67,1,Oct-1993,1,10.0,,14,0,26630,53.3%,27,w,0.0,0.0,29563.763968,29563.76,25000.0,4563.76,0.0,0.0,0.0,Aug-2020,19356.36,,Aug-2020,0,40.0,1,Joint App,140000.0,23.08,Not Verified,0,56,206980,1,7,1,3,3.0,69513,81.0,1,1,17292,68.0,50000,1,3,1,4,14784.0,18581.0,57.7,0,0,186.0,185,9,3,4,9.0,10.0,3.0,10.0,1,4,5,4,5,13,6,10,5,14,0.0,0,0,2,85.2,0.0,0,0,275691,96143,43900,99691,48711.0,Oct-1993,1.0,4.0,16.0,56.5,8.0,12.0,0.0,0.0,49.0,N,,,,,,,,,,,,,,,N,,,,,,


### Clean up the `int_rate` column

When we're preparing a dataset for a machine learning model we typically don't want to leave any string values in our dataset --because it's hard to do math on words. 

Specifically, we have a column that is representing a numeric value, but currently doesn't have a numeric datatype. Lets look at the first 10 values of the `int_rate` column:

In [314]:
# Look at the first 10 values of the int_rate column
df.int_rate.head(10)

0     11.31%
1     18.94%
2      7.56%
3     11.80%
4     12.98%
5     10.33%
6     19.92%
7     23.40%
8     15.02%
9     12.98%
Name: int_rate, dtype: object

In [315]:
# Look at a specific value from the int_rate column

df.int_rate[0]

' 11.31%'

Problems that we need to address with this column:

- String column that should be numeric
- Percent sign `%` included with the number
- Leading space at the beginning of the string

However, we're not going to try and write exactly the right code to fix this column in one go. We're going to methodically build up to the code that will help us address these problems.


In [316]:
# Lets start with just fixing a single string.
# If we can fix one, we can usually fix all of them


In [317]:
#Remove extra white space
' 14.47%'.strip()



'14.47%'

In [318]:
#Doesn't work if there isn't any extra white space to remove

'14.47%'.strip('%')

'14.47'

In [319]:
#You can "chain" two strip functions to remove both the spaces and the % sign
' 14.47%'.strip('%').strip()


'14.47'

In [320]:
#another way to do it is to use slicing
' 14.47%'[1:-1]

'14.47'

In [321]:
# "Cast" the string value to a float
# "cast" -> Change something's data type
# This is the line of code that we're after! ->
float('14.47')

14.47

### Write a function to make our solution reusable!

In [322]:
# Write a function that can do what we have written above to any 
# string that is passsed to it.
def intRateTofloat(str):
  return float(str.strip('%').strip())

In [323]:
# Test out our function by calling it on our example

x=intRateTofloat(' 14.47%')
print(x)

14.47


In [324]:
# is the data type correct?
type(x)


float

### Apply our solution to every cell in a column

In [325]:
# pass in *only* the name of the function, don't call it. 
# This works because we know the function works on every item in the column
# so I can simply "apply" it to the entire column

df['int_rate'].apply(intRateTofloat)

0         11.31
1         18.94
2          7.56
3         11.80
4         12.98
          ...  
128373    15.02
128374    15.02
128375    13.56
128376    11.06
128377    16.91
Name: int_rate, Length: 128378, dtype: float64

In [326]:
# What type of data is held in our new column?
df['new_int_rate']=df['int_rate'].apply(intRateTofloat)
# Look at the datatypes of the last 5 columns
df.dtypes.tail()

settlement_date           object
settlement_amount        float64
settlement_percentage    float64
settlement_term          float64
new_int_rate             float64
dtype: object

In [None]:
df.new_int_rate.mean()

## Challenge

We can create a new column with our cleaned values or overwrite the original, whatever we think best suits our needs. On your assignment you will take the same approach in trying to methodically build up the complexity of your code until you have a few lines that will work for any cell in a column. At that point you'll contain all of that functionality in a reusable function block and then use the `.apply()` function to... well... apply those changes to an entire column.

# [Objective 03](#pandas-apply) - Modify and Create Columns using `.apply()`



## Overview

We're already seen one example of using the `.apply()` function to clean up a column. Lets see if we can do it again, but this time on a slightly more complicated use case.

Remember, the goal here is to write a function that will work correctly on any **individual** cell of a specific column. Then we can reuse that function on those individual cells of a dataframe column via the `.apply()` function.

Lets clean up the emp_title "Employment Title" column!

## Follow Along

First we'll try and diagnose how bad the problem is and what improvements we might be able to make.

In [327]:
# Look at the top 20 employment titles

df['emp_title'].value_counts(dropna=False).head(20)

NaN                   20943
Teacher                2089
Manager                1773
Registered Nurse        950
Driver                  924
RN                      726
Supervisor              697
Sales                   580
Project Manager         526
General Manager         523
Office Manager          521
Owner                   420
Director                402
Operations Manager      387
Truck Driver            387
Nurse                   326
Engineer                325
Sales Manager           304
manager                 301
Supervisor              270
Name: emp_title, dtype: int64

In [328]:
# How many different unique employment titles are there currently?
df['emp_title'].nunique()


43881

In [329]:
# How often is the employment_title null?
print(df['emp_title'].isnull().sum())
df['emp_title'].isnull()

20943


0         False
1         False
2         False
3         False
4         False
          ...  
128373    False
128374    False
128375    False
128376     True
128377    False
Name: emp_title, Length: 128378, dtype: bool

What are some possible reasons as to why a person's employment title may have not been provided?

In [330]:
# Create some examples that represent the cases that we want to clean up
import numpy as np
examples=['manager',' Operating Manager','Regitered Nurse', np.NaN, 'OWNER']


In [331]:
# Write a function to clean up these use cases and increase uniformity.

def clean_emp_title(title):
  if isinstance(title, str):
    print(title.title())
  else:
    print('NaN')
clean_emp_title('OWNER')

Owner


In [332]:
# Using a for loop
for i in examples:
  clean_emp_title(i)

Manager
 Operating Manager
Regitered Nurse
NaN
Owner


In [333]:
  # print things that you want to force to be shown in the notebook output
 

In [334]:
# The value of the last thing to be returned gets printed out
# now return instead of printing
def clean_emp_title(title):
  if isinstance(title, str):
    return(title.title())
  else:
    return('Unknown')
for i in examples:
  print(clean_emp_title(i))

Manager
 Operating Manager
Regitered Nurse
Unknown
Owner


In [335]:
# now add removing of white spaces in front and the end of titles
# chaining executes from Left to Right
def clean_emp_title(title):
  if isinstance(title, str):
    return(title.strip().title())
  else:
    return('Unknown')
for i in examples:
  print(clean_emp_title(i))

Manager
Operating Manager
Regitered Nurse
Unknown
Owner


In [336]:
# if the last thing to be returned gets saved to a variable
# then it won't get printed out


In [337]:
# if I include just a variable name, it will get printed out



In [338]:
#Apply the function to every row of the emp_title variable

df['emp_title_cleaned']=df['emp_title'].apply(clean_emp_title)
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,new_int_rate,emp_title_cleaned
0,5000,5000,5000.0,36 months,11.31%,164.43,B,B3,Shipping,2 years,RENT,40000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,284xx,NC,9.66,0,Oct-2007,0,,,8,0,1070,17.5%,27,w,2770.45,2770.45,2956.61,2956.61,2229.55,727.06,0.0,0.0,0.0,Sep-2020,164.43,Oct-2020,Sep-2020,0,,1,Individual,,,,0,550,144979,1,2,0,1,13.0,143909,78.0,1,3,170,44.0,6100,1,0,1,4,18122.0,3283.0,8.8,0,0,134.0,86,6,6,0,6.0,,9.0,,0,2,5,3,3,19,6,8,5,8,0.0,0,0,1,100.0,0.0,0,0,125853,144979,3600,119753,,,,,,,,,,,,N,CVD19SKIP,INCOMECURT,COMPLETE,2.0,0.0,Apr-2020,Jun-2020,May-2020,2.0,0.0,ACTIVE,58.8,3121.41,164.43,N,,,,,,,11.31,Shipping
1,10000,10000,10000.0,36 months,18.94%,366.26,D,D2,Especialist 1,7 years,MORTGAGE,37115.0,Not Verified,Dec-2018,Current,n,other,Other,891xx,NV,33.57,0,Jan-2002,0,38.0,,9,0,5837,53.1%,15,w,5147.99,5147.99,7309.42,7309.42,4852.01,2457.41,0.0,0.0,0.0,Sep-2020,366.26,Oct-2020,Sep-2020,0,38.0,1,Joint App,91115.0,18.61,Not Verified,0,0,240589,3,2,1,2,8.0,34508,88.0,2,3,1659,64.0,11000,2,0,5,6,26732.0,2669.0,52.3,0,0,158.0,203,4,4,1,5.0,38.0,7.0,38.0,1,2,5,3,6,4,6,10,5,9,0.0,0,0,4,93.3,33.3,0,0,253346,40345,5600,41060,19051.0,Jan-2016,1.0,1.0,9.0,76.4,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,,18.94,Especialist 1
2,20000,20000,20000.0,36 months,7.56%,622.68,A,A3,Teacher,10+ years,MORTGAGE,100000.0,Not Verified,Dec-2018,Fully Paid,n,credit_card,Credit card refinancing,982xx,WA,18.92,0,Feb-1999,0,48.0,,9,0,25416,29.9%,19,w,0.0,0.0,20215.79243,20215.79,20000.0,215.79,0.0,0.0,0.0,Feb-2019,20228.39,,Feb-2019,0,,1,Joint App,190000.0,11.75,Not Verified,0,0,515779,1,2,0,1,13.0,46153,71.0,1,2,9759,39.0,85100,2,2,0,5,57309.0,59684.0,29.9,0,0,171.0,238,1,1,5,1.0,,13.0,48.0,0,5,5,5,6,5,5,9,5,9,0.0,0,0,1,94.7,20.0,0,0,622183,71569,85100,74833,43287.0,Aug-1998,0.0,3.0,10.0,29.7,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,,7.56,Teacher
3,6500,6500,6500.0,36 months,11.80%,215.28,B,B4,Educator,2 years,RENT,46500.0,Source Verified,Dec-2018,Late (16-30 days),n,debt_consolidation,Debt consolidation,352xx,AL,21.01,0,Aug-2011,1,61.0,,24,0,6741,41.6%,30,w,3354.53,3354.53,4083.93,4083.93,3145.47,938.46,0.0,0.0,0.0,Jul-2020,215.28,Oct-2020,Sep-2020,0,61.0,1,Individual,,,,0,0,40223,4,12,2,2,5.0,33482,97.0,7,12,3662,79.0,16200,2,0,4,14,1749.0,7694.0,42.2,0,0,88.0,72,1,1,0,3.0,,5.0,61.0,2,6,10,6,6,14,11,16,10,24,0.0,0,0,9,93.3,0.0,0,0,50845,40223,13300,34645,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,11.8,Educator
4,25000,25000,25000.0,60 months,12.98%,568.58,B,B5,Tire builder,10+ years,MORTGAGE,85000.0,Not Verified,Dec-2018,Fully Paid,n,debt_consolidation,Debt consolidation,356xx,AL,36.67,1,Oct-1993,1,10.0,,14,0,26630,53.3%,27,w,0.0,0.0,29563.763968,29563.76,25000.0,4563.76,0.0,0.0,0.0,Aug-2020,19356.36,,Aug-2020,0,40.0,1,Joint App,140000.0,23.08,Not Verified,0,56,206980,1,7,1,3,3.0,69513,81.0,1,1,17292,68.0,50000,1,3,1,4,14784.0,18581.0,57.7,0,0,186.0,185,9,3,4,9.0,10.0,3.0,10.0,1,4,5,4,5,13,6,10,5,14,0.0,0,0,2,85.2,0.0,0,0,275691,96143,43900,99691,48711.0,Oct-1993,1.0,4.0,16.0,56.5,8.0,12.0,0.0,0.0,49.0,N,,,,,,,,,,,,,,,N,,,,,,,12.98,Tire Builder


In [339]:
df.tail()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,new_int_rate,emp_title_cleaned
128373,23000,23000,23000.0,36 months,15.02%,797.53,C,C3,Tax Consultant,10+ years,MORTGAGE,75000.0,Source Verified,Oct-2018,Charged Off,n,debt_consolidation,Debt consolidation,352xx,AL,20.95,1,Aug-1985,2,22.0,,12,0,22465,43.6%,28,w,0.0,0.0,1547.08,1547.08,1025.67,521.41,0.0,0.0,0.0,Dec-2018,797.53,,Nov-2018,0,,1,Individual,,,,0,0,259658,4,2,3,3,6.0,18149,86.0,4,6,12843,56.0,51500,2,2,5,11,21638.0,26321.0,44.1,0,0,12.0,397,4,4,6,5.0,22.0,4.0,22.0,0,4,5,7,14,3,9,19,5,12,0.0,0,0,7,96.4,14.3,0,0,296500,40614,47100,21000,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,15.02,Tax Consultant
128374,10000,10000,10000.0,36 months,15.02%,346.76,C,C3,security guard,5 years,MORTGAGE,38000.0,Not Verified,Oct-2018,Current,n,debt_consolidation,Debt consolidation,443xx,OH,13.16,3,Jul-1982,0,6.0,,11,0,5634,37.1%,16,w,4136.11,4136.11,7967.14,7967.14,5863.89,2103.25,0.0,0.0,0.0,Sep-2020,346.76,Oct-2020,Sep-2020,0,,1,Individual,,,,0,155,77424,0,1,0,0,34.0,200,10.0,1,1,1866,42.0,15200,2,0,0,2,7039.0,4537.0,50.1,0,0,34.0,434,11,11,3,11.0,6.0,17.0,6.0,0,3,5,5,6,1,8,11,5,11,0.0,0,0,1,73.3,40.0,0,0,91403,9323,9100,2000,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,15.02,Security Guard
128375,5000,5000,5000.0,36 months,13.56%,169.83,C,C1,Payoff Clerk,10+ years,MORTGAGE,35360.0,Not Verified,Oct-2018,Current,n,debt_consolidation,Debt consolidation,381xx,TN,11.3,1,Jun-2006,0,21.0,,9,0,2597,27.3%,15,f,2042.26,2042.26,3902.32,3902.32,2957.74,944.58,0.0,0.0,0.0,Sep-2020,169.83,Oct-2020,Sep-2020,0,,1,Individual,,,,0,1413,69785,0,2,0,1,16.0,2379,40.0,3,4,1826,32.0,9500,0,0,1,5,8723.0,1174.0,60.9,0,0,147.0,85,9,9,2,10.0,21.0,9.0,21.0,0,1,3,2,2,6,6,7,3,9,0.0,0,0,3,92.9,50.0,0,0,93908,4976,3000,6028,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,13.56,Payoff Clerk
128376,10000,10000,9750.0,36 months,11.06%,327.68,B,B3,,,RENT,44400.0,Source Verified,Oct-2018,Current,n,credit_card,Credit card refinancing,980xx,WA,11.78,0,Oct-2008,2,40.0,,15,0,6269,13.1%,25,f,3996.89,3896.97,7521.28,7333.25,6003.11,1518.17,0.0,0.0,0.0,Sep-2020,327.68,Oct-2020,Sep-2020,0,53.0,1,Individual,,,,0,520,16440,3,1,1,1,2.0,10171,100.0,2,5,404,28.0,47700,0,3,5,6,1265.0,20037.0,2.3,0,0,61.0,119,1,1,0,1.0,,1.0,40.0,1,2,4,6,8,3,14,22,4,15,0.0,0,0,3,92.0,0.0,0,0,57871,16440,20500,10171,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,11.06,Unknown
128377,10000,10000,10000.0,36 months,16.91%,356.08,C,C5,Key Accounts Manager,2 years,RENT,80000.0,Not Verified,Oct-2018,Current,n,other,Other,021xx,MA,17.72,1,Sep-2006,0,14.0,,17,0,1942,30.8%,31,w,4202.86,4202.86,8180.45,8180.45,5797.14,2383.31,0.0,0.0,0.0,Sep-2020,356.08,Oct-2020,Sep-2020,0,25.0,1,Individual,,,,0,0,59194,0,15,1,1,12.0,57252,85.0,0,0,1942,80.0,6300,0,5,0,1,3482.0,2058.0,48.5,0,0,144.0,142,40,12,0,131.0,30.0,,30.0,3,1,1,1,5,22,2,9,1,17,0.0,0,0,1,74.2,0.0,0,0,73669,59194,4000,67369,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,16.91,Key Accounts Manager


In [340]:
#Top 20 titles
df['emp_title_cleaned'].value_counts().head(20)

Unknown               20943
Teacher                2556
Manager                2395
Registered Nurse       1416
Driver                 1258
Supervisor             1160
Truck Driver            920
Rn                      834
Office Manager          805
Sales                   803
General Manager         791
Project Manager         720
Owner                   625
Director                523
Operations Manager      518
Sales Manager           500
Police Officer          440
Nurse                   425
Technician              419
Engineer                412
Name: emp_title_cleaned, dtype: int64

In [341]:
#uniqu titles
df['emp_title_cleaned'].nunique()

34895

In [342]:
#Top 20 titles
df['emp_title_cleaned'].value_counts(dropna=False).head(20)

Unknown               20943
Teacher                2556
Manager                2395
Registered Nurse       1416
Driver                 1258
Supervisor             1160
Truck Driver            920
Rn                      834
Office Manager          805
Sales                   803
General Manager         791
Project Manager         720
Owner                   625
Director                523
Operations Manager      518
Sales Manager           500
Police Officer          440
Nurse                   425
Technician              419
Engineer                412
Name: emp_title_cleaned, dtype: int64

In [343]:
df['emp_title_cleaned'].nunique()

34895

In [344]:
# list comprehensions can combine function calls and for loops over lists
# into one succinct and fairly readable single line of code.

[clean_emp_title(i) for i in examples]

['Manager', 'Operating Manager', 'Regitered Nurse', 'Unknown', 'Owner']

In [345]:
# We have a function that works as expected. Lets apply it to our column.
# This time we'll overwrite the original column
df['emp_title']=df['emp_title'].apply(clean_emp_title)
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,new_int_rate,emp_title_cleaned
0,5000,5000,5000.0,36 months,11.31%,164.43,B,B3,Shipping,2 years,RENT,40000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,284xx,NC,9.66,0,Oct-2007,0,,,8,0,1070,17.5%,27,w,2770.45,2770.45,2956.61,2956.61,2229.55,727.06,0.0,0.0,0.0,Sep-2020,164.43,Oct-2020,Sep-2020,0,,1,Individual,,,,0,550,144979,1,2,0,1,13.0,143909,78.0,1,3,170,44.0,6100,1,0,1,4,18122.0,3283.0,8.8,0,0,134.0,86,6,6,0,6.0,,9.0,,0,2,5,3,3,19,6,8,5,8,0.0,0,0,1,100.0,0.0,0,0,125853,144979,3600,119753,,,,,,,,,,,,N,CVD19SKIP,INCOMECURT,COMPLETE,2.0,0.0,Apr-2020,Jun-2020,May-2020,2.0,0.0,ACTIVE,58.8,3121.41,164.43,N,,,,,,,11.31,Shipping
1,10000,10000,10000.0,36 months,18.94%,366.26,D,D2,Especialist 1,7 years,MORTGAGE,37115.0,Not Verified,Dec-2018,Current,n,other,Other,891xx,NV,33.57,0,Jan-2002,0,38.0,,9,0,5837,53.1%,15,w,5147.99,5147.99,7309.42,7309.42,4852.01,2457.41,0.0,0.0,0.0,Sep-2020,366.26,Oct-2020,Sep-2020,0,38.0,1,Joint App,91115.0,18.61,Not Verified,0,0,240589,3,2,1,2,8.0,34508,88.0,2,3,1659,64.0,11000,2,0,5,6,26732.0,2669.0,52.3,0,0,158.0,203,4,4,1,5.0,38.0,7.0,38.0,1,2,5,3,6,4,6,10,5,9,0.0,0,0,4,93.3,33.3,0,0,253346,40345,5600,41060,19051.0,Jan-2016,1.0,1.0,9.0,76.4,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,,18.94,Especialist 1
2,20000,20000,20000.0,36 months,7.56%,622.68,A,A3,Teacher,10+ years,MORTGAGE,100000.0,Not Verified,Dec-2018,Fully Paid,n,credit_card,Credit card refinancing,982xx,WA,18.92,0,Feb-1999,0,48.0,,9,0,25416,29.9%,19,w,0.0,0.0,20215.79243,20215.79,20000.0,215.79,0.0,0.0,0.0,Feb-2019,20228.39,,Feb-2019,0,,1,Joint App,190000.0,11.75,Not Verified,0,0,515779,1,2,0,1,13.0,46153,71.0,1,2,9759,39.0,85100,2,2,0,5,57309.0,59684.0,29.9,0,0,171.0,238,1,1,5,1.0,,13.0,48.0,0,5,5,5,6,5,5,9,5,9,0.0,0,0,1,94.7,20.0,0,0,622183,71569,85100,74833,43287.0,Aug-1998,0.0,3.0,10.0,29.7,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,,7.56,Teacher
3,6500,6500,6500.0,36 months,11.80%,215.28,B,B4,Educator,2 years,RENT,46500.0,Source Verified,Dec-2018,Late (16-30 days),n,debt_consolidation,Debt consolidation,352xx,AL,21.01,0,Aug-2011,1,61.0,,24,0,6741,41.6%,30,w,3354.53,3354.53,4083.93,4083.93,3145.47,938.46,0.0,0.0,0.0,Jul-2020,215.28,Oct-2020,Sep-2020,0,61.0,1,Individual,,,,0,0,40223,4,12,2,2,5.0,33482,97.0,7,12,3662,79.0,16200,2,0,4,14,1749.0,7694.0,42.2,0,0,88.0,72,1,1,0,3.0,,5.0,61.0,2,6,10,6,6,14,11,16,10,24,0.0,0,0,9,93.3,0.0,0,0,50845,40223,13300,34645,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,11.8,Educator
4,25000,25000,25000.0,60 months,12.98%,568.58,B,B5,Tire Builder,10+ years,MORTGAGE,85000.0,Not Verified,Dec-2018,Fully Paid,n,debt_consolidation,Debt consolidation,356xx,AL,36.67,1,Oct-1993,1,10.0,,14,0,26630,53.3%,27,w,0.0,0.0,29563.763968,29563.76,25000.0,4563.76,0.0,0.0,0.0,Aug-2020,19356.36,,Aug-2020,0,40.0,1,Joint App,140000.0,23.08,Not Verified,0,56,206980,1,7,1,3,3.0,69513,81.0,1,1,17292,68.0,50000,1,3,1,4,14784.0,18581.0,57.7,0,0,186.0,185,9,3,4,9.0,10.0,3.0,10.0,1,4,5,4,5,13,6,10,5,14,0.0,0,0,2,85.2,0.0,0,0,275691,96143,43900,99691,48711.0,Oct-1993,1.0,4.0,16.0,56.5,8.0,12.0,0.0,0.0,49.0,N,,,,,,,,,,,,,,,N,,,,,,,12.98,Tire Builder


In [346]:
# Usint for loop instead of .apply()
empt_title=[]
for i in df['emp_title']:
  empt_title.append(clean_emp_title(i))
df['emp_title_cleaned2']=pd.Series(empt_title)
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,new_int_rate,emp_title_cleaned,emp_title_cleaned2
0,5000,5000,5000.0,36 months,11.31%,164.43,B,B3,Shipping,2 years,RENT,40000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,284xx,NC,9.66,0,Oct-2007,0,,,8,0,1070,17.5%,27,w,2770.45,2770.45,2956.61,2956.61,2229.55,727.06,0.0,0.0,0.0,Sep-2020,164.43,Oct-2020,Sep-2020,0,,1,Individual,,,,0,550,144979,1,2,0,1,13.0,143909,78.0,1,3,170,44.0,6100,1,0,1,4,18122.0,3283.0,8.8,0,0,134.0,86,6,6,0,6.0,,9.0,,0,2,5,3,3,19,6,8,5,8,0.0,0,0,1,100.0,0.0,0,0,125853,144979,3600,119753,,,,,,,,,,,,N,CVD19SKIP,INCOMECURT,COMPLETE,2.0,0.0,Apr-2020,Jun-2020,May-2020,2.0,0.0,ACTIVE,58.8,3121.41,164.43,N,,,,,,,11.31,Shipping,Shipping
1,10000,10000,10000.0,36 months,18.94%,366.26,D,D2,Especialist 1,7 years,MORTGAGE,37115.0,Not Verified,Dec-2018,Current,n,other,Other,891xx,NV,33.57,0,Jan-2002,0,38.0,,9,0,5837,53.1%,15,w,5147.99,5147.99,7309.42,7309.42,4852.01,2457.41,0.0,0.0,0.0,Sep-2020,366.26,Oct-2020,Sep-2020,0,38.0,1,Joint App,91115.0,18.61,Not Verified,0,0,240589,3,2,1,2,8.0,34508,88.0,2,3,1659,64.0,11000,2,0,5,6,26732.0,2669.0,52.3,0,0,158.0,203,4,4,1,5.0,38.0,7.0,38.0,1,2,5,3,6,4,6,10,5,9,0.0,0,0,4,93.3,33.3,0,0,253346,40345,5600,41060,19051.0,Jan-2016,1.0,1.0,9.0,76.4,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,,18.94,Especialist 1,Especialist 1
2,20000,20000,20000.0,36 months,7.56%,622.68,A,A3,Teacher,10+ years,MORTGAGE,100000.0,Not Verified,Dec-2018,Fully Paid,n,credit_card,Credit card refinancing,982xx,WA,18.92,0,Feb-1999,0,48.0,,9,0,25416,29.9%,19,w,0.0,0.0,20215.79243,20215.79,20000.0,215.79,0.0,0.0,0.0,Feb-2019,20228.39,,Feb-2019,0,,1,Joint App,190000.0,11.75,Not Verified,0,0,515779,1,2,0,1,13.0,46153,71.0,1,2,9759,39.0,85100,2,2,0,5,57309.0,59684.0,29.9,0,0,171.0,238,1,1,5,1.0,,13.0,48.0,0,5,5,5,6,5,5,9,5,9,0.0,0,0,1,94.7,20.0,0,0,622183,71569,85100,74833,43287.0,Aug-1998,0.0,3.0,10.0,29.7,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,,7.56,Teacher,Teacher
3,6500,6500,6500.0,36 months,11.80%,215.28,B,B4,Educator,2 years,RENT,46500.0,Source Verified,Dec-2018,Late (16-30 days),n,debt_consolidation,Debt consolidation,352xx,AL,21.01,0,Aug-2011,1,61.0,,24,0,6741,41.6%,30,w,3354.53,3354.53,4083.93,4083.93,3145.47,938.46,0.0,0.0,0.0,Jul-2020,215.28,Oct-2020,Sep-2020,0,61.0,1,Individual,,,,0,0,40223,4,12,2,2,5.0,33482,97.0,7,12,3662,79.0,16200,2,0,4,14,1749.0,7694.0,42.2,0,0,88.0,72,1,1,0,3.0,,5.0,61.0,2,6,10,6,6,14,11,16,10,24,0.0,0,0,9,93.3,0.0,0,0,50845,40223,13300,34645,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,11.8,Educator,Educator
4,25000,25000,25000.0,60 months,12.98%,568.58,B,B5,Tire Builder,10+ years,MORTGAGE,85000.0,Not Verified,Dec-2018,Fully Paid,n,debt_consolidation,Debt consolidation,356xx,AL,36.67,1,Oct-1993,1,10.0,,14,0,26630,53.3%,27,w,0.0,0.0,29563.763968,29563.76,25000.0,4563.76,0.0,0.0,0.0,Aug-2020,19356.36,,Aug-2020,0,40.0,1,Joint App,140000.0,23.08,Not Verified,0,56,206980,1,7,1,3,3.0,69513,81.0,1,1,17292,68.0,50000,1,3,1,4,14784.0,18581.0,57.7,0,0,186.0,185,9,3,4,9.0,10.0,3.0,10.0,1,4,5,4,5,13,6,10,5,14,0.0,0,0,2,85.2,0.0,0,0,275691,96143,43900,99691,48711.0,Oct-1993,1.0,4.0,16.0,56.5,8.0,12.0,0.0,0.0,49.0,N,,,,,,,,,,,,,,,N,,,,,,,12.98,Tire Builder,Tire Builder


We can use the same code as we did earlier to see how much progress was made.


In [347]:
# Look at the top 20 employment titles

df['emp_title_cleaned'].value_counts(dropna=False).sort_values(ascending=False).head(20)

Unknown               20943
Teacher                2556
Manager                2395
Registered Nurse       1416
Driver                 1258
Supervisor             1160
Truck Driver            920
Rn                      834
Office Manager          805
Sales                   803
General Manager         791
Project Manager         720
Owner                   625
Director                523
Operations Manager      518
Sales Manager           500
Police Officer          440
Nurse                   425
Technician              419
Engineer                412
Name: emp_title_cleaned, dtype: int64

In [348]:
# How many different unique employment titles are there currently?
df['emp_title_cleaned'].nunique()

34895

In [349]:
# How often is the employment_title null (NaN)?
df['emp_title_cleaned'].value_counts().isna()


Unknown                             False
Teacher                             False
Manager                             False
Registered Nurse                    False
Driver                              False
                                    ...  
Kitchen Bar Manager                 False
Beu                                 False
Director Revenue Integrity Ph&Sa    False
Manufacturing Coordinator           False
Managing Partner/Proprietor         False
Name: emp_title_cleaned, Length: 34895, dtype: bool

In [350]:
#df['emp_title_cleaned'].isna()

In [351]:
#works both ways
print(df['emp_title_cleaned'].value_counts().isna().sum())
print(df['emp_title_cleaned'].isna().sum())

0
0


## Challenge

Using the .apply() function isn't always about creating new columns on a dataframe, we can use it to clean up or modify existing columns as well. 

# [Objective](#dates-and-times) Work with Dates and Times with Pandas

## Overview

Pandas has its own datatype that makes it extremely convenient to convert strings that are in standard date formates to datetime objects and then use those datetime objects to either create new features on a dataframe or work with the dataset in a timeseries fashion. 

This section will demonstrate how to take a column of date strings, convert it to a datetime object and then use the datetime formatting `.dt` to access specific parts of the date (year, month, day) to generate useful columns on a dataframe.

## Follow Along

### Work with Dates 

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

Many of the most useful date columns in this dataset have the suffix `_d` to indicate that they correspond to dates.

We'll use a list comprehension to print them out

In [352]:
[col for col in df if col.endswith('_d')]

['issue_d', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']

Lets look at the string format of the `issue_d` column

In [353]:
df['issue_d'][0].format()

'Dec-2018'

Because this string format %m-%y is a common datetime format, we can just let Pandas detect this format and translate it to the appropriate datetime object.

In [355]:
df['issue_d_new']=pd.to_datetime(df['issue_d'])
df['issue_d_new'].head()

0   2018-12-01
1   2018-12-01
2   2018-12-01
3   2018-12-01
4   2018-12-01
Name: issue_d_new, dtype: datetime64[ns]

Now we can see that the `issue_d` column has been changed to hold `datetime` objects.

Lets look at one of the cells specifically to see what a datetime object looks like:

In [360]:
print([df['issue_d'][0], df['issue_d_new'][0]])
df.head()

['Dec-2018', Timestamp('2018-12-01 00:00:00')]


Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,new_int_rate,emp_title_cleaned,emp_title_cleaned2,issue_d_new
0,5000,5000,5000.0,36 months,11.31%,164.43,B,B3,Shipping,2 years,RENT,40000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,284xx,NC,9.66,0,Oct-2007,0,,,8,0,1070,17.5%,27,w,2770.45,2770.45,2956.61,2956.61,2229.55,727.06,0.0,0.0,0.0,Sep-2020,164.43,Oct-2020,Sep-2020,0,,1,Individual,,,,0,550,144979,1,2,0,1,13.0,143909,78.0,1,3,170,44.0,6100,1,0,1,4,18122.0,3283.0,8.8,0,0,134.0,86,6,6,0,6.0,,9.0,,0,2,5,3,3,19,6,8,5,8,0.0,0,0,1,100.0,0.0,0,0,125853,144979,3600,119753,,,,,,,,,,,,N,CVD19SKIP,INCOMECURT,COMPLETE,2.0,0.0,Apr-2020,Jun-2020,May-2020,2.0,0.0,ACTIVE,58.8,3121.41,164.43,N,,,,,,,11.31,Shipping,Shipping,2018-12-01
1,10000,10000,10000.0,36 months,18.94%,366.26,D,D2,Especialist 1,7 years,MORTGAGE,37115.0,Not Verified,Dec-2018,Current,n,other,Other,891xx,NV,33.57,0,Jan-2002,0,38.0,,9,0,5837,53.1%,15,w,5147.99,5147.99,7309.42,7309.42,4852.01,2457.41,0.0,0.0,0.0,Sep-2020,366.26,Oct-2020,Sep-2020,0,38.0,1,Joint App,91115.0,18.61,Not Verified,0,0,240589,3,2,1,2,8.0,34508,88.0,2,3,1659,64.0,11000,2,0,5,6,26732.0,2669.0,52.3,0,0,158.0,203,4,4,1,5.0,38.0,7.0,38.0,1,2,5,3,6,4,6,10,5,9,0.0,0,0,4,93.3,33.3,0,0,253346,40345,5600,41060,19051.0,Jan-2016,1.0,1.0,9.0,76.4,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,,18.94,Especialist 1,Especialist 1,2018-12-01
2,20000,20000,20000.0,36 months,7.56%,622.68,A,A3,Teacher,10+ years,MORTGAGE,100000.0,Not Verified,Dec-2018,Fully Paid,n,credit_card,Credit card refinancing,982xx,WA,18.92,0,Feb-1999,0,48.0,,9,0,25416,29.9%,19,w,0.0,0.0,20215.79243,20215.79,20000.0,215.79,0.0,0.0,0.0,Feb-2019,20228.39,,Feb-2019,0,,1,Joint App,190000.0,11.75,Not Verified,0,0,515779,1,2,0,1,13.0,46153,71.0,1,2,9759,39.0,85100,2,2,0,5,57309.0,59684.0,29.9,0,0,171.0,238,1,1,5,1.0,,13.0,48.0,0,5,5,5,6,5,5,9,5,9,0.0,0,0,1,94.7,20.0,0,0,622183,71569,85100,74833,43287.0,Aug-1998,0.0,3.0,10.0,29.7,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,,7.56,Teacher,Teacher,2018-12-01
3,6500,6500,6500.0,36 months,11.80%,215.28,B,B4,Educator,2 years,RENT,46500.0,Source Verified,Dec-2018,Late (16-30 days),n,debt_consolidation,Debt consolidation,352xx,AL,21.01,0,Aug-2011,1,61.0,,24,0,6741,41.6%,30,w,3354.53,3354.53,4083.93,4083.93,3145.47,938.46,0.0,0.0,0.0,Jul-2020,215.28,Oct-2020,Sep-2020,0,61.0,1,Individual,,,,0,0,40223,4,12,2,2,5.0,33482,97.0,7,12,3662,79.0,16200,2,0,4,14,1749.0,7694.0,42.2,0,0,88.0,72,1,1,0,3.0,,5.0,61.0,2,6,10,6,6,14,11,16,10,24,0.0,0,0,9,93.3,0.0,0,0,50845,40223,13300,34645,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,11.8,Educator,Educator,2018-12-01
4,25000,25000,25000.0,60 months,12.98%,568.58,B,B5,Tire Builder,10+ years,MORTGAGE,85000.0,Not Verified,Dec-2018,Fully Paid,n,debt_consolidation,Debt consolidation,356xx,AL,36.67,1,Oct-1993,1,10.0,,14,0,26630,53.3%,27,w,0.0,0.0,29563.763968,29563.76,25000.0,4563.76,0.0,0.0,0.0,Aug-2020,19356.36,,Aug-2020,0,40.0,1,Joint App,140000.0,23.08,Not Verified,0,56,206980,1,7,1,3,3.0,69513,81.0,1,1,17292,68.0,50000,1,3,1,4,14784.0,18581.0,57.7,0,0,186.0,185,9,3,4,9.0,10.0,3.0,10.0,1,4,5,4,5,13,6,10,5,14,0.0,0,0,2,85.2,0.0,0,0,275691,96143,43900,99691,48711.0,Oct-1993,1.0,4.0,16.0,56.5,8.0,12.0,0.0,0.0,49.0,N,,,,,,,,,,,,,,,N,,,,,,,12.98,Tire Builder,Tire Builder,2018-12-01


You can see how the month and year have been indicated by the strings that were contained in the column previously, and that the rest of the values have been inferred.

We can use the `.dt` accessor to now grab specific parts of the datetime object. Lets grab just the year from the all of the cells in the `issue_d` column

In [361]:
df['issue_d_new'].dt.year

0         2018
1         2018
2         2018
3         2018
4         2018
          ... 
128373    2018
128374    2018
128375    2018
128376    2018
128377    2018
Name: issue_d_new, Length: 128378, dtype: int64

Now the month.

In [363]:
df['issue_d_new'].dt.month

0         12
1         12
2         12
3         12
4         12
          ..
128373    10
128374    10
128375    10
128376    10
128377    10
Name: issue_d_new, Length: 128378, dtype: int64

It's just that easy! Now, instead of printing them out, lets add these year and month values as new columns on our dataframe. Again, you'll have to scroll all the way over to the right in the table to see the new columns.

In [366]:
df['issue_d_year']=df['issue_d_new'].dt.year 
df['issue_d_month']=df['issue_d_new'].dt.month
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,new_int_rate,emp_title_cleaned,emp_title_cleaned2,issue_d_new,issue_d_year,issue_d_month
0,5000,5000,5000.0,36 months,11.31%,164.43,B,B3,Shipping,2 years,RENT,40000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,284xx,NC,9.66,0,Oct-2007,0,,,8,0,1070,17.5%,27,w,2770.45,2770.45,2956.61,2956.61,2229.55,727.06,0.0,0.0,0.0,Sep-2020,164.43,Oct-2020,Sep-2020,0,,1,Individual,,,,0,550,144979,1,2,0,1,13.0,143909,78.0,1,3,170,44.0,6100,1,0,1,4,18122.0,3283.0,8.8,0,0,134.0,86,6,6,0,6.0,,9.0,,0,2,5,3,3,19,6,8,5,8,0.0,0,0,1,100.0,0.0,0,0,125853,144979,3600,119753,,,,,,,,,,,,N,CVD19SKIP,INCOMECURT,COMPLETE,2.0,0.0,Apr-2020,Jun-2020,May-2020,2.0,0.0,ACTIVE,58.8,3121.41,164.43,N,,,,,,,11.31,Shipping,Shipping,2018-12-01,2018,12
1,10000,10000,10000.0,36 months,18.94%,366.26,D,D2,Especialist 1,7 years,MORTGAGE,37115.0,Not Verified,Dec-2018,Current,n,other,Other,891xx,NV,33.57,0,Jan-2002,0,38.0,,9,0,5837,53.1%,15,w,5147.99,5147.99,7309.42,7309.42,4852.01,2457.41,0.0,0.0,0.0,Sep-2020,366.26,Oct-2020,Sep-2020,0,38.0,1,Joint App,91115.0,18.61,Not Verified,0,0,240589,3,2,1,2,8.0,34508,88.0,2,3,1659,64.0,11000,2,0,5,6,26732.0,2669.0,52.3,0,0,158.0,203,4,4,1,5.0,38.0,7.0,38.0,1,2,5,3,6,4,6,10,5,9,0.0,0,0,4,93.3,33.3,0,0,253346,40345,5600,41060,19051.0,Jan-2016,1.0,1.0,9.0,76.4,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,,18.94,Especialist 1,Especialist 1,2018-12-01,2018,12
2,20000,20000,20000.0,36 months,7.56%,622.68,A,A3,Teacher,10+ years,MORTGAGE,100000.0,Not Verified,Dec-2018,Fully Paid,n,credit_card,Credit card refinancing,982xx,WA,18.92,0,Feb-1999,0,48.0,,9,0,25416,29.9%,19,w,0.0,0.0,20215.79243,20215.79,20000.0,215.79,0.0,0.0,0.0,Feb-2019,20228.39,,Feb-2019,0,,1,Joint App,190000.0,11.75,Not Verified,0,0,515779,1,2,0,1,13.0,46153,71.0,1,2,9759,39.0,85100,2,2,0,5,57309.0,59684.0,29.9,0,0,171.0,238,1,1,5,1.0,,13.0,48.0,0,5,5,5,6,5,5,9,5,9,0.0,0,0,1,94.7,20.0,0,0,622183,71569,85100,74833,43287.0,Aug-1998,0.0,3.0,10.0,29.7,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,,7.56,Teacher,Teacher,2018-12-01,2018,12
3,6500,6500,6500.0,36 months,11.80%,215.28,B,B4,Educator,2 years,RENT,46500.0,Source Verified,Dec-2018,Late (16-30 days),n,debt_consolidation,Debt consolidation,352xx,AL,21.01,0,Aug-2011,1,61.0,,24,0,6741,41.6%,30,w,3354.53,3354.53,4083.93,4083.93,3145.47,938.46,0.0,0.0,0.0,Jul-2020,215.28,Oct-2020,Sep-2020,0,61.0,1,Individual,,,,0,0,40223,4,12,2,2,5.0,33482,97.0,7,12,3662,79.0,16200,2,0,4,14,1749.0,7694.0,42.2,0,0,88.0,72,1,1,0,3.0,,5.0,61.0,2,6,10,6,6,14,11,16,10,24,0.0,0,0,9,93.3,0.0,0,0,50845,40223,13300,34645,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,11.8,Educator,Educator,2018-12-01,2018,12
4,25000,25000,25000.0,60 months,12.98%,568.58,B,B5,Tire Builder,10+ years,MORTGAGE,85000.0,Not Verified,Dec-2018,Fully Paid,n,debt_consolidation,Debt consolidation,356xx,AL,36.67,1,Oct-1993,1,10.0,,14,0,26630,53.3%,27,w,0.0,0.0,29563.763968,29563.76,25000.0,4563.76,0.0,0.0,0.0,Aug-2020,19356.36,,Aug-2020,0,40.0,1,Joint App,140000.0,23.08,Not Verified,0,56,206980,1,7,1,3,3.0,69513,81.0,1,1,17292,68.0,50000,1,3,1,4,14784.0,18581.0,57.7,0,0,186.0,185,9,3,4,9.0,10.0,3.0,10.0,1,4,5,4,5,13,6,10,5,14,0.0,0,0,2,85.2,0.0,0,0,275691,96143,43900,99691,48711.0,Oct-1993,1.0,4.0,16.0,56.5,8.0,12.0,0.0,0.0,49.0,N,,,,,,,,,,,,,,,N,,,,,,,12.98,Tire Builder,Tire Builder,2018-12-01,2018,12


In [369]:
#check for NaNs
df['issue_d'].isnull().sum()

0

Because all of these dates come from Q4 of 2018, the `issue_d` column isn't all that interesting. Lets look at the `earliest_cr_line` column, which is also a string, but that could be converted to datetime format.

We're going to create a new column called `days_from_earliest_credit_to_issue`

It's a long column header, but think about how valuable this piece of information could be. This number will essentially indicate the length of a person's credit history and if that is correlated with repayment or other factors could be a valuable predictor!

In [370]:
df['earliest_cr_line_new']=pd.to_datetime(df['earliest_cr_line'])

In [373]:
df['issue_d_new']-df['earliest_cr_line_new']

0         4079 days
1         6178 days
2         7243 days
3         2679 days
4         9192 days
            ...    
128373   12114 days
128374   13241 days
128375    4505 days
128376    3652 days
128377    4413 days
Length: 128378, dtype: timedelta64[ns]

In [404]:
df['days_from_earliest_credit_to_issue']=(df['issue_d_new']-df['earliest_cr_line_new']).dt.days  #use dt.days to eliminate 'days'
df=df.drop(axis=1, columns=['credit_hist_len','credit_hist_len_new'])
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,new_int_rate,emp_title_cleaned,emp_title_cleaned2,issue_d_new,issue_d_year,issue_d_month,earliest_cr_line_new,days_from_earliest_credit_to_issue
0,5000,5000,5000.0,36 months,11.31%,164.43,B,B3,Shipping,2 years,RENT,40000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,284xx,NC,9.66,0,Oct-2007,0,,,8,0,1070,17.5%,27,w,2770.45,2770.45,2956.61,2956.61,2229.55,727.06,0.0,0.0,0.0,Sep-2020,164.43,Oct-2020,Sep-2020,0,,1,Individual,,,,0,550,144979,1,2,0,1,13.0,143909,78.0,1,3,170,44.0,6100,1,0,1,4,18122.0,3283.0,8.8,0,0,134.0,86,6,6,0,6.0,,9.0,,0,2,5,3,3,19,6,8,5,8,0.0,0,0,1,100.0,0.0,0,0,125853,144979,3600,119753,,,,,,,,,,,,N,CVD19SKIP,INCOMECURT,COMPLETE,2.0,0.0,Apr-2020,Jun-2020,May-2020,2.0,0.0,ACTIVE,58.8,3121.41,164.43,N,,,,,,,11.31,Shipping,Shipping,2018-12-01,2018,12,2007-10-01,4079
1,10000,10000,10000.0,36 months,18.94%,366.26,D,D2,Especialist 1,7 years,MORTGAGE,37115.0,Not Verified,Dec-2018,Current,n,other,Other,891xx,NV,33.57,0,Jan-2002,0,38.0,,9,0,5837,53.1%,15,w,5147.99,5147.99,7309.42,7309.42,4852.01,2457.41,0.0,0.0,0.0,Sep-2020,366.26,Oct-2020,Sep-2020,0,38.0,1,Joint App,91115.0,18.61,Not Verified,0,0,240589,3,2,1,2,8.0,34508,88.0,2,3,1659,64.0,11000,2,0,5,6,26732.0,2669.0,52.3,0,0,158.0,203,4,4,1,5.0,38.0,7.0,38.0,1,2,5,3,6,4,6,10,5,9,0.0,0,0,4,93.3,33.3,0,0,253346,40345,5600,41060,19051.0,Jan-2016,1.0,1.0,9.0,76.4,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,,18.94,Especialist 1,Especialist 1,2018-12-01,2018,12,2002-01-01,6178
2,20000,20000,20000.0,36 months,7.56%,622.68,A,A3,Teacher,10+ years,MORTGAGE,100000.0,Not Verified,Dec-2018,Fully Paid,n,credit_card,Credit card refinancing,982xx,WA,18.92,0,Feb-1999,0,48.0,,9,0,25416,29.9%,19,w,0.0,0.0,20215.79243,20215.79,20000.0,215.79,0.0,0.0,0.0,Feb-2019,20228.39,,Feb-2019,0,,1,Joint App,190000.0,11.75,Not Verified,0,0,515779,1,2,0,1,13.0,46153,71.0,1,2,9759,39.0,85100,2,2,0,5,57309.0,59684.0,29.9,0,0,171.0,238,1,1,5,1.0,,13.0,48.0,0,5,5,5,6,5,5,9,5,9,0.0,0,0,1,94.7,20.0,0,0,622183,71569,85100,74833,43287.0,Aug-1998,0.0,3.0,10.0,29.7,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,,7.56,Teacher,Teacher,2018-12-01,2018,12,1999-02-01,7243
3,6500,6500,6500.0,36 months,11.80%,215.28,B,B4,Educator,2 years,RENT,46500.0,Source Verified,Dec-2018,Late (16-30 days),n,debt_consolidation,Debt consolidation,352xx,AL,21.01,0,Aug-2011,1,61.0,,24,0,6741,41.6%,30,w,3354.53,3354.53,4083.93,4083.93,3145.47,938.46,0.0,0.0,0.0,Jul-2020,215.28,Oct-2020,Sep-2020,0,61.0,1,Individual,,,,0,0,40223,4,12,2,2,5.0,33482,97.0,7,12,3662,79.0,16200,2,0,4,14,1749.0,7694.0,42.2,0,0,88.0,72,1,1,0,3.0,,5.0,61.0,2,6,10,6,6,14,11,16,10,24,0.0,0,0,9,93.3,0.0,0,0,50845,40223,13300,34645,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,11.8,Educator,Educator,2018-12-01,2018,12,2011-08-01,2679
4,25000,25000,25000.0,60 months,12.98%,568.58,B,B5,Tire Builder,10+ years,MORTGAGE,85000.0,Not Verified,Dec-2018,Fully Paid,n,debt_consolidation,Debt consolidation,356xx,AL,36.67,1,Oct-1993,1,10.0,,14,0,26630,53.3%,27,w,0.0,0.0,29563.763968,29563.76,25000.0,4563.76,0.0,0.0,0.0,Aug-2020,19356.36,,Aug-2020,0,40.0,1,Joint App,140000.0,23.08,Not Verified,0,56,206980,1,7,1,3,3.0,69513,81.0,1,1,17292,68.0,50000,1,3,1,4,14784.0,18581.0,57.7,0,0,186.0,185,9,3,4,9.0,10.0,3.0,10.0,1,4,5,4,5,13,6,10,5,14,0.0,0,0,2,85.2,0.0,0,0,275691,96143,43900,99691,48711.0,Oct-1993,1.0,4.0,16.0,56.5,8.0,12.0,0.0,0.0,49.0,N,,,,,,,,,,,,,,,N,,,,,,,12.98,Tire Builder,Tire Builder,2018-12-01,2018,12,1993-10-01,9192


In [405]:
# length of credit history in years
#Don't forget leap year = 365.25 each year.
df['credit_hist_len']=df['days_from_earliest_credit_to_issue']/365.25
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,new_int_rate,emp_title_cleaned,emp_title_cleaned2,issue_d_new,issue_d_year,issue_d_month,earliest_cr_line_new,days_from_earliest_credit_to_issue,credit_hist_len
0,5000,5000,5000.0,36 months,11.31%,164.43,B,B3,Shipping,2 years,RENT,40000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,284xx,NC,9.66,0,Oct-2007,0,,,8,0,1070,17.5%,27,w,2770.45,2770.45,2956.61,2956.61,2229.55,727.06,0.0,0.0,0.0,Sep-2020,164.43,Oct-2020,Sep-2020,0,,1,Individual,,,,0,550,144979,1,2,0,1,13.0,143909,78.0,1,3,170,44.0,6100,1,0,1,4,18122.0,3283.0,8.8,0,0,134.0,86,6,6,0,6.0,,9.0,,0,2,5,3,3,19,6,8,5,8,0.0,0,0,1,100.0,0.0,0,0,125853,144979,3600,119753,,,,,,,,,,,,N,CVD19SKIP,INCOMECURT,COMPLETE,2.0,0.0,Apr-2020,Jun-2020,May-2020,2.0,0.0,ACTIVE,58.8,3121.41,164.43,N,,,,,,,11.31,Shipping,Shipping,2018-12-01,2018,12,2007-10-01,4079,11.167693
1,10000,10000,10000.0,36 months,18.94%,366.26,D,D2,Especialist 1,7 years,MORTGAGE,37115.0,Not Verified,Dec-2018,Current,n,other,Other,891xx,NV,33.57,0,Jan-2002,0,38.0,,9,0,5837,53.1%,15,w,5147.99,5147.99,7309.42,7309.42,4852.01,2457.41,0.0,0.0,0.0,Sep-2020,366.26,Oct-2020,Sep-2020,0,38.0,1,Joint App,91115.0,18.61,Not Verified,0,0,240589,3,2,1,2,8.0,34508,88.0,2,3,1659,64.0,11000,2,0,5,6,26732.0,2669.0,52.3,0,0,158.0,203,4,4,1,5.0,38.0,7.0,38.0,1,2,5,3,6,4,6,10,5,9,0.0,0,0,4,93.3,33.3,0,0,253346,40345,5600,41060,19051.0,Jan-2016,1.0,1.0,9.0,76.4,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,,18.94,Especialist 1,Especialist 1,2018-12-01,2018,12,2002-01-01,6178,16.914442
2,20000,20000,20000.0,36 months,7.56%,622.68,A,A3,Teacher,10+ years,MORTGAGE,100000.0,Not Verified,Dec-2018,Fully Paid,n,credit_card,Credit card refinancing,982xx,WA,18.92,0,Feb-1999,0,48.0,,9,0,25416,29.9%,19,w,0.0,0.0,20215.79243,20215.79,20000.0,215.79,0.0,0.0,0.0,Feb-2019,20228.39,,Feb-2019,0,,1,Joint App,190000.0,11.75,Not Verified,0,0,515779,1,2,0,1,13.0,46153,71.0,1,2,9759,39.0,85100,2,2,0,5,57309.0,59684.0,29.9,0,0,171.0,238,1,1,5,1.0,,13.0,48.0,0,5,5,5,6,5,5,9,5,9,0.0,0,0,1,94.7,20.0,0,0,622183,71569,85100,74833,43287.0,Aug-1998,0.0,3.0,10.0,29.7,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,,7.56,Teacher,Teacher,2018-12-01,2018,12,1999-02-01,7243,19.830253
3,6500,6500,6500.0,36 months,11.80%,215.28,B,B4,Educator,2 years,RENT,46500.0,Source Verified,Dec-2018,Late (16-30 days),n,debt_consolidation,Debt consolidation,352xx,AL,21.01,0,Aug-2011,1,61.0,,24,0,6741,41.6%,30,w,3354.53,3354.53,4083.93,4083.93,3145.47,938.46,0.0,0.0,0.0,Jul-2020,215.28,Oct-2020,Sep-2020,0,61.0,1,Individual,,,,0,0,40223,4,12,2,2,5.0,33482,97.0,7,12,3662,79.0,16200,2,0,4,14,1749.0,7694.0,42.2,0,0,88.0,72,1,1,0,3.0,,5.0,61.0,2,6,10,6,6,14,11,16,10,24,0.0,0,0,9,93.3,0.0,0,0,50845,40223,13300,34645,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,11.8,Educator,Educator,2018-12-01,2018,12,2011-08-01,2679,7.334702
4,25000,25000,25000.0,60 months,12.98%,568.58,B,B5,Tire Builder,10+ years,MORTGAGE,85000.0,Not Verified,Dec-2018,Fully Paid,n,debt_consolidation,Debt consolidation,356xx,AL,36.67,1,Oct-1993,1,10.0,,14,0,26630,53.3%,27,w,0.0,0.0,29563.763968,29563.76,25000.0,4563.76,0.0,0.0,0.0,Aug-2020,19356.36,,Aug-2020,0,40.0,1,Joint App,140000.0,23.08,Not Verified,0,56,206980,1,7,1,3,3.0,69513,81.0,1,1,17292,68.0,50000,1,3,1,4,14784.0,18581.0,57.7,0,0,186.0,185,9,3,4,9.0,10.0,3.0,10.0,1,4,5,4,5,13,6,10,5,14,0.0,0,0,2,85.2,0.0,0,0,275691,96143,43900,99691,48711.0,Oct-1993,1.0,4.0,16.0,56.5,8.0,12.0,0.0,0.0,49.0,N,,,,,,,,,,,,,,,N,,,,,,,12.98,Tire Builder,Tire Builder,2018-12-01,2018,12,1993-10-01,9192,25.166324


What we're about to do is so cool! Pandas' datetime format is so smart that we can simply use the subtraction operator `-` in order to calculate the amount of time between two dates. 

Think about everything that's going on under the hood in order to give us such straightforward syntax! Handling months of different lengths, leap years, etc. Pandas datetime objects are seriously powerful!

In [407]:
df['credit_hist_len_new']=(df['issue_d_new']-df['earliest_cr_line_new']).dt.days
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,new_int_rate,emp_title_cleaned,emp_title_cleaned2,issue_d_new,issue_d_year,issue_d_month,earliest_cr_line_new,days_from_earliest_credit_to_issue,credit_hist_len,credit_hist_len_new
0,5000,5000,5000.0,36 months,11.31%,164.43,B,B3,Shipping,2 years,RENT,40000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,284xx,NC,9.66,0,Oct-2007,0,,,8,0,1070,17.5%,27,w,2770.45,2770.45,2956.61,2956.61,2229.55,727.06,0.0,0.0,0.0,Sep-2020,164.43,Oct-2020,Sep-2020,0,,1,Individual,,,,0,550,144979,1,2,0,1,13.0,143909,78.0,1,3,170,44.0,6100,1,0,1,4,18122.0,3283.0,8.8,0,0,134.0,86,6,6,0,6.0,,9.0,,0,2,5,3,3,19,6,8,5,8,0.0,0,0,1,100.0,0.0,0,0,125853,144979,3600,119753,,,,,,,,,,,,N,CVD19SKIP,INCOMECURT,COMPLETE,2.0,0.0,Apr-2020,Jun-2020,May-2020,2.0,0.0,ACTIVE,58.8,3121.41,164.43,N,,,,,,,11.31,Shipping,Shipping,2018-12-01,2018,12,2007-10-01,4079,11.167693,4079
1,10000,10000,10000.0,36 months,18.94%,366.26,D,D2,Especialist 1,7 years,MORTGAGE,37115.0,Not Verified,Dec-2018,Current,n,other,Other,891xx,NV,33.57,0,Jan-2002,0,38.0,,9,0,5837,53.1%,15,w,5147.99,5147.99,7309.42,7309.42,4852.01,2457.41,0.0,0.0,0.0,Sep-2020,366.26,Oct-2020,Sep-2020,0,38.0,1,Joint App,91115.0,18.61,Not Verified,0,0,240589,3,2,1,2,8.0,34508,88.0,2,3,1659,64.0,11000,2,0,5,6,26732.0,2669.0,52.3,0,0,158.0,203,4,4,1,5.0,38.0,7.0,38.0,1,2,5,3,6,4,6,10,5,9,0.0,0,0,4,93.3,33.3,0,0,253346,40345,5600,41060,19051.0,Jan-2016,1.0,1.0,9.0,76.4,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,,18.94,Especialist 1,Especialist 1,2018-12-01,2018,12,2002-01-01,6178,16.914442,6178
2,20000,20000,20000.0,36 months,7.56%,622.68,A,A3,Teacher,10+ years,MORTGAGE,100000.0,Not Verified,Dec-2018,Fully Paid,n,credit_card,Credit card refinancing,982xx,WA,18.92,0,Feb-1999,0,48.0,,9,0,25416,29.9%,19,w,0.0,0.0,20215.79243,20215.79,20000.0,215.79,0.0,0.0,0.0,Feb-2019,20228.39,,Feb-2019,0,,1,Joint App,190000.0,11.75,Not Verified,0,0,515779,1,2,0,1,13.0,46153,71.0,1,2,9759,39.0,85100,2,2,0,5,57309.0,59684.0,29.9,0,0,171.0,238,1,1,5,1.0,,13.0,48.0,0,5,5,5,6,5,5,9,5,9,0.0,0,0,1,94.7,20.0,0,0,622183,71569,85100,74833,43287.0,Aug-1998,0.0,3.0,10.0,29.7,2.0,7.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,,7.56,Teacher,Teacher,2018-12-01,2018,12,1999-02-01,7243,19.830253,7243
3,6500,6500,6500.0,36 months,11.80%,215.28,B,B4,Educator,2 years,RENT,46500.0,Source Verified,Dec-2018,Late (16-30 days),n,debt_consolidation,Debt consolidation,352xx,AL,21.01,0,Aug-2011,1,61.0,,24,0,6741,41.6%,30,w,3354.53,3354.53,4083.93,4083.93,3145.47,938.46,0.0,0.0,0.0,Jul-2020,215.28,Oct-2020,Sep-2020,0,61.0,1,Individual,,,,0,0,40223,4,12,2,2,5.0,33482,97.0,7,12,3662,79.0,16200,2,0,4,14,1749.0,7694.0,42.2,0,0,88.0,72,1,1,0,3.0,,5.0,61.0,2,6,10,6,6,14,11,16,10,24,0.0,0,0,9,93.3,0.0,0,0,50845,40223,13300,34645,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,11.8,Educator,Educator,2018-12-01,2018,12,2011-08-01,2679,7.334702,2679
4,25000,25000,25000.0,60 months,12.98%,568.58,B,B5,Tire Builder,10+ years,MORTGAGE,85000.0,Not Verified,Dec-2018,Fully Paid,n,debt_consolidation,Debt consolidation,356xx,AL,36.67,1,Oct-1993,1,10.0,,14,0,26630,53.3%,27,w,0.0,0.0,29563.763968,29563.76,25000.0,4563.76,0.0,0.0,0.0,Aug-2020,19356.36,,Aug-2020,0,40.0,1,Joint App,140000.0,23.08,Not Verified,0,56,206980,1,7,1,3,3.0,69513,81.0,1,1,17292,68.0,50000,1,3,1,4,14784.0,18581.0,57.7,0,0,186.0,185,9,3,4,9.0,10.0,3.0,10.0,1,4,5,4,5,13,6,10,5,14,0.0,0,0,2,85.2,0.0,0,0,275691,96143,43900,99691,48711.0,Oct-1993,1.0,4.0,16.0,56.5,8.0,12.0,0.0,0.0,49.0,N,,,,,,,,,,,,,,,N,,,,,,,12.98,Tire Builder,Tire Builder,2018-12-01,2018,12,1993-10-01,9192,25.166324,9192


What's oldest credit history that was involved in Q4 2018? 

In [409]:
df['credit_hist_len'].describe()

count    128378.000000
mean         16.043800
std           7.903200
min           3.082820
25%          11.085558
50%          14.417522
75%          19.832991
max          68.914442
Name: credit_hist_len, dtype: float64

25,171 days is ~ 68.96 years of credit history!

## Challenge

Pandas' datetime format is so easy to work with that there's really no excuse for not using dates to make features on a dataframe! Get ready to practice more of this on your assignment.