<a href="https://colab.research.google.com/github/MaxTechniche/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/module2-make-features/Jacob_Maxfield_LS_DS20_112_Make_Features_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 1, Sprint 1, Module 2*

---

# Make Features 

- Student should be able to understand the purpose of feature engineering
- Student should be able to work with strings in pandas
- Student should be able to work with dates and times in pandas
- Student should be able to modify or create columns of a dataframe using the `.apply()` function


Helpful Links:
- [Minimally Sufficient Pandas](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428)
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series
- [Lambda Learning Method for DS - By Ryan Herr](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit?usp=sharing)

# [Objective](#feature-engineering) - The Purpose of Feature Engineering



## Overview

Feature Engineering is the process of using a combination of domain knowledge, creativity and the pre-existing columns of a dataset to create completely new columns.

 Machine Learning models try to detect patterns in the data and then associate those patterns with certain predictions. The hope is that by creating new columns on our dataset that we can expose our model to new patterns in the data so that it can make better and better predictions.

This is largely a matter of understanding how to work with individual columns of a dataframe with Pandas --which is what we'll be practicing today!

## Follow Along

Columns of a dataframe hold each hold a specific type of data. Lets inspect some of the common datatypes found in datasets and then we'll make a new feature on a dataset using pre-existing columns.

In [262]:
import pandas as pd

# Pandas Display Options:
pd.set_option('display.max_rows', 150)
pd.set_option('display.max_columns', 100)

In [263]:
# Previewing data to make sure there are not anomalies
# !curl https://raw.githubusercontent.com/ryanleeallred/datasets/master/Ames%20Housing%20Data/train.csv

In [264]:
# Loading the dataset
df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/Ames%20Housing%20Data/train.csv')

In [265]:
# Quick look to make sure everything looks okay
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


### Specific Columns hold specific kinds of data

Some columns hold integer values like the `BedroomAbvGr` which stands for "Bedrooms Above Grade." This is the number of non-basement bedrooms in the home.

For more information on specific column meanings view the [data dictionary](https://github.com/ryanleeallred/datasets/blob/master/Ames%20Housing%20Data/data_description.txt).

In [266]:
# Look at the first ten rows of the `BedroomAbvGr` column.
df['BedroomAbvGr'].head(10)

0    3
1    3
2    3
3    3
4    4
5    1
6    3
7    3
8    2
9    2
Name: BedroomAbvGr, dtype: int64

Some columns hold float values like the `LotFrontage` column.

In [267]:
# Look at the first ten rows of the `LotFrontage` column.
df['LotFrontage'].head(10)

0    65.0
1    80.0
2    68.0
3    60.0
4    84.0
5    85.0
6    75.0
7     NaN
8    51.0
9    50.0
Name: LotFrontage, dtype: float64

Hmmm, do the values above look like floats to you?

They all have .0 on them so technically they're being stored as floats, but *should* they be stored as floats?

Lets see what all of the possible values for this column are.

In [268]:
# Viewing the 5 most common 'LotFrontage' values
df['LotFrontage'].value_counts().head()

60.0    143
70.0     70
80.0     69
50.0     57
75.0     53
Name: LotFrontage, dtype: int64

Looks to me like the `LotFrontage` column originally held integer values but was cast to a `float` meaning that each original integer values was converted to its corresponding float representation. 

Any guesses as to why that would have happened?


HINT: What's the most common `LotFrontage` value for this column?

In [269]:
# NaN is the most common value in this column. What is a NaN
df['LotFrontage'].value_counts(dropna=False).head()

NaN     259
60.0    143
70.0     70
80.0     69
50.0     57
Name: LotFrontage, dtype: int64

`NaN` stands stands for "Not a Number" and is the default missing value indicator with Pandas. This means there were cells in this column that didn't have a LotFrontage value recorded for those homes. 

This is where domain knowledge starts to come in. Think about the context we're working with here: houses. What might a null or blank cell representing "Linear feet of street connected to property" mean in the context of a housing dataset?

Ok, so maybe it makes seanse to have some NaNs in this column. What is the datatype of a NaN value?

Perhaps some of this data is truly missing or unrecorded data, but sometimes `NaNs` are more likely to indicate something that was "NA" or "Not Applicable" to a particular observation. There could be multiple reasons why there was no value recorded for a particular feature.

Remember, that Pandas tries to maintain a single datatype for all values in a column, and therefore...

In [270]:
# What is the datatype of NaN?
type(np.NaN)

float

The datatype of a NaN is float!  This means that if we have a column of integer values, but the column has even a single `NaN` that column will not be treated with the integer datatype but all of the integers will be converted to floats in order to try and preserve the same datatype throughout the entire column.

You can see already how understanding column datatypes is crucial to understanding how Pandas help us manage our data.

### Making new Features

Lets slim down the dataset and consider just a few specific columns:

- `TotalBsmtSF`
- `1stFlrSF`
- `2ndFlrSF`
- `SalePrice1`


In [271]:
# creating smaller data frame from the below columns
small_df = df[['TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'SalePrice']].copy()
# Quick look to make sure everything is good
small_df.head()

Unnamed: 0,TotalBsmtSF,1stFlrSF,2ndFlrSF,SalePrice
0,856,856,854,208500
1,1262,1262,0,181500
2,920,920,866,223500
3,756,961,756,140000
4,1145,1145,1053,250000


In [272]:
# More info on the small_df
small_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   TotalBsmtSF  1460 non-null   int64
 1   1stFlrSF     1460 non-null   int64
 2   2ndFlrSF     1460 non-null   int64
 3   SalePrice    1460 non-null   int64
dtypes: int64(4)
memory usage: 45.8 KB


### Syntax for creating new columns

When making a new column on a dataframe, we have to use the square bracket syntax of accessing a column. We can't use "dot syntax" here.

In [273]:
# Lets add up all of the square footage to get a single square footage 
# column for the entire dataset
# Using bracket syntax to make a new 'TotalSquareFootage' column
small_df['TotalSquareFootage'] = sum([small_df['TotalBsmtSF'],small_df['1stFlrSF'], small_df['2ndFlrSF']])
small_df.head()
# We see now that TotalSquareFootage column has been added

Unnamed: 0,TotalBsmtSF,1stFlrSF,2ndFlrSF,SalePrice,TotalSquareFootage
0,856,856,854,208500,2566
1,1262,1262,0,181500,2524
2,920,920,866,223500,2706
3,756,961,756,140000,2473
4,1145,1145,1053,250000,3343


In [274]:
# Lets make a nother new column that is 'PricePerSqFt' by
# dividing the price by the square footage
small_df['PricePerSqFt'] = small_df['SalePrice'] / small_df['TotalSquareFootage']

# Adding round here displays rounded data for easier consumtion
# but keeps the underlying data intacked
small_df.head(10).round(2)

Unnamed: 0,TotalBsmtSF,1stFlrSF,2ndFlrSF,SalePrice,TotalSquareFootage,PricePerSqFt
0,856,856,854,208500,2566,81.25
1,1262,1262,0,181500,2524,71.91
2,920,920,866,223500,2706,82.59
3,756,961,756,140000,2473,56.61
4,1145,1145,1053,250000,3343,74.78
5,796,796,566,143000,2158,66.27
6,1686,1694,0,307000,3380,90.83
7,1107,1107,983,200000,3197,62.56
8,952,1022,752,129900,2726,47.65
9,991,1077,0,118000,2068,57.06


Ok, we have made two new columns on our small dataset.

- What does a **high** `PricePerSqFt` say about a home that the square footage and price alone don't capture as directly?

- What does a **low** `PricePerSqFt` say about a home that the square footage and price alone don't directly capture?

In [None]:
# The price per square foot can give us insight on what a house might have. 
# The house might have a large lot that isn't calculated yet.

## Challenge

I hope you can see how we have used existing columns to create a new column on a dataset that say something new about our unit of observation. This is what making new features (columns) on a dataset is all about and why it's so essential to data science --particularly predictive modeling "Machine Learning." 

We'll spend the rest of the lecture and assignment today trying to get as good as we can at manipulating (cleaning) and creating new columns on datasets.

# [Objective](#work-with-strings) Work with Strings with Pandas

## Overview

So far we have worked with numeric datatypes (ints and floats) but we haven't worked with any columns containing string values. We can't simply use arithmetic to manipulate string values, so we'll need to learn some more techniques in order to work with this datatype.

## Follow Along

We're going to import a new dataset here to work with. This dataset is from LendingClub and holds information about loans issued in Q4 of 2018. This dataset is a bit messy so it will give us plenty of opportunities to clean up existing columns as well as create new ones.

The `!wget` shell command being used here does exactly the same thing that your browser does when you type a URL in the address. It makes a request or "gets" the file at that address. However, in our case the file isn't a webpage, it's a compressed CSV file. 

Try copying and pasting the URL from below into your browser, did it start an automatic download? Any URLs like this that start automatic downloads when navigated to can be used along with the `!wget` command to bring files directly into your notebook's memory.

### Load a new dataset

In [276]:
# Loading in LendingClub csv
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

--2020-09-01 22:25:16--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 54.244.115.45, 54.148.13.215, 35.161.89.82
Connecting to resources.lendingclub.com (resources.lendingclub.com)|54.244.115.45|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip’

LoanStats_2018Q4.cs     [                 <=>]  22.28M  2.12MB/s    in 11s     

2020-09-01 22:25:27 (2.12 MB/s) - ‘LoanStats_2018Q4.csv.zip’ saved [23360898]



We need to use the `!unzip` command to extract the csv from the zipped folder.

In [277]:
!unzip LoanStats_2018Q4.csv.zip

Archive:  LoanStats_2018Q4.csv.zip
  inflating: LoanStats_2018Q4.csv    


We can also use bash/shell commands to look at the raw file using the `!head` and `!tail` commands

In [119]:
# Previewing the top of the csv file
!head LoanStats_2018Q4.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

In [120]:
# Previewing the bottom of the csv file
!tail LoanStats_2018Q4.csv

"","","5600","5600","5600"," 36 months"," 13.56%","190.21","C","C1","","n/a","RENT","15600","Not Verified","Oct-2018","Current","n","","","credit_card","Credit card refinancing","836xx","ID","15.31","0","Aug-2012","0","","97","9","1","5996","34.5%","11","w","2449.86","2449.86","4174.07","4174.07","3150.14","1023.93","0.0","0.0","0.0","Aug-2020","190.21","Sep-2020","Aug-2020","0","","1","Individual","","","","0","0","5996","0","0","0","1","20","0","","0","2","3017","35","17400","1","0","0","3","750","4689","45.5","0","0","20","73","13","13","0","13","","20","","0","3","5","4","4","1","9","10","5","9","0","0","0","0","100","25","1","0","17400","5996","8600","0","","","","","","","","","","","","N","","","","","","","","","","","","","","","N","","","","","",""
"","","23000","23000","23000"," 36 months"," 15.02%","797.53","C","C3","Tax Consultant","10+ years","MORTGAGE","75000","Source Verified","Oct-2018","Charged Off","n","","","debt_consolidation","Debt consolidation","352xx","AL","20.

As we look at the raw file itself, do you see anything that might cause us trouble as we read in the CSV file to a dataframe?

In [121]:
# Read in the CSV
# column names start at row 1 therefore, header=1
df = pd.read_csv('LoanStats_2018Q4.csv', header=1)
# or by using skiprows
# df = pd.read_csv('LoanStats_2018Q4.csv', skiprows=1)

df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,...,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,27975.0,27975.0,27975.0,36 months,14.47%,962.52,C,C2,Conductor,10+ years,MORTGAGE,180000.0,Not Verified,Dec-2018,Fully Paid,n,,,credit_card,Credit card refinancing,117xx,NY,11.47,0.0,Jul-1995,0.0,39.0,,10.0,0.0,29711.0,66.8%,19.0,w,0.0,0.0,31804.529849,31804.53,27975.0,3829.53,0.0,0.0,0.0,Jan-2020,20288.02,,Mar-2020,0.0,...,1.0,8.0,13.0,7.0,10.0,0.0,0.0,0.0,0.0,78.9,60.0,0.0,0.0,286525.0,45387.0,29500.0,25025.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,,,10000.0,10000.0,10000.0,60 months,12.98%,227.43,B,B5,Printer,9 years,RENT,60000.0,Not Verified,Dec-2018,Current,n,,,credit_card,Credit card refinancing,660xx,KS,14.9,0.0,May-2007,0.0,,112.0,7.0,1.0,10677.0,54.2%,12.0,w,7352.95,7352.95,4487.31,4487.31,2647.05,1840.26,0.0,0.0,0.0,Aug-2020,227.43,Sep-2020,Aug-2020,0.0,...,6.0,5.0,6.0,5.0,7.0,0.0,0.0,0.0,1.0,100.0,33.3,1.0,0.0,36200.0,27595.0,13000.0,16500.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,,,28000.0,28000.0,28000.0,60 months,13.56%,645.15,C,C1,Project Manager,10+ years,MORTGAGE,128500.0,Source Verified,Dec-2018,Current,n,,,credit_card,Credit card refinancing,760xx,TX,27.35,1.0,Jul-1987,0.0,24.0,,16.0,0.0,55206.0,75%,31.0,w,20667.91,20667.91,12871.36,12871.36,7332.09,5539.27,0.0,0.0,0.0,Aug-2020,645.15,Sep-2020,Aug-2020,0.0,...,9.0,11.0,17.0,11.0,15.0,,0.0,0.0,1.0,94.0,100.0,0.0,0.0,542027.0,128345.0,55000.0,60331.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,,,16000.0,16000.0,16000.0,60 months,13.56%,368.66,C,C1,LMSW,10+ years,RENT,46000.0,Not Verified,Dec-2018,Current,n,,,credit_card,Credit card refinancing,787xx,TX,11.09,1.0,Oct-1997,1.0,11.0,,9.0,0.0,18946.0,32.1%,21.0,w,11503.84,11503.84,7570.88,7570.88,4496.16,3074.72,0.0,0.0,0.0,Aug-2020,368.66,Sep-2020,Aug-2020,0.0,...,10.0,7.0,11.0,3.0,9.0,0.0,0.0,0.0,1.0,94.7,20.0,0.0,0.0,138125.0,128218.0,57600.0,79125.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,,,7500.0,7500.0,7500.0,36 months,10.72%,244.55,B,B2,Program Manager,2 years,RENT,84000.0,Not Verified,Dec-2018,Charged Off,n,,,debt_consolidation,Debt consolidation,600xx,IL,3.86,0.0,May-2005,1.0,,114.0,9.0,1.0,2200.0,12.6%,17.0,w,0.0,0.0,2829.93,2829.93,1465.61,484.09,0.0,880.23,158.4414,Sep-2019,244.55,,Mar-2020,0.0,...,10.0,3.0,6.0,2.0,9.0,0.0,0.0,0.0,2.0,76.5,0.0,1.0,0.0,403892.0,20833.0,17400.0,16812.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


In [122]:
# We still have a problem with our footer
# It's included data not relative to what we're interested in
df.tail()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,...,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
128409,,,5000.0,5000.0,5000.0,36 months,13.56%,169.83,C,C1,Payoff Clerk,10+ years,MORTGAGE,35360.0,Not Verified,Oct-2018,Current,n,,,debt_consolidation,Debt consolidation,381xx,TN,11.3,1.0,Jun-2006,0.0,21.0,,9.0,0.0,2597.0,27.3%,15.0,f,2187.38,2187.38,3732.49,3732.49,2812.62,919.87,0.0,0.0,0.0,Aug-2020,169.83,Sep-2020,Aug-2020,0.0,...,6.0,6.0,7.0,3.0,9.0,0.0,0.0,0.0,3.0,92.9,50.0,0.0,0.0,93908.0,4976.0,3000.0,6028.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
128410,,,10000.0,10000.0,9750.0,36 months,11.06%,327.68,B,B3,,,RENT,44400.0,Source Verified,Oct-2018,Current,n,,,credit_card,Credit card refinancing,980xx,WA,11.78,0.0,Oct-2008,2.0,40.0,,15.0,0.0,6269.0,13.1%,25.0,f,4285.08,4177.96,7193.6,7013.76,5714.92,1478.68,0.0,0.0,0.0,Aug-2020,327.68,Sep-2020,Aug-2020,0.0,...,3.0,14.0,22.0,4.0,15.0,0.0,0.0,0.0,3.0,92.0,0.0,0.0,0.0,57871.0,16440.0,20500.0,10171.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
128411,,,10000.0,10000.0,10000.0,36 months,16.91%,356.08,C,C5,Key Accounts Manager,2 years,RENT,80000.0,Not Verified,Oct-2018,Current,n,,,other,Other,021xx,MA,17.72,1.0,Sep-2006,0.0,14.0,,17.0,0.0,1942.0,30.8%,31.0,w,4495.59,4495.59,7824.37,7824.37,5504.41,2319.96,0.0,0.0,0.0,Aug-2020,356.08,Sep-2020,Aug-2020,0.0,...,22.0,2.0,9.0,1.0,17.0,0.0,0.0,0.0,1.0,74.2,0.0,0.0,0.0,73669.0,59194.0,4000.0,67369.0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
128412,Total amount funded in policy code 1: 2050909275,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
128413,Total amount funded in policy code 2: 820109297,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


The extra rows at the top and bottom of the file have done two things:

1) The top row has made it so that the entire dataset is being interpreted as column headers

2) The bottom two rows have been read into the 'id' column and are causing every column to have at least two `NaN` values in it.

Lets look at the NaN values of each column so that you can see the problem that the extra rows at the bottom of the file are creating for us

In [123]:
# Sum null values by column and sort from least to greatest
# We see here that those 2 bottom rows are giving us NaN values across the board
df.isnull().sum().sort_values()

inq_fi                                             2
mo_sin_old_rev_tl_op                               2
delinq_amnt                                        2
chargeoff_within_12_mths                           2
acc_open_past_24mths                               2
inq_last_12m                                       2
total_cu_tl                                        2
total_rev_hi_lim                                   2
open_rv_24m                                        2
open_rv_12m                                        2
total_bal_il                                       2
open_il_24m                                        2
open_il_12m                                        2
open_act_il                                        2
open_acc_6m                                        2
tot_cur_bal                                        2
tot_coll_amt                                       2
acc_now_delinq                                     2
application_type                              

In [278]:
# Address the extra NaNs in each column by skipping the footer as well.
# Reloading the dataset
# The 'c' engine does not implement skipfooter, therefore we use the engine='python'
df = pd.read_csv('LoanStats_2018Q4.csv', skiprows=1, skipfooter=2, engine='python')

# Loads slower, but solves our problem
print(df.shape)
df.tail()

(128412, 144)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,...,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
128407,,,23000,23000,23000.0,36 months,15.02%,797.53,C,C3,Tax Consultant,10+ years,MORTGAGE,75000.0,Source Verified,Oct-2018,Charged Off,n,,,debt_consolidation,Debt consolidation,352xx,AL,20.95,1,Aug-1985,2,22.0,,12,0,22465,43.6%,28,w,0.0,0.0,1547.08,1547.08,1025.67,521.41,0.0,0.0,0.0,Dec-2018,797.53,,Nov-2018,0,...,3,9,19,5,12,0.0,0,0,7,96.4,14.3,0,0,296500,40614,47100,21000,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
128408,,,10000,10000,10000.0,36 months,15.02%,346.76,C,C3,security guard,5 years,MORTGAGE,38000.0,Not Verified,Oct-2018,Current,n,,,debt_consolidation,Debt consolidation,443xx,OH,13.16,3,Jul-1982,0,6.0,,11,0,5634,37.1%,16,w,4427.45,4427.45,7620.38,7620.38,5572.55,2047.83,0.0,0.0,0.0,Aug-2020,346.76,Sep-2020,Aug-2020,0,...,1,8,11,5,11,0.0,0,0,1,73.3,40.0,0,0,91403,9323,9100,2000,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
128409,,,5000,5000,5000.0,36 months,13.56%,169.83,C,C1,Payoff Clerk,10+ years,MORTGAGE,35360.0,Not Verified,Oct-2018,Current,n,,,debt_consolidation,Debt consolidation,381xx,TN,11.3,1,Jun-2006,0,21.0,,9,0,2597,27.3%,15,f,2187.38,2187.38,3732.49,3732.49,2812.62,919.87,0.0,0.0,0.0,Aug-2020,169.83,Sep-2020,Aug-2020,0,...,6,6,7,3,9,0.0,0,0,3,92.9,50.0,0,0,93908,4976,3000,6028,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
128410,,,10000,10000,9750.0,36 months,11.06%,327.68,B,B3,,,RENT,44400.0,Source Verified,Oct-2018,Current,n,,,credit_card,Credit card refinancing,980xx,WA,11.78,0,Oct-2008,2,40.0,,15,0,6269,13.1%,25,f,4285.08,4177.96,7193.6,7013.76,5714.92,1478.68,0.0,0.0,0.0,Aug-2020,327.68,Sep-2020,Aug-2020,0,...,3,14,22,4,15,0.0,0,0,3,92.0,0.0,0,0,57871,16440,20500,10171,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
128411,,,10000,10000,10000.0,36 months,16.91%,356.08,C,C5,Key Accounts Manager,2 years,RENT,80000.0,Not Verified,Oct-2018,Current,n,,,other,Other,021xx,MA,17.72,1,Sep-2006,0,14.0,,17,0,1942,30.8%,31,w,4495.59,4495.59,7824.37,7824.37,5504.41,2319.96,0.0,0.0,0.0,Aug-2020,356.08,Sep-2020,Aug-2020,0,...,22,2,9,1,17,0.0,0,0,1,74.2,0.0,0,0,73669,59194,4000,67369,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


In [279]:
# This dataset has 4 columns that are filled with NaN for privacy reasons
# so we are going to remove them using the .drop() method
# removing columns with no values
df = df.drop(['id', 'member_id', 'desc', 'url'], axis=1)

For good measure, we'll also drop some columns that are made up completely of NaN values.

Why might LendingClub have included columns in their dataset that are 100% blank?

In [284]:
# Those column are now no longer in the dataset
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,...,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,27975,27975,27975.0,36 months,14.47%,962.52,C,C2,Conductor,10+ years,MORTGAGE,180000.0,Not Verified,Dec-2018,Fully Paid,n,credit_card,Credit card refinancing,117xx,NY,11.47,0,Jul-1995,0,39.0,,10,0,29711,66.8%,19,w,0.0,0.0,31804.529849,31804.53,27975.0,3829.53,0.0,0.0,0.0,Jan-2020,20288.02,,Mar-2020,0,,1,Individual,,...,1,8,13,7,10,0.0,0,0,0,78.9,60.0,0,0,286525,45387,29500,25025,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,10000,10000,10000.0,60 months,12.98%,227.43,B,B5,Printer,9 years,RENT,60000.0,Not Verified,Dec-2018,Current,n,credit_card,Credit card refinancing,660xx,KS,14.9,0,May-2007,0,,112.0,7,1,10677,54.2%,12,w,7352.95,7352.95,4487.31,4487.31,2647.05,1840.26,0.0,0.0,0.0,Aug-2020,227.43,Sep-2020,Aug-2020,0,,1,Individual,,...,6,5,6,5,7,0.0,0,0,1,100.0,33.3,1,0,36200,27595,13000,16500,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,28000,28000,28000.0,60 months,13.56%,645.15,C,C1,Project Manager,10+ years,MORTGAGE,128500.0,Source Verified,Dec-2018,Current,n,credit_card,Credit card refinancing,760xx,TX,27.35,1,Jul-1987,0,24.0,,16,0,55206,75%,31,w,20667.91,20667.91,12871.36,12871.36,7332.09,5539.27,0.0,0.0,0.0,Aug-2020,645.15,Sep-2020,Aug-2020,0,71.0,1,Individual,,...,9,11,17,11,15,,0,0,1,94.0,100.0,0,0,542027,128345,55000,60331,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,16000,16000,16000.0,60 months,13.56%,368.66,C,C1,LMSW,10+ years,RENT,46000.0,Not Verified,Dec-2018,Current,n,credit_card,Credit card refinancing,787xx,TX,11.09,1,Oct-1997,1,11.0,,9,0,18946,32.1%,21,w,11503.84,11503.84,7570.88,7570.88,4496.16,3074.72,0.0,0.0,0.0,Aug-2020,368.66,Sep-2020,Aug-2020,0,,1,Individual,,...,10,7,11,3,9,0.0,0,0,1,94.7,20.0,0,0,138125,128218,57600,79125,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,7500,7500,7500.0,36 months,10.72%,244.55,B,B2,Program Manager,2 years,RENT,84000.0,Not Verified,Dec-2018,Charged Off,n,debt_consolidation,Debt consolidation,600xx,IL,3.86,0,May-2005,1,,114.0,9,1,2200,12.6%,17,w,0.0,0.0,2829.93,2829.93,1465.61,484.09,0.0,880.23,158.4414,Sep-2019,244.55,,Mar-2020,0,,1,Individual,,...,10,3,6,2,9,0.0,0,0,2,76.5,0.0,1,0,403892,20833,17400,16812,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


### Clean up the `int_rate` column

When we're preparing a dataset for a machine learning model we typically want to represent don't want to leave any string values in our dataset --because it's hard to do math on words. 

Specifically, we have a column that is representing a numeric value, but currently doesn't have a numeric datatype. Lets look at the first 10 values of the `int_rate` column

In [127]:
# Look at the first 10 values of the int_rate column

df['int_rate'].head(10)

0     14.47%
1     12.98%
2     13.56%
3     13.56%
4     10.72%
5     20.89%
6     26.31%
7     23.40%
8     19.92%
9     17.97%
Name: int_rate, dtype: object

In [128]:
# Look at a specific value from the int_rate column

df['int_rate'][0]

' 14.47%'

Problems that we need to address with this column:

- String column that should be numeric
- Percent Sign `%` included with the number
- Leading space at the beginning of the string

However, we're not going to try and write exactly the right code to fix this column in one go. We're going to methodically build up to the code that will help us address these problems.


In [129]:
int_rate = ' 14.47%'

In [130]:
# Using differnt functions and methods
# We strip away the spaces
# Then strip the percentage
# Then convert/cast the value to a float
float(int_rate.strip().strip('%'))

14.47

In [131]:
type(float(int_rate.strip().strip('%')))

float

### Write a function to make our solution reusable!

In [286]:
# Write a function that can do what we have written above to any 
# string that is passsed to it.

def int_rate_to_float(int_rate):
  return float(int_rate.strip().strip('%'))

In [133]:
# Test out our function by calling it on our example
int_rate_to_float(' 14.78%')

14.78

In [134]:
# is the data type correct?
type(int_rate_to_float(' 14.78%'))

float

### Apply our solution to every cell in a column

In [287]:
# pass in only the variable name of the function to be applied
df['int_rate'] = df['int_rate'].apply(int_rate_to_float)

In [289]:
# What type of data is held in our new column?
df['int_rate'].dtype

dtype('float64')

In [137]:
# Calculate the average of the values within the 'int_rate' column
df['int_rate'].mean()

12.928038734699749

## Challenge

We can create a new column with our cleaned values or overwrite the original, whatever we think best suits our needs. On your assignment you will take the same approach in trying to methodically build up the complexity of your code until you have a few lines that will work for any cell in a column. At that point you'll contain all of that functionality in a reusable function block and then use the `.apply()` function to... well... apply those changes to an entire column.

# [Objective](#pandas-apply) Modify and Create Columns using `.apply()`



## Overview

We're already seen one example of using the `.apply()` function to clean up a column. Lets see if we can do it again, but this time on a slightly more complicated use case.

Remember, the goal here is to write a function that will work correctly on any **individual** cell of a specific column. Then we can reuse that function on those individual cells of a dataframe column via the `.apply()` function.

Lets clean up the emp_title "Employment Title" column!

## Follow Along

First we'll try and diagnose how bad the problem is and what improvements we might be able to make.

In [138]:
# Look at the top 20 employment titles
df['emp_title'].value_counts(dropna=False).head(20)

NaN                   20947
Teacher                2090
Manager                1773
Registered Nurse        952
Driver                  924
RN                      726
Supervisor              697
Sales                   580
Project Manager         526
General Manager         523
Office Manager          521
Owner                   420
Director                402
Operations Manager      387
Truck Driver            387
Nurse                   326
Engineer                325
Sales Manager           304
manager                 301
Supervisor              270
Name: emp_title, dtype: int64

In [139]:
# How many different unique employment titles are there currently?
df['emp_title'].nunique()

43892

In [140]:
# How often is the employment_title null?
df['emp_title'].isnull().sum()

20947

What are some possible reasons as to why a person's employment title may have not been provided?

In [141]:
# Create some examples that represent the cases that we want to clean up

examples = ['manager', ' Operations Manager', 'Registered Nurse', np.NaN, 'OWNER']

In [290]:
# Write a function to clean up these use cases and increase uniformity.
def clean_emp_title(title):
  if isinstance(title, str):
    return title.strip().title()
  else:
    return 'Unknown'


for title in examples:
  print(clean_emp_title(title))


Manager
Operations Manager
Registered Nurse
Unknown
Owner


In [143]:
# list comprehensions can combine function calls and for loops over lists
# into one succinct and fairly readable single line of code.

[clean_emp_title(title) for title in examples]

['Manager', 'Operations Manager', 'Registered Nurse', 'Unknown', 'Owner']

In [291]:
# We have a function that works as expected. Lets apply it to our column.
# Overwriting the original column
df['emp_title_cleaned'] = df['emp_title'].apply(clean_emp_title)
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,...,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_cleaned
0,27975,27975,27975.0,36 months,14.47,962.52,C,C2,Conductor,10+ years,MORTGAGE,180000.0,Not Verified,Dec-2018,Fully Paid,n,credit_card,Credit card refinancing,117xx,NY,11.47,0,Jul-1995,0,39.0,,10,0,29711,66.8%,19,w,0.0,0.0,31804.529849,31804.53,27975.0,3829.53,0.0,0.0,0.0,Jan-2020,20288.02,,Mar-2020,0,,1,Individual,,...,8,13,7,10,0.0,0,0,0,78.9,60.0,0,0,286525,45387,29500,25025,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,Conductor
1,10000,10000,10000.0,60 months,12.98,227.43,B,B5,Printer,9 years,RENT,60000.0,Not Verified,Dec-2018,Current,n,credit_card,Credit card refinancing,660xx,KS,14.9,0,May-2007,0,,112.0,7,1,10677,54.2%,12,w,7352.95,7352.95,4487.31,4487.31,2647.05,1840.26,0.0,0.0,0.0,Aug-2020,227.43,Sep-2020,Aug-2020,0,,1,Individual,,...,5,6,5,7,0.0,0,0,1,100.0,33.3,1,0,36200,27595,13000,16500,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,Printer
2,28000,28000,28000.0,60 months,13.56,645.15,C,C1,Project Manager,10+ years,MORTGAGE,128500.0,Source Verified,Dec-2018,Current,n,credit_card,Credit card refinancing,760xx,TX,27.35,1,Jul-1987,0,24.0,,16,0,55206,75%,31,w,20667.91,20667.91,12871.36,12871.36,7332.09,5539.27,0.0,0.0,0.0,Aug-2020,645.15,Sep-2020,Aug-2020,0,71.0,1,Individual,,...,11,17,11,15,,0,0,1,94.0,100.0,0,0,542027,128345,55000,60331,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,Project Manager
3,16000,16000,16000.0,60 months,13.56,368.66,C,C1,LMSW,10+ years,RENT,46000.0,Not Verified,Dec-2018,Current,n,credit_card,Credit card refinancing,787xx,TX,11.09,1,Oct-1997,1,11.0,,9,0,18946,32.1%,21,w,11503.84,11503.84,7570.88,7570.88,4496.16,3074.72,0.0,0.0,0.0,Aug-2020,368.66,Sep-2020,Aug-2020,0,,1,Individual,,...,7,11,3,9,0.0,0,0,1,94.7,20.0,0,0,138125,128218,57600,79125,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,Lmsw
4,7500,7500,7500.0,36 months,10.72,244.55,B,B2,Program Manager,2 years,RENT,84000.0,Not Verified,Dec-2018,Charged Off,n,debt_consolidation,Debt consolidation,600xx,IL,3.86,0,May-2005,1,,114.0,9,1,2200,12.6%,17,w,0.0,0.0,2829.93,2829.93,1465.61,484.09,0.0,880.23,158.4414,Sep-2019,244.55,,Mar-2020,0,,1,Individual,,...,3,6,2,9,0.0,0,0,2,76.5,0.0,1,0,403892,20833,17400,16812,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,Program Manager


We can use the same code as we did earlier to see how much progress was made.


In [292]:
# Look at the top 20 employment titles
df['emp_title_cleaned'].value_counts(dropna=False).head(20)

Unknown               20947
Teacher                2557
Manager                2395
Registered Nurse       1418
Driver                 1258
Supervisor             1160
Truck Driver            920
Rn                      834
Office Manager          805
Sales                   803
General Manager         791
Project Manager         720
Owner                   625
Director                523
Operations Manager      518
Sales Manager           500
Police Officer          440
Nurse                   425
Technician              420
Engineer                412
Name: emp_title_cleaned, dtype: int64

In [293]:
# How many different unique employment titles are there currently?
df['emp_title_cleaned'].nunique()

34902

Using the .apply() function isn't always about creating new columns on a dataframe, we can use it to clean up or modify existing columns as well. 

# [Objective](#dates-and-times) Work with Dates and Times with Pandas

## Overview

Pandas has its own datetime datatype that makes it extremely convenient to convert strings that are in standard date formates to datetime objects and then use those datetime objects to either create new features on a dataframe or work with the dataset in a timeseries fashion. 

This section will demonstrate how to take a column of date strings, convert it to a datetime object and then use the datetime formatting `.dt` to access specific parts of the date (year, month, day) to generate useful columns on a dataframe.

### Work with Dates 

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

Many of the most useful date columns in this dataset have the suffix `_d` to indicate that they correspond to dates.

We'll use a list comprehension to print them out

In [294]:
df['issue_d']

0         Dec-2018
1         Dec-2018
2         Dec-2018
3         Dec-2018
4         Dec-2018
            ...   
128407    Oct-2018
128408    Oct-2018
128409    Oct-2018
128410    Oct-2018
128411    Oct-2018
Name: issue_d, Length: 128412, dtype: object

Lets look at the string format of the `issue_d` column

In [151]:
[col for col in df if col.endswith('_d')]

['issue_d', 'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']

Because this string format %m-%y is a common datetime format, we can just let Pandas detect this format and translate it to the appropriate datetime object.

In [296]:
# Convert each oject in the 'issue_d' column to a datetime datatype
df['issue_d'] =  pd.to_datetime(df['issue_d'], infer_datetime_format=True)

Now we can see that the `issue_d` column has been changed to hold `datetime` objects.

Lets look at one of the cells specifically to see what a datetime object looks like:

In [297]:
# We now have a datetime dtype within our dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128412 entries, 0 to 128411
Columns: 141 entries, loan_amnt to emp_title_cleaned
dtypes: datetime64[ns](1), float64(54), int64(51), object(35)
memory usage: 138.1+ MB


You can see how the month and year have been indicated by the strings that were contained in the column previously, and that the rest of the values have been inferred.

In [154]:
# Print out the first item of the 'issue_d' column
df['issue_d'][0]

Timestamp('2018-12-01 00:00:00')

We can use the `.dt` accessor to now grab specific parts of the datetime object. Lets grab just the year from the all of the cells in the `issue_d` column

In [155]:
# We can also get more specific information on each of the values such as the year
df['issue_d'].dt.year

0         2018
1         2018
2         2018
3         2018
4         2018
          ... 
128407    2018
128408    2018
128409    2018
128410    2018
128411    2018
Name: issue_d, Length: 128412, dtype: int64

Now the month.

In [156]:
# and the month
df['issue_d'].dt.month

0         12
1         12
2         12
3         12
4         12
          ..
128407    10
128408    10
128409    10
128410    10
128411    10
Name: issue_d, Length: 128412, dtype: int64

It's just that easy! Now, instead of printing them out, lets add these year and month values as new columns on our dataframe. Again, you'll have to scroll all the way over to the right in the table to see the new columns.

In [299]:
# Creating new columns with the with each the month and month seperately
df['issue_d_year'] = df['issue_d'].dt.year
df['issue_d_month'] = df['issue_d'].dt.month

df.dtypes.tail()

settlement_percentage    float64
settlement_term          float64
emp_title_cleaned         object
issue_d_year               int64
issue_d_month              int64
dtype: object

Because all of these dates come from Q4 of 2018, the `issue_d` column isn't all that interesting. Lets look at the `earliest_cr_line` column, which is also a string, but that could be converted to datetime format.

We're going to create a new column called `credit_history_length`

It's a long column header, but think about how valuable this piece of information could be. This number will essentially indicate the length of a person's credit history and if that is correlated with repayment or other factors could be a valuable predictor!

In [300]:
# Converting column to datetype dtype
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], infer_datetime_format=True)

In [303]:
# Creating new column using the issue date - their earliest credit line (in years)
df['credit_history_length'] = ((df['issue_d'] - df['earliest_cr_line']).dt.days / 365).round(2)

In [304]:
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,...,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_cleaned,issue_d_year,issue_d_month,credit_history_length
0,27975,27975,27975.0,36 months,14.47,962.52,C,C2,Conductor,10+ years,MORTGAGE,180000.0,Not Verified,2018-12-01,Fully Paid,n,credit_card,Credit card refinancing,117xx,NY,11.47,0,1995-07-01,0,39.0,,10,0,29711,66.8%,19,w,0.0,0.0,31804.529849,31804.53,27975.0,3829.53,0.0,0.0,0.0,Jan-2020,20288.02,,Mar-2020,0,,1,Individual,,...,10,0.0,0,0,0,78.9,60.0,0,0,286525,45387,29500,25025,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,Conductor,2018,12,23.44
1,10000,10000,10000.0,60 months,12.98,227.43,B,B5,Printer,9 years,RENT,60000.0,Not Verified,2018-12-01,Current,n,credit_card,Credit card refinancing,660xx,KS,14.9,0,2007-05-01,0,,112.0,7,1,10677,54.2%,12,w,7352.95,7352.95,4487.31,4487.31,2647.05,1840.26,0.0,0.0,0.0,Aug-2020,227.43,Sep-2020,Aug-2020,0,,1,Individual,,...,7,0.0,0,0,1,100.0,33.3,1,0,36200,27595,13000,16500,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,Printer,2018,12,11.59
2,28000,28000,28000.0,60 months,13.56,645.15,C,C1,Project Manager,10+ years,MORTGAGE,128500.0,Source Verified,2018-12-01,Current,n,credit_card,Credit card refinancing,760xx,TX,27.35,1,1987-07-01,0,24.0,,16,0,55206,75%,31,w,20667.91,20667.91,12871.36,12871.36,7332.09,5539.27,0.0,0.0,0.0,Aug-2020,645.15,Sep-2020,Aug-2020,0,71.0,1,Individual,,...,15,,0,0,1,94.0,100.0,0,0,542027,128345,55000,60331,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,Project Manager,2018,12,31.44
3,16000,16000,16000.0,60 months,13.56,368.66,C,C1,LMSW,10+ years,RENT,46000.0,Not Verified,2018-12-01,Current,n,credit_card,Credit card refinancing,787xx,TX,11.09,1,1997-10-01,1,11.0,,9,0,18946,32.1%,21,w,11503.84,11503.84,7570.88,7570.88,4496.16,3074.72,0.0,0.0,0.0,Aug-2020,368.66,Sep-2020,Aug-2020,0,,1,Individual,,...,9,0.0,0,0,1,94.7,20.0,0,0,138125,128218,57600,79125,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,Lmsw,2018,12,21.18
4,7500,7500,7500.0,36 months,10.72,244.55,B,B2,Program Manager,2 years,RENT,84000.0,Not Verified,2018-12-01,Charged Off,n,debt_consolidation,Debt consolidation,600xx,IL,3.86,0,2005-05-01,1,,114.0,9,1,2200,12.6%,17,w,0.0,0.0,2829.93,2829.93,1465.61,484.09,0.0,880.23,158.4414,Sep-2019,244.55,,Mar-2020,0,,1,Individual,,...,9,0.0,0,0,2,76.5,0.0,1,0,403892,20833,17400,16812,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,Program Manager,2018,12,13.59


What we're about to do is so cool! Pandas' datetime format is so smart that we can simply use the subtraction operator `-` in order to calculate the amount of time between two dates. 

Think about everything that's going on under the hood in order to give us such straightforward syntax! Handling months of different lengths, leap years, etc. Pandas datetime objects are seriously powerful!

What's oldest credit history that was involved in Q4 2018? 

In [305]:
# Showing Summary Statistics for the Credit History Length Column
df['credit_history_length'].describe()

count    128412.000000
mean         16.054602
std           7.908350
min           3.080000
25%          11.090000
50%          14.430000
75%          19.850000
max          68.960000
Name: credit_history_length, dtype: float64

 --

  --

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200> 

# Assignment:

- Replicate the lesson code.

 - This means that if you haven't followed along already, type out the things that we did in class. Forcing your fingers to hit each key will help you internalize the syntax of what we're doing. Make sure you understand each line of code that you're writing, google things that you don't fully understand.
 - [Lambda Learning Method for DS - By Ryan Herr](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit?usp=sharing)
- Convert the `term` column from string to integer.

In [306]:
# removing ' months' and casting it as an int
df['term'] = df['term'].apply(lambda x: int(x.strip(' months')))

In [314]:
# Term now has only number of months and is of dtype int
print(df['term'].dtype)
df.head()

int64


Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,...,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_cleaned,issue_d_year,issue_d_month,credit_history_length
0,27975,27975,27975.0,36,14.47,962.52,C,C2,Conductor,10+ years,MORTGAGE,180000.0,Not Verified,2018-12-01,Fully Paid,n,credit_card,Credit card refinancing,117xx,NY,11.47,0,1995-07-01,0,39.0,,10,0,29711,66.8%,19,w,0.0,0.0,31804.529849,31804.53,27975.0,3829.53,0.0,0.0,0.0,Jan-2020,20288.02,,Mar-2020,0,,1,Individual,,...,10,0.0,0,0,0,78.9,60.0,0,0,286525,45387,29500,25025,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,Conductor,2018,12,23.44
1,10000,10000,10000.0,60,12.98,227.43,B,B5,Printer,9 years,RENT,60000.0,Not Verified,2018-12-01,Current,n,credit_card,Credit card refinancing,660xx,KS,14.9,0,2007-05-01,0,,112.0,7,1,10677,54.2%,12,w,7352.95,7352.95,4487.31,4487.31,2647.05,1840.26,0.0,0.0,0.0,Aug-2020,227.43,Sep-2020,Aug-2020,0,,1,Individual,,...,7,0.0,0,0,1,100.0,33.3,1,0,36200,27595,13000,16500,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,Printer,2018,12,11.59
2,28000,28000,28000.0,60,13.56,645.15,C,C1,Project Manager,10+ years,MORTGAGE,128500.0,Source Verified,2018-12-01,Current,n,credit_card,Credit card refinancing,760xx,TX,27.35,1,1987-07-01,0,24.0,,16,0,55206,75%,31,w,20667.91,20667.91,12871.36,12871.36,7332.09,5539.27,0.0,0.0,0.0,Aug-2020,645.15,Sep-2020,Aug-2020,0,71.0,1,Individual,,...,15,,0,0,1,94.0,100.0,0,0,542027,128345,55000,60331,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,Project Manager,2018,12,31.44
3,16000,16000,16000.0,60,13.56,368.66,C,C1,LMSW,10+ years,RENT,46000.0,Not Verified,2018-12-01,Current,n,credit_card,Credit card refinancing,787xx,TX,11.09,1,1997-10-01,1,11.0,,9,0,18946,32.1%,21,w,11503.84,11503.84,7570.88,7570.88,4496.16,3074.72,0.0,0.0,0.0,Aug-2020,368.66,Sep-2020,Aug-2020,0,,1,Individual,,...,9,0.0,0,0,1,94.7,20.0,0,0,138125,128218,57600,79125,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,Lmsw,2018,12,21.18
4,7500,7500,7500.0,36,10.72,244.55,B,B2,Program Manager,2 years,RENT,84000.0,Not Verified,2018-12-01,Charged Off,n,debt_consolidation,Debt consolidation,600xx,IL,3.86,0,2005-05-01,1,,114.0,9,1,2200,12.6%,17,w,0.0,0.0,2829.93,2829.93,1465.61,484.09,0.0,880.23,158.4414,Sep-2019,244.55,,Mar-2020,0,,1,Individual,,...,9,0.0,0,0,2,76.5,0.0,1,0,403892,20833,17400,16812,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,Program Manager,2018,12,13.59


Make a column named loan_status_is_great. It should contain the integer 1 if loan_status is "Current" or "Fully Paid." Else it should contain the integer 0.

In [315]:
# using a list comprehention, we create a new column called load_status_is_great
# and put in 1 if the loan_status is current or paid, otherwise we put in a 0
df['loan_status_is_great'] = [1 * (x in ['Current', 'Fully Paid']) for x in df['loan_status']]

In [317]:
# The count of great vs not great loan status'
df['loan_status_is_great'].value_counts()

1    114898
0     13514
Name: loan_status_is_great, dtype: int64


- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [318]:
# make a series of datetimes and storing in in variable called lpd
lpd = pd.to_datetime(df['last_pymnt_d'], infer_datetime_format=True)
lpd

0        2020-01-01
1        2020-08-01
2        2020-08-01
3        2020-08-01
4        2019-09-01
            ...    
128407   2018-12-01
128408   2020-08-01
128409   2020-08-01
128410   2020-08-01
128411   2020-08-01
Name: last_pymnt_d, Length: 128412, dtype: datetime64[ns]

In [319]:
# Extracting month and year and creating new columns within the dataset
df['last_pymnt_d_month'] = lpd.dt.month
df['last_pymnt_d_year'] = lpd.dt.year
# Because the columns have missing data, months and years will be displayed as floats
df['last_pymnt_d_year'] #.apply(lambda x: int(x) if x > 0 else 0).value_counts(dropna=False)

0         2020.0
1         2020.0
2         2020.0
3         2020.0
4         2019.0
           ...  
128407    2018.0
128408    2020.0
128409    2020.0
128410    2020.0
128411    2020.0
Name: last_pymnt_d_year, Length: 128412, dtype: float64

# Stretch Goals


You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

In [320]:
# removing and converting percentages to floats in the 'revol_util' column
df['revol_util'] = df['revol_util'].apply(lambda x: float(x.strip('%')) if isinstance(x, str) else float(x))

In [321]:
df['revol_util'].describe()

count    128256.000000
mean         44.197281
std          24.798842
min           0.000000
25%          24.600000
50%          42.400000
75%          62.500000
max         183.800000
Name: revol_util, dtype: float64

In [322]:
# Create series of the top 20 values
top20 = df['emp_title'].value_counts().head(20)
top20

Teacher                     2090
Manager                     1773
Registered Nurse             952
Driver                       924
RN                           726
Supervisor                   697
Sales                        580
Project Manager              526
General Manager              523
Office Manager               521
Owner                        420
Director                     402
Operations Manager           387
Truck Driver                 387
Nurse                        326
Engineer                     325
Sales Manager                304
manager                      301
Supervisor                   270
Administrative Assistant     269
Name: emp_title, dtype: int64

In [323]:
# Change value of emp_title to 'Other' if not in top20
df['emp_title'] = df['emp_title'].apply(lambda x: 'Other' if x not in top20 else x)

In [325]:
# We now have the top 20 AND Other (21 total)
df['emp_title'].value_counts()

Other                       115709
Teacher                       2090
Manager                       1773
Registered Nurse               952
Driver                         924
RN                             726
Supervisor                     697
Sales                          580
Project Manager                526
General Manager                523
Office Manager                 521
Owner                          420
Director                       402
Operations Manager             387
Truck Driver                   387
Nurse                          326
Engineer                       325
Sales Manager                  304
manager                        301
Supervisor                     270
Administrative Assistant       269
Name: emp_title, dtype: int64