<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200> 




# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series
- [Lambda Learning Method for DS - By Ryan Herr](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit?usp=sharing)

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [67]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

--2019-09-05 20:08:42--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip.3’

LoanStats_2018Q4.cs     [        <=>         ]  21.58M  1.72MB/s    in 13s     

2019-09-05 20:08:55 (1.70 MB/s) - ‘LoanStats_2018Q4.csv.zip.3’ saved [22631049]



In [0]:
!unzip LoanStats_2018Q4.csv.zip

Archive:  LoanStats_2018Q4.csv.zip
replace LoanStats_2018Q4.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [0]:
!head LoanStats_2018Q4.csv

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

df = pd.read_csv('LoanStats_2018Q4.csv', header=1, na_values=['n/a'], skipfooter=2)
df.head(10)

## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [0]:
list(df.columns)

# Lookings at Null Values

In [0]:
#shape will give num of rows[1] and cols[0]
df.shape

In [0]:
df.isnull().sum().sort_values(ascending=False)

In [0]:
df = df.drop(columns=['id', 'member_id', 'desc', 'url'])

In [0]:
df.head()

In [0]:
df.isnull().sum().sort_values(ascending=False)

In [0]:
df['int_rate']

In [0]:
df.dtypes

In [0]:
df.head()

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [0]:
int_rate = '17.3038%'

In [0]:
int_rate.strip('%') #removes % char from string

In [0]:
type(int_rate.strip('%'))

In [0]:
float(int_rate.strip('%'))

In [0]:
type(float(int_rate.strip('%')))

In [0]:
def rem_per_type_float(string):
  return float(string.strip('%'))

rem_per_type_float(int_rate)

In [0]:
list(df['int_rate'])

In [0]:
#using list comprehension to fill col of int_rate strings with floats
df['int_rate'] = pd.Series( [ rem_per_type_float(item) for item in list( df['int_rate'] ) ] )

In [0]:
df.head()

Apply the function to the `int_rate` column

### Clean `emp_title`

Look at top 20 titles

How often is `emp_title` null?

Clean the title and handle missing values

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"