_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [1]:
#download data from url and save
!wget https://resources.lendingclub.com/LoanStats_2019Q1.csv.zip

--2019-07-19 16:41:39--  https://resources.lendingclub.com/LoanStats_2019Q1.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2019Q1.csv.zip.2’

          LoanStats     [<=>                 ]       0  --.-KB/s               ^C


In [2]:
#to see the file 
!ls

LoanStats_2019Q1.csv	  LoanStats_2019Q1.csv.zip.1  sample_data
LoanStats_2019Q1.csv.zip  LoanStats_2019Q1.csv.zip.2


In [0]:
# unzip downloaded file
!unzip LoanStats_2019Q1.csv.zip

Archive:  LoanStats_2019Q1.csv.zip
replace LoanStats_2019Q1.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [0]:
# see few head
!head LoanStats_2019Q1.csv

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import pandas as pd
df =pd.read_csv('LoanStats_2019Q1.csv')
print(df.shape)
df.head()

In [0]:
df =pd.read_csv('LoanStats_2019Q1.csv' , skiprows =1, skipfooter=2 , engine='python')

In [0]:
df.isna().sum()

In [0]:
df.info()
pd.options.display.max_colums=25 ## -1
pd.options.display.max_rows=25

In [0]:
df.head().T

## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [0]:
df.describe(include='object')

In [0]:
df.grade.value_counts()

In [0]:
df.emp_length.value_counts()

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [0]:
x= '12.5%'
type(float(x.strip('%')))

In [0]:
df.int_rate=df.int_rate.str.strip('%').astype(float)

In [0]:
df.head()

Apply the function to the `int_rate` column

In [0]:
x

In [0]:
def remove_percent_sign(string):
  return float(x.strip('%'))
remove_percent_sign(x)


In [0]:
df['int_rate']=df['int_rate'].apply(remove_percent_sign)
df['int_rate'].head()

### Clean `emp_title`

Look at top 20 titles

In [0]:
df.emp_title.value_counts(20)

How often is `emp_title` null?

Clean the title and handle missing values

In [0]:
def clean_emp_title():
  if

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [0]:
df['emp_title']=df['emp_title'].apply()

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01