<a href="https://colab.research.google.com/github/AnikaZN/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/thursday-make-features/LS_DS_124_Make_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [0]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

In [0]:
!unzip LoanStats_2018Q4.csv.zip

In [0]:
!head LoanStats_2018Q4.csv

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import pandas as pd

In [0]:
df = pd.read_csv(sep=',', filepath_or_buffer='https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip', skiprows=1, skipfooter=2)
df.head()

In [0]:
df.tail()

In [0]:
df.shape

In [0]:
df.describe()

In [0]:
df.info()

In [0]:
df.dtypes

In [0]:
df.dtypes.value_counts()

In [0]:
df.shape

In [0]:
df.isnull().sum()

In [0]:
df.isnull().sum().sort_values(ascending=False)

In [0]:
df.isnull().sum().sort_values(ascending=False)/df.shape[0]

In [0]:
#alternative
df.isnull().sum().sort_values(ascending=False)/len(df)

## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [0]:
df.dtypes.value_counts()

In [0]:
df['int_rate'].head()

#percentage signs read like letters to pandas
#and mean that it will read as a string

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [0]:
def strip_percent(x_str):
  return float(x_str.strip('%')) #alternative x_str[-1]

Apply the function to the `int_rate` column

In [0]:
df['int_rate'] = df['int_rate'].apply(strip_percent)
df['int_rate'].head()

### Clean `emp_title`

Look at top 20 titles

In [0]:
df['emp_title'].head(n=20)

How often is `emp_title` null?

In [0]:
df['emp_title'].value_counts(dropna=False).head(20)

In [0]:
df['emp_title'].isnull().sum()

Clean the title and handle missing values

In [0]:
#normalizing

df['emp_title'].str.lower()

In [0]:
import numpy as np

In [0]:
def clean_title(title):
    if isinstance(title, str): #returns a boolean, TRUE for string and FALSE for not
      return title.strip().lower()
    else:
        return 'unknown'

In [0]:
df['emp_title'] = df['emp_title'].apply(clean_title)
df['emp_title'].head()

In [0]:
df['emp_title'].value_counts(dropna=False).head(20)

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [0]:
df['emp_title'].str.contains('manager')

In [0]:
df['emp_title_manager'] = df['emp_title'].str.contains('manager')
df['emp_title_manager'].sample(10)

In [0]:
df.to_csv('tmp.csv', index=False)

#"checkpoint" - create a new Save File with all the work you've already done

In [0]:
df['emp_title'].nunique()

In [0]:
idx_manager = df['emp_title_manager'] == True
df_managers = df[idx_manager]
df_managers.shape

In [0]:
idx_nonmanager = df['emp_title_manager'] == False
df_nonmanagers = df[idx_nonmanager]
df_nonmanagers.shape

In [0]:
df_2 = pd.read_csv('tmp.csv')
df_2.head()

In [0]:
del df_2

In [0]:
del df

In [0]:
print(df_managers['int_rate'].mean(), df_nonmanagers['int_rate'].mean())

In [0]:
print(df_managers['int_rate'].std(), df_nonmanagers['int_rate'].std())

In [0]:
%matplotlib inline

In [0]:
df_managers['int_rate'].hist()

In [0]:
df_nonmanagers['int_rate'].hist()

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [0]:
for col_name in df_nonmanagers.columns:
  if col_name.endswith('_d'):
    print(col_name)

In [0]:
df_nonmanagers['issue_d'].sample(10)

In [0]:
pd.to_datetime(df_nonmanagers['issue_d'])

In [0]:
df_nonmanagers['issue_d'] = pd.to_datetime(df_nonmanagers['issue_d'])
df_nonmanagers['issue_d'].head()

In [0]:
df_nonmanagers['issue_year'] = df_nonmanagers['issue_d'].dt.year
df_nonmanagers['issue_month'] = df_nonmanagers['issue_d'].dt.month
df_nonmanagers[['issue_year', 'issue_month']].head()

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [0]:
print(df_nonmanagers.term)

In [0]:
def strip_month(x_str):
  return int(x_str.strip(' months'))

In [0]:
df_nonmanagers['term'] = df_nonmanagers['term'].apply(strip_month)
print(df_nonmanagers['term'])

In [0]:
print(df_nonmanagers.loan_status)

In [0]:
df_nonmanagers['loan_status_is_great'] = df_nonmanagers['loan_status'].str.contains('Current', 'Fully Paid')
df_nonmanagers['loan_status_is_great'] = df_nonmanagers['loan_status_is_great'].astype(int)
df_nonmanagers['loan_status_is_great'].sample(10)

In [0]:
print(df_nonmanagers['last_pymnt_d'])

In [0]:
df_nonmanagers['last_pymnt_d'].nunique()

In [0]:
df_nonmanagers['last_pymnt_d'].isnull().sum()

In [0]:
df_nonmanagers['last_pymnt_d'] = df_nonmanagers['last_pymnt_d'].fillna(method='bfill')

In [0]:
df_nonmanagers['last_pymnt_d'].isnull().sum()

In [0]:
df_nonmanagers['last_pymnt_d'].sort_values(ascending=False)

In [0]:
df_nonmanagers['last_pymnt_d'].sample(10)

#Oct-2018
#Nov-2018
#Dec-2018
#Jan-2019
#Feb-2019
#Mar-2019
#Apr-2019

In [0]:
def split_one(x_str):
  return str(x_str.strip('-2019'))

def split_two(x_str):
  return str(x_str.strip('-2018'))

def split_three(x_str):
  return str(x_str.strip('Oct-'))

def split_four(x_str):
  return str(x_str.strip('Nov-'))

def split_five(x_str):
  return str(x_str.strip('Dec-'))

def split_six(x_str):
  return str(x_str.strip('Jan-'))

def split_seven(x_str):
  return str(x_str.strip('Feb-'))

def split_eight(x_str):
  return str(x_str.strip('Mar-'))

def split_nine(x_str):
  return str(x_str.strip('Apr-'))

In [0]:
df_nonmanagers['last_pymnt_d_month'] = df_nonmanagers['last_pymnt_d'].apply(split_one).apply(split_two)

df_nonmanagers['last_pymnt_d_year'] = df_nonmanagers['last_pymnt_d'].apply(split_three).apply(split_four).apply(split_five).apply(split_six).apply(split_seven).apply(split_eight).apply(split_nine)

In [0]:
print(df_nonmanagers['last_pymnt_d_month'])

In [0]:
print(df_nonmanagers['last_pymnt_d_year'])

# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01