<a href="https://colab.research.google.com/github/jmend01/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/module4-makefeatures/Jonathan_Mendoza_LS_DS_124_Make_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [0]:
!wget https://resources.lendingclub.com/LoanStats_2019Q1.csv.zip

In [0]:
!unzip LoanStats_2019Q1.csv.zip

In [0]:
!head LoanStats_2019Q1.csv

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import pandas as pd
df = pd.read_csv('LoanStats_2019Q1.csv',skiprows = 1, skipfooter = 2, engine = 'python')
print(df.shape)
df.head()

In [0]:
df.isna().sum()

In [0]:
df[df.loan_amnt.isna()]

In [0]:
df.head().T

In [0]:
pd.options.display.max_columns = 150
pd.options.display.max_rows = 150
df.head().T

## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [0]:
df.describe(include = 'object')

In [0]:
df.grade.value_counts()

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [0]:
x = '12.5%'
float(x.strip('%'))


Apply the function to the `int_rate` column

In [0]:
df.int_rate.str.strip('%').astype(float).head()

In [0]:
def remove_pc(string):
  return float(string.strip('%'))

In [0]:
df['int_rate'] = df['int_rate'].apply(remove_pc)

### Clean `emp_title`

Look at top 20 titles

In [0]:
df.emp_title.value_counts().head(20)

In [0]:
df.emp_title.nunique()

How often is `emp_title` null?

In [0]:
df.emp_title.isna().sum()/df.shape[0]

Clean the title and handle missing values

In [0]:
import numpy as np

example = ['owner','Supervisor','Project manager',np.nan]

def clean_emp_title(x):
  if isinstance(x, str):
    return x.strip().title()
  else:
    return 'Missing'
  
# for ex in example:
#     print(clean_emp_title(ex))

[clean_emp_title(x) for x in example]

In [0]:
df['emp_title'] = df['emp_title'].apply(clean_emp_title)

In [0]:
df.emp_title.nunique()

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [0]:
df['emp_title'].str.contains('manager', case = False ).head()

In [0]:
df['emp_title_manager'] = df.emp_title.str.contains('Manager')
df['emp_title_manager'].value_counts()

In [0]:
df.groupby('emp_title_manager').int_rate.mean()

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [0]:
df['issue_d'].describe()

In [0]:
df['issue_d'] = pd.to_datetime(df['issue_d'], infer_datetime_format = True)

In [0]:
df['issue_d'].describe()

In [0]:
df['issue_d'].iloc [0:5].dt.year

In [0]:
df['issue_month'] = df['issue_d'].dt.month

In [0]:
df['issue_month'].value_counts()

In [0]:
df['earliest_cr_line']= pd.to_datetime(df['earliest_cr_line'])

In [0]:
df['days_since_earliest_cr_line'] = (df['issue_d']-df['earliest_cr_line']).dt.days

In [0]:
df['days_since_earliest_cr_line'].describe()

In [0]:
[col for col in df if col.endswith('_d')]

In [0]:
for col in ['last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d']:
  df[col] = pd.to_datetime(df[col])

In [0]:
df.describe(include = 'datetime')

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

## Load in data

In [0]:
!wget https://resources.lendingclub.com/LoanStats_2019Q1.csv.zip
!unzip LoanStats_2019Q1.csv.zip

## Import libraries, initialize and inspect dataset

In [0]:
import pandas as pd
import numpy as np

df = pd.read_csv('LoanStats_2019Q1.csv',skiprows = 1, skipfooter = 2, engine = 'python')
print(df.shape)
df.head()

## inspect 'term' column and change variable type

In [0]:
print(df.term.head())
df.term.value_counts()

In [0]:
def remove_text(string):
  return int(string.strip(' months'))

df['term'] = df['term'].apply(remove_text)
df.term.head()

## Create new column

In [0]:
df.loan_status.value_counts()

In [0]:
df['loan_status_is_great'] = np.where((df['loan_status']== 'Current')|(df['loan_status']=='Fully Paid'), 1, 0)

df['loan_status_is_great'].value_counts()

In [0]:
col_dates = [col for col in df if col.endswith('_d')]
for col in col_dates:
  df[col] = pd.to_datetime(df[col], infer_datetime_format = True)
  
df[col_dates].head()

In [0]:
df['last_pymnt_d_month'] = df['last_pymnt_d'].dt.month
df['last_pymnt_d_year'] = df['last_pymnt_d'].dt.year

print(df['last_pymnt_d_month'].value_counts())
df['last_pymnt_d_year'].value_counts()

# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

## Lending Club Stretch

### find and update columns with percent sign, convert to float, handle missing values

In [0]:
df.head()

In [0]:
df.shape

In [0]:
print(df.revol_util.isna().sum())
print(df.revol_util.value_counts())
df.revol_util.head()

In [0]:
df.revol_util.fillna(method = 'ffill', inplace = True)
df.revol_util.isna().sum()

In [0]:
def remove_perc(string):
  return float(string.strip('%'))

df['int_rate'] = df['int_rate'].apply(remove_perc)
df['revol_util'] = df['revol_util'].apply(remove_perc)

In [0]:
print(df.int_rate.head())
df.revol_util.head()

### Change emp_title to other if not in top 20

In [0]:
df.emp_title.value_counts()

In [0]:
#unfinished

def replacement(col1,col2,string,out):
  df[col1] = np.where((df[col2].str.contains(string, case = False) == True), out, df[col2])
  
  return df[col1]                    

In [114]:
replacement('emp_title2','emp_title','manager','Manager')
df.emp_title.value_counts()

Teacher                                     2037
Manager                                     1626
Registered Nurse                             898
Driver                                       857
Supervisor                                   655
RN                                           623
Sales                                        586
Office Manager                               574
Project Manager                              540
General Manager                              486
Owner                                        449
Director                                     374
Operations Manager                           313
Engineer                                     309
Truck Driver                                 308
Sales Manager                                288
Nurse                                        281
Administrative Assistant                     267
Supervisor                                   260
Accountant                                   259
Server              

In [0]:
df.emp_title2.value_counts()

In [0]:
replacement('emp_title2','emp_title2','mgr','Manager')
replacement('emp_title2','emp_title2','rn','RN')
replacement('emp_title2','emp_title2','registered nurse','RN')
replacement('emp_title2','emp_title2','nurse','Nurse')
replacement('emp_title2','emp_title2','vp','Vice President')
replacement('emp_title2','emp_title2','teacher','Educator')
replacement('emp_title2','emp_title2','educator','Educator')
replacement('emp_title2','emp_title2','Vice President', 'Vice President')
replacement('emp_title2','emp_title2','driver','Driver')
replacement('emp_title2','emp_title2','supervisor','Supervisor')
replacement('emp_title2','emp_title2','owner','Owner')
replacement('emp_title2','emp_title2','admin','Administrative Support')
replacement('emp_title2','emp_title2','operator','Machine Operator')
replacement('emp_title2','emp_title2','ceo','CEO')
replacement('emp_title2','emp_title2','underwriter','Underwriter')
replacement('emp_title2','emp_title2','director','Director')
replacement('emp_title2','emp_title2','tech','Technician')
replacement('emp_title2','emp_title2','sales','Sales')
replacement('emp_title2','emp_title2','analyst','Analyst')
replacement('emp_title2','emp_title2','specialist','Specialist')
replacement('emp_title2','emp_title2','labor','Laborer')
replacement('emp_title2','emp_title2','maint','Maintenance')
replacement('emp_title2','emp_title2','assistant','Assistant')
replacement('emp_title2','emp_title2','qa','Quality Assurance')
replacement('emp_title2','emp_title2','quality','Quality Assurance')
replacement('emp_title2','emp_title2','data','Data Scientist')
replacement('emp_title2','emp_title2','coor','Coordinator')
replacement('emp_title2','emp_title2','mech','Mechanic');
replacement('emp_title2','emp_title2','sched','Scheduler');
replacement('emp_title2','emp_title2','secu','Security');
replacement('emp_title2','emp_title2','physician','Physician');
replacement('emp_title2','emp_title2','oper','Operations');
replacement('emp_title2','emp_title2','customer','Customer Service');
replacement('emp_title2','emp_title2','publish','Publisher');
replacement('emp_title2','emp_title2','consul','Consultant');
replacement('emp_title2','emp_title2','assoc','Associate');

In [0]:
top_twenty = df.emp_title2.value_counts().index.tolist()[:20]

In [0]:
df['emp_title2'] = np.where(df['emp_title2'].isin(top_twenty), df['emp_title2'],'Other')

In [118]:
df['emp_title2'].value_counts()

Other                     56390
Manager                   16228
Technician                 4732
Director                   4011
Educator                   3279
RN                         3207
Supervisor                 3107
Driver                     3029
Specialist                 2739
Sales                      2424
Operations                 2397
Analyst                    2291
Administrative Support     2243
Assistant                  2157
Coordinator                1671
Vice President             1365
Nurse                      1020
Associate                   930
Owner                       887
Consultant                  812
Mechanic                    756
Name: emp_title2, dtype: int64

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01