This is the final preprocessing notebook before modeling. Here, I'll start with trying out models' performances given various variable encoding strategies. I need to explore the possibility of feature reduction; I created multiple features in the previous notebook but am not yet sure if they'll be valuable in making predictions. Finally, it will be helpful to balance the data because currently the target variable 'reordered' is very unbalanced. I can begin to explore various models in the process of all this. 

Notebook on which this one builds: https://github.com/fractaldatalearning/Capstone2/blob/main/notebooks/preprocessing2_feature_engineering.ipynb

One thing to look out for in this notebook: If I'm modeling and the computer is doing fine processing the dataset at this size, I could go back to the notebook for preprocessing1, add more rows to further increments of the full original dataset, concatenate them, re-run all the feature engineering steps with the larger dataset, and come back here to try out modeling with more rows (from twice as many, perhaps up to 10 times as many).

In [1]:
import pandas as pd
import numpy as np
import os
from library.sb_utils import save_file

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

from IPython.display import Audio
sound_file = './alert.wav'

In [2]:
df = pd.read_csv('../data/processed/features_engineered.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218232 entries, 0 to 218231
Data columns (total 27 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   order_id                 218232 non-null  int64  
 1   user_id                  218232 non-null  int64  
 2   order_by_user_sequence   218232 non-null  int64  
 3   days_since_prior_order   218232 non-null  float64
 4   add_to_cart_sequence     218232 non-null  int64  
 5   reordered                218232 non-null  int64  
 6   product_name             218232 non-null  object 
 7   aisle_name               218232 non-null  object 
 8   dept_name                218232 non-null  object 
 9   prior_purchases          218232 non-null  int64  
 10  purchased_percent_prior  218232 non-null  float64
 11  free                     218232 non-null  int64  
 12  fresh                    218232 non-null  int64  
 13  mix                      218232 non-null  int64  
 14  natu

In [3]:
# order_id is redundant as a combination of user and order_by_user_sequence. Delete it. 
df = df.drop(columns='order_id')
df.columns

Index(['user_id', 'order_by_user_sequence', 'days_since_prior_order',
       'add_to_cart_sequence', 'reordered', 'product_name', 'aisle_name',
       'dept_name', 'prior_purchases', 'purchased_percent_prior', 'free',
       'fresh', 'mix', 'natural', 'organic', 'original', 'sweet', 'white',
       'whole', 'rice', 'fruit', 'gluten', 'dow_sin', 'dow_cos', 'hour_sin',
       'hour_cos'],
      dtype='object')

I'd like to try multiple encoders for categorical data. Here's my current understanding of encoders that could make sense for this data:
- One-Hot could work for the dept_name column because there are only 19 categories, much fewer than all the other categorical columns. It wouldn't work for any of the others. 
- Hashing works with high-cardinality variables but isn't reversible and can lead to some (usuall minimal, as far as I've read) info loss. It's not clear to me whether it involves any leakage across rows. 
- My understanding of binary encoding is the best of both worlds from one-hot and hashing: fewer resultant categories than one-hot but interpretable and no info loss, unlike hashing. 
- My understanding is that Bayesian encoders generally cause contamination directly from the dependent variable but are still widely regarded as effective for reasons that aren't completely clear to me. My guess would be that these encoders might somehow take into consideration how variables interact with time such that only target variables from the past are used in decisions about encoding; updating understandings based on prior evidence seems to be what Bayesian techniques are all about. 
- I read that LeaveOneOut is a Bayesian encoder that avoids leakage (especially relative to Target encoder), though I don't understand how.  I also read that LeaveOneOut is especially good for classification tasks, so it's a good one to consider here.
- I know very little about WeightofEvidence or JamesStein encoders but they're Bayesian encoders recommended by Springboard. 

I'd like any encoder(s) I use to be included in an eventual modeling pipeline, but first I want to explore and try them out individually to see better how they would each work with the data. 