# <font color='violet'>Modeling</font>

This notebook builds on exploration I did with various encoding, class balancing, and dimension reduction strategies in the preprocessing stage: https://github.com/fractaldatalearning/Capstone2/blob/main/notebooks/7-kl-preprocess-encoding.ipynb

Previously, I tuned encoders and selected the Target Encoder with default hyperparameters. Here, I'll use a random grid search to find the best classifier and its best parameters. I'm using random search not only because it's faster, but also because I've read that its results as effective as a regular grid search if it's iterated over multiple times. Also, if I use a random grid search, it will be fast enough to enable me to work with more chunks of the dataset and compare results across subsets. 

In [31]:
import pandas as pd
import numpy as np
import os
from library.sb_utils import save_file

import matplotlib.pyplot as plt
import seaborn as sns

import category_encoders as ce

import sklearn
from sklearn import svm, neighbors, ensemble, model_selection, preprocessing
from sklearn.pipeline import Pipeline

import warnings
import random

from IPython.display import Audio
sound_file = './alert.wav'

In [2]:
df = pd.read_csv('../data/processed/for_modeling.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2170652 entries, 0 to 2170651
Data columns (total 24 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   user_id                  int64  
 1   order_by_user_sequence   int64  
 2   order_dow                int64  
 3   order_hour_of_day        int64  
 4   days_since_prior_order   float64
 5   add_to_cart_sequence     int64  
 6   reordered                int64  
 7   product_name             object 
 8   aisle_name               object 
 9   dept_name                object 
 10  prior_purchases          int64  
 11  purchased_percent_prior  float64
 12  apple                    int64  
 13  bar                      int64  
 14  cream                    int64  
 15  free                     int64  
 16  fresh                    int64  
 17  green                    int64  
 18  mix                      int64  
 19  natural                  int64  
 20  organic                  int64  
 21  original

<font color='violet'>Encode categorical columns based on encoder selected during preprocessing</font>

In [4]:
categorical_columns = ['user_id', 'product_name', 'aisle_name', 'dept_name']

X = df.drop(columns=['reordered', 'add_to_cart_sequence'])
y = df['reordered']
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)

target = ce.target_encoder.TargetEncoder(cols=categorical_columns)
target.fit(X_train, y_train)
X_train = target.transform(X_train)
X_test = target.transform(X_test)

X_train.head(2)



Unnamed: 0,user_id,order_by_user_sequence,order_dow,order_hour_of_day,days_since_prior_order,product_name,aisle_name,dept_name,prior_purchases,purchased_percent_prior,...,cream,free,fresh,green,mix,natural,organic,original,sweet,white
142510,0.114031,23,3,6,1.0,0.074561,0.086629,0.08514,7,0.304348,...,0,0,0,0,0,0,0,0,0,0
1056355,0.059568,5,0,17,4.0,0.135348,0.087617,0.112248,2,0.4,...,0,0,0,0,0,0,0,0,0,0


In [6]:
print(X_train['product_name'].min())
print(X_train['product_name'].max())
print(X_train['user_id'].min())
print(X_train['user_id'].max())

0.0
0.9996949920450138
0.0
0.997751097542715


Previously when I was previewing the performance of encoders, I was using StandardScaler. I have since learned that using MinMaxScaler is a better choice if I don't know that my columns are normally distributed. I actually learned during EDA that variables are in fact not normally distributed, so MinMaxScaler is a better option. 

My current understanding is that since Target Encoder returns values between 0 and 1, MinMaxScaler won't mess up those values. But if I'm wrong, then I'll need to make sure to only use MinMaxScaler on the ordinal columns. 

Finally, I also learned that the Random Forest classifier doesn't require normalization. If I land on random forest classifier as I suspect I will, I can try re-running the model without normalization to see if that improves performance. 

In [8]:
scaler = preprocessing.MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train_scaled[0]

array([0.11428763, 0.22222222, 0.5       , 0.26086957, 0.06451613,
       0.07458415, 0.40302193, 0.55894116, 0.08860759, 0.31111111,
       1.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        ])

It looks like values for categorical columns match those from above that were created with the Target Encoder, so it's fine to move forward. 

<font color='violet'>Use random grid search to begin to select a model and hyperparameters</font>

Even though the randomized grid search is faster for each iteration than regular GridSearch, it's still too slow to do with the entire dataset. Make classifier selections using a smaller chunk of the dataset, then do final modeling on the full df. 

In [9]:
len(df['user_id'].unique())

2060

In [59]:
all_users = set(df['user_id'].unique())
users_to_search = random.sample(list(all_users), 206)
df_to_search = df.loc[df['user_id'].isin(users_to_search), :].copy()
df_to_search.shape

(266114, 24)

In [60]:
# Remake train, test set, re-do encoding & normalization
X_to_seach = df_to_search.drop(columns=['reordered', 'add_to_cart_sequence'])
y_to_search = df_to_search['reordered']
X_search_train, X_search_test, y_search_train, y_search_test = model_selection.train_test_split(
    X_to_seach, y_to_search, test_size=0.3)

target = ce.target_encoder.TargetEncoder(cols=categorical_columns)
target.fit(X_search_train, y_search_train)
X_search_train = target.transform(X_search_train)
X_search_test = target.transform(X_search_test)

scaler = preprocessing.MinMaxScaler()
scaler.fit(X_search_train)
X_search_train_scaled = scaler.transform(X_search_train)
X_search_test_scaled = scaler.transform(X_search_test)

X_search_train_scaled.shape



(186279, 22)

In [61]:
#warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
# Initialize classifiers and dictionary of parameter options for each 

clf1 = neighbors.KNeighborsClassifier()
param1 = dict(clf=(clf1,), clf__n_neighbors=list(np.arange(3,22,2)), 
              clf__weights=['uniform','distance'],
              clf__leaf_size=list(np.arange(10,101,10)), 
              clf__p=[1,2], clf__metric=['euclidean','chebyshev','minkowski'])

clf2 = svm.SVC(random_state=43)
param2 = dict(clf=(clf2,), clf__C=list(np.arange(1,11)), 
              clf__kernel=['linear', 'poly', 'rbf', 'sigmoid'],
              clf__degree=list(np.arange(1,11)), clf__gamma=['scale', 'auto'], 
              clf__coef0=list(np.arange(0,4,0.5)), clf__shrinking=[True,False], 
              clf__probability=[True,False], clf__class_weight=[None,'balanced'])

clf3 = ensemble.RandomForestClassifier(random_state=43)
param3 = dict(clf=(clf3,), clf__n_estimators=list(np.arange(100,201,10)), 
              clf__criterion=['gini', 'entropy'], 
              clf__max_depth=list(np.arange(2,21)), 
              clf__min_samples_split=[2,3,4,5], clf__min_samples_leaf=[1,2,3,4,5], 
              clf__min_weight_fraction_leaf=[0,0.1,0.2,0.3,0.4,0.5], 
              clf__max_features=['sqrt', 'log2', None], 
              clf__oob_score=[True,False],
              clf__class_weight=['balanced', 'balanced_subsample'], 
              clf__ccp_alpha=list(np.arange(0,1.1,0.1)))

clf4 = ensemble.BaggingClassifier(random_state=43)
param4 = dict(clf=(clf4,), clf__n_estimators=list(np.arange(2,21,2)), 
              clf__max_samples=list(np.arange(1,11)), 
              clf__max_features=list(np.arange(1,11)), 
              clf__oob_score=[True,False])

clf5 = ensemble.GradientBoostingClassifier(random_state=43)
param5 = dict(clf=(clf5,), clf__loss=['deviance', 'exponential'], 
              clf__learning_rate=list(np.arange(0.1,3.1,0.2)), 
              clf__n_estimators=list(np.arange(100,201,10)), 
              clf__criterion=['friedman_mse', 'squared_error'], 
              clf__min_samples_split=list(np.arange(2,11)), 
              clf__min_samples_leaf=list(np.arange(1,11)),
              clf__min_weight_fraction_leaf=list(np.arange(0,0.6,0.1)), 
              clf__max_depth=list(np.arange(2,11)), 
              clf__min_impurity_decrease=list(np.arange(0,3.1,0.2)), 
              clf__max_features=['auto', 'sqrt', 'log2'])

clf6 = ensemble.AdaBoostClassifier()
param6 = dict(clf=(clf6,), clf__n_estimators=list(np.arange(10,101,10)), 
              clf__learning_rate=list(np.arange(0.1,3.1,0.2)), 
              clf__algorithm=['SAMME', 'SAMME.R'])

# Create pipeline and list of all parameters

pipeline = Pipeline([('clf', clf1)])
params = [param1, param2, param3, param4, param5, param6]

# Find the best classifier & its best parameters:
rs = model_selection.RandomizedSearchCV(estimator=pipeline, param_distributions=params, 
                                        scoring='neg_log_loss',error_score='raise')
rs.fit(X_search_train_scaled, y_search_train)
print(rs.best_params_)
print(rs.best_score_)

First iteration of the random serach resulted in the following best parameters:

GradientBoosting: log loss score of 0.2
params: 'clf__n_estimators': 180, 'clf__min_weight_fraction_leaf': 0.1, 'clf__min_samples_split': 10, 'clf__min_samples_leaf': 7, 'clf__min_impurity_decrease': 0.4, 'clf__max_features': 'log2', 'clf__max_depth': 5, 'clf__loss': 'exponential', 'clf__learning_rate': 0.30000000000000004, 'clf__criterion': 'friedman_mse'

<font color='violet'>Tune random grid search to see if a better classifier is identified.</font>

For now, I can remove GradientBoosting from my list of classifiers and re-run a few times to see if any other model comes up with a better log loss score. Then, return to do further hyperparameter tuning with just the best classifier(s). 

I ran the cell below 5 times and got the following results:
- RandomForest: 0.58


In [None]:
clf1 = neighbors.KNeighborsClassifier()
param1 = dict(clf=(clf1,), clf__n_neighbors=list(np.arange(3,22,2)), 
              clf__weights=['uniform','distance'],
              clf__leaf_size=list(np.arange(10,101,10)), 
              clf__p=[1,2], clf__metric=['euclidean','chebyshev','minkowski'])

clf2 = svm.SVC(random_state=43)
param2 = dict(clf=(clf2,), clf__C=list(np.arange(1,11)), 
              clf__kernel=['linear', 'poly', 'rbf', 'sigmoid'],
              clf__degree=list(np.arange(1,11)), clf__gamma=['scale', 'auto'], 
              clf__coef0=list(np.arange(0,4,0.5)), clf__shrinking=[True,False], 
              clf__probability=[True,False], clf__class_weight=[None,'balanced'])

clf3 = ensemble.RandomForestClassifier(random_state=43)
param3 = dict(clf=(clf3,), clf__n_estimators=list(np.arange(100,201,10)), 
              clf__criterion=['gini', 'entropy'], 
              clf__max_depth=list(np.arange(2,21)), 
              clf__min_samples_split=[2,3,4,5], clf__min_samples_leaf=[1,2,3,4,5], 
              clf__min_weight_fraction_leaf=[0,0.1,0.2,0.3,0.4,0.5], 
              clf__max_features=['sqrt', 'log2', None], 
              clf__oob_score=[True,False],
              clf__class_weight=['balanced', 'balanced_subsample'], 
              clf__ccp_alpha=list(np.arange(0,1.1,0.1)))

clf4 = ensemble.BaggingClassifier(random_state=43)
param4 = dict(clf=(clf4,), clf__n_estimators=list(np.arange(2,21,2)), 
              clf__max_samples=list(np.arange(1,11)), 
              clf__max_features=list(np.arange(1,11)), 
              clf__oob_score=[True,False])

clf5 = ensemble.AdaBoostClassifier()
param5 = dict(clf=(clf5,), clf__n_estimators=list(np.arange(10,101,10)), 
              clf__learning_rate=list(np.arange(0.1,3.1,0.2)), 
              clf__algorithm=['SAMME', 'SAMME.R'])

pipeline = Pipeline([('clf', clf1)])
params = [param1, param2, param3, param4, param5]

rs = model_selection.RandomizedSearchCV(estimator=pipeline, param_distributions=params, 
                                        scoring='neg_log_loss',error_score='raise')
rs.fit(X_search_train_scaled, y_search_train)
print(rs.best_params_)
print(rs.best_score_)

In [None]:
Audio(sound_file, autoplay=True)

<font color='violet'>Tune hyperparameters of top classifier</font>

<font color='violet'>Future Options</font>

Try using the add_to_cart_sequence as a dependent variable. See how well I predict not exactly whether an item will be reordered or not, but instead: If I predict the first 5 items to be reordered (placed in the cart early this time), will any one of them end up actually getting reordered?