# <font color='violet'>Modeling</font>

This notebook builds on exploration I did with various encoding, class balancing, and dimension reduction strategies in the preprocessing stage: https://github.com/fractaldatalearning/Capstone2/blob/main/notebooks/7-kl-preprocess-encoding.ipynb

Previously, I tuned encoders and selected the Target Encoder with default hyperparameters. Here, I'll use a random grid search to find the best classifier and its best parameters. I'm using random search not only because it's faster, but also because I've read that its results as effective as a regular grid search if it's iterated over multiple times. Also, if I use a random grid search, it will be fast enough to enable me to work with more chunks of the dataset and compare results across subsets. 

In [1]:
import pandas as pd
import numpy as np
import os
from library.sb_utils import save_file

import matplotlib.pyplot as plt
import seaborn as sns

import category_encoders as ce

import sklearn
from sklearn import svm, neighbors, ensemble, model_selection, preprocessing, metrics
from sklearn.pipeline import Pipeline

import warnings
import random

from IPython.display import Audio
sound_file = './alert.wav'

In [2]:
df = pd.read_csv('../data/processed/for_modeling.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2170652 entries, 0 to 2170651
Data columns (total 26 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   user_id                       int64  
 1   order_by_user_sequence        int64  
 2   order_dow                     int64  
 3   order_hour_of_day             int64  
 4   days_since_prior_order        float64
 5   add_to_cart_sequence          int64  
 6   reordered                     int64  
 7   product_name                  object 
 8   aisle_name                    object 
 9   dept_name                     object 
 10  prior_purchases               int64  
 11  purchased_percent_prior       float64
 12  apple                         int64  
 13  bar                           int64  
 14  cream                         int64  
 15  free                          int64  
 16  fresh                         int64  
 17  green                         int64  
 18  mix                   

<font color='violet'>Encode categorical columns based on encoder selected during preprocessing. Normalize ordinal columns</font>

Previously when I was previewing the performance of encoders, I had used StandardScaler afterward to normalize numerical columns. I have since learned that using MinMaxScaler is a better choice if I don't know that my columns are normally distributed. I actually learned during EDA that variables are in fact not normally distributed, so MinMaxScaler is a better option. 

My current understanding is that since Target Encoder returns values between 0 and 1, MinMaxScaler won't mess up those values. But if I'm wrong, then I'll need to make sure to only use MinMaxScaler on the ordinal columns. 

In [3]:
categorical_columns = ['user_id', 'product_name', 'aisle_name', 'dept_name']

X = df.drop(columns=['reordered', 'add_to_cart_sequence'])
y = df['reordered']
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)

target = ce.target_encoder.TargetEncoder(cols=categorical_columns)
target.fit(X_train, y_train)
X_train = target.transform(X_train)
X_test = target.transform(X_test)

X_train.head(2)



Unnamed: 0,user_id,order_by_user_sequence,order_dow,order_hour_of_day,days_since_prior_order,product_name,aisle_name,dept_name,prior_purchases,purchased_percent_prior,...,fresh,green,mix,natural,organic,original,sweet,white,purchased_early_past,percent_past_purchased_early
584036,0.043147,88,3,17,1.0,0.004515,0.026462,0.032175,1,0.011364,...,0,0,0,0,1,0,0,0,1,0.011364
694141,0.041899,16,2,9,4.0,0.095092,0.114362,0.106521,1,0.0625,...,0,0,0,0,1,0,0,0,1,0.0625


In [4]:
# Encoding worked. Scale/normalize. 
scaler = preprocessing.MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train_scaled[0]

array([0.0479464 , 0.87878788, 0.5       , 0.73913043, 0.06451613,
       0.00454232, 0.08978224, 0.        , 0.01265823, 0.01161616,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 1.        , 0.        ,
       0.        , 0.        , 0.01369863, 0.01173021])

Values for categorical columns match those that were created with the Target Encoder, so it's fine to move forward. 

<font color='violet'>Use random grid search to select a model</font>

A faster way (which is what I need) to tune models than regular GridSearch is a randomized grid search. I learned that repeating the random grid search 3-5 times results in finding the best option comparably as well as searching over all combinations. 

In [6]:
# Initialize classifiers and dictionary of parameter options for each 

clf1 = neighbors.KNeighborsClassifier()
param1 = dict(clf=(clf1,), clf__n_neighbors=list(np.arange(3,22,2)), 
              clf__weights=['uniform','distance'],
              clf__leaf_size=list(np.arange(10,101,10)), 
              clf__p=[1,2], clf__metric=['euclidean','chebyshev','minkowski'])

clf2 = svm.SVC(random_state=43)
param2 = dict(clf=(clf2,), clf__C=list(np.arange(1,11)), 
              clf__kernel=['linear', 'poly', 'rbf', 'sigmoid'],
              clf__degree=list(np.arange(1,11)), clf__gamma=['scale', 'auto'], 
              clf__coef0=list(np.arange(0,4,0.5)), clf__shrinking=[True,False], 
              clf__probability=[True,False], clf__class_weight=[None,'balanced'])

clf3 = ensemble.RandomForestClassifier(random_state=43)
param3 = dict(clf=(clf3,), clf__n_estimators=list(np.arange(100,201,10)), 
              clf__criterion=['gini', 'entropy'], 
              clf__max_depth=list(np.arange(2,21)), 
              clf__min_samples_split=[2,3,4,5], clf__min_samples_leaf=[1,2,3,4,5], 
              clf__min_weight_fraction_leaf=[0,0.1,0.2,0.3,0.4,0.5], 
              clf__max_features=['sqrt', 'log2', None], 
              clf__oob_score=[True,False],
              clf__class_weight=['balanced', 'balanced_subsample'], 
              clf__ccp_alpha=list(np.arange(0,1.1,0.1)))

clf4 = ensemble.BaggingClassifier(random_state=43)
param4 = dict(clf=(clf4,), clf__n_estimators=list(np.arange(2,21,2)), 
              clf__max_samples=list(np.arange(1,11)), 
              clf__max_features=list(np.arange(1,11)), 
              clf__oob_score=[True,False])

clf5 = ensemble.GradientBoostingClassifier(random_state=43)
param5 = dict(clf=(clf5,), clf__loss=['deviance', 'exponential'], 
              clf__learning_rate=list(np.arange(0.1,3.1,0.2)), 
              clf__n_estimators=list(np.arange(100,201,10)), 
              clf__criterion=['friedman_mse', 'squared_error'], 
              clf__min_samples_split=list(np.arange(2,11)), 
              clf__min_samples_leaf=list(np.arange(1,11)),
              clf__min_weight_fraction_leaf=list(np.arange(0,0.6,0.1)), 
              clf__max_depth=list(np.arange(2,11)), 
              clf__min_impurity_decrease=list(np.arange(0,3.1,0.2)), 
              clf__max_features=['auto', 'sqrt', 'log2'])

clf6 = ensemble.AdaBoostClassifier()
param6 = dict(clf=(clf6,), clf__n_estimators=list(np.arange(10,101,10)), 
              clf__learning_rate=list(np.arange(0.1,3.1,0.2)), 
              clf__algorithm=['SAMME', 'SAMME.R'])

# Create pipeline and list of all parameters

pipeline = Pipeline([('clf', clf1)])
params = [param1, param2, param3, param4, param5, param6]

# Find the best classifier & its best parameters:
rs = model_selection.RandomizedSearchCV(estimator=pipeline, param_distributions=params, 
                                        scoring='neg_log_loss',error_score='raise')
rs.fit(X_train_scaled, y_train)
print(rs.best_params_)
print(rs.best_score_)

{'clf__n_estimators': 110, 'clf__min_weight_fraction_leaf': 0.2, 'clf__min_samples_split': 9, 'clf__min_samples_leaf': 8, 'clf__min_impurity_decrease': 1.0, 'clf__max_features': 'sqrt', 'clf__max_depth': 2, 'clf__loss': 'deviance', 'clf__learning_rate': 0.30000000000000004, 'clf__criterion': 'friedman_mse', 'clf': GradientBoostingClassifier(learning_rate=0.30000000000000004, max_depth=2,
                           max_features='sqrt', min_impurity_decrease=1.0,
                           min_samples_leaf=8, min_samples_split=9,
                           min_weight_fraction_leaf=0.2, n_estimators=110,
                           random_state=43)}
-0.2380404475183044


In all 5 runs, GradientBoosting came out ahead.

<font color='violet'>Tune hyperparameters of top classifier: GradientBoosting</font>

In [None]:
gb = ensemble.GradientBoostingClassifier(random_state=43)

params = dict(gb__loss=['deviance', 'exponential'], 
              gb__learning_rate=list(np.arange(0.1,3.1,0.2)), 
              gb__n_estimators=list(np.arange(100,201,10)), 
              gb__criterion=['friedman_mse', 'squared_error'], 
              gb__min_samples_split=list(np.arange(2,11)), 
              gb__min_samples_leaf=list(np.arange(1,11)),
              gb__min_weight_fraction_leaf=list(np.arange(0,0.6,0.1)), 
              gb__max_depth=list(np.arange(2,11)), 
              gb__min_impurity_decrease=list(np.arange(0,3.1,0.2)), 
              gb__max_features=['auto', 'sqrt', 'log2'])

pipeline = Pipeline([('gb', gb)])

rs = model_selection.RandomizedSearchCV(estimator=pipeline, param_distributions=params, 
                                        scoring='neg_log_loss',error_score='raise')
rs.fit(X_train_scaled, y_train)
print(rs.best_params_)
print(rs.best_score_)

In [None]:
Audio(sound_file, autoplay=True)

After running the above cell 5 times again, these were the best results:

max_features='sqrt', min_impurity_decrease=1.0, min_samples_leaf=8, min_samples_split=9, min_weight_fraction_leaf=0.2, n_estimators=110, max_depth=2, loss='deviance', learning_rate=0.3, criterion='friedman_mse'}
Log Loss: 0.2380404475183044

<font color='violet'>Final Modeling</font>
 
Now that I've selected the GradientBoosting model and its best hyperparameters, I can use the model to make predictions of reorders and get a final evaluation of my model.   

In [None]:
gb = ensemble.GradientBoostingClassifier(random_state=43, )
gb.fit(X_train_scaled, y_train)
y_pred = gb.predict(X_test_scaled)

print('Final log_loss score: ', metrics.log_loss(y_test, y_pred))
print('Final roc_auc score: ', metrics.roc_auc_score(y_test, y_pred))
print('Final f1 score: ', metrics.f1_score(y_test, y_pred))
cm = metrics.confusion_matrix(y_test, y_pred)
print(sns.heatmap(cm, annot=True, fmt='g'))

<font color='violet'>Summary</font>

The GradientBoosting model with hyperparameters indicated directly above (on data encoded with the Target Encoder, as above) returns predictions with a log_loss of . 

All modeling work has been done with just 1% of the users I originally had access to. The model could be applied to predict more users' reorders, but first more users' data would need to be munged with the steps I've taken throughout this project: 
1. Add rows to every order to indicated non-orders.
2. Add columns for the count and percentage of past orders where somebody has ordered an item and, specifically, ordered it within the first 6 items placed in their cart. 
3. Add columns to indicate the presence of keywords that appear in products' names, i.e. 'organic.'
4. Remove rows for items in the 'missing' department, pending improved re-classification of those items into more intuitive, logical department categories. 
5. Encode categorical columns with Target Encoder and normalize ordinal columns with MinMaxScaler (the latter could be done as part of a pipeline along with the GradientBoostingClassifier). 

<font color='violet'>Next Steps</font>

Looking forward, I could also try using the add_to_cart_sequence as a dependent variable. See how well I predict not exactly whether an item will be reordered or not, but instead: If I predict the first 5 items to be reordered (placed in the cart early this time), will any one of them end up actually getting reordered? 