This is the final preprocessing notebook before modeling. Here, I'll try out models' performances given various variable encoding strategies. I might want to balance the data in addition to focusing on F1, kappa and ROC curves as metrics, to see if that would actually improve model performance. I might also need to explore the possibility of feature reduction; I created multiple features in the previous notebook but am not yet sure if they'll be valuable in making predictions. I can begin to explore various models in the process of all this. 

Notebook on which this one builds: https://github.com/fractaldatalearning/Capstone2/blob/main/notebooks/preprocessing2_feature_engineering.ipynb

One thing to look out for in this notebook: If I'm modeling and the computer is doing fine processing the dataset at this size, I could go back to the notebook for preprocessing1, add more rows to further increments of the full original dataset, concatenate them, re-run all the feature engineering steps with the larger dataset, and come back here to try out modeling with more rows (from twice as many, perhaps up to 10 times as many). I could also try a Naive Bayes classifier, which can be used when not all training data fits in memory. 

In [1]:
import pandas as pd
import numpy as np
import os
from library.sb_utils import save_file

import matplotlib.pyplot as plt
import seaborn as sns

import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

import random

from IPython.display import Audio
sound_file = './alert.wav'

In [2]:
df = pd.read_csv('../data/processed/features_engineered.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218232 entries, 0 to 218231
Data columns (total 27 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   order_id                 218232 non-null  int64  
 1   user_id                  218232 non-null  int64  
 2   order_by_user_sequence   218232 non-null  int64  
 3   days_since_prior_order   218232 non-null  float64
 4   add_to_cart_sequence     218232 non-null  int64  
 5   reordered                218232 non-null  int64  
 6   product_name             218232 non-null  object 
 7   aisle_name               218232 non-null  object 
 8   dept_name                218232 non-null  object 
 9   prior_purchases          218232 non-null  int64  
 10  purchased_percent_prior  218232 non-null  float64
 11  free                     218232 non-null  int64  
 12  fresh                    218232 non-null  int64  
 13  mix                      218232 non-null  int64  
 14  natu

In [3]:
# order_id is redundant as a combination of user and order_by_user_sequence. Delete it. 
df = df.drop(columns='order_id')
df.columns

Index(['user_id', 'order_by_user_sequence', 'days_since_prior_order',
       'add_to_cart_sequence', 'reordered', 'product_name', 'aisle_name',
       'dept_name', 'prior_purchases', 'purchased_percent_prior', 'free',
       'fresh', 'mix', 'natural', 'organic', 'original', 'sweet', 'white',
       'whole', 'rice', 'fruit', 'gluten', 'dow_sin', 'dow_cos', 'hour_sin',
       'hour_cos'],
      dtype='object')

In [4]:
df['reordered'].mean()

0.0973001209721764

I can judge effect of my work by comparing model scores with scores of what would happen if I just guess that a random 10% of items get reordered (since 0.097 is the average of the whole 'reordered' column in this dataset). 

In [5]:
# Make an array with 21823 1s randomly dispersed aming the rest 0s. Then use that fake array
# as predictions to see what scores I'd get without using any of the work I've done/ will do. 
ones = [1] * 21823
zeroes = [0] * 196409
array = np.concatenate([ones, zeroes])
len(array)

218232

In [6]:
array[0:6]

array([1, 1, 1, 1, 1, 1])

In [7]:
random.shuffle(array)
array[0:6]

array([0, 0, 0, 0, 0, 0])

In [8]:
y = df['reordered']

In [9]:
# Add a log loss metric

In [10]:
print('fake baseline conf matrix: ', metrics.confusion_matrix(y, array))
print('fake baseline RF f1 score: ', metrics.f1_score(y, array))
print('fake baseline RF kappa score: ', metrics.cohen_kappa_score(y, array))
print('fake baseline RF roc_auc score: ', metrics.roc_auc_score(y, array))

fake baseline conf matrix:  [[177256  19742]
 [ 19153   2081]]
fake baseline RF f1 score:  0.0966625635785122
fake baseline RF kappa score:  -0.002183987584467806
fake baseline RF roc_auc score:  0.4988944935192412


My understanding is that categorical features should be encoded prior to any standardization of ordinal features. Start here. 

I'd like to try multiple encoders for categorical data. A summary of my current knowledge of encoders that could make sense for this data:
- One-Hot could work for the dept_name column because there are only 19 categories, much fewer than all the other categorical columns. It wouldn't work for any of the others. 
- Hashing works with high-cardinality variables but isn't reversible and can lead to some (usuall minimal, as far as I've read) info loss. It's not clear to me whether it involves any leakage across rows. 
- My understanding of binary encoding is that it's the best of both worlds from one-hot and hashing: fewer resultant categories than one-hot but interpretable and no info loss, unlike hashing. 
- My understanding is that Bayesian encoders generally cause contamination, so make sure to split into training and test sets prior to encoding. I read that LeaveOneOut is a Bayesian encoder that avoids leakage by not using the dependent variable.  I also read that it is especially good for classification tasks, so it's a good one to consider here.
- I know very little about WeightofEvidence but it's another Bayesian encoders recommended by Springboard and I can try it out along with Target encoder (though I'd expect Target to over-fit compared with LeaveOneOut). 

I'd like any encoder(s) I use to be included in an eventual modeling pipeline, but first I want to explore and try them out individually to see better how they would each work with the data and operate on the categorical columns. 

In [11]:
# Start by just predicting the reordered column. Perhaps try predicting the add_to_cart_sequence
# column later. Create independent & dependent variables, encode independent categories. 
X = df.drop(columns=['reordered', 'add_to_cart_sequence'])
y = df['reordered']

categorical_columns = ['user_id', 'product_name', 'aisle_name', 'dept_name']
ce_bin = ce.BinaryEncoder(cols=categorical_columns)
Xbin = ce_bin.fit_transform(X,y)

In [12]:
Xbin.columns

Index(['user_id_0', 'user_id_1', 'user_id_2', 'user_id_3', 'user_id_4',
       'user_id_5', 'user_id_6', 'user_id_7', 'order_by_user_sequence',
       'days_since_prior_order', 'product_name_0', 'product_name_1',
       'product_name_2', 'product_name_3', 'product_name_4', 'product_name_5',
       'product_name_6', 'product_name_7', 'product_name_8', 'product_name_9',
       'product_name_10', 'product_name_11', 'product_name_12', 'aisle_name_0',
       'aisle_name_1', 'aisle_name_2', 'aisle_name_3', 'aisle_name_4',
       'aisle_name_5', 'aisle_name_6', 'aisle_name_7', 'dept_name_0',
       'dept_name_1', 'dept_name_2', 'dept_name_3', 'dept_name_4',
       'prior_purchases', 'purchased_percent_prior', 'free', 'fresh', 'mix',
       'natural', 'organic', 'original', 'sweet', 'white', 'whole', 'rice',
       'fruit', 'gluten', 'dow_sin', 'dow_cos', 'hour_sin', 'hour_cos'],
      dtype='object')

In [13]:
# Test out encoder performance in Bagging and RandomForest models. 
# These were overwhelmingly better than others when trying them out with a practice user.
# First need to standardize. Don't bother yet with tuning model hyperparameters.

Xbin_train, Xbin_test, ybin_train, ybin_test = train_test_split(Xbin, y, test_size=0.3)

scaler = StandardScaler()
Xbin_train_scaled = scaler.fit_transform(Xbin_train)
Xbin_test_scaled = scaler.transform(Xbin_test)

bgg_clf = BaggingClassifier()
bgg_clf = bgg_clf.fit(Xbin_train_scaled, ybin_train)
ybin_pred = bgg_clf.predict(Xbin_test_scaled)
print('binary bagging conf matrix: ', metrics.confusion_matrix(ybin_test, ybin_pred))
print('binary bagging f1 score: ', metrics.f1_score(ybin_test, ybin_pred))
print('binary bagging kappa score: ', metrics.cohen_kappa_score(ybin_test, ybin_pred))
print('binary bagging roc_auc score: ', metrics.roc_auc_score(ybin_test, ybin_pred))

binary bagging conf matrix:  [[57973  1117]
 [ 5107  1273]]
binary bagging f1 score:  0.2903078677309008
binary bagging kappa score:  0.25049909366892575
binary bagging roc_auc score:  0.5903132064100721


In [14]:
rf_clf = RandomForestClassifier()
rf_clf = rf_clf.fit(Xbin_train_scaled, ybin_train)
ybin_pred = rf_clf.predict(Xbin_test_scaled)
print('binary RF conf matrix: ', metrics.confusion_matrix(ybin_test, ybin_pred))
print('binary RF f1 score: ', metrics.f1_score(ybin_test, ybin_pred))
print('binary RF kappa score: ', metrics.cohen_kappa_score(ybin_test, ybin_pred))
print('binary RF roc_auc score: ', metrics.roc_auc_score(ybin_test, ybin_pred))

binary RF conf matrix:  [[58389   701]
 [ 5052  1328]]
binary RF f1 score:  0.3158520632655488
binary RF kappa score:  0.28209105627726694
binary RF roc_auc score:  0.5981436053923376


In [15]:
# Try a Bayesian encoder. Start with LeaveOneOut. Even though it has less contaminiation
# than other Bayesian encoders, it's a good idea to split data first. 

Xloo_train, Xloo_test, yloo_train, yloo_test = train_test_split(X, y, test_size=0.3)

ce_loo = ce.leave_one_out.LeaveOneOutEncoder(cols=categorical_columns, random_state=43)
ce_loo.fit(Xloo_train, yloo_train)
Xloo_train = ce_loo.transform(Xloo_train)
Xloo_test = ce_loo.transform(Xloo_test)

Xloo_train

Unnamed: 0,user_id,order_by_user_sequence,days_since_prior_order,product_name,aisle_name,dept_name,prior_purchases,purchased_percent_prior,free,fresh,...,sweet,white,whole,rice,fruit,gluten,dow_sin,dow_cos,hour_sin,hour_cos
68831,0.075396,58,6.0,0.000000,0.073328,0.083773,1,0.017241,0,0,...,0,0,0,0,0,0,0.781831,0.623490,-0.500000,-0.866025
208314,0.104831,35,9.0,0.296296,0.101341,0.091869,7,0.200000,0,0,...,0,0,0,0,0,0,-0.433884,-0.900969,0.707107,-0.707107
205276,0.100213,18,15.0,0.131387,0.106911,0.123285,1,0.055556,0,0,...,0,0,0,0,0,0,0.000000,1.000000,-0.866025,-0.500000
171286,0.080026,65,1.0,0.000000,0.218484,0.128195,1,0.015385,0,0,...,0,0,0,0,0,0,0.000000,1.000000,-0.258819,-0.965926
133998,0.029821,20,2.0,0.000000,0.099040,0.083773,1,0.050000,0,0,...,0,0,0,1,0,0,0.000000,1.000000,0.258819,0.965926
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72192,0.082047,27,7.0,0.182171,0.137708,0.123285,1,0.037037,0,0,...,0,0,0,0,0,0,0.781831,0.623490,0.707107,-0.707107
152895,0.084492,35,10.0,0.054545,0.157351,0.123285,3,0.085714,0,0,...,0,1,0,0,0,0,0.781831,0.623490,0.866025,-0.500000
153264,0.084492,37,19.0,0.081395,0.149678,0.138347,1,0.027027,0,0,...,0,0,0,0,0,0,0.433884,-0.900969,-0.866025,0.500000
50216,0.095718,10,3.0,0.047619,0.062698,0.056574,1,0.100000,0,0,...,0,0,0,0,0,0,-0.974928,-0.222521,0.258819,-0.965926


In [16]:
# Now try this encoded data in models after standardization

scaler = StandardScaler()
Xloo_train_scaled = scaler.fit_transform(Xloo_train)
Xloo_test_scaled = scaler.transform(Xloo_test)

bgg_clf = BaggingClassifier()
bgg_clf = bgg_clf.fit(Xloo_train_scaled, yloo_train)
yloo_pred = bgg_clf.predict(Xloo_test_scaled)
print('LeaveOneOut bagging conf matrix: ', metrics.confusion_matrix(yloo_test, yloo_pred))
print('LeaveOneOut bagging f1 score: ', metrics.f1_score(yloo_test, yloo_pred))
print('LeaveOneOut bagging kappa score: ', metrics.cohen_kappa_score(yloo_test, yloo_pred))
print('LeaveOneOut bagging roc_auc score: ', metrics.roc_auc_score(yloo_test, yloo_pred))

LeaveOneOut bagging conf matrix:  [[57752  1291]
 [ 4878  1549]]
LeaveOneOut bagging f1 score:  0.3343045214200928
LeaveOneOut bagging kappa score:  0.29168573071238413
LeaveOneOut bagging roc_auc score:  0.6095745250431431


In [17]:
rf_clf = RandomForestClassifier()
rf_clf = rf_clf.fit(Xloo_train_scaled, yloo_train)
yloo_pred = rf_clf.predict(Xloo_test_scaled)
print('LeaveOneOut RF conf matrix: ', metrics.confusion_matrix(yloo_test, yloo_pred))
print('LeaveOneOut RF f1 score: ', metrics.f1_score(yloo_test, yloo_pred))
print('LeaveOneOut RF kappa score: ', metrics.cohen_kappa_score(yloo_test, yloo_pred))
print('LeaveOneOut RF roc_auc score: ', metrics.roc_auc_score(yloo_test, yloo_pred))

LeaveOneOut RF conf matrix:  [[58031  1012]
 [ 4768  1659]]
LeaveOneOut RF f1 score:  0.36469553748076505
LeaveOneOut RF kappa score:  0.32583677471970873
LeaveOneOut RF roc_auc score:  0.6204948572909949


In [18]:
# Now see whether LeaveOneOut performs better if I set drop_invariant to True.

Xloo_train, Xloo_test, yloo_train, yloo_test = train_test_split(X, y, test_size=0.3)

ce_loo = ce.leave_one_out.LeaveOneOutEncoder(cols=categorical_columns, random_state=43,
                                            drop_invariant=True)
ce_loo.fit(Xloo_train, yloo_train)
Xloo_train = ce_loo.transform(Xloo_train)
Xloo_test = ce_loo.transform(Xloo_test)

Xloo_train

Unnamed: 0,user_id,order_by_user_sequence,days_since_prior_order,product_name,aisle_name,dept_name,prior_purchases,purchased_percent_prior,free,fresh,...,sweet,white,whole,rice,fruit,gluten,dow_sin,dow_cos,hour_sin,hour_cos
16266,0.037566,62,7.0,0.083916,0.107698,0.123888,1,0.016129,0,0,...,0,0,0,0,0,0,0.781831,0.623490,-0.707107,-0.707107
205433,0.102362,6,5.0,0.114035,0.107698,0.123888,3,0.500000,0,0,...,0,0,0,0,0,0,0.433884,-0.900969,-0.258819,-0.965926
27242,0.086063,22,22.0,0.099415,0.107698,0.123888,0,0.000000,0,0,...,0,0,0,0,0,0,0.781831,0.623490,-0.707107,-0.707107
84824,0.193624,12,8.0,0.051282,0.107698,0.123888,1,0.083333,0,0,...,0,0,0,0,0,0,0.000000,1.000000,0.258819,-0.965926
186889,0.079315,42,3.0,0.178571,0.091291,0.088919,1,0.023810,0,0,...,0,0,0,0,0,0,-0.433884,-0.900969,0.258819,-0.965926
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112142,0.074981,7,5.0,0.032258,0.075688,0.084889,1,0.142857,0,0,...,0,0,0,0,0,0,-0.781831,0.623490,-0.965926,-0.258819
84765,0.193624,11,21.0,0.000000,0.107698,0.123888,1,0.090909,0,0,...,0,0,0,0,0,0,-0.781831,0.623490,-0.258819,0.965926
88630,0.122807,35,6.0,0.258065,0.092498,0.087185,8,0.228571,0,0,...,0,0,0,0,0,0,0.781831,0.623490,0.500000,-0.866025
167688,0.169986,45,2.0,0.155763,0.124816,0.123888,1,0.022222,0,0,...,0,0,0,0,0,0,-0.974928,-0.222521,-0.500000,-0.866025


In [19]:
scaler = StandardScaler()
Xloo_train_scaled = scaler.fit_transform(Xloo_train)
Xloo_test_scaled = scaler.transform(Xloo_test)

rf_clf = RandomForestClassifier()
rf_clf = rf_clf.fit(Xloo_train_scaled, yloo_train)
yloo_pred = rf_clf.predict(Xloo_test_scaled)
print('LOO drop_invariant conf matrix: ', metrics.confusion_matrix(yloo_test, yloo_pred))
print('LOO drop_invariant f1 score: ', metrics.f1_score(yloo_test, yloo_pred))
print('LOO drop_invariant kappa score: ', metrics.cohen_kappa_score(yloo_test, yloo_pred))
print('LOO drop_invariant roc_auc score: ', metrics.roc_auc_score(yloo_test, yloo_pred))

LOO drop_invariant conf matrix:  [[57951  1040]
 [ 4881  1598]]
LOO drop_invariant f1 score:  0.35055391027750354
LOO drop_invariant kappa score:  0.31110148367750445
LOO drop_invariant roc_auc score:  0.6145065962631153


In [20]:
# Model performance dropped very slighly when I dropped columns without variance. 
# Try some of the other Bayesian encoders. Start with Target Encoder.
# It has hyperparameters min_sample_leaf and smoothing that I could tune if the Target encoder
# seems worthwhile compared with others. 

Xtar_train, Xtar_test, ytar_train, ytar_test = train_test_split(X, y, test_size=0.3)

ce_tar = ce.target_encoder.TargetEncoder(cols=categorical_columns)
ce_tar.fit(Xtar_train, ytar_train)
Xtar_train = ce_tar.transform(Xtar_train)
Xtar_test = ce_tar.transform(Xtar_test)

Xtar_train



Unnamed: 0,user_id,order_by_user_sequence,days_since_prior_order,product_name,aisle_name,dept_name,prior_purchases,purchased_percent_prior,free,fresh,...,sweet,white,whole,rice,fruit,gluten,dow_sin,dow_cos,hour_sin,hour_cos
81914,0.107345,7,1.0,0.331754,0.082179,0.135367,4,0.571429,0,0,...,0,0,0,0,0,0,-0.974928,-0.222521,-0.866025,-0.500000
208845,0.102848,38,4.0,0.146018,0.106932,0.124672,13,0.342105,0,0,...,0,0,0,0,0,0,0.974928,-0.222521,0.500000,-0.866025
106919,0.051403,40,12.0,0.026316,0.035503,0.036315,1,0.025000,0,0,...,0,0,0,0,0,0,-0.781831,0.623490,-0.707107,-0.707107
58131,0.039130,26,1.0,0.019608,0.007752,0.036315,2,0.076923,0,0,...,0,0,0,0,0,0,-0.974928,-0.222521,0.707107,-0.707107
94981,0.094574,42,12.0,0.069182,0.106932,0.124672,3,0.071429,0,0,...,0,0,0,0,0,0,0.781831,0.623490,0.707107,-0.707107
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
184535,0.082436,20,7.0,0.058586,0.121382,0.124672,1,0.050000,0,0,...,0,0,0,0,0,0,-0.433884,-0.900969,-0.707107,-0.707107
113726,0.073304,28,11.0,0.437500,0.106932,0.124672,12,0.428571,0,0,...,0,0,0,0,0,0,-0.974928,-0.222521,-0.258819,-0.965926
170493,0.073964,54,0.0,0.073276,0.121382,0.124672,7,0.129630,0,0,...,0,0,1,0,0,0,-0.433884,-0.900969,-0.866025,0.500000
56175,0.104635,13,1.0,0.125000,0.105085,0.087595,2,0.153846,0,0,...,0,0,1,0,0,0,0.000000,1.000000,0.500000,-0.866025


In [21]:
scaler = StandardScaler()
Xtar_train_scaled = scaler.fit_transform(Xtar_train)
Xtar_test_scaled = scaler.transform(Xtar_test)

rf_clf = RandomForestClassifier()
rf_clf = rf_clf.fit(Xtar_train_scaled, ytar_train)
ytar_pred = rf_clf.predict(Xtar_test_scaled)
print('Target RF conf matrix: ', metrics.confusion_matrix(ytar_test, ytar_pred))
print('Target RF f1 score: ', metrics.f1_score(ytar_test, ytar_pred))
print('Target RF kappa score: ', metrics.cohen_kappa_score(ytar_test, ytar_pred))
print('Target RF roc_auc score: ', metrics.roc_auc_score(ytar_test, ytar_pred))

Target RF conf matrix:  [[58100  1017]
 [ 4737  1616]]
Target RF f1 score:  0.359670598709103
Target RF kappa score:  0.3210622797283088
Target RF roc_auc score:  0.6185824208714521


In [22]:
# TargetEncoder seems to perform slightly better than LeaveOneOut, even on these metrics
# that are sensitive to over-fitting unbalanced data. Try WeightofEvidence.
# It does have some parameters that could be tuned but don't bother for now. 

Xwoe_train, Xwoe_test, ywoe_train, ywoe_test = train_test_split(X, y, test_size=0.3)

ce_woe = ce.woe.WOEEncoder(cols=categorical_columns)
ce_woe.fit(Xwoe_train, ywoe_train)
Xwoe_train = ce_woe.transform(Xwoe_train)
Xwoe_test = ce_woe.transform(Xwoe_test)

Xwoe_train

Unnamed: 0,user_id,order_by_user_sequence,days_since_prior_order,product_name,aisle_name,dept_name,prior_purchases,purchased_percent_prior,free,fresh,...,sweet,white,whole,rice,fruit,gluten,dow_sin,dow_cos,hour_sin,hour_cos
209063,1.251531,5,13.0,-0.413764,0.073783,-0.115033,1,0.200000,0,0,...,0,0,0,0,0,0,0.781831,0.623490,-1.000000e+00,-1.836970e-16
59201,-0.969335,45,1.0,-1.208694,-0.031407,-0.051817,2,0.044444,0,0,...,0,0,0,0,0,0,0.433884,-0.900969,8.660254e-01,-5.000000e-01
155758,-0.726720,34,4.0,-0.719146,-0.517082,-0.680133,1,0.029412,0,0,...,0,0,0,0,0,0,0.974928,-0.222521,-2.588190e-01,-9.659258e-01
124436,0.378894,16,6.0,0.533617,0.364952,-0.680133,1,0.062500,0,0,...,0,0,0,0,0,0,0.000000,1.000000,7.071068e-01,-7.071068e-01
108547,-0.685588,52,6.0,-0.448856,-0.123535,0.401654,2,0.038462,0,0,...,0,0,0,0,0,0,0.433884,-0.900969,-7.071068e-01,-7.071068e-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109762,-0.516524,9,11.0,0.433533,0.056395,0.401654,2,0.222222,0,0,...,0,0,0,0,0,0,-0.433884,-0.900969,5.000000e-01,-8.660254e-01
122918,-0.071100,23,8.0,0.491521,0.538691,0.271407,2,0.086957,0,0,...,0,0,0,0,0,0,0.974928,-0.222521,1.224647e-16,-1.000000e+00
175870,0.271329,26,20.0,-0.482757,-0.167149,-0.069845,1,0.038462,0,0,...,0,0,0,0,0,0,0.974928,-0.222521,7.071068e-01,-7.071068e-01
98520,-0.820764,23,18.0,0.615855,-0.403640,-0.949340,2,0.086957,0,0,...,0,0,0,0,0,0,-0.433884,-0.900969,1.000000e+00,6.123234e-17


In [23]:
scaler = StandardScaler()
Xwoe_train_scaled = scaler.fit_transform(Xwoe_train)
Xwoe_test_scaled = scaler.transform(Xwoe_test)

rf_clf = RandomForestClassifier()
rf_clf = rf_clf.fit(Xwoe_train_scaled, ywoe_train)
ywoe_pred = rf_clf.predict(Xwoe_test_scaled)
print('WeightofEvidence RF conf matrix: ', metrics.confusion_matrix(ywoe_test, ywoe_pred))
print('WeightofEvidence RF f1 score: ', metrics.f1_score(ywoe_test, ywoe_pred))
print('WeightofEvidence RF kappa score: ', metrics.cohen_kappa_score(ywoe_test, ywoe_pred))
print('WeightofEvidence RF roc_auc score: ', metrics.roc_auc_score(ywoe_test, ywoe_pred))

WeightofEvidence RF conf matrix:  [[58049  1080]
 [ 4789  1552]]
WeightofEvidence RF f1 score:  0.3459266688955756
WeightofEvidence RF kappa score:  0.3065240556633575
WeightofEvidence RF roc_auc score:  0.6132455992492114


WeightOfEvidence performed slightly less well than Target encoder and LeaveOneOut. 

Move forward with hyperparameter tuning. Already decided to keep LOO's drop_invariant as default False, and that seems to be the only parameter possibly worth changing. For Target, min_samples_leaf and smoothing values seem to take values greater than 0 (int and float, respectively). I've seen examples with these set to 2 instead of the default 1. Try a number of values to see what makes sense. 

In [24]:
min_samples_leaf_options = [1,2,3]
smoothing_options = [1.0,2.0,3.0]
scores = []
for leaf in min_sample_leaf_options:
    for smooth in smoothing_options:
        ce_tar = ce.target_encoder.TargetEncoder(cols=categorical_columns,
                                                min_samples_leaf=leaf,
                                                smoothing=smooth)
        ce_tar.fit(Xtar_train, ytar_train)
        Xtar_train = ce_tar.transform(Xtar_train)
        Xtar_test = ce_tar.transform(Xtar_test)
        scaler = StandardScaler()
        Xtar_train_scaled = scaler.fit_transform(Xtar_train)
        Xtar_test_scaled = scaler.transform(Xtar_test)
        rf_clf = RandomForestClassifier()
        rf_clf = rf_clf.fit(Xtar_train_scaled, ytar_train)
        ytar_pred = rf_clf.predict(Xtar_test_scaled)
        scores.append({'min_samples':leaf, 'smoothing':smooth, 
                       'score': metrics.roc_auc_score(ytar_test, ytar_pred)})

f1_scores_df = pd.DataFrame(scores)

NameError: name 'min_sample_leaf_options' is not defined

In [None]:
scores_df.sort_values('score', ascending=False)

For resampling, under-sampling of non-reorders might actually be better than over-sampling reorders because I have such a big dataset; try out different methods and ratios of reordered:not. And/or, try SMOTE to generate synthetic samples; try penalized-SVM or other ways of penalizing models for poor precision/recall. 