# Extension of Category Encoders Benchmarking
## By Jeff Hale

The python sklearn-compatible *Category Encoders* package provides eleven methods for encoding categorical (oridnal and nominal) data in numerical format for machine learning models as of July 2018 .

Will McGinnis, primary author of the package, benchmarked the first encoders included in the package with three Classification-type data with a Naive Bayes BernouliNB classifier [here](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/). When writing [this guide](https://www.kaggle.com/discdiver/measurement-scales-for-machine-learning/) to explain options for encoding ordinal and nominal data, I tested the newest Category Encoders with a variety of models to try to provide insights as to when to try different encoders. 

There is a good bit of parameter tuning to do. Some model parameters may need different parameters for different encoders due to dimensionality or sparsity. 

I ran two of the datasets with nine sklearn classification algorithms, without changing the default perameters. I left out the genetic splicing dataset because the the over 3,000 unique values in two different columns was causing problems with algorithms in my Kaggle Kernel. 

My adaptation of McGinnis's code is below. 


In [None]:
import time
import gc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import preprocessing
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

import category_encoders

from IPython.display import display, HTML

pd.options.display.max_columns = 50
pd.options.display.width = 1000

%matplotlib inline
plt.style.use('ggplot')
sns.set(font_scale=1.5)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

np.random.seed(34)

import os
print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.

In [None]:
# moved loaders.py into this file

def get_cars_data():
    """Load the cars dataset, split it into X and y, and then call the label encoder to get an integer y column.
    
    :return X: predictor columns
    :type X: pandas.core.frame.DataFrame
    :return y: prediction column
    :type y: numpy.ndarray
    :return mapping: maps string values to ordinal integers
    :type mapping: list of dictionaries
    """

    df = pd.read_csv('../input/car-names/car.data.txt')
    X = df.reindex(columns=[x for x in df.columns.values if x != 'class'])  # equivalent to drop "class" column
    y = df.reindex(columns=['class'])
    print(y['class'].value_counts())
    y = preprocessing.LabelEncoder().fit_transform(y.values.reshape(-1, ))

    mapping = [
        {'col': 'buying', 'mapping': [('vhigh', 0), ('high', 1), ('med', 2), ('low', 3)]},
        {'col': 'maint', 'mapping': [('vhigh', 0), ('high', 1), ('med', 2), ('low', 3)]},
        {'col': 'doors', 'mapping': [('2', 0), ('3', 1), ('4', 2), ('5more', 3)]},
        {'col': 'persons', 'mapping': [('2', 0), ('4', 1), ('more', 2)]},
        {'col': 'lug_boot', 'mapping': [('small', 0), ('med', 1), ('big', 2)]},
        {'col': 'safety', 'mapping': [('high', 0), ('med', 1), ('low', 2)]},
    ]
    
    return X, y, mapping


def get_mushroom_data():
    """Load the mushroom dataset, split it into X and y, and then call the label encoder to get an integer y column."""

    df = pd.read_csv('../input/mushroom-type/agaricus-lepiota.csv')
    X = df.reindex(columns=[x for x in df.columns.values if x != 'class'])
    y = df.reindex(columns=['class'])
    print(y['class'].value_counts())
    y = preprocessing.LabelEncoder().fit_transform(y.values.reshape(-1, ))

    # this data is truly categorical, with no known concept of ordering
    mapping = None

    return X, y, mapping


def get_splice_data():
    """Load the splice dataset, split it into X and y, and then call the label encoder to get an integer y column."""

    df = pd.read_csv('../input/primate-splicejunction-gene-sequences/splice.csv')
    X = df.reindex(columns=[x for x in df.columns.values if x != 'class'])
    X['dna'] = X['dna'].map(lambda x: list(str(x).strip()))
    for idx in range(60):
        X['dna_%d' % (idx, )] = X['dna'].map(lambda x: x[idx])
    del X['dna']

    y = df.reindex(columns=['class'])
    y = preprocessing.LabelEncoder().fit_transform(y.values.reshape(-1, ))

    # this data is truly categorical, with no known concept of ordering (aka nominal)
    mapping = None

    return X, y, mapping

In [None]:
def score_models(clf, X, y, encoder, runs=1):
    """
    Takes in a classifier that supports multiclass classification, and X and a y, 
    and returns a cross validation score.

    :param clf: classifier that supports multiclass classification
    :type clf: sklearn algorithm
    :param X: X data columns
    :type X: numpy.ndarray
    :param y: y data column
    :type y: numpy.ndarray
    :param encoder: encoder to use for running the model
    :type encoder: type
    :param runs: default = 1, number of times to fit_transform and run the model
    :type runs: int
    
    :return float(np.mean(scores)): mean of cross val scores
    :return float(np.std(scores)): standard deviation of cross val scores
    :return scores: list of scores
    :return X_test.shape[1]: number of features
    """

    scores = []

    X_test = None
    for _ in range(runs):
        X_test = encoder().fit_transform(X, y)
        scores.append(cross_val_score(clf, X_test, y, n_jobs=1, cv=5))
        gc.collect()

    scores = [y for z in [x for x in scores] for y in z]
    return float(np.mean(scores)), float(np.std(scores)), scores, X_test.shape[1]

In [None]:
def main(loader, name, models):
    """
    Load a dataset and score with a list of models using different encodings.
    
    :param loader: function name of dataset loading and x,y splitting function to call
    :type loader: string
    :param name:  model name to output
    :type name: string
    :param models: which machine learning models to use
    :type models: dictionary with keys models names strings and sklearn model class values
    
    :return: rankings info for each encoder
    :return type: dataframe
    """
    
    scores = []                               # list for encoder score info
    raw_scores_ds = {}                        # dict for holding raw scores
    model_name = []                           # list of dataframes of results per model type
    mods = [*models.values()]                 # get models and model names 
    mod_names = [*models.keys()]
    counter = 0                                # counter to print element in mod_names
    ranked = pd.DataFrame()                    # df used to rank each encoder and then take a mean ranking
    X, y, mapping = loader()                   # load the dataset
    
    for clf in mods:                           # iterate through each model
        encoders = category_encoders.__all__   # use each encoding method available
        for encoder_name in encoders:
            encoder = getattr(category_encoders, encoder_name)
            start_time = time.time()
            score, stds, raw_scores, dim = score_models(clf, X, y, encoder)
            scores.append( [encoder_name, score, stds, dim, time.time() - start_time, mod_names[counter], name,])
            raw_scores_ds[encoder_name] = raw_scores
            model_name.append(clf)
            
        re = pd.DataFrame(scores, columns=['Encoding', 'Avg. Score', 'Score StDev', 'Dimensionality',  'Elapsed Time', "Model",  'Dataset',])
        re = re.round(decimals=2)
        re['ranking'] = re['Avg. Score'].rank(ascending=False)
        display(HTML(re.sort_values(by=['Model', 'Avg. Score'], ascending = True).to_html()))
        
        ranked = ranked.append(re) 
        
        # plot the scores
        raw = pd.DataFrame.from_dict(raw_scores_ds)
        fig, ax = plt.subplots(figsize = (10, 8))
        sns.boxplot(ax = ax, data=raw, palette="colorblind")
        plt.title('Scores for {mod} Encodings on Dataset {ds}'.format(mod = mod_names[counter], ds = name))
        plt.ylabel('Score (higher better)')
        plt.xticks(rotation=90)
        plt.grid()
        plt.tight_layout()
        plt.show()
        
        scores = []
        counter += 1  
        
    rank_thus_far = ranked.groupby('Encoding')['ranking'].agg(['mean', 'std', 'min', 'max', 'count'])       
    return rank_thus_far

In [None]:
# models to test
model_dict =  {"LR": LogisticRegression(),
                #  MLPClassifier(), 
                #  KNeighborsClassifier(),
                #  SVC(),
                #  RandomForestClassifier(),
                #  GaussianNB(),
                #  GaussianProcessClassifier(), 
                #  DecisionTreeClassifier(), 
                #  AdaBoostClassifier(), 
                "QDA": QuadraticDiscriminantAnalysis()
                }

# main(get_splice_data, 'Splice', mods)       # commented out because splice is throwing a ValueError when running. 

# ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
# There are no NaNs in the dataset. 

## Run models and encoders on Cars dataset

In [None]:
cars_df = main(get_cars_data, 'Cars', model_dict) 

Note that the cars data set is quite imbalanced, with the vast majority of observations listed as unacceptable.
QDA is warning us that the variables are correlated. And quite a few encoders perform very poorly on QDA.

## Cars dataset ranking

In [None]:
display(HTML(car_ranks.sort_values(by=['mean', 'std'], ascending = True).to_html())) 

## Run models and encoders on Mushroom dataset

In [None]:
mushroom_df = main(get_mushroom_data, 'Mushroom', model_dict)  

Note that the Mushroom dataset's  binary outcomes, "edible" and "poisonous", are fairly balanced. QDA performs much better with all the encoders.

## Mushroom dataset ranking

##  Rank the models

In [None]:
display(HTML(mushroom_ranks.sort_values(by=['mean', 'std'], ascending = True).to_html())) 

## Questions

Is there too much variation? How large are changes in QDA if using different random seeds?  Maybe  large because much collinearity in features.

Should we use a classifiction metric like roc/auc instead of accuracy? Accuracy is simple, but also not always what you want - because the unbalanced dataset isn't binary classification, not as much of an issue there. The mushroom dataset is binary but balanced, so not a big concern there either.

## Takeaways

There is a fair bit of variation for different encoding schemes over different trials, with different models, and across the data sets. We really need to look at regression and more data sets.

The hashing encoder performs consistently poorly.  Is just a poor fit with the data sets?

Backward difference and ordinal do poorly with encoding on the mushroom data set, which makes sense as those all depend upon meaningful  ordering of the categories in the nominal data.

Ordinal does great with the Mushroom dataset.

One-hot doesn't do terribly. With more cardinality, it could have more problems.

AdaBoost had a super tough time with this, particulary with the Binary and BaseN encoding. It did better with Hashing, but still poorly. It performed decently with ordinal, backward difference, helmert, and polynomial on the car. Not sure about polynomial, but the other encoders do have real ordinal data encoded, so those should work better in cars than in mushrooms.

## Future Projects

Grid search CV for parameter selection to make it a more real-world comparision of models

