# sklearn's Category Encoders

#### Handling Categorical Data
* Creating dummy features from categorical values makes it so that you can include them in your modeling project by converting a single categorical column into many binary columns indicating the presence and absence of each categorical level (one-hot encoding)
    * **sklearn's Category Encoders package:**
    * **TL;DR;**
        * Use Category Encoders to improve model performance when you have nominal or ordinal data that may provide value.
        * For nominal columns try OneHot, Hashing, LeaveOneOut, and Target encoding. Avoid OneHot for high cardinality columns and decision tree-based algorithms.
        * For ordinal columns try Ordinal (Integer), Binary, OneHot, LeaveOneOut, and Target. Helmert, Sum, BackwardDifference and Polynomial are less likely to be helpful, but if you have time or theoretic reason you might want to try them.
        * For regression tasks, Target and LeaveOneOut probably won’t work well.
        
    * We should (arguably) classify data as 1 of 7 types to make better models faster
        * **Useless** — useless for machine learning algorithms, that is — discrete
        * **Nominal** — groups without order — discrete; groups do not overlap
        * **Binary** — either/or — discrete
        * **Ordinal** — groups with order — discrete; natural, ordered categories
        * **Count** — the number of occurrences — discrete
        * **Time** — cyclical numbers with a temporal component — continuous
        * **Interval** — positive and/or negative numbers without a temporal component — continuous
    * **Nominal Data:**
        * Has values that cannot be ordered in any meaningful way
        * mosst often one-hot (dummy) encoded
    * **Ordinal Data:**
        * Ordinal data can be rank-ordered in a meaningful way
        * Can be encoded in one of three ways:
            * 1) It can be assumed to be close enough to interval data — with relatively equal magnitudes between the values — to treat it as such. 
            * 2) It can be treated as nominal data, where each category has no numeric relationship to another. You can try one-hot encoding and other encodings appropriate for nominal data.
            * 3) The magnitude of the difference between the numbers can be ignored. You can just train your model with different encodings and see which encoding works best.

#### sklearn's Category Encoders package
   * largely derived from StatsModel's Patsy package
   
   #### Classic Encoders:
        * Ordinal — convert string labels to integer values 1 through k. Ordinal.
        * OneHot — one column for each value to compare vs. all other values. Nominal, ordinal.
        * Binary — convert each integer to binary digits. Each binary digit gets one column. Some info loss but fewer dimensions. Ordinal.
        * BaseN — Ordinal, Binary, or higher encoding. Nominal, ordinal. Doesn’t add much functionality. Probably avoid.
        * Hashing — Like OneHot but fewer dimensions, some info loss due to collisions. Nominal, ordinal.
        * Sum — Just like OneHot except one value is held constant and encoded as -1 across all columns.
        
   #### Contrast Encoders:
        * The five contrast encoders all have multiple issues that I argue make them unlikely to be useful for machine learning. They all output one column for each value found in a column.
        * Helmert (reverse) — The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels.
        * Backward Difference — the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level.
        * Polynomial — orthogonal polynomial contrasts. The coefficients taken on by polynomial coding for k=4 levels are the linear, quadratic, and cubic trends in the categorical variable.

   #### Bayesian Encoders:
       * The Bayesian encoders use information from the dependent variable in their encodings. They output one column and can work well with high cardinality data.
       * Target — use the mean of the DV, must take steps to avoid overfitting/ response leakage. Nominal, ordinal. For classification tasks.
       * LeaveOneOut — similar to target but avoids contamination. Nominal, ordinal. For classification tasks.
       * WeightOfEvidence — added in v1.3. Not documented in the docs as of April 11, 2019. The method is explained in this post.
       * James-Stein — forthcoming in v1.4. Described in the code here.
       * M-estimator — forthcoming in v1.4. Described in the code here. Simplified target encoder.
       
   * Note that all Category Encoders impute missing values automatically by default. However, I recommend filling missing data data yourself prior to encoding so you can test the results of several methods.
   * **Some terminology:**
       * *k* is the original number of unique values in your data column
       * *High cardinality* means a lot of unique values (a large *k*)
       * *High dimensionality* means a matrix with many dimensions; comes with Curse of Dimensionality (often results in overfitting)
       * *Sparse* data is a matrix with lots of zeroes relative to other values. Some algorithms may not work well with sparse data

#### The basic code setup for all examples to follow:

```
import numpy as np
import pandas as pd
import category_encoders as ce
from sklearn.preprocessing import LabelEncoder
pd.options.display.float_format = '{:.f}'.format #to make legibile

# make some data
df = pd.DataFrame({
    'color':["a","c","a","a","b","b"],
    'outcome':[1,2,0,0,0,1]})

# set up X and y
X = df.drop('outcome', axis = 1)
y = df.drop('color', axis = 1)
```

#### Ordinal
* OrdinalEncoder converts each string value to a whole number. The first unique value in your column becomes 1, the second becomes 2, the third becomes 3, and so on.
* If the column contains nominal data, stopping after you use OrdinalEncoder is a bad idea. Your machine learning algorithm will treat the variable as continuous and assume the values are on a meaningful scale. Instead, if you have a column with values car, bus, and truck you should first encode this nominal data using OrdinalEncoder. Then encode it again using one of the methods appropriate to nominal data
* If your column values are truly ordinal, that means that the integer assigned to each value is meaningful. Assignment should be done with intention. Say your column had the string values “First”, “Third”, and “Second” in it. Those values should be mapped to the corresponding integers by passing OrdinalEncoder a list of dicts:

```
[{"col": "finished_race_order", 
  "mapping": [("First", 1), 
              ("Second", 2), 
              ("Third", 3)]
}]
```
* OrdinalEncoder to transform the color column values from letters to integers:

```
ce_ord = ce.OrdinalEncoder(col = ['color'])
ce_ord.fit_transform(X, y['outcome'])
```
* Scikit-learn’s OrdinalEncoder does pretty much the same thing as Category Encoder’s OrdinalEncoder, but is not quite as user friendly. Scikit-learn’s encoder won’t return a pandas DataFrame. Instead it returns a NumPy array if you pass a DataFrame. It also outputs values starting with 0, compared to OrdinalEncoder’s default of outputting values starting with 1.

#### OneHot
* One-hot encoding is the classic approach to dealing with nominal, and maybe ordinal, data. It’s referred to as the “The Standard Approach for Categorical Data” in Kaggle’s Machine Learning tutorial series. It also goes by the names dummy encoding, indicator encoding, and occasionally binary encoding.

```
ce_one_hot = ce.OneHotEncoder(col = ['color'])
ce_one_hot.fit_transform(X, y)
```
* One-hot encoding can perform very well, but the number of new features is equal to k, the number of unique values. This feature expansion can create serious memory problems if your dataset has high cardinality features. One-hot-encoded data can also be difficult for decision-tree-based algorithms
* The pandas GetDummies and scikit-learn’s OneHotEncoder functions perform the same role as the Category Encoders OneHotEncoder.

#### Binary 
* Binary encoding can be thought of as a hybrid of one-hot and hashing encoders. Binary creates fewer features than one-hot, while preserving some uniqueness of values in the column. It can work well with higher dimensionality ordinal data.
* Here’s how it works:
    * The categories are encoded by OrdinalEncoder if they aren’t already in numeric form.
    * Then those integers are converted into binary code, so for example 5 becomes 101 and 10 becomes 1010
    * Then the digits from that binary string are split into separate columns. So if there are 4–7 values in an ordinal column then 3 new columns are created: one for the first bit, one for the second, and one for the third.
    * Each observation is encoded across the columns in its binary form.
    
```
ce_bin =ce.BinaryEncoder(cols = ['color'])
ce_bin.fit_transform(X, y)
```
* binary really shines when the cardinality of the column is higher
* Binary encoding creates fewer columns than one-hot encoding. It is more memory efficient. It also reduces the chances of dimensionality problems with higher cardinality.
* Binary encoding is a decent compromise for ordinal data with high cardinality.
* For nominal data a hashing algorithm with more fine-grained control usually makes more sense.

#### BaseN
* When the BaseN base = 1 it is basically the same as one hot encoding. When base = 2 it is basically the same as binary encoding. McGinnis said of this encoder, “Practically, this adds very little new functionality, rarely do people use base-3 or base-8 or any base other than ordinal or binary in real problems.”
* The main reason for BaseN’s existence is to possibly make grid searching easier. You could use BaseN with scikit-learn’s GridSearchCV. However, if you’re going to grid search with these encoding options, you can make the encoder part of your scikit-learn pipeline and put the options in your parameter grid.

```
ce_basen = ce.BaseNencoder(cols = ['color'])
ce_casen.fit_transform(X, y)
```
* The default base for BaseNEncoder is 2, which is the equivalent of BinaryEncoder.

#### Hashing
* HashingEncoder implements the hashing trick. It is similar to one-hot encoding but with fewer new dimensions and some info loss due to collisions. The collisions do not significantly affect performance unless there is a great deal of overlap.

```
ce_hash = ce.HashingEncoder(cols = ['color'])
ce_hash.fit_transform(X, y)
```
* The n_components parameter controls the number of expanded columns. The default is eight columns.
* If you set n_components less than k you’ll have a small reduction in the value provided by the encoded data. You’ll also have fewer dimensions.
* You can pass a hashing algorithm of your choice to HashingEncoder; the default is md5. Hashing algorithms have been very successful in some Kaggle competitions. It’s worth trying HashingEncoder for nominal and ordinal data if you have high cardinality features.

#### Recap:
* For nominal columns try OneHot, Hashing, LeaveOneOut, and Target encoding. Avoid OneHot for high cardinality columns.
* For ordinal columns try Ordinal (Integer), Binary, OneHot, LeaveOneOut, and Target. Helmert, Sum, BackwardDifference and Polynomial are less likely to be helpful, but if you have time or theoretic reason you might want to try them.
* The Bayesian encoders can work well for some machine learning tasks. 