In [1]:
%%capture
%cd ..

### Table of Contents

- [Quickstart](#Quickstart)
- [Automatic Encoding of Missing and Unknown Values](#Automatic-Encoding-of-Missing-and-Unknown-Values)
- [Reducing Cardinality of Encoded Labels](#Reducing-Cardinality-of-Encoded-Labels)

---

### Import Libraries and Data

In [2]:
from encoders.OneHotLabelEncoder import OneHotLabelEncoder

import numpy as np
import pandas as pd

For this tutorial, we will be using the [Servo dataset](https://archive.ics.uci.edu/ml/datasets/Servo).

**Dataset Characteristics**  
- Target variable: <font color='green'>class</font>
- Attributes: <font color='green'>4</font> *(2 categorical, 2 integer)*  
- Rows: <font color='green'>167</font>

In [3]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/servo/servo.data'
cols = ['motor', 'screw', 'pgain', 'vgain', 'class']

df = pd.read_csv(url, names=cols)
df.head(5)

Unnamed: 0,motor,screw,pgain,vgain,class
0,E,E,5,4,0.281251
1,B,D,6,5,0.506252
2,D,D,4,3,0.356251
3,B,A,3,2,5.500033
4,D,B,6,5,0.356251


---

### Quickstart

It's simple, just initiate OneHotLabelEncoder, then fit and transform the DataFrame!  
Categorical features will automatically be selected.

In [4]:
ohle = OneHotLabelEncoder().fit(df)
ohle.transform(df.head(5))

Unnamed: 0,motor_A,motor_B,motor_C,motor_D,motor_E,screw_A,screw_B,screw_C,screw_D,screw_E,pgain,vgain,class
0,0,0,0,0,1,0,0,0,0,1,5,4,0.281251
1,0,1,0,0,0,0,0,0,1,0,6,5,0.506252
2,0,0,0,1,0,0,0,0,1,0,4,3,0.356251
3,0,1,0,0,0,1,0,0,0,0,3,2,5.500033
4,0,0,0,1,0,0,1,0,0,0,6,5,0.356251


---

### Automatic Encoding of Missing and Unknown Values

You might be wondering...  

- Why not use pandas.get_dummies?
    - get_dummies only works if the exact same labels appears in both the training and test dataset!
    - If missing labels (or new labels found), you are not guaranteed to get the same amount of columns in training and test.
  
  
- Why not use LabelEncoder?
    - LabelEncoder will not transform unknown values!
    - A LabelEncoder also needs to be created for each encoded column, which is tedious coding.
  
  
- Why not use OneHotEncoder?
    - OneHotEncoder can't handle string inputs!
    - OneHotEncoder also can't handle missing values.
    
OneHotLabelEncoder solves all these problems!

In [5]:
# labels parameter allows us to explicitly specify columns to encode
# delete=False keeps original data column
# ignore_col=True adds indicator column for unknown values
# missing_col=True adds indicator column for missing values

ohle2 = OneHotLabelEncoder(labels=['vgain'], delete=False, ignore_col=True, missing_col=True).fit(df)

Let's intentionally add a missing and unknown value to the dataset.

In [6]:
df2 = df.copy()

df2.iloc[0, 3] = np.NaN
df2.iloc[1, 3] = -999

df2.head()

Unnamed: 0,motor,screw,pgain,vgain,class
0,E,E,5,,0.281251
1,B,D,6,-999.0,0.506252
2,D,D,4,3.0,0.356251
3,B,A,3,2.0,5.500033
4,D,B,6,5.0,0.356251


Voila, the missing value is captured in **missing_vgain** and -999 is captured in **ignore_vgain**!

In [7]:
ohle2.transform(df2.head())

Unnamed: 0,motor,screw,pgain,vgain,vgain_1,vgain_2,vgain_3,vgain_4,vgain_5,ignore_vgain,missing_vgain,class
0,E,E,5,,0,0,0,0,0,0,1,0.281251
1,B,D,6,-999.0,0,0,0,0,0,1,0,0.506252
2,D,D,4,3.0,0,0,1,0,0,0,0,0.356251
3,B,A,3,2.0,0,1,0,0,0,0,0,5.500033
4,D,B,6,5.0,0,0,0,0,1,0,0,0.356251


---

### Reducing Cardinality of Encoded Labels

Sometimes there are just way too many labels to encode and we don't want a column for every unique label.  OneHotLabelEncoder supports filtering with the ``top`` parameter!  

The ``top`` parameter either selects the ``top`` most frequent label values (if int specified) or the ``top`` percent most frequent label values (if float specified) to discard less common labels.  

Here's an example with the **motor** and **screw** columns, where *C* and *A* are, respectively, the most frequent labels.

In [8]:
df['motor'].value_counts()

C    40
B    36
A    36
E    33
D    22
Name: motor, dtype: int64

In [9]:
df['screw'].value_counts()

A    42
B    35
C    31
D    30
E    29
Name: screw, dtype: int64

Boom. All other labels not satisfying ``top`` gets moved into the **ignore_&lt;feature_name>** column.

In [10]:
ohle3 = OneHotLabelEncoder(top=1, ignore_col=True, delete=False).fit(df)
ohle3.transform(df.tail())

Unnamed: 0,motor,motor_C,ignore_motor,screw,screw_A,ignore_screw,pgain,vgain,class
162,B,0,1,C,0,1,3,2,4.499986
163,B,0,1,E,0,1,3,1,3.699967
164,C,1,0,D,0,1,4,3,0.956256
165,A,0,1,B,0,1,3,2,4.499986
166,A,0,1,A,1,0,6,5,0.806255
