# One hot encoder

`sklearn.preprocessing.OneHotEncoder` is realisatoin of one hot encoding in sklearn. Here is some details associated with working with this tool.

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from IPython.display import HTML

np.random.seed(10)
categories = [chr(i) for i in range(ord("a"), ord("e"))]
sample_size = 10

test_frame = pd.DataFrame({
    "col1" : np.random.choice(categories, sample_size), 
    "col2" : np.random.choice(categories, sample_size)
})

header_template = "<p style='font-size:17px'>{}</p>"

## Basic example

Here is option how it'll work in the most basic case.

In [2]:
ohe_transformer = OneHotEncoder(sparse_output=False)

display(HTML(header_template.format("Input DataFrame")))
display(test_frame)
display(HTML(header_template.format("Output as DataFrame")))
display(pd.DataFrame(
    ohe_transformer.fit_transform(test_frame),
    columns = ohe_transformer.get_feature_names_out()
))

Unnamed: 0,col1,col2
0,b,a
1,b,b
2,a,b
3,d,c
4,a,a
5,b,b
6,d,a
7,a,c
8,b,a
9,b,c


Unnamed: 0,col1_a,col1_b,col1_d,col2_a,col2_b,col2_c
0,0.0,1.0,0.0,1.0,0.0,0.0
1,0.0,1.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0,0.0,1.0
4,1.0,0.0,0.0,1.0,0.0,0.0
5,0.0,1.0,0.0,0.0,1.0,0.0
6,0.0,0.0,1.0,1.0,0.0,0.0
7,1.0,0.0,0.0,0.0,0.0,1.0
8,0.0,1.0,0.0,1.0,0.0,0.0
9,0.0,1.0,0.0,0.0,0.0,1.0


## `categories`

There is a special argument that allows you to select which categories to use as new columns - `categories` argument.

### Exclude categories

Suppose you don't want certain columns for some categories. For example, you might want to optimise your pipeline and remove unimportant columns.

You can simply specify `categories' and omit the categories you don't want to see in the result. Just like in the following cell.

In [3]:
ohe_transformer = OneHotEncoder(
    sparse_output=False,
    categories=[
        ["a", "b"],
        ["a", "c"]
    ],
    handle_unknown='ignore'
)

display(HTML(header_template.format("Input DataFrame")))
display(test_frame)
display(HTML(header_template.format("Output as DataFrame")))
display(pd.DataFrame(
    ohe_transformer.fit_transform(test_frame),
    columns = ohe_transformer.get_feature_names_out()
))

Unnamed: 0,col1,col2
0,b,a
1,b,b
2,a,b
3,d,c
4,a,a
5,b,b
6,d,a
7,a,c
8,b,a
9,b,c


Unnamed: 0,col1_a,col1_b,col2_a,col2_c
0,0.0,1.0,1.0,0.0
1,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,1.0,0.0,1.0,0.0
5,0.0,1.0,0.0,0.0
6,0.0,0.0,1.0,0.0
7,1.0,0.0,0.0,1.0
8,0.0,1.0,1.0,0.0
9,0.0,1.0,0.0,1.0


**Note** аf some options are omitted in categories and `drop='first'` is set, it turns out that the converter can lose information uncontrollably. Because separate use of `drop='first'` does not lead to information loss, because the discarded column can be restored using the remaining ones. If some columns are dropped through `categories`, it will lead to unexpected information loss. In this case sklearn will generate a warning even though its text does not correspond to the problem.

The following cell shows an example of such a case.

In [5]:
ohe_transformer = OneHotEncoder(
    sparse_output=False,
    categories=[
        ["a", "b", "c", "d"],
        ["a", "c"]
    ],
    handle_unknown='ignore',
    drop="first"
)

display(HTML(header_template.format("Input DataFrame")))
display(test_frame)
display(HTML(header_template.format("Output as DataFrame")))
display(pd.DataFrame(
    ohe_transformer.fit_transform(test_frame),
    columns = ohe_transformer.get_feature_names_out()
))

Unnamed: 0,col1,col2
0,b,a
1,b,b
2,a,b
3,d,c
4,a,a
5,b,b
6,d,a
7,a,c
8,b,a
9,b,c




Unnamed: 0,col1_b,col1_c,col1_d,col2_c
0,1.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,1.0
4,0.0,0.0,0.0,0.0
5,1.0,0.0,0.0,0.0
6,0.0,0.0,1.0,0.0
7,0.0,0.0,0.0,1.0
8,1.0,0.0,0.0,0.0
9,1.0,0.0,0.0,1.0


### Extra categories

You can specify categories that are not listed in the dataframe for fitting. So you can just mention it - in output it will just be a category containing only zeros. The following cell demonstrate it.

In [4]:
ohe_transformer = OneHotEncoder(
    sparse_output=False,
    categories=[
        ["a", "b", "m"],
        ["a", "b", "c"]
    ],
    handle_unknown='ignore'
)

display(HTML(header_template.format("Input DataFrame")))
display(test_frame)
display(HTML(header_template.format("Output as DataFrame")))
display(pd.DataFrame(
    ohe_transformer.fit_transform(test_frame),
    columns = ohe_transformer.get_feature_names_out()
))

Unnamed: 0,col1,col2
0,b,a
1,b,b
2,a,b
3,d,c
4,a,a
5,b,b
6,d,a
7,a,c
8,b,a
9,b,c


Unnamed: 0,col1_a,col1_b,col1_m,col2_a,col2_b,col2_c
0,0.0,1.0,0.0,1.0,0.0,0.0
1,0.0,1.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,1.0,0.0,0.0
5,0.0,1.0,0.0,0.0,1.0,0.0
6,0.0,0.0,0.0,1.0,0.0,0.0
7,1.0,0.0,0.0,0.0,0.0,1.0
8,0.0,1.0,0.0,1.0,0.0,0.0
9,0.0,1.0,0.0,0.0,0.0,1.0
