# One-Hot Encoding
## Introduction
Encoding is the process of converting categorical values into numerical values. Encoding is widely used during data preprocessing because datasets may contain categorical values that cannot be used directly for training Machine Learning models. There are two major types of encoding: Label Encoding and One-Hot Encoding.
One-Hot Encoding is one of the most popular encoding techniques. It is also known as Dummy Encoding. It is a process of creating new columns for each category and assigning a binary value to each column. The binary value is 1 if the category is present in the row and 0 if the category is not present in the row. The number of columns created is equal to the number of unique categories in the categorical column. The new columns are also known as dummy variables.
One-Hot Encoding is mostly used for categorical features with two or more categories. Categorical features with large number of categories are not suitable for One-Hot Encoding as they create complexity in the dimension of the dataset (i.e. Curse of Dimensionality) and it results in the complex Machine Learning model. Dummy variables also create high correlation among each other which is also called the Dummy Variable Trap. In this case, we can use Label Encoding which produces only one column for the categorical feature.

In [156]:
# importing necessary libraries
import numpy as np
import pandas as pd

In [157]:
# importing the iris species dataset
dataset = pd.read_csv('./../../datasets/iris_species/Iris.csv')
dataset.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [158]:
dataset['Species'].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Species, dtype: int64

## Use Case
Here, the number of unique categories is 3. This number is perfectly fine for the One-Hot Encoding technique because it would create 3 more columns. This will not increase a lot of complexity in the dataset. Furthermore, the feature is not of ordinal type, so using Label Encoding might create some unwanted biasness on the dataset. Therefore, we use Label Encoding in his context.

The OneHotEncoder class is the almost mirrored version of the OneHotEncoder class in Scikit Learn.

In [159]:
# One Hot Encoding class
class OneHotEncoder:
    # this class can map only a single column at a time
    def __init__(self, sparse=True, dtype=float, handle_unknown='ignore'):
        self.sparse = sparse  # the sparse will always be true for now and the we will return a sparse matrix everytime
        self.dtype = dtype  # dtype -> float | int
        self.handle_unknown = handle_unknown  # handle_unknown will be 'ignore' for now and return None for unknown categorical values

        self.categories_ = None

    # fit the class with the incoming data
    def fit(self, X):
        # the input X is expected to be a single column value / series
        self.categories_ = np.unique(X)
        return self

    # transform the incoming data to one hot encoded form
    def transform(self, X):
        X = np.array(X)
        X_transformed = np.zeros((X.shape[0], len(self.categories_)), dtype=self.dtype)
        for i, category in enumerate(self.categories_):
            X_transformed[:, i] = (X == category).astype(int)
        return X_transformed

    # fitting the class with the incoming data and transforming it into one-hot encoded form
    def fit_transform(self, X):
        return self.fit(X).transform(X)

    # revering the encoded data into original form
    def inverse_transform(self, X):
        X = np.array(X)
        X_transformed = np.zeros(X.shape[0], dtype=self.categories_.dtype)

        for i, category in enumerate(self.categories_):
            X_transformed[X[:, i] == 1] = category

        # filtering the unknown categorical values (if any) to None instead of 0 since 0 is the default one if the category is not found
        return [x if x != 0 else None for x in X_transformed]

## Testing OneHotEncoder class
Testing for our own dataset i.e. Species column in the Iris dataset.

In [160]:
# initializing the OneHotEncoder class
ohe = OneHotEncoder()

In [161]:
ohe.fit(dataset['Species'])

<__main__.OneHotEncoder at 0x216b8604f10>

In [162]:
ohe.categories_

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [163]:
ohe.transform(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [164]:
ohe.fit_transform(dataset['Species'])

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0

In [165]:
ohe.inverse_transform([[0.0, 1.0, 0.0],
                       [0.0, 0.0, 1.0],
                       [1.0, 0.0, 0.0]])

['Iris-versicolor', 'Iris-virginica', 'Iris-setosa']