# Label Encoding
## Introduction
Encoding is the process of converting categorical values into numerical values. Encoding is widely used during data preprocessing because datasets may contain categorical values that cannot be used directly for training Machine Learning models. There are two major types of encoding: Label Encoding and One-Hot Encoding.
Label Encoding is one of encoding techniques that converts the unique categorical values into numerical values between 0 and n-1, where n is the number of unique categories. Label Encoding is useful when the categorical values are in a large number and the categories are ordinal. For example, the categories of a feature are “low”, “medium”, “high”, and “very high”. In this case, the categories are ordinal and the values can be encoded as 0, 1, 2, and 3. Label Encoding is mostly useful for ordinal categories.

In [35]:
# importing necessary libraries
import numpy as np
import pandas as pd

In [36]:
# importing the fist market dataset
dataset = pd.read_csv('./../../datasets/fish_market/Fish.csv')
dataset.head()

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
0,Bream,242.0,23.2,25.4,30.0,11.52,4.02
1,Bream,290.0,24.0,26.3,31.2,12.48,4.3056
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961
3,Bream,363.0,26.3,29.0,33.5,12.73,4.4555
4,Bream,430.0,26.5,29.0,34.0,12.444,5.134


In [37]:
dataset['Species'].value_counts()

Perch        56
Bream        35
Roach        20
Pike         17
Smelt        14
Parkki       11
Whitefish     6
Name: Species, dtype: int64

## Use Case
Here, the number of unique categories is 7. This number is too large for the One-Hot Encoding technique because it would create 7 more columns and increase the complexity of the model. Plus there is the issue of multicollinearity in the variables. Therefore, we use Label Encoding in his context.

The LabelEncoder class is the mirrored version of the LabelEncoder class in Scikit Learn.

In [38]:
# Label Encoding class
class LabelEncoder:
    # initialize the class
    def __init__(self):
        self.classes_ = None

    # fit the class with the incoming data
    def fit(self, X):
        self.classes_ = np.unique(X)
        return self

    # transform the incoming data into label encoded form
    def transform(self, X):
        X = np.array(X)
        X_transformed = np.zeros(X.shape).astype(int)
        for i, val in enumerate(self.classes_):
            X_transformed[X == val] = i
        return X_transformed

    # fitting the class with the incoming data and transforming it into label encoded form
    def fit_transform(self, X):
        return self.fit(X).transform(X)

    # revering the encoded data into original form
    def inverse_transform(self, X):
        X = np.array(X)
        X_transformed = np.zeros(X.shape).astype(self.classes_.dtype)
        for i, val in enumerate(self.classes_):
            X_transformed[X == i] = val
        return X_transformed

    # get the categorical classes
    def get_params(self):
        return {"classes_": self.classes_}


## Testing
The below tests are taken from the Scikit Learn documentation.
Source: <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>

In [39]:
# initializing the LabelEncoder class
le = LabelEncoder()

In [40]:
# on numeric data
le.fit([1,2,2,6])

<__main__.LabelEncoder at 0x2870aa401c0>

In [41]:
le.classes_

array([1, 2, 6])

In [42]:
le.transform([1, 1, 2, 6])

array([0, 0, 1, 2])

In [43]:
le.inverse_transform([0, 0, 1, 2])

array([1, 1, 2, 6])

In [44]:
# on categorical data
le.fit(["paris", "paris", "tokyo", "amsterdam"])

<__main__.LabelEncoder at 0x2870aa401c0>

In [45]:
list(le.classes_)

['amsterdam', 'paris', 'tokyo']

In [46]:
le.transform(["tokyo", "tokyo", "paris"])

array([2, 2, 1])

In [47]:
list(le.inverse_transform([2, 2, 1]))

['tokyo', 'tokyo', 'paris']

Testing for our own dataset i.e. Species column in the fish market dataset.

In [48]:
le = LabelEncoder()
le.fit_transform(dataset['Species'])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5])

In [49]:
# getting the categorical values in the object
le.get_params()

{'classes_': array(['Bream', 'Parkki', 'Perch', 'Pike', 'Roach', 'Smelt', 'Whitefish'],
       dtype=object)}

In [50]:
le = LabelEncoder()
le.fit_transform(dataset['Species'])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5])

In [51]:
# getting the categorical values in the object
le.get_params()

{'classes_': array(['Bream', 'Parkki', 'Perch', 'Pike', 'Roach', 'Smelt', 'Whitefish'],
       dtype=object)}