# Label Encoder

This notebook will show how lebel encoding converts categorical data to ordinal/numeric format
- This transforms the categorical labels with value between o and number of unique labels or classes present
- Label encoders can be used to normalize numerical labels and non numerical labels both

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

### Importing modules

In [1]:
from sklearn import preprocessing

import pandas as pd
import numpy as np

## Numerical labels

### Creating the label encoder object
Convert categorical features to integer codes where unique integer identifier is used to represent each category, there is an implied order to the categories

In [2]:
num_encoder = preprocessing.LabelEncoder()

### Fitting the label encoder

Can be called on already numerical data, the numbers will be taken as categories and assigned a unique integer code

In [3]:
num_encoder.fit([50, 20, 60, 60])

LabelEncoder()

#### The classes_  property
Returns every unique label present, you cannot set the order and the classes property will show the ordering the encoder determined 

In [4]:
num_encoder.classes_

array([20, 50, 60])

#### The transform() menthod
It transforms labels to encoded labels

In [5]:
num_encoder.transform([50, 20, 60, 60])

array([1, 0, 2, 2])

#### The inverse_transform() function
It transforms encoded labels to its original form

In [6]:
num_encoder.inverse_transform([1, 0, 2, 2])

array([50, 20, 60, 60])

In [7]:
num_encoder.inverse_transform([1, 0, 2, 2, 1, 0])

array([50, 20, 60, 60, 50, 20])

## Non numerical labels
Encoding non numerical labels

#### Creating a object of the label encoder

In [8]:
string_encoder = preprocessing.LabelEncoder()

#### Fitting the values to the label encoder object

We will be taking some categories of weather, we are assuming there is an inherent ordering to these. The encoded result is what allows ML models to assume an ordering of these categories when fed into them as features.

In [9]:
string_encoder.fit(["Cloudy", "Sunny", "Windy", "Cloudy", "Rainy"])

LabelEncoder()

In [10]:
string_encoder.classes_

array(['Cloudy', 'Rainy', 'Sunny', 'Windy'], dtype='<U6')

In [11]:
string_encoder.transform(["Cloudy", "Rainy", "Sunny", "Windy"])

array([0, 1, 2, 3])

#### Insverse transform of the labels

In [12]:
list(string_encoder.inverse_transform([3, 0, 2, 1, 3]))

['Windy', 'Cloudy', 'Sunny', 'Rainy', 'Windy']

## Label encoding on a pandas dataframe

### Reading a csv file 

The dataset has been generated by the creator

The following dataset is of present and former employees working in a companywith different designations. 

In [13]:
employee_data = pd.read_csv("datasets/employee_salary.csv")
employee_data

Unnamed: 0,Designation,Age,Salary,Retired
0,Manager,54,72000,Yes
1,Supervisor,27,32000,No
2,Vice-president,30,42000,No
3,Manager,58,83000,Yes
4,Supervisor,40,35000,No
5,Supervisor,35,42000,No
6,Employee,40,48000,No
7,Vice-president,55,79000,Yes
8,Employee,45,67000,No
9,Supervisor,40,45000,No


This type of encoder can encode multiple columns in one go, see what can happen if you do not specify what columns you want encoded

In [14]:
data_encoder = preprocessing.LabelEncoder()

In [15]:
employee_data_encoded = employee_data.apply(data_encoder.fit_transform)

### Encoding without specifying columns will lead to encoding of all the columns

In [16]:
employee_data_encoded

Unnamed: 0,Designation,Age,Salary,Retired
0,1,6,7,1
1,2,0,0,0
2,3,1,2,0
3,1,8,9,1
4,2,3,1,0
5,2,2,2,0
6,0,3,4,0
7,3,7,8,1
8,0,4,6,0
9,2,3,3,0


* This kind of output is not helpful since we did not want to encode numeric data that is not categorical

### Encoding the column `Designation`

Here we will specify a column to be encoded, scikit-learn estimators expect continuous input and will assume an odering on the categories

In [17]:
state_encoder = preprocessing.LabelEncoder()

employee_data['Designation_encoded'] = state_encoder.fit_transform(employee_data.Designation)

In [18]:
state_encoder.classes_

array(['Employee', 'Manager', 'Supervisor', 'Vice-president'],
      dtype=object)

### Encoding the column `Retired`

Since the retired column differs from teh encoder for the destination column we use a seprate encoder

In [19]:
retirement_encoder = preprocessing.LabelEncoder()

employee_data['Retired_encoded'] = retirement_encoder.fit_transform(employee_data.Retired)

In [20]:
retirement_encoder.classes_

array(['No', 'Yes'], dtype=object)

In [21]:
employee_data

Unnamed: 0,Designation,Age,Salary,Retired,Designation_encoded,Retired_encoded
0,Manager,54,72000,Yes,1,1
1,Supervisor,27,32000,No,2,0
2,Vice-president,30,42000,No,3,0
3,Manager,58,83000,Yes,1,1
4,Supervisor,40,35000,No,2,0
5,Supervisor,35,42000,No,2,0
6,Employee,40,48000,No,0,0
7,Vice-president,55,79000,Yes,3,1
8,Employee,45,67000,No,0,0
9,Supervisor,40,45000,No,2,0


### Inverse transform of the encoded lables of a specific column

In [22]:
employee_data['Retirement_decoded'] = retirement_encoder.inverse_transform\
                                             (employee_data['Retired_encoded'])

In [23]:
employee_data['Designation_decoded'] = state_encoder.inverse_transform\
                                            (employee_data['Designation_encoded'])

In [24]:
employee_data

Unnamed: 0,Designation,Age,Salary,Retired,Designation_encoded,Retired_encoded,Retirement_decoded,Designation_decoded
0,Manager,54,72000,Yes,1,1,Yes,Manager
1,Supervisor,27,32000,No,2,0,No,Supervisor
2,Vice-president,30,42000,No,3,0,No,Vice-president
3,Manager,58,83000,Yes,1,1,Yes,Manager
4,Supervisor,40,35000,No,2,0,No,Supervisor
5,Supervisor,35,42000,No,2,0,No,Supervisor
6,Employee,40,48000,No,0,0,No,Employee
7,Vice-president,55,79000,Yes,3,1,Yes,Vice-president
8,Employee,45,67000,No,0,0,No,Employee
9,Supervisor,40,45000,No,2,0,No,Supervisor
