# Introduction

The performance of a machine learning model is influenced by several factors. These include not only the model and the hyperparameters, but also how we process and feed different types of variables to the model. 

Most machine learning models only accept numerical variables. Therefore, preprocessing the categorical variables becomes a necessary step. We need to convert these categorical variables to numbers. This allows the model to understand and extract valuable information.

[Basic + Core](https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/)

[Advance + More](https://medium.com/anolytics/all-you-need-to-know-about-encoding-techniques-b3a0af68338b)

## Data Encoding

1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding 


### Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

1. Red: [1, 0, 0]
2. Green: [0, 1, 0]
3. Blue: [0, 0, 1]

In [30]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [31]:
df = pd.DataFrame({
    'color':['red','blue','green','green','red','blue']
})

In [32]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [33]:
encoder = OneHotEncoder()

In [34]:
encoded = encoder.fit_transform(df[['color']]).toarray()

In [35]:
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())

In [36]:
encoded_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [37]:
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [38]:
df

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red
5,blue


In [39]:
pd.concat ([df, encoded_df], axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [44]:
import seaborn as sns
df = sns.load_dataset('tips')

In [45]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [46]:
encoded_tips = encoder.fit_transform(df[['sex']]).toarray()

In [49]:
encoded_tips_df = pd.DataFrame(encoded_tips,columns=encoder.get_feature_names_out())

In [50]:
encoded_tips_df.head()

Unnamed: 0,sex_Female,sex_Male
0,1.0,0.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,1.0,0.0


In [51]:
pd.concat([df,encoded_tips_df],axis=1)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_Female,sex_Male
0,16.99,1.01,Female,No,Sun,Dinner,2,1.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0.0,1.0
2,21.01,3.50,Male,No,Sun,Dinner,3,0.0,1.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0.0,1.0
4,24.59,3.61,Female,No,Sun,Dinner,4,1.0,0.0
...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,0.0,1.0
240,27.18,2.00,Female,Yes,Sat,Dinner,2,1.0,0.0
241,22.67,2.00,Male,Yes,Sat,Dinner,2,0.0,1.0
242,17.82,1.75,Male,No,Sat,Dinner,2,0.0,1.0


In [52]:
df['tip'].unique()

array([ 1.01,  1.66,  3.5 ,  3.31,  3.61,  4.71,  2.  ,  3.12,  1.96,
        3.23,  1.71,  5.  ,  1.57,  3.  ,  3.02,  3.92,  1.67,  3.71,
        3.35,  4.08,  2.75,  2.23,  7.58,  3.18,  2.34,  4.3 ,  1.45,
        2.5 ,  2.45,  3.27,  3.6 ,  3.07,  2.31,  2.24,  2.54,  3.06,
        1.32,  5.6 ,  6.  ,  2.05,  2.6 ,  5.2 ,  1.56,  4.34,  3.51,
        1.5 ,  1.76,  6.73,  3.21,  1.98,  3.76,  2.64,  3.15,  2.47,
        1.  ,  2.01,  2.09,  1.97,  3.14,  2.2 ,  1.25,  3.08,  4.  ,
        2.71,  3.4 ,  1.83,  2.03,  5.17,  5.85,  3.25,  4.73,  3.48,
        1.64,  4.06,  4.29,  2.55,  5.07,  1.8 ,  2.92,  1.68,  2.52,
        4.2 ,  1.48,  2.18,  2.83,  6.7 ,  2.3 ,  1.36,  1.63,  1.73,
        2.74,  5.14,  3.75,  2.61,  4.5 ,  1.61, 10.  ,  3.16,  5.15,
        3.11,  3.55,  3.68,  5.65,  6.5 ,  4.19,  2.56,  2.02,  1.44,
        3.41,  5.16,  9.  ,  1.1 ,  3.09,  1.92,  1.58,  2.72,  2.88,
        3.39,  1.47,  1.17,  4.67,  5.92,  1.75])

In [53]:
encoded_tip = encoder.fit_transform(df[['tip']]).toarray()

In [64]:
encoder.get_feature_names_out()

array(['tip_1.0', 'tip_1.01', 'tip_1.1', 'tip_1.17', 'tip_1.25',
       'tip_1.32', 'tip_1.36', 'tip_1.44', 'tip_1.45', 'tip_1.47',
       'tip_1.48', 'tip_1.5', 'tip_1.56', 'tip_1.57', 'tip_1.58',
       'tip_1.61', 'tip_1.63', 'tip_1.64', 'tip_1.66', 'tip_1.67',
       'tip_1.68', 'tip_1.71', 'tip_1.73', 'tip_1.75', 'tip_1.76',
       'tip_1.8', 'tip_1.83', 'tip_1.92', 'tip_1.96', 'tip_1.97',
       'tip_1.98', 'tip_2.0', 'tip_2.01', 'tip_2.02', 'tip_2.03',
       'tip_2.05', 'tip_2.09', 'tip_2.18', 'tip_2.2', 'tip_2.23',
       'tip_2.24', 'tip_2.3', 'tip_2.31', 'tip_2.34', 'tip_2.45',
       'tip_2.47', 'tip_2.5', 'tip_2.52', 'tip_2.54', 'tip_2.55',
       'tip_2.56', 'tip_2.6', 'tip_2.61', 'tip_2.64', 'tip_2.71',
       'tip_2.72', 'tip_2.74', 'tip_2.75', 'tip_2.83', 'tip_2.88',
       'tip_2.92', 'tip_3.0', 'tip_3.02', 'tip_3.06', 'tip_3.07',
       'tip_3.08', 'tip_3.09', 'tip_3.11', 'tip_3.12', 'tip_3.14',
       'tip_3.15', 'tip_3.16', 'tip_3.18', 'tip_3.21', 'tip_3.23',
     



array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [73]:
df_to_transform = pd.DataFrame([['tip_1.83']], columns=['column_name'])
encoder.transform(df_to_transform).toarray()

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [54]:
encoded_tip_df = pd.DataFrame(encoded_tip,columns=encoder.get_feature_names_out())

In [55]:
encoded_tip_df.head()

Unnamed: 0,tip_1.0,tip_1.01,tip_1.1,tip_1.17,tip_1.25,tip_1.32,tip_1.36,tip_1.44,tip_1.45,tip_1.47,...,tip_5.65,tip_5.85,tip_5.92,tip_6.0,tip_6.5,tip_6.7,tip_6.73,tip_7.58,tip_9.0,tip_10.0
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Label Encoding 
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

1. Red: 1
2. Green: 2
3. Blue: 3

In [78]:
import seaborn as sns
tips_df = sns.load_dataset('tips')

In [79]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder = LabelEncoder()

In [85]:
tips_df.reset_index(drop=True,inplace=True)
tips_df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [88]:
day_array = tips_df['day'].values
day_array

['Sun', 'Sun', 'Sun', 'Sun', 'Sun', ..., 'Sat', 'Sat', 'Sat', 'Sat', 'Thur']
Length: 244
Categories (4, object): ['Thur', 'Fri', 'Sat', 'Sun']

In [89]:
lbl_encoder.fit_transform(day_array)

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 3])

In [92]:
lbl_encoder.transform(['Sun'])

array([2])

In [93]:
lbl_encoder.transform(['Sat'])

array([1])

In [95]:
lbl_encoder.transform(['Thur'])
lbl_encoder.transform(['Fri'])

array([0])

### Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

In [96]:
from sklearn.preprocessing import OrdinalEncoder

In [98]:
size_df = pd.DataFrame({
    'size' : ['small','medium','large','extra large']
})

In [99]:
size_df

Unnamed: 0,size
0,small
1,medium
2,large
3,extra large


In [100]:
encoder = OrdinalEncoder()

In [101]:
encoder.fit_transform(size_df[['size']])

array([[3.],
       [2.],
       [1.],
       [0.]])

In [102]:
encoder.transform([['small']])



array([[3.]])