# Data Encoding
Techniques that are used for representing categorical data as numerical data, which is more suitable for machine learning algorithms.

## 1. Nominal Encoding / OHE(One Hot Encoding)

Each category is represented as a binary vector, where each bit corresponds to a unique category.

In [2]:
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder

In [3]:
# Creating a Dataset/DataFrame
df = pd.DataFrame({
    'color': ['red','blue','green','green','red','blue']
})

In [5]:
df

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red
5,blue


In [6]:
# Creating an instance of OHE
encoder = OneHotEncoder()

In [7]:
# Performing fit and transform
encoded = encoder.fit_transform(df[['color']])

In [8]:
encoded

<6x3 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [9]:
encoded = encoder.fit_transform(df[['color']]).toarray()

In [10]:
encoded

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [11]:
# To get everything printed in a correct manner
encoder_df = pd.DataFrame(encoded,columns = encoder.get_feature_names_out())

In [12]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [14]:
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [16]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


## 2. Label Encoding

Unique Numerical labels are assigned to each category in the variable.
(Labels are assigned in alphabetical order or are based on the frequency of the categories)

In [17]:
df

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red
5,blue


In [18]:
lbl_encoder = LabelEncoder()

In [19]:
lbl_encoded = lbl_encoder.fit_transform(df['color'])

In [20]:
lbl_encoded

array([2, 0, 1, 1, 2, 0])

In [21]:
lbl_encoder.transform([['red']])

  y = column_or_1d(y, warn=True)


array([2])

In [22]:
lbl_encoder.transform([['blue']])

array([0])

In [23]:
lbl_encoder.transform([['green']])

  y = column_or_1d(y, warn=True)


array([1])

## 3. Ordinal Encoding
Encoding using categorical data that have intrinsic order or ranking. (Assigned based on position in the order)

In [24]:
df_ord = pd.DataFrame({
    'size' : ['small', 'medium','large', 'large', 'medium', 'small']
})

In [25]:
df_ord

Unnamed: 0,size
0,small
1,medium
2,large
3,large
4,medium
5,small


In [36]:
encoder = OrdinalEncoder(categories = [['small','medium','large']])
# we have given categories section to assign the ranks

In [45]:
encoder.fit_transform(df[['size']])


array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])




In [41]:
encoder.transform([['small']])

array([[0.]])
