### Data Encoding
1. Nominal/ One Hot Encoding (OHE)
2. Ordinal and label Encoding
3. Target Guided Ordinal Encoding

In [1]:
# OHE

import pandas as pd
from sklearn.preprocessing import OneHotEncoder


In [4]:
df = pd.DataFrame({'colors':['red','blue','green','red','blue','green','red','blue','green']})

In [5]:
df

Unnamed: 0,colors
0,red
1,blue
2,green
3,red
4,blue
5,green
6,red
7,blue
8,green


In [6]:
# create instance of encoder

encoder = OneHotEncoder()

In [14]:
# categorical features are sorted and then used
encoded_df = pd.DataFrame(encoder.fit_transform(df[['colors']]).toarray(),columns=['color_blue','color_green','color_red'])

In [15]:
pd.concat([df,encoded_df],axis=1)

Unnamed: 0,colors,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,red,0.0,0.0,1.0
4,blue,1.0,0.0,0.0
5,green,0.0,1.0,0.0
6,red,0.0,0.0,1.0
7,blue,1.0,0.0,0.0
8,green,0.0,1.0,0.0


### Label And Ordinal Encoding


In [16]:
from sklearn.preprocessing import LabelEncoder

In [17]:
# for nominal data: only categorised not ranked
df = pd.DataFrame({'colors':['red','blue','green','red','blue','green','red','blue','green']})

In [18]:
encoder = LabelEncoder()

In [21]:
encoder.fit_transform(df['colors'])

array([2, 0, 1, 2, 0, 1, 2, 0, 1])

In [23]:
# for ordinal data : categorised and ranked
#ordinal encoding
from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({
    'size': ['small','medium','large','medium','large','small']
})

In [24]:
Ordinal_encoder = OrdinalEncoder(categories=[['small','medium','large']])

In [29]:
Ordinal_encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [2.],
       [0.]])

### Target Guided Ordinal Encoding

It is a technique used to encode categorical variables based on their relationship with the target variable.
this encoding technique is useful when we have a categorical variable with a large number of unique categories and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each categorical variable with a numerical value based on mean or median of the target variable for that category. this creates a monotonic relationship with between the categorical varialble and the target variable, which can improve the predictive power of our model.

In [33]:
df = pd.DataFrame({
    'city':['delhi','banglore','mumbai','indore','delhi','hyderabad'],
    'price':[200,150,300,250,180,320]
})

In [36]:
# calculate the mean price for each city
mean_prices = df.groupby(by='city')['price'].mean().to_dict()

In [37]:
# replace each city with its mean price
df['city_encoded'] = df['city'].map(mean_prices)

In [38]:
df

Unnamed: 0,city,price,city_encoded
0,delhi,200,190.0
1,banglore,150,150.0
2,mumbai,300,300.0
3,indore,250,250.0
4,delhi,180,190.0
5,hyderabad,320,320.0
