# Target Guided Ordinal Encoding

It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [57]:
import pandas as pd

In [58]:
## creating a simple dataframe with categorical variable(city) and target variable(price)...

df = pd.DataFrame({
    'city' : ['Kolkata' , 'Barcelona' , 'Milan' , 'Moscow' , 'Kolkata' , 'Milan'],
    'price' : [200, 150, 300, 250, 180, 320]
})

df

Unnamed: 0,city,price
0,Kolkata,200
1,Barcelona,150
2,Milan,300
3,Moscow,250
4,Kolkata,180
5,Milan,320


In [59]:
mean_price = df.groupby(['city'])['price'].mean().to_dict()

In [60]:
mean_price

{'Barcelona': 150.0, 'Kolkata': 190.0, 'Milan': 310.0, 'Moscow': 250.0}

In [61]:
df['city_encoded'] = df['city'].map(mean_price)
## here the .map() is similar to as done in javascript...

df

Unnamed: 0,city,price,city_encoded
0,Kolkata,200,190.0
1,Barcelona,150,150.0
2,Milan,300,310.0
3,Moscow,250,250.0
4,Kolkata,180,190.0
5,Milan,320,310.0


In [62]:
## now we can replace 'city' with 'mean_price'...

df[['city_encoded','price']]

Unnamed: 0,city_encoded,price
0,190.0,200
1,150.0,150
2,310.0,300
3,250.0,250
4,190.0,180
5,310.0,320


### Practice

In [63]:
## replace time based on total bill ...

import seaborn as sns

In [64]:
df = sns.load_dataset('tips')

df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [65]:
bill_mean = df.groupby(['time'])['total_bill'].mean()

bill_mean

  bill_mean = df.groupby(['time'])['total_bill'].mean()


time
Lunch     17.168676
Dinner    20.797159
Name: total_bill, dtype: float64

In [69]:
df['time_encoded'] = df['time'].map(bill_mean)

df['time_encoded']

0      20.797159
1      20.797159
2      20.797159
3      20.797159
4      20.797159
         ...    
239    20.797159
240    20.797159
241    20.797159
242    20.797159
243    20.797159
Name: time_encoded, Length: 244, dtype: category
Categories (2, float64): [17.168676, 20.797159]

In [67]:
df[['total_bill','tip','sex','smoker','day','time_encoded','size']]

Unnamed: 0,total_bill,tip,sex,smoker,day,time_encoded,size
0,16.99,1.01,Female,No,Sun,20.797159,2
1,10.34,1.66,Male,No,Sun,20.797159,3
2,21.01,3.50,Male,No,Sun,20.797159,3
3,23.68,3.31,Male,No,Sun,20.797159,2
4,24.59,3.61,Female,No,Sun,20.797159,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,20.797159,3
240,27.18,2.00,Female,Yes,Sat,20.797159,2
241,22.67,2.00,Male,Yes,Sat,20.797159,2
242,17.82,1.75,Male,No,Sat,20.797159,2
