### Data Encoding:
1. One Hot Encoding/ Nominal Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding

#### One Hot Encoding/ Nominal Encoding:
One Hot Encoding, also known as Nominal Encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable called as "color" with three possible values (red, green, blue), we can represnt it throgh one hot encoding as follows:
1. Red[1,0,0]
2. Green[0,1,0]
3. Blue[0,0,1]

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [3]:
data=pd.DataFrame({'color':['red','green','blue','red','red','blue','green']})
data

Unnamed: 0,color
0,red
1,green
2,blue
3,red
4,red
5,blue
6,green


In [4]:
# creating an instance of OneHotEncoder
encoder=OneHotEncoder()

In [7]:
# perform fit and transform
encoded_array=encoder.fit_transform(data[['color']]).toarray()

In [8]:
encoded_array

array([[0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

In [9]:
encoded_data=pd.DataFrame(encoded_array, columns=encoder.get_feature_names_out())
encoded_data

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0
6,0.0,1.0,0.0


In [15]:
import numpy as np
np.array(encoded_data['color_blue'])

array([0., 0., 1., 0., 0., 1., 0.])

#### Exercise

In [16]:
import seaborn as sns
df=sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [28]:
encoder=OneHotEncoder()
encoded_sex_array=encoder.fit_transform(df[['sex']]).toarray()
encoded_smoker_array=encoder.fit_transform(df[['smoker']]).toarray()
encoded_day_array=encoder.fit_transform(df[['day']]).toarray()
encoded_time_array=encoder.fit_transform(df[['time']]).toarray()

In [29]:
encoded_sex=pd.DataFrame(encoded_sex_array, columns=encoder.get_feature_names_out())

In [30]:
encoded_smoker=pd.DataFrame(encoded_smoker_array, columns=encoder.get_feature_names_out())

In [32]:
encoded_time=pd.DataFrame(encoded_time_array, columns=encoder.get_feature_names_out())

In [38]:
# Assuming 'encoder' is an instance of OneHotEncoder
encoder = OneHotEncoder()

# Fit and transform the 'day' column
encoded_day_array = encoder.fit_transform(df[['day']]).toarray()

# Create a DataFrame with the correct number of columns
encoded_day = pd.DataFrame(encoded_day_array, columns=encoder.get_feature_names_out(['day']))

# Display the DataFrame
encoded_day

Unnamed: 0,day_Fri,day_Sat,day_Sun,day_Thur
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0
...,...,...,...,...
239,0.0,1.0,0.0,0.0
240,0.0,1.0,0.0,0.0
241,0.0,1.0,0.0,0.0
242,0.0,1.0,0.0,0.0


In [39]:
encoded_data=pd.concat([df, encoded_sex, encoded_smoker, encoded_day, encoded_time], axis=1)
encoded_data

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,time_Dinner,time_Lunch,time_Dinner.1,time_Lunch.1,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner.2,time_Lunch.2
0,16.99,1.01,Female,No,Sun,Dinner,2,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,21.01,3.50,Male,No,Sun,Dinner,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,24.59,3.61,Female,No,Sun,Dinner,4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
240,27.18,2.00,Female,Yes,Sat,Dinner,2,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
241,22.67,2.00,Male,Yes,Sat,Dinner,2,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
242,17.82,1.75,Male,No,Sat,Dinner,2,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


#### Label Encoding
Label Encoding and Ordinal Encoding are techniques used to encode categorical data.

Label Encoding invloves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of categories. For example: if we have a categorical variable called as "color" with three possible values (red, green, blue), we can represnt it using label encoding as follows.

Red:1
Green:2
Blue:3

In [40]:
data=pd.DataFrame({'color':['Red',"Green",'Red','Red','Blue']})
data

Unnamed: 0,color
0,Red
1,Green
2,Red
3,Red
4,Blue


In [41]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder=LabelEncoder()

In [45]:
encoded_data=lbl_encoder.fit_transform(data[['color']])
encoded_data

  y = column_or_1d(y, warn=True)


array([2, 1, 2, 2, 0])

In [47]:
lbl_encoder.transform([['Red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [48]:
lbl_encoder.transform([['Green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

In [49]:
lbl_encoder.transform([["Blue"]])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

In [52]:
'''
Red->2
Green->1
Blue->0
'''

'\nRed->2\nGreen->1\nBlue->0\n'

#### Ordinal Encoding:
It is used to encode categorical variables that have an intrinsic order of ranking. In this technique, each category is assigned a numerical values based on it's position in the order. For example, if we have a categorical variable "education level" with 4 possible values - (high-school, college, graduate, post-graduate) then we can represent it using ordinal encoding as follows:
1. High School - 1
2. College - 2
3. Graduate - 3
4. Post Graduate - 4

In [53]:
from sklearn.preprocessing import OrdinalEncoder
data=pd.DataFrame({'size':['small','medium','large','medium','small','large']})
data

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [56]:
ord_encoder=OrdinalEncoder(categories=[['small','medium','large']])
encoded_data=ord_encoder.fit_transform(data[['size']])
encoded_data

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [57]:
ord_encoder.transform([['small']])



array([[0.]])

In [58]:
ord_encoder.transform([['medium']])



array([[1.]])

In [59]:
ord_encoder.transform([['large']])



array([[2.]])

#### Target Guided Ordinal Encoding
It is a technique used to encode ordinal variables based on their relationship with the target variable. The encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our Machine Learning Model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable of that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [1]:
import pandas as pd
df=pd.DataFrame({'city':['New York','Paris','London','Tokyo','New York','Paris'],'price':[180,200,120,340,202,243]})
df

Unnamed: 0,city,price
0,New York,180
1,Paris,200
2,London,120
3,Tokyo,340
4,New York,202
5,Paris,243


In [6]:
mean_price=df.groupby('city')['price'].mean().to_dict()
mean_price

{'London': 120.0, 'New York': 191.0, 'Paris': 221.5, 'Tokyo': 340.0}

In [8]:
df['city_encoded']=df['city'].map(mean_price)
df

Unnamed: 0,city,price,city_encoded
0,New York,180,191.0
1,Paris,200,221.5
2,London,120,120.0
3,Tokyo,340,340.0
4,New York,202,191.0
5,Paris,243,221.5


In [9]:
df[['city_encoded','price']]

Unnamed: 0,city_encoded,price
0,191.0,180
1,221.5,200
2,120.0,120
3,340.0,340
4,191.0,202
5,221.5,243
