## Data Encoding

1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding 

### Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

1. Red: [1, 0, 0]
2. Green: [0, 1, 0]
3. Blue: [0, 0, 1]

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
df = pd.DataFrame({
    'color':['red','blue','green','green','red','blue']
})

In [3]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [None]:
#Create an instance of one hot encoder
encoder = OneHotEncoder()

In [8]:
#perform fit and transform
encoded = encoder.fit_transform(df[['color']]).toarray()

In [10]:
encoder_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [11]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [13]:
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [14]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [18]:
## Assignment
import seaborn as sns
df_1 = sns.load_dataset('tips')
df_1

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [17]:
encoder_1 = OneHotEncoder()

In [20]:
encoded_1 = encoder_1.fit_transform(df_1[['time']]).toarray()

In [21]:
encoded_1

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.

In [24]:
encoder_df_1 = pd.DataFrame(encoded_1,columns = encoder_1.get_feature_names_out())

In [25]:
encoder_df_1

Unnamed: 0,time_Dinner,time_Lunch
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,1.0,0.0
...,...,...
239,1.0,0.0
240,1.0,0.0
241,1.0,0.0
242,1.0,0.0


In [29]:
final_df = pd.concat([df_1,encoder_df_1],axis=1)

In [30]:
final_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,time_Dinner,time_Lunch
0,16.99,1.01,Female,No,Sun,Dinner,2,1.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,1.0,0.0
2,21.01,3.50,Male,No,Sun,Dinner,3,1.0,0.0
3,23.68,3.31,Male,No,Sun,Dinner,2,1.0,0.0
4,24.59,3.61,Female,No,Sun,Dinner,4,1.0,0.0
...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,1.0,0.0
240,27.18,2.00,Female,Yes,Sat,Dinner,2,1.0,0.0
241,22.67,2.00,Male,Yes,Sat,Dinner,2,1.0,0.0
242,17.82,1.75,Male,No,Sat,Dinner,2,1.0,0.0


In [33]:
final_df.drop(['time'],axis=1,inplace=True)

In [34]:
final_df

Unnamed: 0,total_bill,tip,sex,smoker,day,size,time_Dinner,time_Lunch
0,16.99,1.01,Female,No,Sun,2,1.0,0.0
1,10.34,1.66,Male,No,Sun,3,1.0,0.0
2,21.01,3.50,Male,No,Sun,3,1.0,0.0
3,23.68,3.31,Male,No,Sun,2,1.0,0.0
4,24.59,3.61,Female,No,Sun,4,1.0,0.0
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,3,1.0,0.0
240,27.18,2.00,Female,Yes,Sat,2,1.0,0.0
241,22.67,2.00,Male,Yes,Sat,2,1.0,0.0
242,17.82,1.75,Male,No,Sat,2,1.0,0.0


### Label Encoding 
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

1. Red: 1
2. Green: 2
3. Blue: 3

In [28]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [35]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [41]:
encoded_df = le.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


In [42]:
encoded_df

array([2, 0, 1, 1, 2, 0])

In [40]:
le.transform([['blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

In [46]:
le.transform([['green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

In [44]:
df.shape

(6, 1)

### Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

In [47]:
from sklearn.preprocessing import OrdinalEncoder

In [48]:
# create a sample dataframe with an ordinal variable
df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})

In [49]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [52]:
encoder = OrdinalEncoder(categories=[['small','medium','large']])

In [53]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [54]:
encoder.transform([['large']])



array([[2.]])

## Target Guided Ordinal Encoding 
It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [55]:
import pandas as pd

# create a sample dataframe with a categorical variable and a target variable
df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

In [56]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [60]:
mean_price = df.groupby('city')['price'].mean().to_dict()

In [61]:
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [62]:
df['city_encoded'] = df['city'].map(mean_price)

In [63]:
df['city_encoded']

0    190.0
1    150.0
2    310.0
3    250.0
4    190.0
5    310.0
Name: city_encoded, dtype: float64

In [64]:
df[['price','city_encoded']]

Unnamed: 0,price,city_encoded
0,200,190.0
1,150,150.0
2,300,310.0
3,250,250.0
4,180,190.0
5,320,310.0


In [65]:
import seaborn as sns
df = sns.load_dataset('tips')

In [66]:
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [70]:
dict_df = df.groupby('day')['total_bill'].mean().to_dict()

  dict_df = df.groupby('day')['total_bill'].mean().to_dict()


In [71]:
dict_df

{'Thur': 17.682741935483868,
 'Fri': 17.15157894736842,
 'Sat': 20.44137931034483,
 'Sun': 21.41}

In [72]:
df['encoded_df']= df['day'].map(dict_df)

In [73]:
pd.concat([df['encoded_df'],df['day']],axis=1)

Unnamed: 0,encoded_df,day
0,21.410000,Sun
1,21.410000,Sun
2,21.410000,Sun
3,21.410000,Sun
4,21.410000,Sun
...,...,...
239,20.441379,Sat
240,20.441379,Sat
241,20.441379,Sat
242,20.441379,Sat
