## Data Encoding

1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding

### Nominal/OHE Encoding

One hot encoding also known as nominal encoding is a technique used to represent categorical data as numerical data which is more suitable for ML algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical varibale "color" with three possible values(red, green, blue), we can represent it by one hot encoding as follows:

1. Red:[1,0,0]
2. Green:[0,1,0]
3. Blue:[0,0,1]

### Disadvantage: 

If the dataset has more categories like 50 then it will create 50 more featuers(columns)

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [5]:
## Create a simple dataframe
df = pd.DataFrame({
    'Color':['Red','Blue','Green','Green']
})

In [6]:
df.head()

Unnamed: 0,Color
0,Red
1,Blue
2,Green
3,Green


In [7]:
## Create an instance of OneHotEncoder

encoder = OneHotEncoder()

In [10]:
## Perform fit and transform

encoded=encoder.fit_transform(df[['Color']]).toarray()

In [11]:
import pandas as pd
encoder_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())

In [12]:
encoder_df

Unnamed: 0,Color_Blue,Color_Green,Color_Red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0


In [None]:
## For new data
encoder.transform([['Blue']]).toarray()



array([[1., 0., 0.]])

In [16]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,Color,Color_Blue,Color_Green,Color_Red
0,Red,0.0,0.0,1.0
1,Blue,1.0,0.0,0.0
2,Green,0.0,1.0,0.0
3,Green,0.0,1.0,0.0


### Label Encoding

Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

1. Red:1
2. Green:2
3. Blue:3

In [17]:
df.head()

Unnamed: 0,Color
0,Red
1,Blue
2,Green
3,Green


In [18]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder = LabelEncoder()

In [None]:
lbl_encoder.fit_transform(df[['Color']]) # Apply label encoder alphabetically

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1])

In [20]:
lbl_encoder.transform([['Red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

### Disadvantage:
When we are assigning label encoder it is taking 'red' as 2, 'green'as 1 and 'blue' as 0 so model will think 'red' is greater than 'blue' and 'green' because 2>1,0

### Ordinal Encoding

It is used to encode categorical data that have an intrinsic order or ranking. In this tech each category is assigned a numerical value based on its position in the order. For ex, if we have a categorical variable 'education level' with four possible values(high school, college, graduate) we can represent it using ordinal encoding as follows:

1. High school:1
2. College:2
3. Graduate:3


In [21]:
## Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder

In [22]:
# Create a sample dataframe with an ordinal variable

df = pd.DataFrame({
    'size': ['small', 'medium', 'large','medium','small']
})
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small


In [23]:
# Create an instance of Ordinal Encoder and then fit_transform

encoder = OrdinalEncoder(categories = [['small','medium','large']])
encoder

In [24]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.]])

In [26]:
encoder.transform([['small']])



array([[0.]])

## Target Guided Ordinal Encoding

It is a tech used to encode categorical variables based on their relationship with the target variable. This encoding tech is useful when we have a categorical variable with a large number of  unique categories, and we want to use this variable as a feature in our ML model.

In target guided ordinal encoding, we relace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that cateory. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [27]:
import pandas as pd

# create a sample dataframe with a categorical variable and a target variable

df = pd.DataFrame({
    'city': ["New York","London", "Paris", "Tokyo","New York", "Paris"],
    'Price': [200,150,300,250,180,300]
})

In [28]:
df

Unnamed: 0,city,Price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,300


In [30]:
mean_price = df.groupby('city')['Price'].mean().to_dict()
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 300.0, 'Tokyo': 250.0}

In [32]:
df['City_encoded']= df['city'].map(mean_price)
df

Unnamed: 0,city,Price,City_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,300.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,300,300.0


In [34]:
df[['Price','City_encoded']]

Unnamed: 0,Price,City_encoded
0,200,190.0
1,150,150.0
2,300,300.0
3,250,250.0
4,180,190.0
5,300,300.0
