## Data Encoding

Data Encoding techniques are used to convert categorical data into numerical format so that machine learning algorithms can process them. 

Here are some common encoding techniques:
1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding 

### Nominal/OHE Encoding

One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

1. Red: [1, 0, 0]
2. Green: [0, 1, 0]
3. Blue: [0, 0, 1]

#### Disadvantages of One Hot Encoding
1. High Dimensionality: One hot encoding can lead to a high-dimensional feature space, especially when dealing with categorical variables that have a large number of unique values. This can result in increased computational complexity and memory usage.
2. Sparsity: The resulting binary vectors are often sparse, meaning that most of the values are zero. This can lead to inefficiencies in storage and computation, as many machine learning algorithms are not optimized for sparse data.
3. Loss of Ordinal Information: One hot encoding does not capture any ordinal relationship between categories
    (e.g., small, medium, large). If the categories have a natural order, one hot encoding may not be the best choice.
    

In [7]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [9]:
## Create simple dataframe
df = pd.DataFrame({
    'color': ['red', 'green', 'yellow', 'blue', 'pink', 'red']
})

In [10]:
df.head()

Unnamed: 0,color
0,red
1,green
2,yellow
3,blue
4,pink


In [11]:
## create an instance of onehotencoder..
encoder = OneHotEncoder()

In [12]:
## perform fit and transform
encoder.fit_transform(df[['color']]).toarray()

array([[0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.]])

In [13]:
encoded = encoder.fit_transform(df[['color']]).toarray()

In [14]:
import pandas as pd
encoder_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())

In [16]:
encoder_df

Unnamed: 0,color_blue,color_green,color_pink,color_red,color_yellow
0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0
3,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0


In [None]:
## for new data..
encoder.transform([['blue']]).toarray()



array([[1., 0., 0., 0., 0.]])

In [25]:
pd.concat([df, encoder_df], axis=1)

Unnamed: 0,color,color_blue,color_green,color_pink,color_red,color_yellow
0,red,0.0,0.0,0.0,1.0,0.0
1,green,0.0,1.0,0.0,0.0,0.0
2,yellow,0.0,0.0,0.0,0.0,1.0
3,blue,1.0,0.0,0.0,0.0,0.0
4,pink,0.0,0.0,1.0,0.0,0.0
5,red,0.0,0.0,0.0,1.0,0.0


In [26]:
import seaborn as sns
sns.load_dataset('tips')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


### Label Encoding 

Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

1. Red: 1
2. Green: 2
3. Blue: 3

#### Disadvantages of Label Encoding
1. Ordinal Relationship: Label encoding can create an ordinal relationship between categories, which may not be appropriate for all categorical variables. For example, if we have a categorical variable "color" with values (red, green, blue), assigning numerical labels may imply an order that does not exist.
2. Model Interpretation: The numerical labels assigned to categories may not have any meaningful interpretation, which can make it difficult to interpret the results of machine learning models.

In [27]:
df.head()

Unnamed: 0,color
0,red
1,green
2,yellow
3,blue
4,pink


In [28]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder = LabelEncoder()

In [29]:
lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([3, 1, 4, 0, 2, 3])

In [30]:
lbl_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([3])

In [31]:
lbl_encoder.transform([['pink']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [33]:
lbl_encoder.transform([['yellow']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([4])

### Ordinal Encoding

It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

#### Disadvantages of Ordinal Encoding
1. Assumption of Equal Intervals: Ordinal encoding assumes that the intervals between categories are equal, which may not be the case in reality. For example, the difference between "high school" and "college" may not be the same as the difference between "graduate" and "post-graduate".
2. Model Interpretation: Similar to label encoding, the numerical values assigned to categories may not have any meaningful interpretation, which can make it difficult to interpret the results of machine learning models.

In [68]:
from sklearn.preprocessing import OrdinalEncoder

In [70]:
## create a sample dataframe with an ordinal variable..
df = pd.DataFrame({
    'size':['small', 'medium', 'large', 'medium', 'x-large', 'xx-large', 'small', 'large']
})

In [71]:
df.head()

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,x-large


In [72]:
## create an instance  of Oridinal Encoder and then fit_transform..
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large', 'x-large', 'xx-large']])

In [73]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [3.],
       [4.],
       [0.],
       [2.]])

In [74]:
encoder.transform([['large']])



array([[2.]])

In [75]:
encoder.transform([['xx-large']])



array([[4.]])

## Target Guided Ordinal Encoding 
It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [None]:
## create dataframe
df = pd.DataFrame({
    'city':['Delhi', 'Noida', 'Varansi', 'Patna', 'Delhi', 'Patna', 'Banglore'],
    'price':[500, 230, 400, 800, 700, 100, 600]
})

In [None]:
df

Unnamed: 0,city,price
0,Delhi,500
1,Noida,230
2,Varansi,400
3,Patna,800
4,Delhi,700
5,Patna,100
6,Banglore,600


In [55]:
df.groupby('city')['price'].mean()

city
Banglore    600.0
Delhi       600.0
Noida       230.0
Patna       450.0
Varansi     400.0
Name: price, dtype: float64

In [56]:
mean_price = df.groupby('city')['price'].mean().to_dict()

In [57]:
mean_price

{'Banglore': 600.0,
 'Delhi': 600.0,
 'Noida': 230.0,
 'Patna': 450.0,
 'Varansi': 400.0}

In [59]:
df['city_encoded'] = df['city'].map(mean_price)

In [60]:
df

Unnamed: 0,city,price,city_encoded
0,Delhi,500,600.0
1,Noida,230,230.0
2,Varansi,400,400.0
3,Patna,800,450.0
4,Delhi,700,600.0
5,Patna,100,450.0
6,Banglore,600,600.0


In [None]:
df[['price', 'city_encoded']]  ## this is used for model pupose..

Unnamed: 0,price,city_encoded
0,500,600.0
1,230,230.0
2,400,400.0
3,800,450.0
4,700,600.0
5,100,450.0
6,600,600.0
