**Data Encoding**
Data Encoding means converting a set of Categorical values into Numeral values for a specific purpose in Machine Learning.
e.g. Assume you have the following-

Experience  Degree                      Salary
(input)     (input)                     (output)
            (categorical variable)
8 years     BE
            PHD
            Masters
Based on the input variables we have to predict the output variable - Salary.
The Machine Learning algorithm will have a hard time understanding the Categorical variables for Degree - BE, PHD, etc.
e.g. BE = (1,0,0), PHD = (0,1,0), Masters = (0,0,1)
So we convert these Categorical variables into Numeral Variables. This is known as Encoding            

**Sparse Matrix**
- A matrix with 1's and 0's is known as Sparse Matrix.

**Over Fitting**
When a machine learning model learns too well on training data - including noise, mistakes, random fluctuations so it often memorizes instead of learning patterns, so it performs very well on training data but poorly on new/unseen data.

**Fit and Transform**
- In Machine Learning 'fit' and 'transform' are two key steps used in data pre-processing (esp with skikit-learn)
- fit = Learn the rules from data*
    - The transfomer learns some values from your dataset.
    - e.g. mean, median, mode, standard deviation, unique categories, what values are missing
- transform will apply the rules learnt from 'fit' to the data

**Types of Encodings**
**1) One Hot Encoding(OHE) / Normal Encoding**
- One Hot Encoding is also known as Nominal Encoding
- It is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms.
- In this technique, each category is represented as a binary vector where each bit corresponds to a unique category.
- For Example
Red     Green       Blue    Encoded value       Color
  1        0          0        1,0,0              Red
  0        1          0        0,1,0              Green
  0        0          1        0,0,1              Blue

**Disadvantages of One Hot Encoding / Nominal Encoding**
1) If one of the Categorical values has 100 different types of categories, then this algorithm will create 100 different types of encodings.

2) Overfitting
When you have a Sparse Matrix usage (as in the case of OHE), then it leads to Overfitting, which is not good.



**2) Label Encoding**
- Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

- Labek encoding involves assigning a unique numerical label to each category in the variable. 
- The labels are usually assigned in alphabetical order OR based on the frequence of the categories.
- E.g. If we have a categorical variable 'color' with three possible values (red, green, blue) we can represent it using 'Label encoding' as follows:-
1) Red 1
2) Green 2
3) Blue 3

**Disadvantages of Label Encoding**
 - In Label Encoding, if the categories Red, Green, Blue are assigned encodings 1,2,3 respectively, then the ML model might think that 3 is greater than 2, and 2 is greater than 1.
 - In such a case it might prefer or rank Blue Higher than Red, but in our case all categories are equal.
 - THis problem is solved using Ordinal Encoding


**3) Ordinal Encoding**
- Ordinal Encoding is used to encode categorial cata that have an intrinsic order or ranking.
- In this technique, each category is assigned a numerical value based on its position in the order.
- E.g. IF we have a categorical variable 'education level', with four possible values (high schoo, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:
1) High School : 1
2) College: 2
3) Graduate: 3
4) Post-graduate: 4

- The Rankings given by us like so:-
**Assgning ranking to the sizes**
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])

**4) Target Guided Ordinal Encoding**
- It is a technique that is used to encode a 'categorical variable' based on their relationship with another variable known as the 'target variable'.
- This encoding technique is useful when we have a categorical variable with a large number of unique categories and we want to use this variable as a feature in our machine learning model.
- In Target Guided Ordinal Encoding we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category.
- This creates a relationship between the categorical variable and the target variable which can improve the predecive power of our model.

In [3]:
## ************ 1) One Hot Encoding using sklearn ************
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']
})

print(df)

# Create an instance of OneHotEncoder
encoder = OneHotEncoder()

# Fit and transform the data using One Hot Encoder
# This will create a Sparse Matrix`
# fit will learn from data
# transform will apply the learned encoding to data
# Blue will be first, green second, and red third in alphabetical order in the array below
# Red = 0,0,1
# Blue = 1,0,0
# Green = 0,1,0
# This data can now be used to train ML models
encoded_array = encoder.fit_transform(df[['Color']]).toarray()
print("encoded_array",encoded_array)

import pandas as pd
encoded_df = pd.DataFrame(encoded_array, columns=encoder.get_feature_names_out(['Color']))
print('encoded_df',encoded_df)

# Transform new data (Green) present in training data
encoded_df_green = encoder.transform([['Green']]).toarray()  # [[0. 1. 0.]]
print('encoded_df_green',encoded_df_green)

# Transform new data (Yellow) not present in training data
#If you give an unknown category which was not present in training data, it will raise an error
# The following lines will give an error because 'Yellow' was not in the training data.
# encoded_df_yellow = encoder.transform([['Yellow']],handle_unknown='ignore').toarray()  # This will raise an error because 'Yellow' is not in the training data
# print('encoded_df_yellow',encoded_df_yellow)



   Color
0    Red
1   Blue
2  Green
3   Blue
4    Red
encoded_array [[0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]
encoded_df    Color_Blue  Color_Green  Color_Red
0         0.0          0.0        1.0
1         1.0          0.0        0.0
2         0.0          1.0        0.0
3         1.0          0.0        0.0
4         0.0          0.0        1.0
encoded_df_green [[0. 1. 0.]]




In [5]:
## ************ 2) Label Encoding using sklearn ************
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

df = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']
})

# Fit and transform the data using Label Encoder
# Blue will be first, green second, and red third in alphabetical order in the array below
# Red = 2
# Blue = 0
# Green = 1
# Output: array([2, 0, 1, 0, 2])
df_label_encoded_data = label_encoder.fit_transform(df['Color'])
print('df_label_encoded_data',df_label_encoded_data)

# Transform new data (Green) present in training data
df_label_encoded_green = label_encoder.transform(['Green'])  # [1]
print('df_label_encoded_green',df_label_encoded_green)

# Trnsform new data (Yellow) not present in training data
#If you give an unknown category which was not present in training data, it will raise an error
# The following lines will give an error because 'Yellow' was not in the training data.
# df_label_encoded_yellow = label_encoder.transform(['Yellow'])  # This will raise an error because 'Yellow' is not in the training data
# print('df_label_encoded_yellow',df_label_encoded_yellow)


df_label_encoded_data [2 0 1 0 2]
df_label_encoded_green [1]


In [6]:
## ************ 3) Ordinal Encoding using sklearn ************
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small','large']
})


# Assgning ranking to the sizes
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])

# Fit and transform the data using Ordinal Encoder
# small = 0
# medium = 1
# large = 2
# Output: array([[0.], [1.], [2.], 
encoded_array = encoder.fit_transform(df[['size']])
print("encoded_array", encoded_array)   

# Transform new data (large) present in training data
encoded_large = encoder.transform([['large']])  # [[2.]]
print('encoded_large',encoded_large)

# Transform new data (extra large) not present in training data
#If you give an unknown category which was not present in training data, it will raise an error
# The following lines will give an error because 'extra large' was not in the training data.
# encoded_extralarge = encoder.transform([['extra large']])  # This will raise an error because 'extra large' is not in the training data
# print('encoded_extralarge',encoded_extralarge)

encoded_array [[0.]
 [1.]
 [2.]
 [1.]
 [0.]
 [2.]]
encoded_large [[2.]]




In [8]:
## ************ 4) Target Guided Ordinal Encoding using sklearn ************
import pandas as pd

# Categorical variable = city
# Target variable = price
# e.g. we will take a mean of prices of New York (500000,350000)  and repalce both these values with its mean = 425000
df_city_data = pd.DataFrame({
    'city':['New York', 'Los Angeles', 'Chicago', 'New York', 'Phoenix'],
    'price':[500000, 450000, 300000, 350000, 400000]
})

# Find the mean prices for each city
df_mean_prices = df_city_data.groupby('city')['price'].mean().to_dict()
print('df_mean_prices',df_mean_prices)

# Replace city names with their mean prices
# Add a new column in df_city_data called 'city_encoded' which will have the mean prices
# The 'city' is the key in this dictionary, and using this key we will put the mean price in the new column.
df_city_data['city_encoded'] = df_city_data['city'].map(df_mean_prices)
print('df_city_data updated with mean prices = ')
print(df_city_data)

df_mean_prices {'Chicago': 300000.0, 'Los Angeles': 450000.0, 'New York': 425000.0, 'Phoenix': 400000.0}
df_city_data updated with mean prices = 
          city   price  city_encoded
0     New York  500000      425000.0
1  Los Angeles  450000      450000.0
2      Chicago  300000      300000.0
3     New York  350000      425000.0
4      Phoenix  400000      400000.0
