# **Encoding Concept in ML**

In machine learning (ML), **encoding** is a crucial step in preparing data for processing by machine learning algorithms. Here are some common types of encoding used in ML:

## **Label Encoding**

This is used for encoding categorical variables into numerical labels. Each unique category is assigned a numerical value. However, this encoding may not be suitable for ordinal data as it imposes an arbitrary order on the categories.

Here's an example of how to perform label encoding using Python, specifically using the **LabelEncoder** class from the **scikit-learn** library:

In [13]:
from sklearn.preprocessing import LabelEncoder

# Sample categorical data
categories = ['red', 'green', 'blue', 'red', 'blue']

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the categorical data
encoded_labels = label_encoder.fit_transform(categories)

# Print the encoded labels
print("Original categories:", categories)
print("Encoded labels:", encoded_labels)


Original categories: ['red', 'green', 'blue', 'red', 'blue']
Encoded labels: [2 1 0 2 0]


## **One-Hot Encoding**

In this technique, each categorical variable is converted into a binary vector where each category is represented by a binary value (0 or 1). Each category gets its own dimension in the vector, and only one of them is "hot" (set to 1) while the others are "cold" (set to 0). One-hot encoding is suitable for categorical variables without an inherent order.

In this example, we have a list of categorical data representing different types of animals. We want to perform one-hot encoding on this data.

* We import the necessary libraries: OneHotEncoder from sklearn.preprocessing and numpy as np.

* We define our sample categorical data in the categories list.

* We initialize the OneHotEncoder object, specifying sparse=False to get a dense array as output.

* Since OneHotEncoder expects a 2D array as input, we reshape the categories list into a 2D numpy array using reshape(-1, 1).

* We then use the fit_transform() method of OneHotEncoder to fit the encoder to the data and transform it into one-hot encoded format.

* Finally, we print both the original categories and the one-hot encoded data.

The output will be a binary matrix where each row represents one observation (in this case, one animal) and each column represents a category (in this case, one animal type - cat, dog, or bird). A value of 1 indicates the presence of that category (animal type), while a value of 0 indicates its absence.




In [3]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample categorical data
categories = ['cat', 'dog', 'bird', 'cat', 'bird']

# Initialize the OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)

# Reshape the data to fit the expected shape for OneHotEncoder
categories_array = np.array(categories).reshape(-1, 1)

# Fit and transform the categorical data
onehot_encoded = onehot_encoder.fit_transform(categories_array)

# Print the one-hot encoded data
print("Original categories:", categories)
print("One-hot encoded data:")
print(onehot_encoded)


Original categories: ['cat', 'dog', 'bird', 'cat', 'bird']
One-hot encoded data:
[[0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]]




## **Ordinal Encoding**

This is similar to label encoding but preserves the ordinal relationship between categories. Each category is mapped to an integer value based on its order.

To perform ordinal encoding in Python, you can utilize the **OrdinalEncoder** class from scikit-learn.

In this example:

* We have a list of ordinal data representing different levels (e.g., 'low', 'medium', 'high').

* We specify the order of categories using the category_order list.

* We initialize the OrdinalEncoder object with the specified categories order.

* Since OrdinalEncoder expects a 2D array as input, we reshape the categories list into a 2D list using a list comprehension.

* We then use the fit_transform() method of OrdinalEncoder to fit the encoder to the data and transform it into ordinal encoded format.

* Finally, we print both the original categories and the ordinal encoded data.

The output will be an array of integer values representing the ordinal encoding of the categories, preserving the specified order.

In [14]:
from sklearn.preprocessing import OrdinalEncoder

# Sample ordinal data
categories = ['low', 'medium', 'high', 'low', 'medium', 'high']

# Define the order of categories
category_order = ['low', 'medium', 'high']

# Initialize the OrdinalEncoder with the specified categories order
ordinal_encoder = OrdinalEncoder(categories=[category_order])

# Reshape the data to fit the expected shape for OrdinalEncoder
categories_array = [[category] for category in categories]

# Fit and transform the ordinal data
ordinal_encoded = ordinal_encoder.fit_transform(categories_array)

# Print the ordinal encoded data
print("Original categories:", categories)
print("Ordinal encoded data:")
print(ordinal_encoded)


Original categories: ['low', 'medium', 'high', 'low', 'medium', 'high']
Ordinal encoded data:
[[0.]
 [1.]
 [2.]
 [0.]
 [1.]
 [2.]]


## **Binary Encoding**

This technique converts categorical variables into binary representation. Each category is first converted into numeric labels, and then those labels are converted into binary digits. Each digit of the binary representation corresponds to whether the category is present (1) or absent (0).

To perform binary encoding in Python, you can use libraries such as category_encoders or implement it manually. Here's how you can implement binary encoding manually.

In this example:

* We have a sample DataFrame df containing categorical data in the 'category' column.

* We define a dictionary category_to_binary that maps each category to a binary representation. For example, category 'A' is mapped to '00', category 'B' is mapped to '01', and category 'C' is mapped to '10'.

* We apply binary encoding by mapping each category to its corresponding binary representation using the map() function.

* Finally, we print both the original DataFrame and the binary encoded data.

The output will be a Series containing binary representations of the original categorical data. Each digit of the binary representation corresponds to whether the category is present (1) or absent (0).

In [5]:
import pandas as pd

# Sample categorical data
data = {'category': ['A', 'B', 'C', 'A', 'C', 'B']}
df = pd.DataFrame(data)

# Create a dictionary to map categories to binary digits
category_to_binary = {
    'A': '00',
    'B': '01',
    'C': '10'
}

# Apply binary encoding to the categorical data
binary_encoded = df['category'].map(category_to_binary)

# Print the binary encoded data
print("Original categories:")
print(df)
print("\nBinary encoded data:")
print(binary_encoded)


Original categories:
  category
0        A
1        B
2        C
3        A
4        C
5        B

Binary encoded data:
0    00
1    01
2    10
3    00
4    10
5    01
Name: category, dtype: object


## **Frequency Encoding**

In this method, categorical variables are replaced with the frequency of each category in the dataset. This can be useful when the frequency of occurrence of each category is informative.

To perform frequency encoding in Python, you can use libraries such as Pandas. Here's how you can implement frequency encoding manually.

In this example:

* We have a sample DataFrame df containing categorical data in the 'category' column.

* We calculate the frequency of each category using the value_counts() function with normalize=True to get the relative frequencies.

* We apply frequency encoding by mapping each category to its corresponding frequency using the map() function.

* Finally, we print both the original DataFrame and the frequency encoded data.

The output will be a DataFrame containing the original categorical data along with a new column ('frequency_encoded') representing the frequency of each category in the dataset. This encoding method can be useful when the frequency of occurrence of each category is informative for the machine learning model.


In [6]:
import pandas as pd

# Sample categorical data
data = {'category': ['A', 'B', 'C', 'A', 'C', 'B', 'A', 'A', 'B', 'C']}
df = pd.DataFrame(data)

# Calculate the frequency of each category
frequency_map = df['category'].value_counts(normalize=True)

# Apply frequency encoding to the categorical data
df['frequency_encoded'] = df['category'].map(frequency_map)

# Print the frequency encoded data
print("Original categories:")
print(df)


Original categories:
  category  frequency_encoded
0        A                0.4
1        B                0.3
2        C                0.3
3        A                0.4
4        C                0.3
5        B                0.3
6        A                0.4
7        A                0.4
8        B                0.3
9        C                0.3


## **Target Encoding**

Target Encoding, also known as mean encoding or likelihood encoding, is a technique used in machine learning for encoding categorical variables into numerical values based on the target variable. Unlike Label Encoding or One-Hot Encoding, which do not take into account the target variable, Target Encoding utilizes information from the target variable to encode categorical features.

Here's how Target Encoding works:

**Calculate Mean Target for Each Category:** For each category in the categorical variable, calculate the mean of the target variable (e.g., the proportion of positive class instances).

**Replace Categories with Mean Target Values:** Replace each category in the categorical variable with its corresponding mean target value.

**Handle Out-of-Sample Categories:** For categories that are not present in the training data but appear in the test data, handle them appropriately (e.g., impute with the overall mean target value or a default value).

Target Encoding is particularly useful for classification tasks, especially when dealing with high-cardinality categorical variables (i.e., variables with a large number of unique categories). It can provide valuable information to machine learning models, especially when there's a strong relationship between the categorical variable and the target variable.

In this example:

* We have a dataset with three columns: "city", "population_density" (categorical), and "average_house_price" (continuous target variable).

* We calculate the mean target (average house price) for each population density category using groupby and mean.

* Then, we replace each population density category in the "population_density" column with its corresponding mean target value, resulting in a new column "encoded_population_density".

* Now, the "encoded_population_density" column contains the target-encoded values for each population density category based on the average house price.

This information can be useful for predicting house prices based on population density using machine learning models.







In [19]:
import pandas as pd

# Sample data
data = {
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia'],
    'population_density': ['high', 'medium', 'high', 'low', 'medium', 'low'],
    'average_house_price': [850000, 600000, 450000, 300000, 400000, 350000]
}

df = pd.DataFrame(data)

# Calculate mean target for each population density
mean_target = df.groupby('population_density')['average_house_price'].mean()

# Replace population densities with mean target values
df['encoded_population_density'] = df['population_density'].map(mean_target)

print("Original data:")
print(df)



Original data:
           city population_density  average_house_price  \
0      New York               high               850000   
1   Los Angeles             medium               600000   
2       Chicago               high               450000   
3       Houston                low               300000   
4       Phoenix             medium               400000   
5  Philadelphia                low               350000   

   encoded_population_density  
0                    650000.0  
1                    500000.0  
2                    650000.0  
3                    325000.0  
4                    500000.0  
5                    325000.0  


## **Backward difference encoding**

Backward difference encoding is a technique used for ordinal categorical variables, where each category is represented by the difference between the mean of the dependent variable for that category and the mean of the dependent variable for the previous category. This encoding method captures the relative differences between adjacent categories.

Here's how backward difference encoding can be implemented in Python using pandas. In this example, we calculate the mean target for each category ('low', 'medium', 'high'), and then calculate the difference between the mean target for each category and the mean target for the previous category. We replace the categories with these backward difference values in a new column named 'backward_difference_encoded'. This column represents the backward difference-encoded ordinal categorical variable.

Note that NaN is used for the first category because there is no previous category to calculate the difference from.



In [15]:
import pandas as pd

# Sample data
data = {'category': ['low', 'medium', 'high', 'low', 'medium', 'high'],
        'target': [10, 20, 30, 15, 25, 35]}  # Sample dependent variable

df = pd.DataFrame(data)

# Calculate mean target for each category
mean_target = df.groupby('category')['target'].mean()

# Calculate mean target for the previous category
mean_target_previous = mean_target.shift(1)

# Calculate difference between mean target for each category and mean target for the previous category
backward_difference = mean_target - mean_target_previous

# Replace categories with backward difference values
df['backward_difference_encoded'] = df['category'].map(backward_difference)

print("Original data:")
print(df)


Original data:
  category  target  backward_difference_encoded
0      low      10                        -20.0
1   medium      20                         10.0
2     high      30                          NaN
3      low      15                        -20.0
4   medium      25                         10.0
5     high      35                          NaN


## **Embedding**

Embedding is a powerful technique used in natural language processing (NLP) and recommendation systems to represent categorical variables, such as words or items, as dense vectors in a lower-dimensional continuous space. These vectors capture semantic relationships between categories, allowing similar categories to have similar vector representations. Embeddings are learned through neural network training, typically in an unsupervised or semi-supervised manner.

Here's a high-level overview of how embeddings are used in NLP and recommendation systems:

**Word Embeddings in NLP:** In NLP tasks such as sentiment analysis, machine translation, or named entity recognition, words are represented as dense vectors in an embedding space. Word embeddings capture semantic relationships between words. Words with similar meanings or contexts are mapped to nearby points in the embedding space. Popular word embedding techniques include Word2Vec, GloVe (Global Vectors for Word Representation), and FastText.
These embeddings are often pre-trained on large text corpora and then fine-tuned on specific tasks.

**Item Embeddings in Recommendation Systems:**In recommendation systems, items (e.g., movies, products) are represented as dense vectors in an embedding space.
Item embeddings capture latent features of items and user preferences. Similar items are mapped to nearby points in the embedding space.
Embeddings can be learned from user-item interaction data using techniques such as matrix factorization, neural collaborative filtering, or deep learning-based models.These embeddings are used to generate recommendations by finding similar items based on their vector representations.

Here's a simplified example of how to use embeddings for word representation in Python using TensorFlow/Keras. In this example, we define an embedding layer with a vocabulary size of 10,000 and an embedding dimension of 100. We then apply the embedding layer to input data consisting of word indices. The output is the embedded representation of the input data, where each word index is mapped to a dense vector in the embedding space.

In [7]:
import tensorflow as tf

# Example vocabulary
vocab_size = 10000
embedding_dim = 100

# Define the embedding layer
embedding_layer = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim)

# Example input data (word indices)
input_data = tf.constant([[1, 5, 8], [2, 7, 0]])

# Apply the embedding layer to the input data
embedded_data = embedding_layer(input_data)

# Print the embedded data
print(embedded_data)


tf.Tensor(
[[[ 0.03909    -0.02228323 -0.02981289  0.0449113   0.03844407
    0.00491518  0.04882425 -0.03801579  0.02137852 -0.02503163
   -0.00622847  0.03244979 -0.02821611  0.00798281  0.01105924
    0.020074    0.04368163  0.02184993 -0.00857129 -0.00340353
   -0.04006969  0.01594497  0.04521877  0.01394308 -0.00557312
    0.04327646 -0.01530663 -0.00655278  0.03958556  0.02479544
   -0.04309788  0.03645028  0.04386966  0.00230924  0.00407517
   -0.00351539  0.02980191  0.01750701  0.04364634  0.02358819
   -0.04524968 -0.02400434 -0.01403198 -0.00854195 -0.02708879
   -0.00536134 -0.03376164  0.02267481 -0.04519984  0.04025909
   -0.00030692  0.0216408  -0.03237735  0.0287937  -0.01087565
    0.00432619  0.0060644   0.02265669  0.02737783  0.03406625
    0.02862075 -0.00952833 -0.04593113 -0.04478073  0.01866552
    0.04952565  0.0339929  -0.00795019  0.03934612  0.00547834
   -0.03424891  0.04806289  0.03041938  0.02892997  0.01106149
    0.01061774 -0.01199242 -0.00492664  0.04