# 1. Label Encoding
---

## What It Is:

- Assign each category a unique integer. 

- For example, if you have a feature Color with values ["Red", "Green", "Blue"], you might encode them as [0, 1, 2].

## When to Use:

- Often used for ordinal features where the order of categories matters (e.g., education level: “High School” = 0, “Bachelor’s” = 1, “Master’s” = 2).

- Sometimes used for non-ordinal features as a quick baseline, but can mislead many ML algorithms into thinking there is a numeric ordering.

## Pros/Cons:

Pro: Simple, space-efficient.

Con: Implies an arbitrary numeric relationship (e.g., 2 > 1 > 0) that isn’t meaningful for nominal features.

---

# 2. One-Hot Encoding (Dummy Encoding)
---

## What It Is:

- For each category in a feature, you create a binary column (0/1). If you have 3 categories for a feature, you end up with 3 columns, each indicating whether the feature equals that category.

- Example: Color = Red, Green, Blue → 3 new columns: Color_Red, Color_Green, Color_Blue.

## When to Use:

- Nominal (unordered) categorical features.

- Common in many machine learning algorithms and often a go-to approach, especially for tree-based models or linear models.

## Dummy Variable Trap:

If you include all columns (one for each category) in a linear model, you can introduce perfect multicollinearity.
Some libraries automatically drop one column (“dummy trap” avoidance) to serve as a baseline.

## Pros/Cons:

Pro: Doesn’t impose a numeric relationship; purely binary indicators.

Con: Can increase dimensionality significantly if many categories exist.

---

# 3. Ordinal Encoding
---

## What It Is:

- Similar to label encoding, but specifically used for features with a true natural order.

- For example, “small” < “medium” < “large” might be encoded as [0, 1, 2].

## When to Use:

When categories have a meaningful rank or scale (e.g., “Low”, “Medium”, “High” or “Freshman”, “Sophomore”, “Junior”, “Senior”).

## Pros/Cons:

Pro: Respects the real order; beneficial if the model can use that ordinal structure (e.g., linear or logistic regression).

Con: Not applicable to truly nominal categories (e.g., colors, countries).

# 4. Frequency / Count Encoding
---

## What It Is:

- Replace each category with its frequency or count in the dataset. For instance, if “Red” appears 50 times, “Green” 30 times, and “Blue” 20 times, you encode them as [50, 30, 20].

## When to Use:

For high-cardinality features where one-hot encoding would blow up dimensionality.
In some ML models, the frequency can be predictive if frequent categories differ from rare ones in relevant ways.

## Pros/Cons:

Pro: Simple, reduces dimensionality, can incorporate some notion of importance (common categories get bigger numbers).

Con: Numeric scale could mislead algorithms that interpret magnitude as meaning (though tree-based methods often handle it reasonably).

---

# 5. Target Encoding

---

## What It Is:

- Each category is replaced with some function of target variable statistics for that category. Commonly the mean of the target for rows with that category.

- For classification, encode category $\textit{C}$ with the mean of the target (e.g., probability of being “1”) among rows with that category.

## When to Use:

- High-cardinality categorical features in supervised learning contexts (particularly in regularized forms to avoid overfitting).

- Often used in Kaggle competitions with carefully tuned cross-validation to avoid leakage.

## Pros/Cons:

Pro: Great at capturing the relationship between category and target without exploding dimensionality.

Con:

- Can lead to overfitting if done naively (the model learns “cheated” information from the target).

- Must use proper CV or regularization techniques (like smoothing, noise addition) when creating these encodings.

---

# 6. Hashing Encoding (Feature Hashing)
---

## What It Is:

- Uses a hash function to map each category to one of many “hash buckets”.

- Instead of creating a column for each distinct category, it creates columns for each hash bucket.

## When to Use:

- Very high-cardinality or streaming data contexts where you cannot keep track of all unique categories (especially in text analytics).

- If you’re okay with possible collisions (different categories mapping to the same bucket).

## Pros/Cons:

Pro: Memory-efficient for large or unbounded category sets; no need to store category-index mappings.

Con: Collisions can degrade performance, and interpretation is less transparent.

---

# 7. Embedding (Deep Learning)
---

## What It Is:

- In deep learning, you can learn a dense vector representation (embedding) for categories. 

- Each category is mapped to a vector of real numbers that the model updates during training.


- Common in NLP and recommended for large or complex categorical features.

## When to Use:

- Neural network architectures, especially for large categories (e.g., item IDs in recommendation systems, words in language models).

- Gains from capturing semantic relationships in a lower-dimensional space.


## Pros/Cons:

- Pro: Can capture rich relationships between categories, highly flexible if enough data is available.

- Con: Requires training a neural model; can be more complex to implement, interpret, and requires sufficient data.

# Example
---

In [2]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "Color":   ["Red", "Green", "Blue", "Green", "Blue", "Blue", "Red", "Green"],
    "Size":    ["Small", "Medium", "Large", "Small", "Large", "Medium", "Small", "Large"],
    "Purchased": [1, 0, 1, 1, 0, 1, 0, 1]  # Binary Target
})

df

Unnamed: 0,Color,Size,Purchased
0,Red,Small,1
1,Green,Medium,0
2,Blue,Large,1
3,Green,Small,1
4,Blue,Large,0
5,Blue,Medium,1
6,Red,Small,0
7,Green,Large,1


# 1. Label Encoding
---

In [4]:
from sklearn.preprocessing import LabelEncoder

# Copy data
df_label = df.copy()

# Encode 'Color' (nominal)
le_color = LabelEncoder()
df_label["Color_Label"] = le_color.fit_transform(df_label["Color"])

# Encode 'Size' (often better done with OrdinalEncoder if truly ordinal)
le_size = LabelEncoder()
df_label["Size_Label"] = le_size.fit_transform(df_label["Size"])

print("\nLabel Encoding:")
df_label


Label Encoding:


Unnamed: 0,Color,Size,Purchased,Color_Label,Size_Label
0,Red,Small,1,2,2
1,Green,Medium,0,1,1
2,Blue,Large,1,0,0
3,Green,Small,1,1,2
4,Blue,Large,0,0,0
5,Blue,Medium,1,0,1
6,Red,Small,0,2,2
7,Green,Large,1,1,0


# One Hot Encoding
---

In [14]:
df_onehot = pd.get_dummies(df, columns=["Color"], prefix="Color", drop_first=False,dtype=int)
print("\nOne-Hot Encoding with pandas.get_dummies:")
df_onehot


One-Hot Encoding with pandas.get_dummies:


Unnamed: 0,Size,Purchased,Color_Blue,Color_Green,Color_Red
0,Small,1,0,0,1
1,Medium,0,0,1,0
2,Large,1,1,0,0
3,Small,1,0,1,0
4,Large,0,1,0,0
5,Medium,1,1,0,0
6,Small,0,0,0,1
7,Large,1,0,1,0


# 3. Ordinal Encoding (for truly ordered categories)

In [15]:
from sklearn.preprocessing import OrdinalEncoder

df_ordinal = df.copy()

# Define an explicit ordering for the Size feature
size_categories = [["Small", "Medium", "Large"]]

ord_enc = OrdinalEncoder(categories=size_categories)
df_ordinal["Size_Ordinal"] = ord_enc.fit_transform(df_ordinal[["Size"]])

print("\nOrdinal Encoding for 'Size':")
df_ordinal


Ordinal Encoding for 'Size':


Unnamed: 0,Color,Size,Purchased,Size_Ordinal
0,Red,Small,1,0.0
1,Green,Medium,0,1.0
2,Blue,Large,1,2.0
3,Green,Small,1,0.0
4,Blue,Large,0,2.0
5,Blue,Medium,1,1.0
6,Red,Small,0,0.0
7,Green,Large,1,2.0


# 4. Frequency Encoding

In [16]:
df_freq = df.copy()

# Calculate frequencies for 'Color'
color_counts = df_freq["Color"].value_counts()
df_freq["Color_Freq"] = df_freq["Color"].map(color_counts)

print("\nFrequency Encoding for 'Color':")
df_freq


Frequency Encoding for 'Color':


Unnamed: 0,Color,Size,Purchased,Color_Freq
0,Red,Small,1,2
1,Green,Medium,0,3
2,Blue,Large,1,3
3,Green,Small,1,3
4,Blue,Large,0,3
5,Blue,Medium,1,3
6,Red,Small,0,2
7,Green,Large,1,3


# 5. Target Encoding

In [17]:
df_targetenc = df.copy()

# Compute the mean 'Purchased' by 'Color'
color_target_mean = df_targetenc.groupby("Color")["Purchased"].mean()

df_targetenc["Color_TargetEnc"] = df_targetenc["Color"].map(color_target_mean)

print("\nTarget Encoding for 'Color' (Mean of 'Purchased'):")
df_targetenc


Target Encoding for 'Color' (Mean of 'Purchased'):


Unnamed: 0,Color,Size,Purchased,Color_TargetEnc
0,Red,Small,1,0.5
1,Green,Medium,0,0.666667
2,Blue,Large,1,0.666667
3,Green,Small,1,0.666667
4,Blue,Large,0,0.666667
5,Blue,Medium,1,0.666667
6,Red,Small,0,0.5
7,Green,Large,1,0.666667


# 6. Hashing Encoding

In [21]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction import FeatureHasher

df_hash = pd.DataFrame({
    "Color": ["Red", "Green", "Blue", "Green", "Blue", "Blue", "Red", "Green"]
})

n_features = 4  # number of hash buckets
fh = FeatureHasher(n_features=n_features, input_type='string')

# Convert each 'Color' entry to a list of one element, so we have "iterables of iterables"
hashed_features = fh.fit_transform(
    df_hash["Color"].apply(lambda x: [x])  # e.g. "Red" -> ["Red"]
)

hashed_df = pd.DataFrame(
    hashed_features.toarray(), 
    columns=[f"Color_Hash_{i}" for i in range(n_features)]
)

df_hash = pd.concat([df_hash, hashed_df], axis=1)
df_hash

Unnamed: 0,Color,Color_Hash_0,Color_Hash_1,Color_Hash_2,Color_Hash_3
0,Red,0.0,1.0,0.0,0.0
1,Green,0.0,0.0,0.0,1.0
2,Blue,0.0,0.0,-1.0,0.0
3,Green,0.0,0.0,0.0,1.0
4,Blue,0.0,0.0,-1.0,0.0
5,Blue,0.0,0.0,-1.0,0.0
6,Red,0.0,1.0,0.0,0.0
7,Green,0.0,0.0,0.0,1.0


In [28]:
df_hash_new = df_hash.drop("Color",axis=1)
merged_df = pd.merge(df, df_hash_new, left_index=True, right_index=True, how='inner')
merged_df

Unnamed: 0,Color,Size,Purchased,Color_Hash_0,Color_Hash_1,Color_Hash_2,Color_Hash_3
0,Red,Small,1,0.0,1.0,0.0,0.0
1,Green,Medium,0,0.0,0.0,0.0,1.0
2,Blue,Large,1,0.0,0.0,-1.0,0.0
3,Green,Small,1,0.0,0.0,0.0,1.0
4,Blue,Large,0,0.0,0.0,-1.0,0.0
5,Blue,Medium,1,0.0,0.0,-1.0,0.0
6,Red,Small,0,0.0,1.0,0.0,0.0
7,Green,Large,1,0.0,0.0,0.0,1.0


In [30]:
import torch
import torch.nn as nn
import torch.optim as optim

# Example data
data = [
    # (Color,   Size,    Purchased)
    ("Red",    "Small",  1),
    ("Green",  "Medium", 0),
    ("Blue",   "Large",  1),
    ("Green",  "Small",  1),
    ("Blue",   "Large",  0),
    ("Blue",   "Medium", 1),
    ("Red",    "Small",  0),
    ("Green",  "Large",  1),
]

# Map Color -> integer IDs
color_to_id = {"Red": 0, "Green": 1, "Blue": 2}

# "Size" is ordinal or nominal, but for illustration we can one-hot encode it:
size_to_id = {"Small": [1, 0, 0],  # e.g., [1,0,0] = Small
              "Medium": [0, 1, 0], 
              "Large": [0, 0, 1]}

# Convert data into numeric form
color_ids = []
size_onehots = []
targets = []

for (color, size, purchased) in data:
    color_ids.append(color_to_id[color])
    size_onehots.append(size_to_id[size])
    targets.append(purchased)

# Turn them into torch tensors
color_ids_t = torch.tensor(color_ids, dtype=torch.long)           # shape [N]
size_onehots_t = torch.tensor(size_onehots, dtype=torch.float32)  # shape [N, 3]
targets_t = torch.tensor(targets, dtype=torch.float32)            # shape [N]


In [31]:
class PurchasePredictor(nn.Module):
    def __init__(self, num_colors, embed_dim, size_input_dim):
        super().__init__()
        # Embedding for the Color feature
        self.color_embedding = nn.Embedding(num_embeddings=num_colors, embedding_dim=embed_dim)
        
        # A small feed-forward network
        # Input to first layer: embed_dim (from color) + size_input_dim (from size one-hot)
        self.fc1 = nn.Linear(embed_dim + size_input_dim, 4)
        self.fc2 = nn.Linear(4, 1)
        
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, color_ids, size_onehot):
        # color_ids: shape [N], size_onehot: shape [N, 3]
        # 1) Look up the embedding for each color index
        color_embeds = self.color_embedding(color_ids)  # shape [N, embed_dim]
        
        # 2) Concatenate color embedding + size one-hot
        x = torch.cat((color_embeds, size_onehot), dim=1)  # shape [N, embed_dim + size_input_dim]
        
        # 3) Pass through feed-forward network
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        
        # 4) Output a probability
        out = self.sigmoid(x).view(-1)  # shape [N]
        return out

# Instantiate the model
num_colors = len(color_to_id)   # 3 (Red, Green, Blue)
embed_dim = 2                   # you can choose any dimension
size_input_dim = 3              # 'Size' is one-hot with 3 columns

model = PurchasePredictor(num_colors, embed_dim, size_input_dim)


In [32]:
# Define a loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# We can run multiple epochs
epochs = 50
for epoch in range(epochs):
    optimizer.zero_grad()
    
    # Forward pass
    predictions = model(color_ids_t, size_onehots_t)  # shape [N]
    
    # Compute loss
    loss = criterion(predictions, targets_t)
    
    # Backprop
    loss.backward()
    optimizer.step()
    
    if (epoch+1) % 10 == 0:
        print(f"Epoch {epoch+1:2d}/{epochs}, Loss: {loss.item():.4f}")


  from .autonotebook import tqdm as notebook_tqdm


Epoch 10/50, Loss: 0.6589
Epoch 20/50, Loss: 0.6420
Epoch 30/50, Loss: 0.6261
Epoch 40/50, Loss: 0.6067
Epoch 50/50, Loss: 0.5874


In [34]:
print("\nLearned Color Embeddings:")
model.color_embedding.weight.data
# first row is red (color id =0)
# second row is green (color id=1)
# third row is blue ()color id = 2


Learned Color Embeddings:


tensor([[-1.1595, -0.2128],
        [ 0.0416,  1.0284],
        [-0.8431, -0.3531]])

----