# Encoding Categorical Features: Binary Encoding vs. Label Encoding
In this part, I am encoding USER and PC column into numerical features.

## **Why Choose Binary Encoding Over Label Encoding?**

- **Avoids Artificial Ordinality:** Unlike Label Encoding, Binary Encoding doesn't introduce an arbitrary numerical order among categories, preventing the model from misinterpreting relationships.
- **Dimensionality Efficiency:** Binary Encoding transforms categories into a compact binary format, significantly reducing the number of features compared to One-Hot Encoding while preserving uniqueness.
- **Scalability:** Handles high-cardinality features (e.g., 1,000 unique users) efficiently without creating thousands of columns.

## **How They Work**

### **Label Encoding**

Assigns each unique category a distinct integer.

| Category | Label Encoded |
|----------|---------------|
| User1    | 0             |
| User2    | 1             |
| User3    | 2             |
| ...      | ...           |
| User1000 | 999           |

**Pros:**
- Simple and easy to implement.
- Low dimensionality (single column).

**Cons:**
- Introduces artificial ordinality.
- May mislead models to infer unintended relationships.

### **Binary Encoding**

Converts each category to its binary representation spread across multiple binary columns.

| Category | Binary Encoded (10 bits) |
|----------|--------------------------|
| User1    | 0000000000               |
| User2    | 0000000001               |
| User3    | 0000000010               |
| ...      | ...                      |
| User1000 | 1111100111               |

**Pros:**
- Avoids artificial ordinality.
- More compact than One-Hot Encoding (~10 columns for 1,000 categories).
- Preserves uniqueness of categories.

**Cons:**
- Slightly more complex implementation.
- Binary features are less interpretable individually.

## **Conclusion**

**Binary Encoding** offers a balanced approach for high-cardinality categorical features by maintaining uniqueness without inflating dimensionality or introducing artificial order, making it a superior choice over **Label Encoding** for scenarios like representing 1,000 unique users.

In [20]:
import keras
import tensorflow as tf
import os,datetime
import pandas as pd
import numpy as np

import pickle

import string

from sklearn.preprocessing import LabelEncoder
import category_encoders as ce

In [31]:
df = pd.read_csv("./r4.2/http.csv",usecols=['user','pc'])
df

Unnamed: 0,user,pc
0,LRR0148,PC-4275
1,NGF0157,PC-6056
2,NGF0157,PC-6056
3,IRM0931,PC-7188
4,IRM0931,PC-7188
...,...,...
28434418,BRM0995,PC-0768
28434419,BRM0995,PC-0768
28434420,ZSB0649,PC-5343
28434421,BAM0636,PC-8138


In [32]:
le_user = LabelEncoder()

print("FITTING USERS")

le_user.fit(df['user'])

print("TRANSFORMING USERS")

df['user'] = le_user.transform(df['user'])

#Save the user label encoder
pkl_user_label_output = open("encoded_objects/user_label_encoder.pkl",'wb')

print("SAVING ENCODED USER")
pickle.dump(le_user, pkl_user_label_output)


# Binary Encoder for user
user_binary_encoder = ce.BinaryEncoder(cols=['user'])
user_binary_encoder.fit(df)

# Fit and transform the 'user' column
df = user_binary_encoder.transform(df)

#Save the user label encoder
pkl_user_binary_output = open("encoded_objects/user_binary_encoder.pkl",'wb')

print("SAVING ENCODED USER")
pickle.dump(user_binary_encoder, pkl_user_binary_output)

df

FITTING USERS
TRANSFORMING USERS
SAVING ENCODED USER
SAVING ENCODED USER


Unnamed: 0,user_0,user_1,user_2,user_3,user_4,user_5,user_6,user_7,user_8,user_9,pc
0,0,0,0,0,0,0,0,0,0,1,PC-4275
1,0,0,0,0,0,0,0,0,1,0,PC-6056
2,0,0,0,0,0,0,0,0,1,0,PC-6056
3,0,0,0,0,0,0,0,0,1,1,PC-7188
4,0,0,0,0,0,0,0,0,1,1,PC-7188
...,...,...,...,...,...,...,...,...,...,...,...
28434418,1,1,0,1,1,0,1,0,1,0,PC-0768
28434419,1,1,0,1,1,0,1,0,1,0,PC-0768
28434420,1,1,0,1,1,1,0,0,1,0,PC-5343
28434421,1,1,1,1,0,0,0,1,1,1,PC-8138


In [33]:
'''
PC ENCODER
'''

le_pc = LabelEncoder()

print("FITTING PC")
le_pc.fit(df['pc'])

print("TRANSFORMING PC")
df['pc'] = le_pc.transform(df['pc'])

#Save the pc label encoder
pkl_pc_label_output = open("encoded_objects/pc_label_encoder.pkl",'wb')

print("SAVING ENCODED PC")
pickle.dump(le_pc, pkl_pc_label_output)


# Binary Encoder for PC
pc_binary_encoder= ce.BinaryEncoder(cols=['pc'])
pc_binary_encoder.fit(df)

# Fit and transform the 'pc' column
df = pc_binary_encoder.transform(df)

#Save the PC label encoder
pkl_PC_binary_output = open("encoded_objects/pc_binary_encoder.pkl",'wb')

print("SAVING ENCODED PC")
pickle.dump(pc_binary_encoder, pkl_PC_binary_output)

print("COMPLETED")

df

FITTING PC
TRANSFORMING PC
SAVING ENCODED PC
SAVING ENCODED PC
COMPLETED


Unnamed: 0,user_0,user_1,user_2,user_3,user_4,user_5,user_6,user_7,user_8,user_9,pc_0,pc_1,pc_2,pc_3,pc_4,pc_5,pc_6,pc_7,pc_8,pc_9
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1
4,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28434418,1,1,0,1,1,0,1,0,1,0,1,1,0,1,1,0,1,0,1,0
28434419,1,1,0,1,1,0,1,0,1,0,1,1,0,1,1,0,1,0,1,0
28434420,1,1,0,1,1,1,0,0,1,0,1,1,0,1,1,1,0,0,1,0
28434421,1,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0,0,1,1,1
