<a id="table-of-contents"></a>
# 📖 Table of Contents

- [🔢 Label Encoding](#label-encoding)
- [🟦 One-Hot Encoding](#one-hot-encoding)
- [🧱 Dummy Encoding](#dummy-encoding)
- [🧮 Ordinal Encoding](#ordinal-encoding)
- [📊 Frequency / Count Encoding](#frequency-count-encoding)
- [🎯 Target Encoding](#target-encoding)
- [#️⃣ Binary Encoding](#binary-encoding)
- [💠 Hashing Encoding](#hashing-encoding)
___


<a id="label-encoding"></a>
# 🔢 Label Encoding


<details><summary><strong>📖 Click to Expand</strong></summary>

<p>Label encoding assigns each category a unique integer, preserving no explicit order.</p>

<ul>
  <li>Best suited for <strong>ordinal</strong> variables where the order matters</li>
  <li>Can mislead <strong>linear models</strong> if used with nominal data</li>
  <li>Used as a quick baseline for tree-based models</li>
</ul>

</details>


In [2]:
# Dummy dataset
import pandas as pd

df = pd.DataFrame({
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small'],
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
    'City': ['NY', 'LA', 'SF', 'NY', 'SF'],
    'Target': [0, 1, 0, 1, 0]
})
df


Unnamed: 0,Size,Color,City,Target
0,Small,Red,NY,0
1,Medium,Blue,LA,1
2,Large,Green,SF,0
3,Medium,Blue,NY,1
4,Small,Red,SF,0


In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Size_LabelEncoded'] = le.fit_transform(df['Size'])
df[['Size', 'Size_LabelEncoded']].sort_values(by='Size')

Unnamed: 0,Size,Size_LabelEncoded
2,Large,0
1,Medium,1
3,Medium,1
0,Small,2
4,Small,2


[Back to the top](#table-of-contents)
___


<a id="one-hot-encoding"></a>
# 🟦 One-Hot Encoding

<details><summary><strong>📖 Click to Expand</strong></summary>

<p>One-hot encoding creates a binary column for each category, indicating its presence.</p>

<ul>
  <li>Ideal for <strong>nominal</strong> variables with low cardinality</li>
  <li>Can cause <strong>curse of dimensionality</strong> with high cardinality</li>
  <li>Works well with <strong>linear and tree-based models</strong></li>
</ul>

</details>


In [25]:
df_onehot = pd.get_dummies(df, columns=['Color'], prefix='Color')
df_onehot


Unnamed: 0,Size,City,Target,Size_LabelEncoded,Size_OrdinalEncoded,City_CountEncoded,City_TargetEncoded,Color_Blue,Color_Green,Color_Red
0,Small,NY,0,2,1,2,0.5,0,0,1
1,Medium,LA,1,1,2,1,1.0,1,0,0
2,Large,SF,0,0,3,2,0.0,0,1,0
3,Medium,NY,1,1,2,2,0.5,1,0,0
4,Small,SF,0,2,1,2,0.0,0,0,1


[Back to the top](#table-of-contents)
___


<a id="dummy-encoding"></a>
<h1>🧱 Dummy Encoding</h1>

<details><summary><strong>📖 Click to Expand</strong></summary>

<p>Dummy encoding is a variant of one-hot encoding where one category is dropped to serve as a baseline (reference level).</p>

<ul>
  <li>Prevents <strong>multicollinearity</strong> in linear models</li>
  <li>Only <strong>k-1 columns</strong> created for k categories</li>
  <li>Drop is arbitrary unless explicitly defined</li>
</ul>

</details>


In [21]:
# Dummy encoding: One-hot encoding with drop_first=True
df_dummy = pd.get_dummies(df, columns=['Color'], prefix='Color', drop_first=True)
df_dummy


Unnamed: 0,Size,City,Target,Size_LabelEncoded,Size_OrdinalEncoded,City_CountEncoded,City_TargetEncoded,Color_Green,Color_Red
0,Small,NY,0,2,1,2,0.5,0,1
1,Medium,LA,1,1,2,1,1.0,0,0
2,Large,SF,0,0,3,2,0.0,1,0
3,Medium,NY,1,1,2,2,0.5,0,0
4,Small,SF,0,2,1,2,0.0,0,1


[Back to the top](#table-of-contents)
___


<a id="ordinal-encoding"></a>
# 🧮 Ordinal Encoding

<details><summary><strong>📖 Click to Expand</strong></summary>

<p>Ordinal encoding maps categories to integers based on their meaningful rank.</p>

<ul>
  <li>Use when categories have a <strong>clear order</strong> (e.g., small, medium, large)</li>
  <li><strong>Manual ordering</strong> is critical to avoid misleading signals</li>
  <li>Works well with models that can leverage numeric relationships</li>
</ul>

</details>


In [None]:
# Manually map Size: Small < Medium < Large
size_mapping = {'Small': 1, 'Medium': 2, 'Large': 3}
df['Size_OrdinalEncoded'] = df['Size'].map(size_mapping)
df[['Size', 'Size_OrdinalEncoded']]

Unnamed: 0,Size,Size_OrdinalEncoded
0,Small,1
1,Medium,2
2,Large,3
3,Medium,2
4,Small,1


[Back to the top](#table-of-contents)
___


<a id="frequency-count-encoding"></a>
# 📊 Frequency / Count Encoding

<details><summary><strong>📖 Click to Expand</strong></summary>

<p>Replaces each category with its count or frequency in the dataset.</p>

<ul>
  <li>Simple and <strong>efficient</strong> for high-cardinality features</li>
  <li>Works well with <strong>tree-based models</strong></li>
  <li>May introduce bias in <strong>linear models</strong></li>
</ul>

</details>


In [None]:
count_map = df['City'].value_counts().to_dict()
df['City_CountEncoded'] = df['City'].map(count_map)
df[['City', 'City_CountEncoded']]

Unnamed: 0,City,City_CountEncoded
0,NY,2
1,LA,1
2,SF,2
3,NY,2
4,SF,2


[Back to the top](#table-of-contents)
___


<a id="target-encoding"></a>
# 🎯 Target Encoding

<details><summary><strong>📖 Click to Expand</strong></summary>

<p>Each category is replaced with the mean of the target variable for that category.</p>

<ul>
  <li>Effective for <strong>high-cardinality</strong> categorical features</li>
  <li>Prone to <strong>data leakage</strong> if not handled carefully</li>
  <li>Use <strong>regularization</strong> or cross-validation for safe use</li>
</ul>

</details>


In [8]:
# Mean target for each City
target_mean = df.groupby('City')['Target'].mean().to_dict()
df['City_TargetEncoded'] = df['City'].map(target_mean)
df[['City', 'City_TargetEncoded']]


Unnamed: 0,City,City_TargetEncoded
0,NY,0.5
1,LA,1.0
2,SF,0.0
3,NY,0.5
4,SF,0.0


[Back to the top](#table-of-contents)
___


<a id="binary-encoding"></a>
# #️⃣ Binary Encoding

<details><summary><strong>📖 Click to Expand</strong></summary>

<p>Encodes categories as binary numbers, then splits digits into columns.</p>

<ul>
  <li>Reduces dimensionality vs. one-hot</li>
  <li>Good for <strong>medium cardinality</strong> data</li>
  <li>Less interpretable than one-hot or label</li>
</ul>

</details>


In [18]:
def binary_encode(series):
    categories = series.astype('category').cat.codes
    max_len = int(categories.max()).bit_length()  # cast to Python int
    binary_cols = categories.apply(lambda x: list(map(int, bin(int(x))[2:].zfill(max_len))))
    return pd.DataFrame(binary_cols.tolist(), columns=[f"{series.name}_bin_{i}" for i in range(max_len)])

df_binary = binary_encode(df['City'])
df_binary.head()


Unnamed: 0,City_bin_0,City_bin_1
0,0,1
1,0,0
2,1,0
3,0,1
4,1,0


[Back to the top](#table-of-contents)
___


<a id="hashing-encoding"></a>
# 💠 Hashing Encoding

<details><summary><strong>📖 Click to Expand</strong></summary>

<p>Applies a hash function to map categories to fixed number of columns.</p>

<ul>
  <li>Useful for <strong>extremely high-cardinality</strong> data</li>
  <li>Prone to <strong>hash collisions</strong>, which may reduce signal</li>
  <li>Non-invertible: can't trace back original categories</li>
</ul>

</details>


In [26]:
import hashlib

def hash_encode(series, n_components=4):
    def hash_string(val):
        h = int(hashlib.md5(val.encode()).hexdigest(), 16)
        return [int(b) for b in bin(h)[2:].zfill(n_components)[-n_components:]]

    hashed = series.astype(str).apply(hash_string)
    return pd.DataFrame(hashed.tolist(), columns=[f"{series.name}_hash_{i}" for i in range(n_components)])

hash_encode(df['City'], n_components=4)

Unnamed: 0,City_hash_0,City_hash_1,City_hash_2,City_hash_3
0,1,0,0,0
1,0,0,0,1
2,1,1,0,1
3,1,0,0,0
4,1,1,0,1


[Back to the top](#table-of-contents)
___
