# Label Encoding Techniques: A Senior Analyst's Guide

## 1. Introduction
**Label Encoding** is the process of converting categorical text data into numerical format so that machine learning models can process it. 

### The Critical Distinction: Nominal vs. Ordinal
Before encoding, you must classify your data:
* **Ordinal Data:** Categories with an inherent order (e.g., *Low, Medium, High* or *Junior, Senior, Lead*). The numerical order matters ($0 < 1 < 2$).
* **Nominal Data:** Categories with NO inherent order (e.g., *Red, Blue, Green* or *Paris, Tokyo, New York*). 

**Warning:** Using Label Encoding on *Nominal* data can confuse models (e.g., Linear Regression) by implying an order that doesn't exist (e.g., implying Paris < Tokyo). For Nominal data, One-Hot Encoding is usually preferred.

---

In [None]:
# Step 1: Import Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

print("Libraries imported successfully.")

## 2. Generating Synthetic Research Data
We will create a dataframe containing both types of categorical data to demonstrate the difference in encoding techniques.

In [None]:
# Step 2: Create a dummy dataset
data = {
    'EmployeeID': [101, 102, 103, 104, 105],
    'Department': ['Sales', 'IT', 'HR', 'Sales', 'IT'],          # Nominal (No order)
    'Performance': ['Average', 'Excellent', 'Poor', 'Good', 'Excellent'] # Ordinal (Order exists)
}

df = pd.DataFrame(data)

print("Original Data:")
display(df)

## 3. Approach 1: Manual Mapping (Pandas `map`)
**Best for:** Ordinal Data where specific order is crucial.

Automatic encoders usually sort alphabetically. However, in our data, 'Poor' is functionally lower than 'Excellent', but alphabetically 'Excellent' comes before 'Poor'. Manual mapping gives us full control.

In [None]:
# Define the semantic order explicitly
performance_order = {
    'Poor': 0,
    'Average': 1,
    'Good': 2,
    'Excellent': 3
}

# Apply the mapping
df['Performance_Encoded_Map'] = df['Performance'].map(performance_order)

print("Mapping Results (Notice 'Poor' is 0, 'Excellent' is 3):")
display(df[['Performance', 'Performance_Encoded_Map']])

## 4. Approach 2: Scikit-Learn `LabelEncoder`
**Best for:** Target vectors (y) or when order doesn't matter (and you will use tree-based models).

**The Pitfall:** `LabelEncoder` sorts categories alphabetically before encoding. Watch what happens to the 'Performance' column below.

In [None]:
# Initialize the encoder
le = LabelEncoder()

# Fit and Transform the 'Performance' column
df['Performance_Encoded_Sklearn'] = le.fit_transform(df['Performance'])

print("LabelEncoder Results (Alphabetical Sort):")
display(df[['Performance', 'Performance_Encoded_Sklearn']])

# Check the classes to see the alphabetical order
print("\nClass Order assigned by LabelEncoder:")
for i, item in enumerate(le.classes_):
    print(f"{item} = {i}")

**Observation:** Notice above that `Excellent` became `1` and `Poor` became `3` (if alphabetically sorted after Average). This destroys the semantic meaning of "Performance"!

## 5. Approach 3: Scikit-Learn `OrdinalEncoder`
**Best for:** Feature matrices (X) with multiple columns.

Unlike `LabelEncoder` (which expects a 1D array), `OrdinalEncoder` is designed for 2D feature arrays and allows us to pass categories to enforce order.

In [None]:
# Define the desired order for the categories
desired_order = [['Poor', 'Average', 'Good', 'Excellent']]

# Initialize with specific categories
oe = OrdinalEncoder(categories=desired_order)

# OrdinalEncoder expects a 2D array, so we reshape or pass the dataframe slice
df['Performance_Encoded_Ordinal'] = oe.fit_transform(df[['Performance']])

print("OrdinalEncoder Results (Controlled Order):")
display(df[['Performance', 'Performance_Encoded_Ordinal']])

## 6. Approach 4: Pandas `factorize`
**Best for:** Quick, ad-hoc analysis where you just need unique IDs for strings.

It creates codes based on the order of appearance or sorting.

In [None]:
# Using factorize on the Nominal column 'Department'
codes, uniques = pd.factorize(df['Department'])

df['Department_Encoded_Factorize'] = codes

print("Factorize Results on Department:")
display(df[['Department', 'Department_Encoded_Factorize']])
print(f"Unique Classes: {uniques}")

## 7. Summary & Best Practices

| Method | Use Case | Pros | Cons |
| :--- | :--- | :--- | :--- |
| **Pandas `map`** | **Ordinal Data** (Small/Medium) | Full control over order (`Low=0, High=2`). | Requires manual dictionary creation. |
| **Sklearn `LabelEncoder`** | **Target Variable (y)** | Standard API, easy to invert. | Sorts alphabetically by default (Bad for ordinal features). |
| **Sklearn `OrdinalEncoder`** | **Feature Matrix (X)** | Handles 2D arrays, integrates into Sklearn Pipelines. | Slightly more verbose syntax. |
| **Pandas `factorize`** | **Quick Prototyping** | Very fast, no imports needed. | Less consistent for ML pipelines. |

### Final Recommendation
For professional ML pipelines, use **`OrdinalEncoder`** with explicitly defined categories for ordinal features, and **`OneHotEncoder`** for nominal features.