# Label and Ordinal Encoding

In [1]:
import pandas as pd

In [2]:
from sklearn.preprocessing import LabelEncoder

In [None]:
# we know ordinal numbers are numbers with rank like first 2nd so here the rank of colors will be in alphabetically  order by default
df= pd.DataFrame({
    'color':['red','blue','green','green','red','blue']
})

In [4]:
df

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red
5,blue


In [5]:
# create an instance of label encoder
encoder = LabelEncoder()

In [6]:
encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1, 2, 0])

Label encoding by default alphabetically

Great question! Both **Ordinal Encoding** and **Label Encoding** are techniques used in **feature engineering** to convert categorical data into numerical values for machine learning models — but they are **not the same**, and using the wrong one can hurt your model's performance.

---

### 🔄 1. **Label Encoding**

* **What it does**: Assigns a unique integer to **each category** in a single column.
* **Example**:

  ```python
  from sklearn.preprocessing import LabelEncoder
  le = LabelEncoder()
  le.fit_transform(['Red', 'Green', 'Blue'])  
  # Output: [2, 1, 0]
  ```
* **Good for**:

  * Target variables (labels) in classification problems.
* **Not ideal for**:

  * Categorical features with **no inherent order** — can mislead models into thinking there's a ranking.

---

### 🔢 2. **Ordinal Encoding**

* **What it does**: Converts categories into integers **based on a defined or natural order**.
* **Example**:

  ```python
  from sklearn.preprocessing import OrdinalEncoder
  enc = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
  enc.fit_transform([['Medium'], ['Low'], ['High']])
  # Output: [[1], [0], [2]]
  ```
* **Good for**:

  * Features with a **natural rank/order** (e.g., education level, size, ratings).
* **Bad for**:

  * Features with **no meaningful order** — use One-Hot Encoding instead.

---

### 🔍 Key Differences:

| Feature          | Label Encoding                 | Ordinal Encoding                  |
| ---------------- | ------------------------------ | --------------------------------- |
| Multiple Columns | No (one column at a time)      | Yes (works with multiple columns) |
| Order Preserved  | No                             | Yes                               |
| Risk of Misuse   | High (if used on nominal data) | Low (if order is meaningful)      |
| Output Type      | Integer                        | Integer                           |

---

### ✅ When to Use What:

| Use Case                                                      | Recommended Encoding |
| ------------------------------------------------------------- | -------------------- |
| Target labels(What we predict) in classification                               | **Label Encoding**   |
| Ordered categorical features (e.g., "Low", "Medium", "High")  | **Ordinal Encoding** |
| Unordered categorical features (e.g., "Red", "Green", "Blue") | **One-Hot Encoding** |

---

Would you like a code demo comparing how models perform with label vs. ordinal vs. one-hot encoding?



### 📌 **Nominal Data** (also called **categorical** or **unordered** data):

These are **categories that have no inherent order or ranking**. The values are just labels — there's no “greater than” or “less than” relationship between them.

---

### ✅ Examples of Nominal Data:

* **Colors**: Red, Green, Blue
* **Cities**: Mumbai, Delhi, Bangalore
* **Genders**: Male, Female, Other
* **Animal Types**: Dog, Cat, Bird

---

### ❌ Why You Shouldn’t Use Label/Ordinal Encoding on Nominal Data:

Label or ordinal encoding will assign **integers** (e.g., Red = 0, Green = 1, Blue = 2), which tricks many machine learning models into thinking there's an **order or magnitude**, like:

```
Blue > Green > Red
```

But in nominal data, that **ranking makes no sense**, so the model may learn **incorrect patterns**.

---

### ✅ What to Use Instead:

Use **One-Hot Encoding**, which creates **binary columns**:

```
Color_Red  Color_Green  Color_Blue
    1           0           0
    0           1           0
    0           0           1
```

This avoids any artificial ranking.

---


In [9]:
## Ordinal Encoding

from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({
    'size':['small','medium','large','medium','small','large']
})


In [10]:
encoder = OrdinalEncoder(categories=[['small','medium','large']]) # we are telling rank
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])