# 📘 Notes: Encoding, Scaling, and Feature Use in ML Models

---

## 🔸 One-Hot vs Label Encoding

- Use **One-Hot Encoding** when:
  - You have **nominal** categories (no inherent order)
  - Especially for **linear models** (Linear Regression, Logistic Regression, etc.)
- Use **Label Encoding** when:
  - Categories have an **ordinal relationship**
  - Or for **tree-based models** (Decision Trees, Random Forests, XGBoost)

> 💡 Tree models split based on feature values, not their scale, so label encoding usually works fine.

---

## 🔸 Linear Models & Label Encoding

- Linear models interpret label-encoded values **numerically** (i.e., 0, 1, 2... as having increasing influence).
- If category values are unordered, this can introduce a **false sense of order**, which can mislead the model.

---

## 🔸 One-Hot Encoding in Tree Models?

- It **also works**, and sometimes improves performance with high-cardinality categories.
- However, it increases dimensionality.
- Label encoding is **simpler** and often sufficient for trees since they split on specific values anyway.

---

## 🔸 Standardization vs Normalization

| Method            | Formula                                      | When to Use                                        |
|-------------------|----------------------------------------------|----------------------------------------------------|
| **Standardization** | \( z = \frac{x - \mu}{\sigma} \)             | Data is **roughly normal**; for linear models, SVM, PCA |
| **Normalization**   | \( x_{norm} = \frac{x - min}{max - min} \)  | When using **distance-based** models (KNN, KMeans), neural networks |

- **Standardization** preserves the shape of the distribution.
- **Normalization** rescales features to a common range (usually [0,1]).

---

## 🔸 Why Use Min-Max Scaling?

To make features **comparable** in scale, especially when:
- Using **KNN**, **KMeans**, or **Neural Networks**
- Preventing features with large ranges from dominating the model

> Real-life example: Income (0–100,000) will dominate age (0–100) unless both are scaled.

---

## 🔸 Binary Features like Gender in Models

| Model Type        | Use Gender as 0/1? | Explanation |
|-------------------|-------------------|-------------|
| Tree-based Models | ✅ Yes            | Uses value for splitting (e.g., `gender == 1`) |
| Linear Regression | ✅ Yes            | Learns effect via coefficient `β₁` in `y = β₀ + β₁ * gender` |

> If gender = 1 means male, then `β₁` is the average difference in `y` between males and females.

---

## 🔸 Categorical Features with >2 Categories

- **Tree models**: Label or One-Hot both are fine.
- **Linear models**: Use One-Hot Encoding to avoid implying an order.

---
