# 🧠 Enconders - Summary
- Use OneHotEncoder when you want full control, need to work with sklearn pipelines, or must handle unknown categories safely.

- Use DictVectorizer when your data is in dictionary format (e.g., JSON or from APIs) and you want to plug it into a pipeline quickly.

- Use pd.get_dummies() for quick, simple transformations when you're staying inside pandas and not building a full ML pipeline.

# OneHotEncoder vs. DictVectorizer

Two scikit‑learn transformers that both turn categorical data into numerical vectors — but they’re built around **different mental models of your data**.

| Feature | **OneHotEncoder** | **DictVectorizer** |
|---------|------------------|--------------------|
| **Data model** | *Column‑oriented table* (NumPy array or pandas DataFrame). Each column has a fixed meaning. | *Bag‑of‑features* (list/iterator of Python dicts). Each `(key, value)` pair is an independent feature. |
| **Typical raw input** | `pd.DataFrame` from CSV / SQL. | JSON‑like records, log entries, API payloads. |
| **How categories are discovered** | Per column: learns unique values *within* each column. | Each distinct `(key, value)` becomes its own feature name `key=value`. |
| **Numeric features** | Must be handled separately (e.g., via `ColumnTransformer`). | Numeric *values* pass through unchanged; only strings are one‑hot‑encoded. |
| **Unknown categories at inference** | Controlled with `handle_unknown=('error', 'ignore', 'infrequent_if_exist')`. | New `(key, value)` raises an error unless pre‑declared. |
| **Drop reference level** | Built‑in (`drop='first'`, `drop='if_binary'`). | Not built‑in; slice columns manually if needed. |
| **Output format** | CSR sparse (default) or dense (`sparse_output=False`). | CSR sparse (default) or dense (`sparse=False`). |
| **Inverse transform** | ✔️ Returns structured array/DataFrame. | ✔️ Returns list of dicts. |
| **Best for** | Stable, tidy tabular data; fine‑grained per‑column control in pipelines. | Flexible, high‑dimensional, sparse feature spaces from dict/JSON inputs. |

---

## Deeper Intuition

- **OneHotEncoder = “fixed schema, flexible values.”**  
  Think spreadsheet columns (*Color*, *Size*, *City*). Encoder catalogs possible values **within** each column and builds block‑wise one‑hot columns.

- **DictVectorizer = “flexible schema, key‑value explosion.”**  
  Logging events: each line can have different keys (`device=Android`, `browser=Chrome`, `temperature=21.4`). Every new `(key, value)` pair spawns a feature; numeric values flow through unchanged.

---

## Minimal Example



In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction import DictVectorizer

df = pd.DataFrame({
    "color":  ["red", "blue", "red"],
    "size":   ["M",   "L",    "S" ],
    "weight": [1.1,    2.0,    1.5]
})
records = df.to_dict(orient="records")    # [{'color':'red',...}, ...]

# 1️⃣ OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)
X_ohe = ohe.fit_transform(df[["color", "size"]])  # numeric handled elsewhere

# 2️⃣ DictVectorizer
dv = DictVectorizer(sparse=False)
X_dv = dv.fit_transform(records)


print("OneHotEncoder output:\n", X_ohe)
print("DictVectorizer output:\n", X_dv)


OneHotEncoder output:
 [[0. 1. 0. 1. 0.]
 [1. 0. 1. 0. 0.]
 [0. 1. 0. 0. 1.]]
DictVectorizer output:
 [[0.  1.  0.  1.  0.  1.1]
 [1.  0.  1.  0.  0.  2. ]
 [0.  1.  0.  0.  1.  1.5]]


### Resulting feature names

| Encoder | Feature names |
|---------|---------------|
| **ohe** | `color=blue`, `color=red`, `size=L`, `size=M`, `size=S` |
| **dv**  | `color=blue`, `color=red`, `size=L`, `size=M`, `size=S`, **`weight`** |

`weight` automatically passes through with `DictVectorizer`.

---

### Quick Selection Guide

1. **Tidy DataFrame → OneHotEncoder** (use in a `ColumnTransformer`).
2. **List of dicts / JSON → DictVectorizer** (zero friction).
3. **Need to drop reference level or handle unseen categories gracefully → OneHotEncoder**.
4. **Just playing inside pandas, no pipeline → `pd.get_dummies()`** is quickest, though less production‑friendly.

> **Rule of thumb:** choose the transformer whose **input format matches your raw data** and whose **options match your production constraints** (memory, unknown categories, multicollinearity).
