## Feature Extraction (Big Picture)

**Feature extraction** is the step where we convert **text ‚Üí numbers** so machine-learning models can understand it.

Models don‚Äôt understand words like *‚Äúgreat‚Äù* or *‚Äúboring‚Äù*.
They understand **vectors** like:

```
[0, 2, 0, 1, 3]
```

Two classic techniques:

1. **Bag-of-Words (BoW)**
2. **TF-IDF**

## 1Ô∏è‚É£ Bag-of-Words (BoW)

### What it is

A **count-based representation** of text.

* Ignores grammar
* Ignores word order
* Keeps **how many times** each word appears

### How it works

1. Build a **vocabulary** (all unique words in the corpus)
2. Each document ‚Üí vector of word counts

### Example

**Corpus**

```
Doc 1: "enjoy disappoint bore great"
Doc 2: "bore disappoint bore bore"
Doc 3: "great great enjoy"
```

**Vocabulary**

```
["enjoy", "disappoint", "bore", "great"]
```

**Feature Matrix**

```
Doc 1 ‚Üí [1, 1, 1, 1]
Doc 2 ‚Üí [0, 1, 3, 0]
Doc 3 ‚Üí [1, 0, 0, 2]
```

###  Advantages

* Simple
* Fast to compute
* Works well with models like Logistic Regression, Naive Bayes

### Limitations

* Loses word order
  (‚ÄúI cleaned my car‚Äù = ‚ÄúI my car cleaned‚Äù)
* Treats all words as equally important
* Large vocab ‚Üí high dimensionality

## 2Ô∏è‚É£ TF-IDF (Term Frequency ‚Äì Inverse Document Frequency)


### Why TF-IDF exists

BoW **overvalues common words**.

TF-IDF answers:

> ‚ÄúIs this word important **in this document** compared to the whole corpus?‚Äù

### üîπ Term Frequency (TF)

Measures **how often a word appears in a document**.

$$
TF(t,d) = \frac{\text{count of term } t \text{ in document } d}{\text{total words in document } d}
$$

Example:

* Word: `"apple"`
* Appears **3 times**
* Document length = **100**

```
TF = 3 / 100 = 0.03
```

### üîπ Inverse Document Frequency (IDF)

Measures **how rare a word is across all documents**.

$$
IDF(t) = \log \left(\frac{N + 1}{DF(t) + 1}\right)
$$

Where:

* **N** = total number of documents
* **DF(t)** = number of documents containing term *t*
* `+1` prevents divide-by-zero

Intuition:

* Appears in **every document** ‚Üí IDF ‚âà 0
* Appears in **few documents** ‚Üí high IDF

Example:

* `"apple"` appears in **10 of 1000 documents**

```
IDF ‚âà log(1001 / 11) ‚âà 2
```

### üîπ TF-IDF Score

$$
TF\text{-}IDF = TF \times IDF
$$

Example:

```
TF = 0.03
IDF = 2

TF-IDF = 0.06
```

‚û°Ô∏è High score = **important word for that document**

## BoW vs TF-IDF (Quick Comparison)

| Aspect                 | BoW      | TF-IDF             |
| ---------------------- | -------- | ------------------ |
| Word importance        | No     |  Yes              |
| Uses frequency         |  Yes    |  Yes (normalized) |
| Penalizes common words | No     |  Yes              |
| Complexity             | Very low | Slightly higher    |
| Semantic meaning       | No     | No               |

## When to Use What?

###  Use **BoW** when:

* Dataset is small
* Speed matters
* Interpretability is key

###  Use **TF-IDF** when:

* Large corpus
* Many common words
* Classification / search / clustering tasks