# 1️⃣ Bag of Words (BoW)
### What problem it solves

##### Convert text into numbers so ML models can process it.

### Working (step-by-step)
#### Step 1: Corpus


In [None]:
D1: "I love deep learning"
D2: "I love machine learning"

#### Step 2: Build vocabulary

In [None]:
Vocabulary = [I, love, deep, machine, learning]

#### Step 3: Count word frequency per document

| Document | I | love | deep | machine | learning |
| -------- | - | ---- | ---- | ------- | -------- |
| D1       | 1 | 1    | 1    | 0       | 1        |
| D2       | 1 | 1    | 0    | 1       | 1        |


#### Step 4: Vector output

##### Each document → fixed-length vector

In [None]:
D1 → [1, 1, 1, 0, 1]
D2 → [1, 1, 0, 1, 1]

#### Key properties

Ignores word order

Ignores meaning

High dimensional & sparse

#### How ML uses it

Each column = feature

Model learns weights per word


# 2️⃣ TF-IDF
### What problem it improves

#### BoW treats all words equally → TF-IDF adds importance weighting

## Working
#### Step 1: Term Frequency (TF)

In [None]:
TF(word, doc) = count(word in doc) / total words in doc

#### Step 2: Inverse Document Frequency (IDF)

In [None]:
IDF(word) = log(N / df(word))

N = total documents

df = documents containing the word

#### Step 3: TF-IDF score

In [None]:
TF-IDF = TF × IDF

#### Example result

| Document | I    | love | deep | machine | learning |
| -------- | ---- | ---- | ---- | ------- | -------- |
| D1       | 0.21 | 0.21 | 0.55 | 0.00    | 0.31     |
| D2       | 0.21 | 0.21 | 0.00 | 0.55    | 0.31     |


### What the score means

Higher score → more important in that document

Still no semantics

### How ML uses it

Same as BoW:

Vector + label → classifier/regressor

# 3️⃣ Word2Vec (Embedding Learning)
### What problem it solves

#### BoW / TF-IDF:

Sparse

No meaning

No similarity

#### Word2Vec learns semantic meaning.

Core idea

Words appearing in similar contexts have similar meanings.

## Architecture (Skip-Gram example)
#### Input

Center word (one-hot vector)

#### Output

Context word (one-hot vector)

In [None]:
Input → Hidden → Output

#### Hidden layer = embedding

### Training process

Sentence:

"I love deep learning"


Window size = 1

Training pairs:

In [None]:
(love → I), (love → deep)
(deep → love), (deep → learning)

#### Neural Network structure

Input layer → one-hot (V size)

Hidden layer → D neurons (embedding dimension)

Output layer → V size

#### Learning via backprop

Pull related words closer

Push unrelated words apart

#### After training:

Keep embedding matrix

Discard NN

### Output

In [None]:
Embedding("deep") = [0.12, -0.87, 0.44, ...]


### Properties

Dense

Semantic similarity

Static (same vector everywhere)

# 4️⃣ Self-Attention (Contextual Embeddings)
### What problem it solves

#### Word2Vec cannot change meaning per sentence.

### Input

In [None]:
X ∈ (T × n)  → embeddings of sentence words

### Working
#### Step 1: Linear projections

In [None]:
Q = X · WQ
K = X · WK
V = X · WV

#### Step 2: Attention scores

In [None]:
Scores = (Q · Kᵀ) / √dₖ

Meaning:

How much each word relates to every other word

#### Step 3: Softmax

In [None]:
Attention = softmax(Scores)

Turns scores into probabilities.

#### Step 4: Weighted sum

In [None]:
Output = Attention · V

Each word becomes a context-aware vector

#### Output

In [None]:
Same word → different vector in different sentences

#### Properties

Dense

Semantic + contextual

Computed dynamically

Not stored per word

# 5️⃣ Full comparison (working perspective)

| Method         | How vectors are created             | Meaning captured | Context aware |
| -------------- | ----------------------------------- | ---------------- | ------------- |
| BoW            | Count words                         | ❌                | ❌             |
| TF-IDF         | Count × importance                  | ❌                | ❌             |
| Word2Vec       | Learn from context windows          | ✅                | ❌             |
| Self-Attention | Compute relevance between all words | ✅                | ✅             |


In [None]:
BoW        → "Which words?"
TF-IDF     → "Which words matter?"
Word2Vec   → "What do words mean?"
Self-Attn  → "What do words mean here?"
