<a href="https://colab.research.google.com/github/fpgmina/DeepNLP/blob/main/Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Embeddings: CBOW

If we **stack context one-hot vectors as columns**:

$$
X = [x_1, x_2, \dots, x_C] \in \mathbb{R}^{V \times C}
$$

where $V$ is vocabulary size and $C$ is the number of context words.

---

**Step 1: Multiply by embedding matrix**

$$
H = E^T X \in \mathbb{R}^{N \times C}
$$

- $E \in \mathbb{R}^{V \times N}$ is the input embedding matrix  
- $N$ is the embedding dimension  
- Each column of $H$ is the embedding of a context word

---

**Step 2: Average columns to get hidden vector**

$$
h = \frac{1}{C} H \mathbf{1}_C
$$

- $\mathbf{1}_C \in \mathbb{R}^{C}$ is a column vector of ones  
- $h \in \mathbb{R}^{N}$ is the mean embedding of the context words

---

**Step 3: Compute output logits**

$$
u = U^T h \in \mathbb{R}^{V}
$$

- $U \in \mathbb{R}^{N \times V}$ is the output weight matrix  

---

**Step 4: Compute softmax / loss**

$$
\hat{y} = \text{softmax}(u)
$$

- $\hat{y} \in \mathbb{R}^{V}$ is the predicted probability distribution over the vocabulary  
- Loss is computed with **cross-entropy** against the target word

---

### Intuition

1. **Input → embedding:** $E^T X$
2. **Aggregate context:** multiply by $\frac{1}{C} \mathbf{1}_C$
3. **Hidden → vocab:** $U^T h$
4. **Softmax:** converts logits to probability distribution

> This is exactly what the CBOW model does in matrix form — fully vectorized, no loops needed.


NB: Even though there’s **no non-linearity** (like ReLU or tanh), this setup can still *learn useful embeddings* because the model’s goal is to make words that appear in similar contexts have similar representations.  
Adding non-linearities wouldn’t necessarily improve this — in fact, in the original **Word2Vec (CBOW)** paper, **no non-linearity** was used between the embedding and output layers.  
The model’s simplicity is key: it’s essentially performing a form of **logistic regression** on word co-occurrence statistics.

In [None]:
# ==========================================
# CONTINUOUS BAG OF WORDS (CBOW) MODEL
# using nn.Linear instead of nn.Embedding
# ------------------------------------------
# This version is for EDUCATIONAL PURPOSES:
# - we manually create one-hot encodings
# - we use a linear layer to simulate embeddings
# ==========================================

import torch
import torch.nn as nn
import torch.optim as optim
from itertools import chain

In [None]:
# ----------------------------------------------------
# 1. Define a tiny toy corpus
# ----------------------------------------------------
corpus = [
    "the quick brown fox jumped over the lazy dog",
    "i love playing with my dog",
    "the dog loves the fox"
]

In [None]:
# ----------------------------------------------------
# 2. Preprocess text: tokenize and build vocabulary
# ----------------------------------------------------
tokenized_corpus = [sentence.lower().split() for sentence in corpus]

# Flatten all tokens into a set of unique words
vocab = sorted(set(chain.from_iterable(tokenized_corpus)))

# Maps for word <-> index
word_to_ix = {word: i for i, word in enumerate(vocab)}
ix_to_word = {i: word for word, i in word_to_ix.items()}

vocab_size = len(vocab)
print("Vocabulary:", vocab)
print("Word to index:", word_to_ix)
print("Index to word:", ix_to_word)

Vocabulary: ['brown', 'dog', 'fox', 'i', 'jumped', 'lazy', 'love', 'loves', 'my', 'over', 'playing', 'quick', 'the', 'with']
Word to index: {'brown': 0, 'dog': 1, 'fox': 2, 'i': 3, 'jumped': 4, 'lazy': 5, 'love': 6, 'loves': 7, 'my': 8, 'over': 9, 'playing': 10, 'quick': 11, 'the': 12, 'with': 13}
Index to word: {0: 'brown', 1: 'dog', 2: 'fox', 3: 'i', 4: 'jumped', 5: 'lazy', 6: 'love', 7: 'loves', 8: 'my', 9: 'over', 10: 'playing', 11: 'quick', 12: 'the', 13: 'with'}


In [None]:
# ----------------------------------------------------
# 3. Create (context, target) pairs
#    For each word, context = nearby words in a window
# ----------------------------------------------------
def make_context_windows(tokenized_corpus, window_size=2):
    data = []
    for sentence in tokenized_corpus:
        for i, word in enumerate(sentence):
            context = []
            for j in range(-window_size, window_size + 1):
                if j != 0 and 0 <= i + j < len(sentence):
                    context.append(sentence[i + j])
            if len(context) > 0:
                data.append((context, word))
    return data

data = make_context_windows(tokenized_corpus, window_size=2)
print("\nSample (context, target) pairs:")
for i in range(4):
    print(data[i])


Sample (context, target) pairs:
(['quick', 'brown'], 'the')
(['the', 'brown', 'fox'], 'quick')
(['the', 'quick', 'fox', 'jumped'], 'brown')
(['quick', 'brown', 'jumped', 'over'], 'fox')


Target = "the" (first word)
Context = next 2 words → ["quick", "brown"] ✅

Target = "quick" (second word)
Context = 2 words before + 2 words after = ["the", "brown", "fox"] ✅

Target = "brown" (third word)
Context = 2 words before + 2 words after = ["the", "quick", "fox", "jumped"] ✅

In [None]:
# ----------------------------------------------------
# 4. Helper: create one-hot encodings for words
# ----------------------------------------------------
def one_hot_vector(word_idx, vocab_size):
    """Return one-hot vector for a single word index."""
    vec = torch.zeros(vocab_size)
    vec[word_idx] = 1.0
    return vec

def make_tensor(context, target):
    """Return input (context as mean of one-hots) and target (as index)."""
    context_vecs = torch.stack([one_hot_vector(word_to_ix[w], vocab_size) for w in context])
    mean_context = context_vecs.mean(dim=0)   # average context representation
    target_idx = torch.tensor([word_to_ix[target]], dtype=torch.long)
    return mean_context.unsqueeze(0), target_idx  # shape: [1, vocab_size], [1]


In [None]:
# ----------------------------------------------------
# 5. Define CBOW model using a linear layer
# ----------------------------------------------------
class CBOWLinear(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOWLinear, self).__init__()
        # Instead of an embedding layer, we use a Linear layer
        # The linear layer maps from vocab_size (one-hot input)
        # -> embedding_dim (hidden representation)
        self.linear1 = nn.Linear(vocab_size, embedding_dim, bias=False)
        # Then another layer projects embedding back to vocab size for prediction
        self.linear2 = nn.Linear(embedding_dim, vocab_size, bias=False)

    def forward(self, context_onehot):
        """
        context_onehot: shape [batch_size, vocab_size]
        (this is a bag of words averaged one-hot)
        """
        hidden = self.linear1(context_onehot)  # map to embedding space
        output = self.linear2(hidden)          # map back to vocab probabilities
        return output

### Should the loss (CrossEntropy) be part of the neural network?

No — the **loss function** should not be part of the model class.

The model’s job is to define how data flows through layers (the *forward pass*).  
The **loss function** is a *separate component* that tells the optimizer how wrong the predictions are.

Keeping the loss separate has two main benefits:
1. ✅ It allows you to **reuse the same model** with different losses (e.g., CrossEntropy, NLLLoss, or MSE).
2. ✅ It keeps your model definition **clean and modular**, focusing only on computation, not evaluation.

So the typical workflow is:
```python
outputs = model(inputs)
loss = criterion(outputs, targets)


In [None]:
### check that the architecture runs
vocab_size = len(vocab)
embedding_dim = 10
model = CBOWLinear(vocab_size, embedding_dim)
# --------------------------------------------------
# 1. Create a fake input: average of 3 one-hot vectors
# --------------------------------------------------
def one_hot(idx, vocab_size):
    vec = torch.zeros(vocab_size)
    vec[idx] = 1.0
    return vec

# Fake context: indices [2, 5, 7]
context_indices = [2, 5, 7]
context_vecs = torch.stack([one_hot(i, vocab_size) for i in context_indices])
context_input = context_vecs.mean(dim=0).unsqueeze(0)  # shape: [1, vocab_size]
print(f"Context input: {context_input}")
# --------------------------------------------------
# 2. Forward pass
# --------------------------------------------------
output = model(context_input)
print("Output shape:", output.shape)
# Expected: [1, vocab_size] → raw logits for each word
print("Output logits:", output)

Context input: tensor([[0.0000, 0.0000, 0.3333, 0.0000, 0.0000, 0.3333, 0.0000, 0.3333, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000]])
Output shape: torch.Size([1, 14])
Output logits: tensor([[-0.0067,  0.0452,  0.0626, -0.0505,  0.0582, -0.0175,  0.0010, -0.0029,
          0.1141,  0.0109,  0.0526,  0.0113, -0.0143,  0.0754]],
       grad_fn=<MmBackward0>)


In [None]:
# ----------------------------------------------------
# 6. Initialize model, loss, and optimizer
# ----------------------------------------------------
embedding_dim = 10
model = CBOWLinear(vocab_size, embedding_dim)

# Cross-entropy is appropriate for classification
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# ----------------------------------------------------
# 7. Training loop
# ----------------------------------------------------
print("\nTraining CBOW with nn.Linear ...")
for epoch in range(100):
    total_loss = 0
    for context, target in data:
        # Prepare input tensors
        context_tensor, target_tensor = make_tensor(context, target)

        # Forward pass
        pred = model(context_tensor)

        # Compute loss
        loss = loss_function(pred, target_tensor)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    if epoch % 10 == 0:
        print(f"Epoch {epoch:3d} | Loss: {total_loss:.4f}")


Training CBOW with nn.Linear ...
Epoch   0 | Loss: 52.8511
Epoch  10 | Loss: 26.3736
Epoch  20 | Loss: 10.6341
Epoch  30 | Loss: 4.3213
Epoch  40 | Loss: 2.0580
Epoch  50 | Loss: 1.1233
Epoch  60 | Loss: 0.6902
Epoch  70 | Loss: 0.4621
Epoch  80 | Loss: 0.3285
Epoch  90 | Loss: 0.2437


In [None]:

# ----------------------------------------------------
# 8. Inspect learned embeddings
# ----------------------------------------------------
# The "embedding" for each word is just the corresponding row
# of the first linear layer's weight matrix
with torch.no_grad():
    embedding_matrix = model.linear1.weight.data  # shape [embedding_dim, vocab_size]
    # Transpose it so each row corresponds to a word
    embedding_matrix = embedding_matrix.T         # [vocab_size, embedding_dim]

# Show a few word vectors
print("\nSample word embeddings (rows correspond to words):")
for i, word in enumerate(vocab[:5]):
    print(f"{word:>10s} : {embedding_matrix[i].numpy()}")


Sample word embeddings (rows correspond to words):
     brown : [-0.77698165 -1.2045684  -0.2711266  -2.0864468   0.05063495 -0.21035163
  1.1458198  -2.808613   -0.69390845 -1.6719401 ]
       dog : [ 1.7378042   0.81062865 -3.9599147   2.3803916   1.0353929  -0.3023908
 -1.1362057  -1.5668218  -2.71045    -2.68095   ]
       fox : [-0.5113443  -1.1659439  -3.1197329  -1.6425169  -1.3973596  -0.04413448
 -1.0791274  -0.85332245  1.0101377  -1.8844566 ]
         i : [-2.1445022  -0.28397575  2.3767862   1.6141582   2.061398    1.2741618
 -1.9728276   2.297368    0.23704498  0.7643219 ]
    jumped : [ 2.9816546  -0.2211122   1.4346014  -3.3715289   0.20837057  3.0859766
 -0.9067423   1.3982038   1.748613   -0.29846627]


In [None]:
# ----------------------------------------------------
# 9. Compute word similarity using cosine similarity
# ----------------------------------------------------
def cosine_similarity(vec1, vec2):
    return torch.dot(vec1, vec2) / (vec1.norm() * vec2.norm())

def most_similar(word, topn=5):
    idx = word_to_ix[word]
    word_vec = embedding_matrix[idx]
    sims = []
    for other_word, j in word_to_ix.items():
        if other_word != word:
            sim = cosine_similarity(word_vec, embedding_matrix[j])
            sims.append((other_word, sim.item()))
    sims.sort(key=lambda x: -x[1])
    return sims[:topn]

print("\nMost similar words to 'dog':")
for w, s in most_similar("dog"):
    print(f"{w:>10s} : {s:.3f}")


Most similar words to 'dog':
     quick : 0.407
       fox : 0.318
      lazy : 0.133
     brown : 0.116
        my : 0.042
