<a href="https://colab.research.google.com/github/ankita2002/LLMS/blob/main/Practice_Session_04_Student_Exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PS 04: Word Representation II (Statistical Methods)

## Learning Objectives

By the end of this practice session, you will be able to:

1. **Construct one-hot vectors** and reason about **sparsity** and **dimensionality**.
2. **Compare cosine similarity** and understand how it captures vector similarity.
3. **Review TF–IDF & LSA** and relate these to dense word embeddings.
4. **Train Word2Vec** models (Skip-gram & CBOW) with **Gensim** on a toy corpus.
5. **Visualize embeddings** with **t-SNE** and interpret clusters.
6. **Use embeddings for analogies** (e.g., *king – man + woman ≈ queen*).
7. **Implement a tiny GloVe** optimizer on global co-occurrence statistics.
8. **Compare neighbors across models** (Skip-gram, CBOW, GloVe) and discuss differences.

---

## Practice Tasks Overview

This session contains **2 main tasks**, each with two subtasks:

- **Task 1**
  - **1.1**: *Manual One-Hot Construction (+ Pairwise Similarities)* — Build sentence one-hot vectors by hand from a tiny vocabulary.
  - **1.2**: *Neighbors & Analogies (Skip-gram)* — Top-5 similar words for a target using Word2Vec (Skip-gram), and solve *king – man + woman = ?* with Skip-gram embeddings.

- **Task 2**
  - **2.1**: *CBOW Neighbors **and** CBOW vs Skip-gram Comparison* — Compare neighbors for example target words and discuss coherence.
  - **2.2**: *Optimize GloVe Hyperparameters to Hit Expected Neighbors* — Tune GloVe so that, for each target word, at least one word from an expected set appears in its top-k neighbors.


In [None]:
!pip install -q gensim

## 1. One-Hot Encoding and Cosine Similarity

### One-Hot Example (queen vs. princess)
Let a tiny vocabulary be  
$\text{V} = [\text{king},\, \text{queen},\, \text{man},\, \text{woman},\, \text{princess}]$.

Then the one-hot vectors are:
$$
\mathbf{e}_{\text{queen}} = [0,\,1,\,0,\,0,\,0],\qquad
\mathbf{e}_{\text{princess}} = [0,\,0,\,0,\,0,\,1].
$$
They are **orthogonal**: $\mathbf{e}_{\text{queen}}^\top \mathbf{e}_{\text{princess}} = 0$.

### Why One-Hot is Limited
- Cannot generalize: "queen" and "princess" are orthogonal.
- Distance does not reflect semantic similarity.
- Motivation for **dense** representations (embeddings).

### Cosine Similarity on Simple Vectors
- Cosine: invariant to scaling and good for semantic direction.

**Formula**
$$
\text{Cosine}(\mathbf{a},\mathbf{b}) \;=\;
\frac{\mathbf{a}^\top \mathbf{b}}{\lVert \mathbf{a} \rVert_2 \, \lVert \mathbf{b} \rVert_2}
$$

**For the one-hot example above**  
$$
\text{Cosine}(\mathbf{e}_{\text{queen}}, \mathbf{e}_{\text{princess}}) = 0
$$


In [None]:
import numpy as np

# Tiny toy corpus (3 short sentences)
tiny_corpus_sentences = [
    "king and queen rule the kingdom",
    "the man and the woman walk",
    "the river flows by the old kingdom"
]

# Tokenization & Vocabulary
def tokenize_words(text: str):
    """Lowercase, whitespace-split tokenization (toy; no punctuation handling)."""
    return [w.lower() for w in text.split()]

all_tokens = [token for sent in tiny_corpus_sentences for token in tokenize_words(sent)]
vocabulary = sorted(set(all_tokens))
token_to_index = {token: idx for idx, token in enumerate(vocabulary)}
index_to_token = {idx: token for token, idx in token_to_index.items()}

print("Vocabulary (sorted):")
print(" ", vocabulary)
print(f"→ Vocabulary size: {len(vocabulary)} token types\n")

print("Token → Index mapping:")
for i, (tok, idx) in enumerate(token_to_index.items()):
    print(f"  {tok:>10} → {idx}")
print()

# One-hot encoders
def one_hot_word_vector(word: str, token_to_index: dict) -> np.ndarray:
    """Return a one-hot vector for a single word using the given vocabulary mapping."""
    vec = np.zeros(len(token_to_index), dtype=int)
    if word in token_to_index:
        word_index = token_to_index[word]
        vec[word_index] = 1
    return vec

def one_hot_sentence_union_vector(sentence: str, token_to_index: dict) -> np.ndarray:
    """
    Return a sentence-level one-hot vector (set/union).
    A position is 1 if the corresponding token appears at least once in the sentence.
    """
    vec = np.zeros(len(token_to_index), dtype=int)
    for word in tokenize_words(sentence):
        if word in token_to_index:
            word_index = token_to_index[word]
            vec[word_index] = 1
    return vec

# Demonstration: word-level one-hots
print("Example one-hot vectors (word-level):")
for word in ["king", "queen", "river", "cat"]:  # 'cat' is intentionally OOV
    vec = one_hot_word_vector(word, token_to_index)
    status = "" if word in token_to_index else " (OOV — out of vocabulary)"
    print(f"     vector: {vec.tolist()}")
print()

# Demonstration: sentence-level one-hots (union)
print("Sentence-level one-hot vectors (union of tokens in sentence):")
for i, sentence in enumerate(tiny_corpus_sentences):
    vec = one_hot_sentence_union_vector(sentence, token_to_index)
    print(f"  S{i}: '{sentence}'")
    print(f"     vector: {vec.tolist()}")
print()

# Cosine similarity demo
A_vec = np.array([1.0, 1.0])
B_vec = np.array([2.0, 2.0])  # Same direction as A, larger magnitude
C_vec = np.array([2.0, 0.0])  # Different direction

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Cosine similarity between vectors a and b."""
    denom = (np.linalg.norm(a) * np.linalg.norm(b))
    return float(a @ b) / denom if denom > 0 else 0.0

print("Cosine similarity on simple vectors:")
print(f"  A = {A_vec.tolist()}, B = {B_vec.tolist()}, C = {C_vec.tolist()}")
print(f"  cos(A, B) = {cosine_similarity(A_vec, B_vec):.4f}")
print(f"  cos(A, C) = {cosine_similarity(A_vec, C_vec):.4f}")
print(f"  cos(B, C) = {cosine_similarity(B_vec, C_vec):.4f}")
print("  → Cosine judges A and B as identical in direction.\n")


## 2. Shared Toy Corpus (for Skip-gram, CBOW, and GloVe)

We’ll use a small, curated corpus that mixes several topical clusters so the embeddings have enough signal to learn simple relationships:

- **Royalty & gender**: e.g., *king, queen, prince, princess, palace, kingdom*  
- **Tech**: e.g., *computer, software, data, algorithms, models*  
- **Emotions**: e.g., *happy, sad, joyful, cheerful, angry*  
- **Nature / water**: e.g., *river, valley, bank, bridge, boats*

### What this section does
1. **Tokenize** each sentence with a simple lowercase + whitespace split.
2. **Build the vocabulary** from all tokens.

### Variables defined (used by later sections)
- raw_corpus → the list of corpus sentences (List[str])  
- tokenized_corpus → the tokenized sentences (List[List[str]])  
- vocab_all → the sorted list of unique tokens in the corpus



In [None]:
# Shared Toy Corpus for Skip-gram, CBOW, and GloVe

import numpy as np
from collections import Counter

# Corpus sentences (grouped by theme)
toy_corpus_sentences = [
    # royalty / gender
    "king rules the kingdom",
    "queen rules the kingdom",
    "the king is a man",
    "the queen is a woman",
    "a princess and a prince live in the palace",
    "a man and a woman walk to the palace",
    "the royal family visits the city",
    "the kingdom has a strong army",
    # tech
    "a computer runs software and processes data",
    "programmers write code in a computer laboratory",
    "deep learning models train on data",
    "algorithms improve computer performance",
    "hardware and software form a computer system",
    "laptops and desktops are kinds of computer",
    # emotions
    "he feels happy and joyful today",
    "she is very happy with the results",
    "the movie made everyone sad",
    "a cheerful smile can make people happy",
    "angry voices faded after a happy resolution",
    # nature / water
    "the river flows through the valley",
    "boats travel along the river",
    "fish live in the river and lake",
    "the river bank is covered with trees",
    "a bridge crosses the wide river"
]

# Tokenization
def tokenize_simple(text: str):
    """Lowercase + whitespace-split."""
    return [w.lower() for w in text.split()]

tokenized_sentences = [tokenize_simple(s) for s in toy_corpus_sentences]

# Vocabulary
all_tokens_flat = [tok for sent in tokenized_sentences for tok in sent]
vocabulary_all = sorted(set(all_tokens_flat))
word_to_index = {w: i for i, w in enumerate(vocabulary_all)}
index_to_word = {i: w for w, i in word_to_index.items()}

# Statistics
num_sents = len(tokenized_sentences)
vocab_size = len(vocabulary_all)
sent_lengths = [len(s) for s in tokenized_sentences]
avg_len = float(np.mean(sent_lengths)) if sent_lengths else 0.0

print(f"\nToy corpus prepared:")
print(f"  • #sentences = {num_sents}")
print(f"  • vocab size = {vocab_size}")
print(f"  • avg tokens per sentence ≈ {avg_len:.2f}")
print()

# Peek at a few sentences with their tokens
print("Sample sentences (with tokens):")
for i in range(min(3, num_sents)):
    print(f"  S{i}: '{toy_corpus_sentences[i]}'")
    print(f"     tokens: {tokenized_sentences[i]}")
print()

# Peek at a few vocabulary terms and mappings
print("Sample vocabulary terms (first 10):")
print(" ", vocabulary_all[:10])
print("Sample word→index pairs (first 10):")
for i, (w, idx) in enumerate(word_to_index.items()):
    if i >= 10:
        print("   ...")
        break
    print(f"   {w:>12} → {idx}")
print()

# Variable names (to stay aligned with later cells)
raw_corpus = toy_corpus_sentences
tokenized_corpus = tokenized_sentences
vocab_all = vocabulary_all

print(f"\nToy corpus ready: {len(tokenized_corpus)} sentences | Vocab size: {len(vocab_all)}")


## 3. Word2Vec (Skip-gram) — PyTorch on the Toy Corpus

We train a **Skip-gram** model (with **naïve softmax**) on the **shared toy corpus** prepared earlier.

**Corpus variables used (from the previous cell):**
- `tokenized_corpus` (list of token lists)
- `word_to_index` (w→i)
- `index_to_word` (i→w)

---

### Data preparation (with few examples)

We use a symmetric window of size **$W$** on each side.

- **Robust (used in code below):** the window is **clipped at sentence boundaries**.  
  For a sentence of length $n$, and a target word at position $p$ (token $t$), its **context indices** are
  $$
  \{\max(0,p\!-\!W),\ldots,p\!-\!1\}\;\cup\;\{p\!+\!1,\ldots,\min(n\!-\!1,p\!+\!W)\}.
  $$
  We emit one training pair **$(t \rightarrow c)$** for **each** context token $c$ in that set.

- **Strict (HW style):** require a **full** window of exactly $W$ on **both** sides; otherwise **skip** the position.

Each eligible target produces up to $2W$ pairs $(\text{target } t \rightarrow \text{context } c)$.

**Example (from the provided corpus, $W=2$):**

Sentence: “**a bridge crosses the wide river**”  
Tokens: $[a,\; bridge,\; crosses,\; the,\; wide,\; river]$

- Target $t=$ “bridge” at position $p=1$  
  Context window (clipped): positions $\{0\}\cup\{2,3\}$ → contexts $c \in \{a,\; crosses,\; the\}$  

  **Strict:** **skip** (because we don’t have $W=2$ tokens on the left)

  Emitted pairs for robust mode: $(\text{bridge}\!\to\!a),\;(\text{bridge}\!\to\!\text{crosses}),\;(\text{bridge}\!\to\!\text{the})$

- Target $t=$ “wide” at position $p=4$  
  Context window (clipped): positions $\{2,3\}\cup\{5\}$ → contexts $c \in \{\text{crosses},\;\text{the},\;\text{river}\}$  

  **Strict:** **skip** (because we don’t have $W=2$ tokens on the right)

  Emitted pairs for robust mode: $(\text{wide}\!\to\!\text{crosses}),\;(\text{wide}\!\to\!\text{the}),\;(\text{wide}\!\to\!\text{river})$

(These words are mapped to integer ids via `word_to_index` before training.)

---

![Skip-gram Example](https://maelfabien.github.io/assets/images/skp_gr.png)

---

### Notation (using $t$=target, $c$=context)

Let $|V|$ be the vocabulary size and $d$ the embedding dimension.

- $W_{\text{in}} \in \mathbb{R}^{|V|\times d}$: **target/center** embedding table; for word index $t$,
  $\;\mathbf{v}_t = W_{\text{in}}[t,:]$.
- $W_{\text{out}} \in \mathbb{R}^{|V|\times d}$: **context** embedding table; for word index $c$,
  $\;\mathbf{u}_c = W_{\text{out}}[c,:]$.
- $b_{\text{out}} \in \mathbb{R}^{|V|}$: output bias vector; component $b_{\text{out},j}$ for class $j$.

---

### Model (naïve softmax)

Given a **target** index $t$ (selects $\mathbf{v}_t$ from $W_{\text{in}}$), we score **every** vocabulary item and predict **one** context index $c$:

$$
\mathbf{z} = W_{\text{out}}\,\mathbf{v}_t + b_{\text{out}},
\qquad
\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{z}),
\quad
\hat{y}_j = \frac{e^{z_j}}{\sum_{k=1}^{|V|} e^{z_k}} .
$$

**Cross-entropy loss (target $\to$ context)** with a one-hot target $\mathbf{y}$ for the true context $c$:

$$
\boxed{\;{L}(t,c) = -\log \hat{y}_c\;}
$$

**Equivalent log-sum-exp form** (expanding the softmax):

$$
\boxed{\;
{L}(t,c)
= -\big(\mathbf{u}_c^\top \mathbf{v}_t + b_{\text{out},c}\big)
\;+\; \log \sum_{j=1}^{|V|} \exp\!\big(\mathbf{u}_j^\top \mathbf{v}_t + b_{\text{out},j}\big)
\;}
$$

> **Efficiency note:** In this demo we use **naïve softmax** (summing over the **entire vocabulary**). In practice, **negative sampling** replaces the full softmax with $1$ positive $+\;K$ negatives to avoid the $\text{O}(|V|)$ normalization.

---

### Training setup (this demo)

- **Embedding dim** $d$: 50  
- **Window**: 5 (words on each side)  
- **Batch size**: 8  
- **Epochs**: 20  
- **Optimizer**: Adam (learning rate $= 0.05$)  
- **Loss**: `nn.CrossEntropyLoss()` (applies softmax + CE)

---

### After training

- Keep $W_{\text{in}}$ as the word embeddings (often discard/ignore $W_{\text{out}}$ downstream).

- We use the **input embeddings** $W_{\text{in}}$ as word vectors.


- To embed a word (token) with index \(w\), its embedding is
  $$
  \mathbf{v}_w = W_{\text{in}}[w,:] \in \mathbb{R}^d.
  $$

- To embed a sentence, let a sentence be a sequence of word (token) indices
  $$
  S = (w_1, w_2, \dots, w_m).
  $$

  We define its embedding as the **mean of its word embeddings**:
  $$
  \mathrm{embed}(S)
  \;=\; \frac{1}{m} \sum_{r=1}^{m} \mathbf{v}_{w_r}
  \;=\; \frac{1}{m} \sum_{r=1}^{m} W_{\text{in}}[w_r,:].
  $$

- Nearest neighbors are computed with **cosine similarity** on rows of $W_{\text{in}}$:

$$
\cos(\mathbf{v}_a,\mathbf{v}_b) \;=\;
\frac{\mathbf{v}_a^\top \mathbf{v}_b}{\lVert \mathbf{v}_a\rVert\,\lVert \mathbf{v}_b\rVert}
$$


In [None]:
# ================================
# Skip-gram (naïve softmax) — PyTorch on toy corpus
# ================================
import numpy as np
import random
from dataclasses import dataclass
from typing import Dict, List, Tuple

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# -----------------------------
# Reproducibility & device
# -----------------------------
RANDOM_SEED = 42
torch.manual_seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# -----------------------------
# Skip-gram data maker (robust windowing by handling boundaries via clipping)
# -----------------------------
def skipgram_pairs_with_edges(tokenized: List[List[str]],
                              w2i: Dict[str,int],
                              window: int = 2
                             ) -> Tuple[np.ndarray, np.ndarray]:
    """
    Create Skip-gram pairs with a symmetric window of size `window`,
    **clipping the window to sentence boundaries** (handles edges).

    For each position c, we take context indices in:
        [max(0, c-window) ... c-1]  ∪  [c+1 ... min(n-1, c+window)]

    Returns:
        centers : (N,) np.int64
        contexts: (N,) np.int64
    """
    centers, contexts = [], []
    W = window
    for sent in tokenized:
        idxs = [w2i[w] for w in sent if w in w2i]
        n = len(idxs)
        for c in range(n):
            left  = max(0, c - W)
            right = min(n, c + W + 1)
            for j in range(left, right):
                if j == c:
                    continue
                centers.append(idxs[c])
                contexts.append(idxs[j])
    return np.array(centers, dtype=np.int64), np.array(contexts, dtype=np.int64)

# -----------------------------
# Dataset
# -----------------------------
class SkipGramDataset(Dataset):
    def __init__(self, centers: np.ndarray, contexts: np.ndarray):
        self.centers  = centers
        self.contexts = contexts
    def __len__(self) -> int:
        return self.centers.shape[0]
    def __getitem__(self, idx: int):
        x = torch.tensor(self.centers[idx],  dtype=torch.long)
        y = torch.tensor(self.contexts[idx], dtype=torch.long)
        return x, y

# -----------------------------
# Model (Embedding -> Linear -> CE)
# -----------------------------
class SkipGramTorch(nn.Module):
    def __init__(self, vocab_size: int, dim: int):
        super().__init__()
        self.in_embed = nn.Embedding(vocab_size, dim)  # W_in: (|V|, d)
        self.out      = nn.Linear(dim, vocab_size)     # logits: (d -> |V|)
        # Small Gaussian init, like CBOW task
        nn.init.normal_(self.in_embed.weight, mean=0.0, std=0.01)
        nn.init.normal_(self.out.weight,      mean=0.0, std=0.01)
        nn.init.zeros_(self.out.bias)

    def forward(self, center_idx: torch.LongTensor) -> torch.Tensor:
        """
        center_idx: (B,) indices of center words
        returns:
            logits: (B, |V|)
        """
        v_c = self.in_embed(center_idx)  # (B, d)
        logits = self.out(v_c)           # (B, |V|)
        return logits

@dataclass
class SkipGramModelToy:
    """Light wrapper to carry the trained model & metadata."""
    w2i: Dict[str,int]
    i2w: Dict[int,str]
    model: nn.Module
    dim: int
    window: int

# -----------------------------
# Training function (mini-batch + Adam)
# -----------------------------
def skipgram_train_toy(tokenized: List[List[str]],
                       w2i: Dict[str,int],
                       i2w: Dict[int,str],
                       dim: int        = 100,
                       window: int     = 2,
                       batch_size: int = 256,
                       epochs: int     = 10,
                       lr: float       = 1e-3,
                      ) -> SkipGramModelToy:

    centers, contexts = skipgram_pairs_with_edges(tokenized, w2i, window=window)

    print(f"[Skip-gram] data: {len(contexts)} pairs | V={len(w2i)} | d={dim} | window={window}")

    ds = SkipGramDataset(centers, contexts)
    dl = DataLoader(ds, batch_size=batch_size, shuffle=True)

    model = SkipGramTorch(vocab_size=len(w2i), dim=dim).to(DEVICE)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)

    for ep in range(1, epochs+1):
        model.train()
        total_loss, total_n = 0.0, 0
        for xb, yb in dl:
            xb, yb = xb.to(DEVICE), yb.to(DEVICE)
            optimizer.zero_grad()
            logits = model(xb)           # (B, |V|)
            loss = criterion(logits, yb) # targets are context indices
            loss.backward()
            optimizer.step()
            total_loss += loss.item() * xb.size(0)
            total_n += xb.size(0)
        avg_loss = total_loss / max(1, total_n)
        if ep % max(1, epochs//5) == 0:
            print(f"  epoch {ep:02d}/{epochs} | avg_loss={avg_loss:.4f}")

    return SkipGramModelToy(w2i=w2i, i2w=i2w, model=model, dim=dim, window=window)

def cosine_similarity_numpy(vec1: np.ndarray, vec2: np.ndarray) -> float:
    """
    Cosine(v1, v2) = (v1 · v2) / (||v1|| * ||v2||).
    If either norm is zero, return 0.0.
    """
    dot_product = float(np.dot(vec1, vec2))
    norm1 = float(np.linalg.norm(vec1))
    norm2 = float(np.linalg.norm(vec2))
    if norm1 == 0.0 or norm2 == 0.0:
        return 0.0
    return dot_product / (norm1 * norm2)

# -----------------------------
# Nearest neighbors from W_in
# -----------------------------
def most_similar_from_Win(word: str, skg: SkipGramModelToy, topn: int = 5) -> List[Tuple[str, float]]:
    """
    Top-n neighbors by cosine using rows of W_in (input embeddings).
    """
    if word not in skg.w2i:
        return []

    with torch.no_grad():
        M = skg.model.in_embed.weight.detach().cpu().numpy()  # (|V|, d)

    q_idx = skg.w2i[word]
    v_q = M[q_idx]

    # Cosine similarities against all vocab rows
    sims = np.array([cosine_similarity_numpy(M[i], v_q) for i in range(M.shape[0])], dtype=float)

    # Exclude the query itself
    sims[q_idx] = -np.inf

    # Top-n
    k = topn
    order = np.argsort(-sims)[:k]
    return [(skg.i2w[i], float(sims[i])) for i in order]

# -----------------------------
# Train on the toy corpus
# -----------------------------
sg_model_pytorch = skipgram_train_toy(
    tokenized=tokenized_corpus,
    w2i=word_to_index,
    i2w=index_to_word,
    dim=50,         # embedding dim
    window=5,       # context window on each side
    batch_size=8,
    epochs=20,
    lr=0.05,
)

# Probe neighbors
probe_words = ["king", "queen", "computer", "happy", "river"]
print("\nNearest neighbors (Skip-gram PyTorch, cosine on W_in):")
for w in probe_words:
    nns = most_similar_from_Win(w, sg_model_pytorch, topn=5)
    pretty = ", ".join([f"{a} ({b:.3f})" for a, b in nns]) if nns else "[OOV]"
    print(f"  • {w:<9}: {pretty}")


### Skip-gram (gensim) + t-SNE visualization

We now train a **Skip-gram** Word2Vec model using the **gensim** library (to mirror our toy PyTorch implementation) and then visualize a subset of the learned embeddings with **t-SNE**.

---

### t-SNE visualization notes
After training, we project a **subset of words** into 2D with **t-SNE** to visually inspect clusters.

- **What t-SNE does:** preserves **local neighborhoods** (who is near whom), not global distances.
- **Parameters used:** `n_components=2`; **perplexity** is chosen adaptively from the number of plotted words.
- **Caveats:** t-SNE is **stochastic**; absolute distances/axes are **not** directly interpretable—use it to spot **clusters** and **neighbors**, not to make metric claims.


In [None]:
# Train Skip-gram Word2Vec (gensim) on toy corpus and visualize via t-SNE
from gensim.models import Word2Vec
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

# -----------------------------
# Configuration (mirror toy setup)
# -----------------------------
sg_config = {
    "architecture": "Skip-gram",
    "vector_size": 50,           # embedding dim (d)
    "window": 5,                 # context window (each side)
    "min_count": 1,              # keep everything in this demo
    "sample": 0.0,               # disable subsampling (tiny corpus)
    "negative": 0,               # no negative sampling
    "hs": 1,                     # use heirarchical softmax
    "epochs": 20,                # training epochs
    "seed": RANDOM_SEED,
    "alpha": 0.05,
    "workers": 1,
}

print("Training Word2Vec (Skip-gram, gensim)")
print(f"  • sentences           : {len(tokenized_corpus)}")
print(f"  • vector_size (dims)  : {sg_config['vector_size']}")
print(f"  • window              : {sg_config['window']}")
print(f"  • min_count           : {sg_config['min_count']}")
print(f"  • negative samples    : {sg_config['negative']}")
print(f"  • epochs              : {sg_config['epochs']}")
print()

# Build the model with parameters on the constructor and train
sg_model_gensim = Word2Vec(
    sentences=tokenized_corpus,
    vector_size=sg_config["vector_size"],
    window=sg_config["window"],
    min_count=sg_config["min_count"],
    sample=sg_config["sample"],
    sg=1,                                 # 1 for Skip-gram and 0 for CBOW
    negative=sg_config["negative"],
    hs=sg_config["hs"],
    seed=sg_config["seed"],
    alpha=sg_config["alpha"],
    epochs=sg_config["epochs"],
    workers=sg_config["workers"]
)

# Post-training summary
vocab_size = len(sg_model_gensim.wv)
print("Skip-gram model trained (gensim).")
print(f"  • vocabulary size     : {vocab_size}")
print(f"  • embedding dim       : {sg_model_gensim.wv.vector_size}")
print(f"  • #training sentences : {sg_model_gensim.corpus_count}")
print()

In [None]:
# -----------------------------
# t-SNE visualization for (man, woman) and (king, queen)
# -----------------------------

import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

RANDOM_SEED = 42

# Words/pairs to visualize and connect
pairs = [("man", "woman"), ("king", "queen")]
words = sorted({w for a, b in pairs for w in (a, b)})

# In-vocab filtering
in_vocab = [w for w in words if w in sg_model_gensim.wv]
oov = [w for w in words if w not in sg_model_gensim.wv]
if oov:
  print(f"Skipping OOV words: {oov}")

if len(in_vocab) < 2:
  print("Not enough in-vocab words to plot (need at least 2).")
else:
  # Collect embeddings
  X = np.array([sg_model_gensim.wv[w] for w in in_vocab])

  # Choose a valid perplexity for tiny sets: must be < n_samples
  n = len(in_vocab)
  perplexity = max(1.0, min(30.0, n - 1.0))

  tsne = TSNE(
      n_components=2,
      perplexity=perplexity,
      random_state=RANDOM_SEED
  )
  coords = tsne.fit_transform(X)
  pos = {w: coords[i] for i, w in enumerate(in_vocab)}

  # Plot
  plt.figure(figsize=(7, 6))
  for w in in_vocab:
      x, y = pos[w]
      plt.scatter(x, y, s=70)
      plt.annotate(w, (x, y), fontsize=11, xytext=(6, 3), textcoords="offset points")

  # Helper to draw a line between two words if both are present
  def connect(a: str, b: str):
      if a in pos and b in pos:
          xs = [pos[a][0], pos[b][0]]
          ys = [pos[a][1], pos[b][1]]
          plt.plot(xs, ys, linewidth=2.2, alpha=0.85)

  # Draw the requested connections
  connect("man", "woman")
  connect("king", "queen")

  plt.title("t-SNE of Skip-gram embeddings — connections: man–woman, king–queen")
  plt.xlabel("t-SNE dim 1"); plt.ylabel("t-SNE dim 2")
  plt.grid(True, alpha=0.3)
  plt.show()


## TASK 1.1: Manual One-Hot Construction (+ Pairwise Similarities)

**Objective:**  
1) Given a tiny vocabulary and 3 sentences, construct sentence one-hot vectors *(union form)* by hand.  
2) **Compute** the **cosine similarity** **between each pair of the three sentence vectors**, and **compare** the results briefly.

**Tiny vocabulary:**  
$$
\text{V} = [\text{"king"},\ \text{"queen"},\ \text{"man"},\ \text{"woman"},\ \text{"river"},\ \text{"walk"}]
$$

**Sentences:**
1. "king and queen"  
2. "man and woman walk"  
3. "the river and the queen"


In [None]:
# Task 1.1: Manual One-Hot Construction & Pairwise Similarities

# Tiny task vocabulary and sentences
task_vocab = ["king","queen","man","woman","river","walk"]
w2i = {t:i for i,t in enumerate(task_vocab)}
task_sents = [
    "king and queen",
    "man and woman walk",
    "the river and the queen"
]

# Utilities
def _tokenize_simple(text: str):
    """Lowercase + whitespace split."""
    return [w.lower() for w in text.split()]

# TODO: Implement one-hot vector construction for each sentence
def hand_one_hot(sentence: str, w2i: dict) -> np.ndarray:
    """Union one-hot vector for a sentence based on the provided vocabulary mapping."""
    vec = np.___(___(w2i), dtype=int)
    for word in ___(sentence):
        if word in w2i:
            index = w2i[___]
            vec[___] = ___
    return vec

def cosine_similarity_vec(a, b):
    denom = (np.linalg.norm(a) * np.linalg.norm(b))
    return float(a @ b) / denom if denom > 0 else 0.0

# Build vectors
task_vecs = []
for i, sentence in enumerate(task_sents):
    vec = hand_one_hot(sentence, w2i)
    task_vecs.append(vec)
    print(f"  Sent {i}: '{sentence}'")
    print(f"    one-hot -> {vec.tolist()} (number_active_dimension={int(vec.sum())})")

# TODO: Compute pairwise similarities among the 3 sentence vectors
pairs = [(0,1), (0,2), (1,2)]
print("\nPairwise Cosine similarity (sentence one-hots, union):")
for i, j in pairs:
    vi, vj = ___[i], ___[j]
    cos_ij = ___(vi, vj)
    print(f"  S{i} ↔ S{j}:  cos={cos_ij:.4f}")

# Insights
print("\nCosine reflects pattern overlap (shared active indices),")

## TASK 1.2: Neighbors & Analogies (Skip-gram)

**Goal:** Use the trained Skip-gram model **(gensim)** to (1) inspect **top-k nearest neighbors** for several target words and (2) solve **vector analogies** (e.g., *king − man + woman ≈ queen*), then **evaluate** the results.

**Targets**: `["king", "queen", "computer", "happy", "river"]`

**What you’ll do**
1. **Neighbors:** For each target $t$, list the **top-5 neighbors** with cosine similarities.  

   - This reveals local structure: topical clusters, gender pairs, tech terms, etc.

2. **Analogies:** For each analogy $(\text{positive} - \text{negative})$, compute the vector  

   $\mathbf{v} = \sum_{p\in \text{positive}} \mathbf{v}_p \;-\; \sum_{n\in \text{negative}} \mathbf{v}_n$

   and retrieve the **top-5** closest words to $\mathbf{v}$.

3. **Evaluation:** If an **expected** answer is given (e.g., *queen*), report
   - **Hit@5** (whether expected is in top-5),
   - **Rank** of the expected word if present


In [None]:
# TASK 1.2: Neighbors & Analogies

import numpy as np

# Helpers
def topn(model, word, n=5):
    """Top-n most similar words (cosine) for a single query word."""
    if (model is None) or (word not in model.wv):
        return []
    return model.wv.most_similar(word, topn=n)

def most_similar_vec(model, positive=None, negative=None, topn_k=5):
    """
    Top-n most similar words to a vector formed by (+) positives and (-) negatives.
    """
    positive = positive or []
    negative = negative or []
    if model is None:
        return []
    # Ensure all are in-vocab
    # TODO: Prepare positive and negative vector lists
    pos = [___ for ___ in ___ if w in model.___]
    neg = [___ for ___ in ___ if w in model.___]
    if not pos and not neg:
        return []
    return model.wv.most_similar(positive=pos, negative=neg, topn=topn_k)

def rank_of_word(results_list, target_word):
    """Return 1-based rank of target_word within (word, score) results; None if absent."""
    for i, (w, _) in enumerate(results_list, start=1):
        if w == target_word:
            return i
    return None

# Configuration
neighbor_targets = ["king", "queen", "computer", "happy", "river"]
analogy_specs = [
    # (positives, negatives, expected)
    (["king", "woman"], ["man"], "queen"),
    (["queen", "man"], ["woman"], "king"),
    (["programming", "computer"], ["data"], None),   # open-ended, no expected
]

# Neighbors
print("Nearest neighbors (Skip-gram, cosine similarity)")
for t in neighbor_targets:
    # TODO: Get top-5 neighbors
    sims = ___(___, ___, n=5)
    if not sims:
        print(f"  • {t:<10} → [OOV or no neighbors]")
        continue
    pretty = ", ".join([f"{w} ({s:.5f})" for w, s in sims])
    print(f"  • {t:<10} → {pretty}")
print()

# Analogies
print("Vector analogies: top-5 results (+/- lists shown)")
for positives, negatives, expected in analogy_specs:
    # Show the query
    pos_str = " + ".join(positives) if positives else "∅"
    neg_str = " + ".join(negatives) if negatives else "∅"
    print(f"  • Query: ({pos_str}) − ({neg_str})", end="")
    if expected:
        print(f"  | expected ≈ {expected}")
    else:
        print()

    # Compute results
    results = most_similar_vec(sg_model_gensim, positive=positives, negative=negatives, topn_k=5)
    if not results:
        print("      → No results (possibly OOV terms).")
        continue

    # Pretty print top-5
    print("      top-5:", ", ".join([f"{w} ({s:.5f})" for w, s in results]))

    # Evaluation if expected given
    if expected:
        r = rank_of_word(results, expected)
        hit5 = (r is not None)
        print(f"      eval : Hit@5={hit5} | rank={r if r else '—'}")
    print()


## 5. Word2Vec (CBOW)

We train a **CBOW** model (with **naïve softmax**) on the **shared toy corpus** prepared earlier.

**Corpus variables used (from the previous cell):**
- `tokenized_corpus` (list of token lists)
- `word_to_index` (w→i)
- `index_to_word` (i→w)

---

### Data preparation (with few examples)

We use a symmetric window of size **$W$** on each side.

- **Strict (HW style):** require a **full** window of exactly $W$ tokens on **both** sides; boundary positions are **skipped** (no padding/masking).
- **Robust (alternative):** clip the window at sentence boundaries and average whatever context exists (can be fewer than $2W$).

Each eligible position yields **one** training example: **(mean of context embeddings) → center**.

**Example (from the corpus, $W=2$):**

Sentence: “**a bridge crosses the wide river**”  
Tokens: $[a,\; bridge,\; crosses,\; the,\; wide,\; river]$

- Center (target) $t=$ “crosses” at position $p=2$  
  Context indices (clipped): $\{0,1\}\cup\{3,4\}$ → contexts $c \in \{a,\; bridge,\; the,\; wide\}$
  
  Emitted example for both strict and robust modes: **mean($a, bridge, the, wide$) → $t=$ crosses**

- Center (target) $t=$ “a” at position $p=0$  
  Context indices (clipped) $\{1,2\}$ → contexts $c \in \{bridge,\; crosses\}$
  
  **Strict:** **skip** (because we don’t have $W=2$ tokens on the left)
  
  Emitted example for robust mode: **mean$(bridge, crosses$) → $t=$ a**

(These tokens are mapped to integer ids via `word_to_index` before batching.)

---


![CBOW Example](https://sp-ao.shortpixel.ai/client/to_auto,q_glossy,ret_img,w_1177,h_824/https://towardsmachinelearning.org/wp-content/uploads/2022/04/CBOW1.png)

### Notation ($t$=target/center, $c$=context)

Let $|V|$ be the vocabulary size and $d$ the embedding dimension.

- $W_{\text{in}} \in \mathbb{R}^{|V|\times d}$: **input/context** embedding table; for word index $c$,  
  $\mathbf{v}_c = W_{\text{in}}[c,:]$.
- $W_{\text{out}} \in \mathbb{R}^{|V|\times d}$: **output/target** embedding table; for word index $t$,  
  $\mathbf{u}_t = W_{\text{out}}[t,:]$.
- $b_{\text{out}} \in \mathbb{R}^{|V|}$: output bias; component $b_{\text{out},j}$ for class $j$.

---

### Model (naïve softmax)

Given the set of context indices $\text{C}$ (the $W$ tokens to the left and $W$ to the right) around a center:

1) **Aggregate context** (mean of input embeddings):

$$
\mathbf{h} \;=\; \frac{1}{|\text{C}|}\sum_{c \in \text{C}} \mathbf{v}_c .
$$

2) **Score** every vocabulary item and **predict the center index** $t$:

$$
\mathbf{z} \;=\; W_{\text{out}}\,\mathbf{h} + b_{\text{out}},
\qquad
\hat{\mathbf{y}} \;=\; \mathrm{softmax}(\mathbf{z}),
\quad
\hat{y}_j \;=\; \frac{e^{z_j}}{\sum_{k=1}^{|V|} e^{z_k}} .
$$

**Cross-entropy loss (context → center)** with a one-hot target for the true center $t$:

$$
\boxed{\,L(\text{C},t) = -\log \hat{y}_t\,}
$$

**Equivalent log-sum-exp form** (expanding the softmax):

$$
\boxed{\,L(\text{C},t)
= -\big(\mathbf{u}_t^\top \mathbf{h} + b_{\text{out},t}\big)
\;+\; \log \sum_{j=1}^{|V|} \exp\!\big(\mathbf{u}_j^\top \mathbf{h} + b_{\text{out},j}\big)\,}
$$

> The $+\log \sum \exp$ term is the **normalizer** over all classes $j$ (including $t$). Intuition: make the aggregated context $\mathbf{h}$ align with the correct center vector $\mathbf{u}_t$, while the normalization distributes probability mass away from incorrect centers.

> **Efficiency note:** We use **naïve softmax** here (summing over the **entire vocabulary**). In practice, **negative sampling** often replaces the full softmax with $1$ positive $+\;K$ negatives to avoid the $\text{O}(|V|)$ normalization.

---

### Training setup (this demo)

- **Embedding dim** $d$: 50  
- **Window**: 5 (words on each side)  
- **Batch size**: 8  
- **Epochs**: 20  
- **Optimizer**: Adam (learning rate $= 0.05$)  
- **Loss**: `nn.CrossEntropyLoss()` (applies softmax + CE)

---

### After training

- Keep $W_{\text{in}}$ as the word embeddings (often discard/ignore $W_{\text{out}}$ downstream).

- We use the **input embeddings** $W_{\text{in}}$ as word vectors.


- To embed a word (token) with index \(w\), its embedding is
  $$
  \mathbf{v}_w = W_{\text{in}}[w,:] \in \mathbb{R}^d.
  $$

- To embed a sentence, let a sentence be a sequence of word (token) indices
  $$
  S = (w_1, w_2, \dots, w_m).
  $$

  We define its embedding as the **mean of its word embeddings**:
  $$
  \mathrm{embed}(S)
  \;=\; \frac{1}{m} \sum_{r=1}^{m} \mathbf{v}_{w_r}
  \;=\; \frac{1}{m} \sum_{r=1}^{m} W_{\text{in}}[w_r,:].
  $$

- Nearest neighbors are computed with **cosine similarity** on rows of $W_{\text{in}}$:

$$
\cos(\mathbf{v}_a,\mathbf{v}_b) \;=\;
\frac{\mathbf{v}_a^\top \mathbf{v}_b}{\lVert \mathbf{v}_a\rVert\,\lVert \mathbf{v}_b\rVert}
$$


### CBOW (gensim) + t-SNE visualization

We now train a **CBOW** Word2Vec model using the **gensim** library and then visualize a subset of the learned embeddings with **t-SNE**.

**Training setup difference compared with Skip-gram**  
- Architecture: **CBOW** (`sg=0`, `cbow_mean=1`)  


In [None]:
# Train CBOW Word2Vec using gensim and visualize with t-SNE
from gensim.models import Word2Vec
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

# -----------------------------
# Configuration
# -----------------------------
cbow_config = {
    "architecture": "CBOW",
    "vector_size": 50,           # embedding dim (d)
    "window": 5,                 # context window (each side)
    "min_count": 1,              # keep everything in this demo
    "sample": 0.0,               # disable subsampling (tiny corpus)
    "negative": 0,               # no negative sampling
    "hs": 1,                     # use heirarchical softmax
    "epochs": 20,                # training epochs
    "seed": RANDOM_SEED,
    "alpha": 0.05,
    "workers": 1,
    "cbow_mean": 1,              # average the context vectors
}

print("Training Word2Vec (CBOW, gensim)")
print(f"  • sentences           : {len(tokenized_corpus)}")
print(f"  • vector_size (dims)  : {cbow_config['vector_size']}")
print(f"  • window              : {cbow_config['window']}")
print(f"  • min_count           : {cbow_config['min_count']}")
print(f"  • negative samples    : {cbow_config['negative']}")
print(f"  • epochs              : {cbow_config['epochs']}")
print()

# Build the model with specified parameters on the constructor and train
cbow_model_gensim = Word2Vec(
    sentences=tokenized_corpus,                 # corpus here
    vector_size=cbow_config["vector_size"],
    window=cbow_config["window"],
    min_count=cbow_config["min_count"],
    sample=sg_config["sample"],
    sg=0,                                       # 0 for CBOW, 1 for Skip-gram
    negative=cbow_config["negative"],
    hs=sg_config["hs"],
    seed=cbow_config["seed"],
    cbow_mean=cbow_config["cbow_mean"],
    alpha=sg_config["alpha"],
    epochs=cbow_config["epochs"],
    workers=sg_config["workers"]

)

# Post-training summary
vocab_size = len(cbow_model_gensim.wv)
print("CBOW model trained (gensim).")
print(f"  • vocabulary size     : {vocab_size}")
print(f"  • embedding dim       : {cbow_model_gensim.wv.vector_size}")
print(f"  • #training sentences : {cbow_model_gensim.corpus_count}")
print()


In [None]:
# -----------------------------
# t-SNE visualization for (man, woman) and (king, queen)
# -----------------------------

import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

RANDOM_SEED = 42

# Words/pairs to visualize and connect
pairs = [("man", "woman"), ("king", "queen")]
words = sorted({w for a, b in pairs for w in (a, b)})

# In-vocab filtering
in_vocab = [w for w in words if w in cbow_model_gensim.wv]
oov = [w for w in words if w not in cbow_model_gensim.wv]
if oov:
  print(f"Skipping OOV words: {oov}")

if len(in_vocab) < 2:
  print("Not enough in-vocab words to plot (need at least 2).")
else:
  # Collect embeddings
  X = np.array([cbow_model_gensim.wv[w] for w in in_vocab])

  # Choose a valid perplexity for tiny sets: must be < n_samples
  n = len(in_vocab)
  perplexity = max(1.0, min(30.0, n - 1.0))

  tsne = TSNE(
      n_components=2,
      perplexity=perplexity,
      random_state=RANDOM_SEED
  )
  coords = tsne.fit_transform(X)
  pos = {w: coords[i] for i, w in enumerate(in_vocab)}

  # Plot
  plt.figure(figsize=(7, 6))
  for w in in_vocab:
      x, y = pos[w]
      plt.scatter(x, y, s=70)
      plt.annotate(w, (x, y), fontsize=11, xytext=(6, 3), textcoords="offset points")

  # Helper to draw a line between two words if both are present
  def connect(a: str, b: str):
      if a in pos and b in pos:
          xs = [pos[a][0], pos[b][0]]
          ys = [pos[a][1], pos[b][1]]
          plt.plot(xs, ys, linewidth=2.2, alpha=0.85)

  # Draw the requested connections
  connect("man", "woman")
  connect("king", "queen")

  plt.title("t-SNE of CBOW embeddings — connections: man–woman, king–queen")
  plt.xlabel("t-SNE dim 1"); plt.ylabel("t-SNE dim 2")
  plt.grid(True, alpha=0.3)
  plt.show()


## 6. GloVe (Global Vectors) — PyTorch on the Toy Corpus

We fit **GloVe** with **PyTorch** on the same **toy corpus**.  
Unlike Skip-gram/CBOW (predictive, **local** windows + softmax), GloVe is **regression** on the **global** co-occurrence matrix.

**Corpus variables used (from the shared toy cell):**
- `tokenized_corpus` (list of token lists)
- `word_to_index` (w→i)
- `index_to_word` (i→w)

---

### Data (how we build $X$ in this demo)
- Symmetric **window** of size $W$ on each side (we clip at sentence boundaries).  
- For each center position $i$ and each context $j$ in its window, add **$1/\text{distance}$** to $X_{ij}$.
- This captures **global** co-occurrence structure across the whole corpus.

Each non-zero $(i,j)$ becomes a **training pair** carrying $(X_{ij}, f(X_{ij}))$.

### Mini co-occurrence example for $X$

**Tiny corpus**
- “king rules the kingdom”
- “the queen rules the kingdom”

**Vocabulary**
$[\,\text{king},\ \text{kingdom},\ \text{queen},\ \text{rules},\ \text{the}\,]$

**Settings**: symmetric window $w=2$; add $1/\text{distance}$ for each context token.

#### Directed matrix $X$ (rows = center $i$, columns = context $j$)
|        | king | kingdom | queen | rules | the |
|:------:|:----:|:-------:|:-----:|:-----:|:---:|
| **king**    | 0.0 | 0.0 | 0.0 | 1.0 | 0.5 |
| **kingdom** | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| **queen**   | 0.0 | 0.0 | 0.0 | 1.0 | 1.5 |
| **rules**   | 1.0 | 1.0 | 1.0 | 0.0 | 2.5 |
| **the**     | 0.5 | 2.0 | 1.5 | 2.5 | 0.0 |

*Interpretation:* $X_{\text{king},\text{rules}}=1.0$ because **rules** is at distance 1 from **king**;  
$X_{\text{king},\text{the}}=0.5$ because **the** is at distance 2 from **king**.

---


![GloVe Example](https://www.researchgate.net/publication/337461648/figure/fig1/AS:11431281342676763@1743564919258/The-model-architecture-of-GloVe-The-input-is-a-one-hot-representation-of-a-word-The.tif)

### Notation ($i$=center word in a pair, $j$=context word in a pair)

Let $|V|$ be vocab size and $d$ the embedding dim.
- $W_{\text{in}} \in \mathbb{R}^{|V|\times d}$: **word** embedding table; $\mathbf{v}_i = W_{\text{in}}[i,:]$  
- $W_{\text{out}} \in \mathbb{R}^{|V|\times d}$: **context** embedding table; $\mathbf{u}_j = W_{\text{out}}[j,:]$  
- $b_{\text{in}} \in \mathbb{R}^{|V|}$: word **biases** for center words; $b_i$ = $b_{\text{in}}[i]$
- $b_{\text{out}} \in \mathbb{R}^{|V|}$: word **biases** for context words; $b_{\text{ctx},j}$ = $b_{\text{out}}[j]$
- $X \in \mathbb{R}_{\ge 0}^{|V|\times|V|}$: **(directed) co-occurrence**; entry $X_{ij}$ counts how often context $j$ appears around word $i$  
  (we build it with a symmetric window and **$1/\text{distance}$** weighting).

---

### Objective (weighted least squares)
For all pairs with $X_{ij}>0$, minimize
$$
J \;=\; \sum_{i,j} f(X_{ij}) \Big( \underbrace{\mathbf{v}_i^\top \mathbf{u}_j + b_i + b_{\text{ctx},j}}_{\text{prediction} s_{ij}} \;-\; \log X_{ij} \Big)^2,
$$

with the standard GloVe **weighting function**

$$
f(x) \;=\;
\begin{cases}
\left(\dfrac{x}{x_{\max}}\right)^\alpha, & x < x_{\max},\\[6pt]
1, & x \ge x_{\max}.
\end{cases}
$$

**Intuition.** Learn embeddings so that the score $s_{ij}$ **approximates** $\log X_{ij}$.  

$f(x)$ down-weights very rare pairs and caps the influence of very frequent pairs.

**Role of $\alpha$**

$\alpha$ controls how fast $f(x)$ grows for $x < x_{\max}$:

- If $\alpha = 1$: $f(x) = x/x_{\max}$ (linear growth).
- If $0 < \alpha < 1$ (usual case): sublinear growth → very small counts are down-weighted, mid-range counts are emphasized.

In practice, GloVe typically uses $\alpha = 0.75$.

**Role of $x_{\max}$**

$x_{\max}$ is the **co-occurrence cutoff**:

- For $x < x_{\max}$: $f(x) = (x/x_{\max})^\alpha$ increases with $x$.
- For $x \ge x_{\max}$: $f(x) = 1$ (extra frequency doesn’t increase the weight).

This prevents extremely frequent words (e.g. “the”, “of”) from dominating the loss.

In practice, $x_{\max} \in [50, 100]$ (often $x_{\max} = 100$; $50$ is also reasonable).

---

### Training setup (this demo)
- **Embedding dim** $d$: 100  
- **Window**: 5 (distance-weighted $1/\text{dist}$)  
- **Weighting**: $x_{\max}=50$, $\alpha=0.75$  
- **Optimizer**: Adam (learning rate $= 0.05$)
- **Epochs**: 100
- **Batch size**: 256  
- **Init**: small Gaussian for $W_{\text{in}}, W_{\text{out}}$ and zeros for biases

---

### After training

- Common practice is to use the **sum** of word and context embeddings as the final word vector:

  $$
  \mathbf{e}_i \;=\; \mathbf{v}_i \;+\; \mathbf{u}_i,
  $$

  where $\mathbf{v}_i = W_{\text{in}}[i,:]$ and $\mathbf{u}_i = W_{\text{out}}[i,:]$.

- Collect these into a single embedding matrix

  $$
  E \;=\; W_{\text{in}} + W_{\text{out}} \in \mathbb{R}^{|V|\times d},
  $$

  and use the $i$-th row $\mathbf{e}_i = E[i,:]$ as the embedding for word index $i$.

- To embed a word (token) with index $w$, its embedding is

  $$
  \mathbf{e}_w = E[w,:] \in \mathbb{R}^d.
  $$

- To embed a sentence, let the sentence be a sequence of word (token) indices

  $$
  S = (w_1, w_2, \dots, w_m).
  $$

  We define its embedding as the **mean of its word embeddings**:

  $$
  \mathrm{embed}(S)
  \;=\; \frac{1}{m} \sum_{r=1}^{m} \mathbf{e}_{w_r}
  \;=\; \frac{1}{m} \sum_{r=1}^{m} E[w_r,:]
  $$

- Nearest neighbors are computed with **cosine similarity** on rows of \(E\):

  $$
  \cos(\mathbf{e}_a,\mathbf{e}_b) \;=\;
  \frac{\mathbf{e}_a^\top \mathbf{e}_b}{\lVert \mathbf{e}_a\rVert\,\lVert \mathbf{e}_b\rVert}
  $$



In [None]:
# ================================
# GloVe (weighted least squares) — PyTorch on the toy corpus
# ================================
import math, random
from collections import defaultdict
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd
from IPython.display import display

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# -----------------------------
# Reproducibility & device
# -----------------------------
RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# -----------------------------
# Build directed co-occurrence X with distance weighting
# -----------------------------
def build_cooccurrence(tokenized: List[List[str]],
                       w2i: Dict[str,int],
                       window: int = 5) -> Dict[Tuple[int,int], float]:
    """
    X[(i,j)] = sum over occurrences of j in the symmetric window around i of (1 / distance)
    Window is clipped at sentence boundaries.
    """
    X = defaultdict(float)
    for sent in tokenized:
        ids = [w2i[w] for w in sent if w in w2i]
        n = len(ids)
        for c in range(n):
            i_idx = ids[c]  # center word index
            left  = max(0, c - window)
            right = min(n, c + window + 1)
            for pos in range(left, right):
                if pos == c:
                    continue
                j_idx = ids[pos]
                dist = abs(pos - c)
                X[(i_idx, j_idx)] += 1.0 / dist
    return X

# -----------------------------
# Dataset over non-zero co-occurrence pairs
# -----------------------------
class GloVePairs(Dataset):
    def __init__(self, pairs_ij, x_vals, weights):
        self.i = torch.tensor([p[0] for p in pairs_ij], dtype=torch.long)
        self.j = torch.tensor([p[1] for p in pairs_ij], dtype=torch.long)
        self.x = torch.tensor(x_vals, dtype=torch.float32)
        self.w = torch.tensor(weights, dtype=torch.float32)
    def __len__(self):
        return self.i.shape[0]
    def __getitem__(self, idx):
        return self.i[idx], self.j[idx], self.x[idx], self.w[idx]

# -----------------------------
# GloVe model: embeddings + biases
# -----------------------------
class GloVeTorch(nn.Module):
    """
    Parameters:
      W_in   (|V|, d)   → word embeddings  v_i
      W_out  (|V|, d)   → context embeddings u_j
      b_word (|V|,)     → word bias b_i
      b_ctx  (|V|,)     → context bias b_ctx,j
    Prediction: s_ij = v_i^T u_j + b_i + b_ctx,j
    Loss per pair: w_ij * (s_ij - log X_ij)^2
    """
    def __init__(self, vocab_size: int, dim: int):
        super().__init__()
        self.in_embed  = nn.Embedding(vocab_size, dim)
        self.out_embed = nn.Embedding(vocab_size, dim)
        self.b_word    = nn.Embedding(vocab_size, 1)
        self.b_ctx     = nn.Embedding(vocab_size, 1)

        # Initialization (small Gaussian for embeddings, zeros for biases)
        nn.init.normal_(self.in_embed.weight,  mean=0.0, std=0.01)
        nn.init.normal_(self.out_embed.weight, mean=0.0, std=0.01)
        nn.init.zeros_(self.b_word.weight)
        nn.init.zeros_(self.b_ctx.weight)

    def forward(self, i_idx: torch.LongTensor, j_idx: torch.LongTensor):
        v_i = self.in_embed(i_idx)            # (B, d)
        u_j = self.out_embed(j_idx)           # (B, d)
        b_i = self.b_word(i_idx).squeeze(-1)  # (B,)
        b_j = self.b_ctx(j_idx).squeeze(-1)   # (B,)
        s_ij = (v_i * u_j).sum(dim=1) + b_i + b_j  # (B,)
        return s_ij

    def combined_matrix(self):
        """E = W_in + W_out (|V|, d) for downstream similarity."""
        return self.in_embed.weight + self.out_embed.weight

# -----------------------------
# Weighting function f(x) for GloVe
# -----------------------------
def glove_weight(x: np.ndarray, x_max: float, alpha: float) -> np.ndarray:
    w = (x / x_max) ** alpha
    w = np.where(x < x_max, w, 1.0)
    return w.astype(np.float32)

# -----------------------------
# Training function (prints prep & co-occurrence tables)
# -----------------------------
def glove_train_torch(tokenized: List[List[str]],
                      w2i: Dict[str,int],
                      i2w: Dict[int,str],
                      dim: int        = 50,
                      window: int     = 5,
                      x_max: float    = 50.0,
                      alpha: float    = 0.75,
                      batch_size: int = 256,
                      epochs: int     = 30,
                      lr: float       = 0.05):
    # Build directed co-occurrence
    X = build_cooccurrence(tokenized, w2i, window=window)

    # ---- Prints: prep + dense tables ----
    V = len(i2w)
    vocab_list = [i2w[i] for i in range(V)]
    print("\nGloVe prep:")
    print(f"  • vocab size               : {V}")
    print(f"  • nonzero co-occurrences   : {len(X)}")

    # Dense table for readability
    Xd = np.zeros((V, V), dtype=float)
    for (i, j), val in X.items():
        Xd[i, j] = val

    df_directed  = pd.DataFrame(Xd, index=vocab_list, columns=vocab_list)

    print("\nCo-occurrence matrix X (directed): values rounded to 2 decimals")
    display(df_directed.round(2))
    # -----------------------------------------------

    # Convert to training tensors
    pairs = list(X.keys())
    xvals = np.array([X[p] for p in pairs], dtype=np.float32)
    wvals = glove_weight(xvals, x_max=x_max, alpha=alpha)

    print("\nGloVe training (PyTorch):")
    print(f"  • window                 : {window}")
    print(f"  • x_max, alpha           : {x_max}, {alpha}")
    print(f"  • d, batch, epochs       : {dim}, {batch_size}, {epochs}")
    print(f"  • optimizer              : Adam (lr={lr})")

    ds = GloVePairs(pairs, xvals, wvals)
    dl = DataLoader(ds, batch_size=batch_size, shuffle=True)

    model = GloVeTorch(vocab_size=V, dim=dim).to(DEVICE)
    opt = optim.Adam(model.parameters(), lr=lr)

    for ep in range(1, epochs+1):
        model.train()
        total_loss, total_n = 0.0, 0
        for i_idx, j_idx, x_ij, w_ij in dl:
            i_idx = i_idx.to(DEVICE); j_idx = j_idx.to(DEVICE)
            x_ij  = x_ij.to(DEVICE);  w_ij  = w_ij.to(DEVICE)

            opt.zero_grad()
            s_ij = model(i_idx, j_idx)           # (B,)
            r    = s_ij - torch.log(x_ij + 1e-12) # Added tiny epsilon for stability to avoid log(0)
            loss = (w_ij * (r ** 2)).mean()
            loss.backward()
            opt.step()

            total_loss += float(loss.item()) * i_idx.size(0)
            total_n    += i_idx.size(0)
        avg = total_loss / max(1, total_n)
        if ep % max(1, epochs//5) == 0:
            print(f"  epoch {ep:02d}/{epochs} | avg_weighted_MSE≈{avg:.6f}")

    return model, X  # Return model and the co-occurrence dict for any later use

def cosine_similarity_numpy(vec1: np.ndarray, vec2: np.ndarray) -> float:
    """
    Cosine(v1, v2) = (v1 · v2) / (||v1|| * ||v2||).
    If either norm is zero, return 0.0.
    """
    dot_product = float(np.dot(vec1, vec2))
    norm1 = float(np.linalg.norm(vec1))
    norm2 = float(np.linalg.norm(vec2))
    if norm1 == 0.0 or norm2 == 0.0:
        return 0.0
    return dot_product / (norm1 * norm2)

# -----------------------------
# Nearest neighbors on E = W_in + W_out
# -----------------------------
def glove_most_similar(word: str,
                       model: GloVeTorch,
                       w2i: Dict[str,int],
                       i2w: Dict[int,str],
                       topn: int = 5):
    """
    Top-n neighbors by cosine similarity using E = W_in + W_out.
    """
    if word not in w2i:
        return []

    # Combined embeddings (E = W_in + W_out)
    with torch.no_grad():
        E = model.combined_matrix().detach().cpu().numpy()  # shape: (|V|, d)

    q_idx = w2i[word]
    v_q = E[q_idx]  # (d,)

    # Cosine similarities against all words
    sims = np.array([cosine_similarity_numpy(E[i], v_q) for i in range(E.shape[0])], dtype=float)

    # Exclude the query token itself
    sims[q_idx] = -np.inf

    # Top-n by cosine
    k = topn
    order = np.argsort(-sims)[:k]
    return [(i2w[i], float(sims[i])) for i in order]

# -----------------------------
# Train on the toy corpus & probe (PyTorch GloVe)
# -----------------------------
glove_model, X_dict = glove_train_torch(
    tokenized=tokenized_corpus,
    w2i=word_to_index,
    i2w=index_to_word,
    dim=100,
    window=5,
    x_max=50,
    alpha=0.75,
    batch_size=256,
    epochs=100,
    lr=0.05
)

probe_words = ["king", "queen", "computer", "happy", "river", "bridge"]
print("\nNearest neighbors (GloVe PyTorch, cosine on W_in + W_out):")
for w in probe_words:
    nns = glove_most_similar(w, glove_model, word_to_index, index_to_word, topn=5)
    pretty = ", ".join([f"{a} ({b:.3f})" for a,b in nns]) if nns else "[OOV]"
    print(f"  • {w:<9}: {pretty}")


### Comparison: CBOW/Skip-gram vs GloVe

- These are **static** word embeddings; they do not handle OOVs or context differences (no contextual embeddings).  
- Differences reflect **training objective & statistics**: predictive (Skip-gram/CBOW) vs. global regression (GloVe).
  - **Skip-gram/CBOW**: predictive, use **local** context windows with negative sampling.
  - **GloVe**: regression on **global** counts, fits all (non-zero) pairs jointly.
  - In practice, neighbor lists may be similar in general but can **differ**: GloVe tends to reflect **broader global statistics**, while Skip-gram may capture **finer local relations** (especially for rarer words), and CBOW may capture frequent-context neighbors better.


## TASK 2.1: CBOW vs Skip-gram Comparison

**What you’ll implement**

- **Model comparison.**  
   For `["king", "computer", "happy", "river", "bridge"]`:
   - list **CBOW** and **Skip-gram** neighbors (top-5),
   - compute **overlap** (intersection size and Jaccard index),
   - report a simple **coherence** metric = average pairwise cosine **among the 5 neighbors** (higher ≈ tighter cluster).

Let
- $C_w$ = set of top-5 CBOW neighbors for word $w$,
- $S_w$ = set of top-5 Skip-gram neighbors for word $w$.

Then:

- **Overlap size**:
  $$
  \text{overlap}(w) = \bigl| C_w \cap S_w \bigr|.
  $$

- **Jaccard index**:
  $$
  J(w) = \frac{\bigl| C_w \cap S_w \bigr|}{\bigl| C_w \cup S_w \bigr|}.
  $$

Notes:
- Training on tiny corpora can produce many near-ties, that's why the results scores are very close to each other.


In [None]:
# TASK 2.1: CBOW vs Skip-gram comparison

import numpy as np

# Gensim most_similar (baseline)
def most_similar_gensim(model, query_word, topn=5):
    """
    Thin wrapper around gensim's most_similar for side-by-side comparison.
    Returns a list of (word, score). Returns [] if model/word missing.
    """
    if (model is None) or (query_word not in model.wv):
        return []
    return [(w, float(s)) for (w, s) in model.wv.most_similar(query_word, topn=topn)]

# Coherence: avg pairwise cosine among neighbors
def neighbor_coherence(model, word, topn_k=5):
    """
    Average pairwise cosine among the top-k neighbors (unordered pairs).
    Returns np.nan if fewer than 2 neighbors are available.
    """
    # Get neighbor words via our manual cosine function
    neighbors = [w for w, _ in most_similar_gensim(model, word, topn=topn_k)]
    if len(neighbors) < 2:
        return np.nan

    # Unit vectors for neighbors
    kv = model.wv
    vecs = np.array([kv[w] for w in neighbors], dtype=float) # (num_words, d)
    vecs_norm = np.linalg.norm(vecs, axis=1, keepdims=True)  # (num_words, 1)
    # We add a tiny epsilon to the denominator before normalization to avoid division by 0
    vecs = vecs / (vecs_norm + 1e-12)

    # Plain, clear average over all unordered pairs
    total, count = 0.0, 0
    n = len(neighbors)
    for i in range(n):
        for j in range(i + 1, n):
            # TODO: Compute the cosine similarity between each pair of vectors (note that the vectors are already normalized)
            total += float(np.___(___[___], ___[___]))
            count += 1

    return (total / count) if count else np.nan

# CBOW vs Skip-gram comparison
compare_words = ["king", "queen", "computer", "happy", "river"]
print("CBOW vs Skip-gram — neighbors, overlap, and coherence")

for w in compare_words:
    cb = most_similar_gensim(cbow_model_gensim, w, topn=5)
    sg = most_similar_gensim(sg_model_gensim,  w, topn=5)

    cb_words = [x for x, _ in cb]
    sg_words = [x for x, _ in sg]
    # TODO: Find the overlap between the manual and gensim sets of neighboring words
    overlap = sorted(___(___) & ___(___))
    jaccard = (len(overlap) / len(set(cb_words) | set(sg_words))) if (cb_words and sg_words) else 0.0

    cb_coh = neighbor_coherence(cbow_model_gensim, w, topn_k=5)
    sg_coh = neighbor_coherence(sg_model_gensim,  w, topn_k=5)

    print(f"\n  ▶ Word: {w}")
    print(f"    Skip-gram : {', '.join([f'{a} ({b:.5f})' for a,b in sg])}")
    print(f"    CBOW      : {', '.join([f'{a} ({b:.5f})' for a,b in cb])}")
    print(f"    Overlap   : {overlap}  |  Jaccard={jaccard:.5f}")
    print(f"    Coherence : CBOW≈{cb_coh:.5f}  |  SG≈{sg_coh:.5f}")

print()
print("Discussion and Insights")
print("- Skip-gram often yields sharper, specific neighbors (rare-word friendly).")
print("- CBOW often yields smoother, frequent-context neighbors.")
print("- These are tendencies; the exact behavior depends on corpus, params, etc.")
print("- Consider semantic coherence and topical tightness.")


## TASK 2.2: Skip-gram vs GloVe — Neighbors and MRR

We compare **top-5 nearest neighbors** (cosine) for several targets under two models, then quantify how well each model surfaces a small **expected set** of “reasonable neighbors” using **Reciprocal Rank (RR)** and **Mean Reciprocal Rank (MRR)**.

**Models compared**
- **Skip-gram (PyTorch)** — neighbors from the input table $W_{\text{in}}$.
- **GloVe (PyTorch)** — neighbors from $E = W_{\text{in}} + W_{\text{out}}$.

**Targets**: `["king", "computer", "happy", "river", "bridge"]`

**Expected sets** (derived from our toy corpus; used for MRR)
- `king` → {`queen`, `kingdom`, `palace`}
- `computer` → {`software`, `hardware`, `system`}
- `happy` → {`joyful`, `cheerful`, `smile`}
- `river` → {`bank`, `bridge`, `lake`}
- `bridge` → {`river`, `bank`, `boats`}

### Metrics
- **RR (per target)**: if any expected word appears in the model’s **top-5** list at rank $r$, then $\text{RR}=1/r$; otherwise $\text{RR}=0$.
- **MRR (per model)**: mean of RR over all targets.

> These sets are small, corpus-specific heuristics (not gold labels). They’re only for quick checks on this toy setup.


In [None]:
# TASK 2.2 — Neighbors + RR/MRR (SG vs GloVe)

from typing import List, Tuple, Dict, Iterable
import numpy as np

# --- Helpers to pull neighbors from existing models --------------------------
def neighbors_skipgram(word: str, topn: int = 5) -> List[Tuple[str, float]]:
    """
    Top-n neighbors from the Skip-gram PyTorch model (W_in).
    Expects `skg_toy` (SkipGramModelToy) and `most_similar_from_Win` to exist.
    """
    if sg_model_pytorch is not None:
        return most_similar_from_Win(word, sg_model_pytorch, topn=topn)
    return []

def neighbors_glove(word: str, topn: int = 5) -> List[Tuple[str, float]]:
    """
    Top-n neighbors from the GloVe PyTorch model (E = W_in + W_out).
    Expects `glove_model`, `word_to_index`, `index_to_word`, and `glove_most_similar` to exist.
    """
    if (glove_model is not None and
      word_to_index is not None and index_to_word is not None):
        return glove_most_similar(word, glove_model, word_to_index, index_to_word, topn=topn)
    return []

targets = ["king", "computer", "happy", "river", "bridge"]

# --- Expected sets for RR/MRR (toy, heuristic) --------------------------------
expected: Dict[str, set] = {
    "king":     ["queen", "kingdom", "palace"],
    "computer": ["software", "hardware", "system"],
    "happy":    ["joyful", "cheerful", "smile"],
    "river":    ["bank", "bridge", "lake"],
    "bridge":   ["river", "bank", "boats"],
}

# --- RR/MRR utilities ---------------------------------------------------------
def reciprocal_rank(pred_names: Iterable[str], expected_set: set) -> Tuple[float, int, str]:
    """
    Return (RR, rank, matched_word) where rank is 1-based on first hit in pred_names.
    If no expected item is found, returns (0.0, 0, '').
    """
    for r, w in enumerate(pred_names, start=1):
        # TODO: Find the rank of the first match in the expected set for the target word
        if ___ in ___:
            return 1.0 / ___, r, w
    return 0.0, 0, ""

def pretty_list(pairs: List[Tuple[str, float]]) -> str:
    return ", ".join([f"{w} ({s:.3f})" for w, s in pairs])

# --- Run comparison -----------------------------------------------------------
print("\n=== Nearest neighbors + RR/MRR (Skip-gram vs GloVe) ===")
sg_rrs, gl_rrs = [], []

for t in targets:
    sg_neighbors = neighbors_skipgram(t, topn=5)
    gl_neighbors = neighbors_glove(t, topn=5)

    sg_preds = [w for w, _ in sg_neighbors]
    gl_preds = [w for w, _ in gl_neighbors]

    rr_sg, rank_sg, hit_sg = reciprocal_rank(sg_preds, expected.get(t, set()))
    rr_gl, rank_gl, hit_gl = reciprocal_rank(gl_preds, expected.get(t, set()))

    sg_rrs.append(rr_sg)
    gl_rrs.append(rr_gl)

    print(f"\nTarget: {t}")
    print(f"  Skip-gram top-5: [{pretty_list(sg_neighbors)}]")
    print(f"    RR = {rr_sg:.3f}" + (f"  (hit='{hit_sg}' at rank {rank_sg})" if hit_sg else "  (no expected hit)"))
    print(f"  GloVe     top-5: [{pretty_list(gl_neighbors)}]")
    print(f"    RR = {rr_gl:.3f}" + (f"  (hit='{hit_gl}' at rank {rank_gl})" if hit_gl else "  (no expected hit)"))

# TODO: Find the mean of reciprocal ranks previously calculated
mrr_sg = float(np.___(___)) if ___ else 0.0
mrr_gl = float(np.___(___)) if ___ else 0.0

print("\n--- MRR over all targets ---")
print(f"  Skip-gram MRR: {mrr_sg:.3f}")
print(f"  GloVe     MRR: {mrr_gl:.3f}")

# Quick comparison line
better = "Skip-gram" if mrr_sg > mrr_gl else ("GloVe" if mrr_gl > mrr_sg else "Tied")
print(f"\nResult: {better} has higher MRR on this toy dataset.")


## 7. Summary

**What You Practiced**
- Built one-hot vectors; compared and calculated **cosine similarity**.
- Trained **Skip-gram** and **CBOW** Word2Vec; explored neighbors and **analogy arithmetic**.
- Visualized embeddings with **t-SNE**.
- Implemented a **tiny GloVe** on co-occurrences.
- Compared neighbors across **Skip-gram / CBOW / GloVe** and discussed differences.

**Key Takeaways**
- One-hot is simple but lacks semantics → embeddings provide dense, shared statistics.
- **Cosine** is the go-to similarity for embeddings (direction matters).
- **Skip-gram** (rare words) vs **CBOW** (frequent words) trade-offs.
- **GloVe** leverages **global** statistics; Word2Vec focuses on **local** contexts.
- Visualization + nearest-neighbor inspection are useful sanity checks.