# AIO Q1 -- Word2Vec Skip-Gram with Negative Sampling


## Background

In the Word2Vec skip-gram model, the goal is to predict **context words** given a **center word**.
Each word has two embeddings:

- **Input embedding** from matrix $U \in \mathbb{R}^{V \times d}$  
- **Output embedding** from matrix $W \in \mathbb{R}^{V \times d}$

Where:
- $V$ = vocabulary size (total number of unique words)
- $d$ = embedding dimension (vector size for each word)
- $K$ = number of negative samples per positive pair

The negative sampling loss for one training pair is:

$$
L = -\log \sigma(s_{\text{pos}}) - \sum_{k=1}^K \log \sigma(-s_{\text{neg},k})
$$

Where $\sigma(x) = \frac{1}{1 + e^{-x}}$ is the sigmoid function.


In [None]:
import numpy as np # Numpy is the only library you'll need

## Task 1 — Understanding Separate Embedding Matrices [8 points]

### 1.1 [3 points]
**Explain** why the operation $u_c = U[c]$ (selecting row $c$ from $U \in \mathbb{R}^{V \times d}$) requires $O(d)$ time rather than $O(Vd)$ time.

Your explanation should mention:
- How rows are stored in memory
- Why we don't need to examine all $V$ rows
</br></br>

**Answer:**

---

### 1.2 [5 points]
Let $\mathcal{M}_{\text{share}}$ denote a model using one shared embedding matrix and $\mathcal{M}_{\text{separate}}$ denote a model using separate $U$ and $W$ matrices.

**Show** that $\mathcal{M}_{\text{separate}}$ has strictly more representational capacity by constructing a counterexample: Give specific $2 \times 2$ matrices $U$ and $W$ and demonstrate a configuration that cannot be represented by any single shared matrix.
</br></br>
**Answer:**


## Task 2 — Properties of the Dot Product Score [12 points]

The compatibility score is $s = w_o^\top u_c = \sum_{i=1}^d w_{o,i} \cdot u_{c,i}$.

### 2.1 [6 points]
**Prove** or **disprove** the following statement:

> *"If $\|u_c\|_2 = \|w_o\|_2 = 1$ (unit vectors), then $s \in [-1, 1]$."*

If true, provide a proof. If false, provide a counterexample.
</br></br>
**Answer:**

---

### 2.2 [6 points]
Suppose all embeddings are initialized from $\mathcal{N}(0, \sigma^2)$ where $\sigma = 0.01$ and $d = 300$.

**Compute** the expected value and variance of the dot product $s = w_o^\top u_c$ at initialization, assuming $u_c$ and $w_o$ are independent.

**Given**: For independent random variables $X \sim \mathcal{N}(0, \sigma_X^2)$ and $Y \sim \mathcal{N}(0, \sigma_Y^2)$:
- $\mathbb{E}[XY] = 0$
- $\text{Var}(XY) = \sigma_X^2 \sigma_Y^2$

</br></br>
**Answer:**


## Task 3 — Sigmoid Function Analysis [10 points]

### 3.1 [5 points]
**Prove** that:

$$
\lim_{x \to \infty} \sigma(x) = 1 \quad \text{and} \quad \lim_{x \to -\infty} \sigma(x) = 0
$$

using the definition $\sigma(x) = \frac{1}{1 + e^{-x}}$.
</br></br>
**Answer:**

---

### 3.2 [5 points]
**Derive** the derivative $\frac{d\sigma}{dx}$ and express it in terms of $\sigma(x)$ itself.

Show your work using the quotient rule or chain rule.

</br></br>
**Answer:**


## Task 4 — Computational Complexity Analysis [10 points]

### 4.1 [5 points]
Let $W \in \mathbb{R}^{V \times d}$ and $u_c \in \mathbb{R}^d$.

**Determine** the computational complexity of computing all scores $s = Wu_c$ (returning a vector of length $V$).</br> Express your answer in $\Theta(\cdot)$ notation and explain your reasoning.
</br></br>
**Answer:**

---

### 4.2 [5 points]
During negative sampling, we only need $K+1$ scores (1 positive, $K$ negative) where $K \ll V$.

**Compute** the exact number of floating-point operations (multiplications + additions) needed for:
1. Computing only the needed $K+1$ scores
2. Computing the full matrix-vector product $Wu_c$

Evaluate for $V = 200{,}000$, $d = 300$, $K = 10$.
</br></br>
**Answer:**


## Task 5 — Vectorized Loss Implementation [20 points]

Complete the vectorized implementation with **no Python loops**:


In [None]:
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def neg_sampling_loss(U, W, c, o, negs):
    """
    Compute: L = -log σ(s_pos) - Σ_k log σ(-s_neg,k)
    
    Args:
        U: (V, d) center embeddings
        W: (V, d) context embeddings
        c: int, center word index
        o: int, positive context word index
        negs: (K,) negative sample indices
    Returns:
        loss: scalar
    """
    u_c = U[c]          # (d,)
    w_o = W[o]          # (d,)
    w_negs = W[negs]    # (K, d)
    
    # Scores (provided)
    s_pos = w_o @ u_c                 # scalar
    s_negs = w_negs @ u_c             # (K,)

    # --- Your code below (3–4 lines) ---
    sig_pos = ...
    sig_negs = ...
    loss = ...
    # -----------------------------------

    return loss


## Task 6 — Vectorized Gradient Update Implementation [30 points]

The gradients for negative sampling loss are:

$$
\frac{\partial L}{\partial u_c} = (\sigma(s_{\text{pos}}) - 1) w_o + \sum_{k=1}^K \sigma(s_{\text{neg},k}) w_{\text{neg},k}
$$

$$
\frac{\partial L}{\partial w_o} = (\sigma(s_{\text{pos}}) - 1) u_c
$$

$$
\frac{\partial L}{\partial w_{\text{neg},k}} = \sigma(s_{\text{neg},k}) u_c
$$

where $s_{\text{pos}} = w_o^\top u_c$ and $s_{\text{neg},k} = w_{\text{neg},k}^\top u_c$.

### 6.1 [30 points]
Implement the gradient descent update:


In [None]:
def neg_sampling_update(U, W, c, o, negs, lr):
    """
    Perform gradient descent: θ ← θ - η·∇L
    
    Args:
        U: (V, d) center embeddings (modified in place)
        W: (V, d) context embeddings (modified in place)
        c: int, center word index
        o: int, positive context word index
        negs: (K,) negative sample indices
        lr: float, learning rate η
    """
    # Step 1: Extract embeddings
    u_c = U[c]          # (d,)
    w_o = W[o]          # (d,)
    w_negs = W[negs]    # (K, d)
    
    # Step 2: Compute scores
    # s_pos = 
    # s_negs = 
    
    # Step 3: Compute sigmoid values
    # sig_pos = 
    # sig_negs = 
    
    # Step 4: Compute gradients
    # grad_u_c = 
    # grad_w_o = 
    # grad_w_negs = 
    
    # Step 5: Update embeddings (gradient descent)
    # U[c] -= 
    # W[o] -= 
    # W[negs] -= 


## Task 7 — Sampling Distribution Analysis [10 points]

Word2Vec uses $P_n(w) \propto f(w)^{\alpha}$ where $f(w)$ is word frequency and $\alpha = 3/4$.

### 7.1 [5 points]
Let $R(\alpha)$ be the ratio of sampling probabilities between a high-frequency word ($f_h = 0.05$) and a low-frequency word ($f_l = 0.00001$):

$$
R(\alpha) = \frac{P_n(w_h)}{P_n(w_l)} = \left(\frac{f_h}{f_l}\right)^{\alpha}
$$

**Compute** $R(1)$, $R(3/4)$, and $R(0)$.
Show your calculations.
</br></br>

**Answer:**

---

### 7.2 [5 points]
**Prove** that for any $0 < \alpha < 1$ and frequencies $f_1 > f_2 > 0$:

$$
\left(\frac{f_1}{f_2}\right)^{\alpha} < \frac{f_1}{f_2}
$$

**Interpret** this result: Why does using $\alpha = 3/4$ instead of $\alpha = 1$ help with the rare word problem?
</br></br>
**Answer:**
