In [1]:
import numpy as np
import random

#make sure you got these folder
from utils.gradcheck import gradcheck_naive, grad_tests_softmax, grad_tests_negsamp
from utils.utils import normalizeRows, softmax

## Assignment 2: Word2Vec

### Estimated Time: ~10 hours

**Quick note**:  This assignment may be overwhelming for some of you.  It may be wise to set aside some significant amount of time so you can slowly go over this assignment.  The objective of this assignment is for you to understand the math behind <code>word2vec</code>, which will be a good fundamental background to understand any other NLP embedding algorithms.  We will also attempt to implement those maths into code to further enhance our understandings.

Let’s have a quick refresher on the word2vec algorithm. For full details, you may want to rewatch the zoom video we did in our first two lectures.  

The key insight behind word2vec is that *a word is known by the company it keeps*. Concretely, suppose we have a **center** word $c$ and a contextual window. We shall refer to words that lie in this contextual window as **outside words** denoting $o$. For example, in Figure 1 we see that the center word $c$ is *banking*. Since the context window size is 2, the outside words are *turning*, *into*, *crises*, and *as*.

The goal of the skip-gram word2vec algorithm is to accurately learn the probability distribution $P(O|C)$. Given a specific word $o$ and a specific word $c$, we want to calculate $P (O = o|C = c)$, which is the probability that word $o$ is an *outside* word for $c$, i.e., the probability that $o$ falls within the contextual window of $c$.

<img src = "img/word2vec.png" width=400>

In word2vec, the conditional probability distribution is given by taking vector dot-products and applying the softmax function:

$$P (O = o|C = c) = \displaystyle\frac{\exp({u_o^{T} v_c})}{\sum_{w \in V} \exp({u_w^{T} v_c})}$$

Here, $u_o$ is the *outside* vector representing outside word $o$, and $v_c$ is the *center* vector representing center word $c$. To contain these parameters, we have two matrices, $U$ and $V$ . The columns of $U$ are all the *outside* vectors $u_w$. The columns of $V$ are all of the *center* vectors $v_w$. Both $U$ and $V$ contain a vector for every $w \in \text{Vocabulary}$.

Recall from lectures that, for a single pair of words $c$ and $o$, the loss is given by:

$$\mathbf{J}_\text{naive-softmax}(v_c, o, U) = -\log P(O=o |C=c)$$

We can view this loss as the cross-entropy2 between the true distribution $y$ and the predicted distribution $\hat{y}$. Here, both $y$ and $\hat{y}$ are vectors with length equal to the number of words in the vocabulary. 

Furthermore, the $k$th entry in these vectors indicates the conditional probability of the $k$th word being an *outside word* for the given $c$. The true empirical distribution $y$ is a one-hot vector with a 1 for the true outside word $o$, and $0$ everywhere else. The predicted distribution $\hat{y}$ is the probability distribution $P (O|C = c)$ given by our model in above equation.

## Part 1:  Math behind word2vec

### Question 1 (1pt)

#### <font color="red">Answer the following questions</font> 

1. What is $U$ and the shape of $U$?
2. What is $V$ and the shape of $V$?
3. What is $u_o$ and the shape of $u_o$?
4. What is $v_c$ and the shape of $v_c$?
5. What is $y$ and the shape of $y$?
6. What is $\hat{y}$ and the shape of $\hat{y}$?
7. What is the numeric range of the softmax function P (O = o|C = c)?
8. Why use $\log$ after the softmax function?

Solution:

1. $U$ is the outside words matrix of shape <code>(vocab_size, embedding_dim)</code>
2. $V$ is the center word matrix of shape <code>(vocab_size, embedding_dim)</code>
3. $u_o$ is the embedding vector holding a particular outside word $o$ of shape <code>(embedding_dim, )</code>
4. $v_c$ is the embedding vector holding a particular center word $c$ of shape <code>(embedding_dim, )</code> 
5. $y$ is the true distribution one-hot vector with 1 for the true outside word $o$ and 0 everywhere else; has shape of <code>(vocab_size, )</code>
6. $\hat{y}$ is the predicted distribution one-hot vector given by our model $P(O|C = c)$; has shape of <code>(vocab_size, )</code>
7. 0 to 1
8. $\log$ has many nice properties; (1) helps numerically because the product of a large number of small probabilities can easily underflow; this is resolved by computing instead the sum of the log probabilities; (2) cancel out nicely with $\exp$, (3) we can use $\log$ because it is a monotically increasing function, thus it won't affect the objective function.

### Question 2 (1pt)

Show that the naive-softmax loss is the same as the cross-entropy loss between $y$ and $\hat{y}$; i.e., show that

$$-\sum_{w \in V}y_w \log(\hat{y}_w) = -\log(\hat{y}_o)$$

#### <font color="red">Write your answer here.</font> 
*(you may need to study latex to write your answers)*

Because $y$ is a one-hot encoder vector with zeros everywhere except at the index $w=o$  where $y_o = 1$, the sum is actually just

$$-\sum_{w \in V} y_w \log(\hat{y}_w) = -(y_1 \log(\hat{y}_1) + \cdots + y_o \log(\hat{y}_o) + \cdots + y_{|V|} \log (\hat{|V|}) = \log(\hat{y}_o)$$

### Question 3 (1pt)

Compute the partial derivative of $\mathbf{J}_{\text{naive-softmax}}$ with respect to $v_c$.

#### <font color="red">Write your answer here.</font> 

$$\begin{align*}
\partial \frac{J_{\text{naive_softmax}}}{\partial v_c} &= \frac{\partial}{\partial v_c}[-\log(\hat{y}_o)]\\
&= \frac{\partial}{\partial v_c}[-\log \displaystyle\frac{\exp({u_o^{T} v_c})}{\sum_{w \in V} \exp({u_w^{T} v_c})}]\\
&= -\frac{\partial}{\partial v_c}[\log{\exp({u_o^{T} v_c})} - \log(\sum_{w \in V} \exp({u_w^{T} v_c}))]\\
&= -\frac{\partial}{\partial v_c}[\log{\exp({u_o^{T} v_c})}] + \frac{\partial}{\partial v_c} [\log(\sum_{w \in V} \exp({u_w^{T} v_c}))]\\ 
&= -\frac{\partial}{\partial v_c}[u_o^{T} v_c] + \frac{\partial}{\partial v_c} [\log(\sum_{w \in V} \exp({u_w^{T} v_c}))]\\
&= -(u_o) + (\frac{1}{\sum_{w \in V} \exp({u_w^{T} v_c})}\sum_{x \in V}u_x \dot \exp({u_x^T v_c}))\\
&= -u_o + \sum_{x \in V}\frac{\exp({u_x^T v_c})}{\sum_{w \in V} \exp({u_w^{T} v_c})}u_x\\
&= -u_o + \sum_{x \in V} p(u_x | v_c) u_x \\
&= -u_o + \sum_{x \in V} \hat{y}_x u_x
\end{align*}$$

This says that the gradient of the loss function w.r.t. the center word is equal to the difference between the observed representation of the outside context word and the expected word according to our model.

### Question 4 (1pt)

Compute the partial derivative of $\mathbf{J}_{\text{naive-softmax}}$ with respect to each of the outside word vectors $u_w$'s.  There will be two cases:  when $w = o$ , the true outside word vector, and $w \neq o$ for all other words.

#### <font color="red">Write your answer here.</font> 

**Case 1** - the outside word vector is the true context word vector:

$$\begin{align*}
\partial \frac{J_{\text{naive_softmax}}}{\partial u_{w=o}} &= \frac{\partial}{\partial  u_{w=o}}[-\log(\hat{y}_o)]\\
&= \frac{\partial}{\partial u_{w=o}}[-\log \displaystyle\frac{\exp({u_o^{T} v_c})}{\sum_{w \in V} \exp({u_w^{T} v_c})}]\\
&= -\frac{\partial}{\partial u_{w=o}}[u_o^Tv_c] + \frac{\partial}{\partial u_{w=o}}[\log \sum_{w \in V} \exp({u_w^{T} v_c})]\\
&= -(v_c) + (\frac{1}{\sum{w \in V \exp({u_w^{T} v_c})}}(\exp({u_o^{T} v_c}) \cdot v_c))\\
&= -(v_c) + (\frac{\exp({u_o^{T} v_c})}{\sum{w \in V \exp({u_w^{T} v_c})}}\cdot v_c)\\
& = -(v_c) + \hat{y}_o \cdot v_c\\
\end{align*}$$

**Case 2** - the outside word vector is the NOT true context word vector:

$$\begin{align*}
\partial \frac{J_{\text{naive_softmax}}}{\partial u_{w\neq o}} &= \frac{\partial}{\partial  u_{w \neq o}}[-\log(\hat{y}_o)]\\
&= \frac{\partial}{\partial u_{w \neq o}}[-\log \displaystyle\frac{\exp({u_o^{T} v_c})}{\sum_{w \in V} \exp({u_w^{T} v_c})}]\\
&= \frac{\partial}{\partial u_{w \neq o}}[u_o^T v_c] +  \frac{\partial}{\partial u_{w \neq o}}[\log \sum_{w \in V} \exp{(u_w^T v_c)}]\\
&= 0 + (\frac{1}{\sum{w \in V \exp({u_w^{T} v_c})}}(\exp({u_{w \neq o}^{T} v_c}) \cdot v_c))\\
&= v_c \cdot \hat{y}_{w \neq o}
\end{align*}$$

### Question 5 (1pt)

Compute the derivatives of the sigmoid function given by 

$$ g(x) = \frac{1}{1+e^{-x}} $$

#### <font color="red">Write your answer here.</font> 

The derivative of sigmoid function is

$$
\begin{aligned}
    \frac{dg}{dx} &= \frac{0(1 + e^{-x}) - (-1)(e^{-x}))}{(1 + e^{-x})^2} \\
    &= \frac{e^{-x}}{(1 + e^{-x})^2}  = \frac{e^{-x} + 1 - 1}{(1 + e^{-x})^2} \\
    &= \frac{1}{(1 + e^{-x})} - \frac{1}{(1 + e^{-x})^2} \\
    &= \frac{1}{(1 + e^{-x})} \big(1 - \frac{1}{(1 + e^{-x})}\big)\\
    &= g(1 - g)
\end{aligned}
$$

### Question 6 (1pt)

Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that $K$ negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as $w_1,w_2,\cdots,w_K$ and their outside vectors as $u_1,\cdots,u_K$ For this question, assume that the $K$ negative samples are distinct. In other words, $i \neq j$ implies $w_i \neq w_j$ for $i,j \in \{1,\cdots,K\}$. Note that $o \notin \{w_1,\cdots,w_K\}$. For a center word $c$ and an outside word $o$, the negative sampling loss function is given by:

$$\mathbf{J}_{\text{neg-sample}}(v_c, o, U) = -\log(\sigma(u_o^Tv_c)) - \sum_{k=1}^K\log(\sigma(-u_k^Tv_c))$$

Compute the partial derivatives of $\mathbf{J}_{\text{neg-sample}}$ with respect of $v_c, u_o, \text{ and } u_k$.  Please write your answers in terms of the vectors $u_o, v_c, \text{and } u_k$.  

After this, explain with one sentence why this loss function is much more efficient to compute than the naive-softmax loss.

#### <font color="red">Write your answer here.</font> 

(1) **w.r.t. $v_c$**:

$$\frac{\partial \mathbf{J}_{\text{neg-sample}}}{\partial v_c} = \frac{\partial}{\partial v_c}[-\log(\sigma(u_o^Tv_c))] - \frac{\partial}{\partial v_c}[\sum_{k=1}^K \log(\sigma(-u_k^Tv_c))]$$

For the first term, let $S = \sigma(u_o^Tv_c)$:

$$\begin{align*}
\frac{\partial}{\partial v_c}[-\log(\sigma(u_o^Tv_c))]  &= (\frac{\partial}{\partial v_c}[-\log(S)])(\frac{\partial}{\partial v_c}[S])\\
&= (-\frac{1}{\sigma(u_o^Tv_c)})(u_o\sigma(u_o^Tv_c)(1 - \sigma(u_o^Tv_c)))\\
&= u_o(1 - \sigma(u_o^Tv_c))
\end{align*}$$

For the second term, let $S = \sigma(-u_k^Tv_c)$:

$$\begin{align*}
\frac{\partial}{\partial v_c}[\sum_{k=1}^K \log(\sigma(-u_k^Tv_c))] &= \sum_{k=1}^{K} \frac{\partial}{\partial v_c}[\log(S)] \frac{\partial}{\partial v_c}[S]\\
&= \sum_{k=1}^{K} \frac{-u_k\sigma(-u_k^Tv_c)(1 - \sigma(-u_k^Tv_c))}{\sigma(-u_k^Tv_c)}\\
&= -\sum_{k=1}^{K} u_k(1 - \sigma(-u_k^Tv_c))
\end{align*}$$

Combining the two answers, we get:

$$ \frac{\partial \mathbf{J}_{\text{neg-sample}}}{\partial v_c} = u_o(1 - \sigma(u_o^Tv_c)) + \sum_{k=1}^{K} u_k(1 - \sigma(-u_k^Tv_c))$$

(2) **w.r.t. $u_o$**:

$$\begin{align*}
\frac{\partial \mathbf{J}_{\text{neg-sample}}}{\partial u_o} &= \frac{\partial}{\partial u_o}[-\log(\sigma(u_o^Tv_c))] - \frac{\partial}{\partial u_o}[\sum_{k=1}^K \log(\sigma(-u_k^Tv_c))]\\
&= \frac{\partial}{\partial u_o}[-\log(\sigma(u_o^Tv_c))] - 0\\
&= -[\frac{1}{\sigma(u_o^Tv_c)}][\sigma(u_o^Tv_c)(1 - \sigma(u_o^Tv_c))v_c]\\
&= -v_c(1 - \sigma(u_o^Tv_c))
\end{align*}$$

(3) **w.r.t. $u_k$**:

$$\begin{align*}
\frac{\partial \mathbf{J}_{\text{neg-sample}}}{\partial u_k} &= \frac{\partial}{\partial u_k}[-\log(\sigma(u_o^Tv_c))] - \frac{\partial}{\partial u_k}[\sum_{k=1}^K \log(\sigma(-u_k^Tv_c))]\\
&= 0 - \frac{\partial}{\partial u_k}[\sum_{k=1}^K \log(\sigma(-u_k^Tv_c))]\\
&= -\frac{-v_c\sigma(-u_k^Tv_c)(1 - \sigma(-u_k^Tv_c))}{\sigma(-u_k^Tv_c)}\\
&= v_c(1 - \sigma(-u_k^Tv_c))
\end{align*}$$

This loss function is much more efficient because it only takes $O(K)$, whereas the naive method requires us to normalize looking all word vectors, taking $O(|V|)$

### Question 7 (1pt)

Suppose the center word is $c = w_t$ and the context window is $[w_{t−m}, \cdots, w_{t−1}, w_t, w_{t+1}, \cdots, w_{t+m}]$, where $m$ is the context window size. Recall that for the skip-gram version of word2vec, the total loss for the context window is:

$$\mathbf{J}_{skip-gram}(v_c, w_{t-m}, \cdots, w_{t+m}, U) = \sum_{\substack{-m \leq j \leq m \\ j \neq 0}} \mathbf{J} (v_c, w_{t+j}, U)$$

Here, $\mathbf{J}(v_c, w_{t+j}, U)$ represents an arbirtary loss term for the center word $c=w_t$ and outside word $w_{t+j}$.  $\mathbf{J}(v_c, w_{t+j}, U)$ could be $\mathbf{J}_{\text{naive-softmax}}$ or $\mathbf{J}_{\text{neg-sample}}$ depending on your implementation.

Write down three partial derivatives:

- (i) $\displaystyle\frac{\partial {\mathbf{J}_{\text{skip-gram}}} (v_c, w_{t-m}, \cdots w_{t+m}, U)}{\partial U}$
- (ii) $\displaystyle\frac{\partial {\mathbf{J}_{\text{skip-gram}}} (v_c, w_{t-m}, \cdots w_{t+m}, U)}{\partial v_c}$
- (iii) $\displaystyle\frac{\partial {\mathbf{J}_{\text{skip-gram}}} (v_c, w_{t-m}, \cdots w_{t+m}, U)}{\partial v_w} \text{ where } w \neq c$

Write your answers in terms of $\partial \mathbf{J}(v_c, w_{t+j}, U)/\partial U$ and $\partial \mathbf{J}(v_c, w_{t+j}, U)/\partial v_c$.  This is very simple - don't overthink - each solution should be one line.  We just want you to write so that you are more clear when you implement.

#### <font color="red">Write your answer here.</font> 

- (i) $\displaystyle\sum_{-m \leq j \leq m, j \neq 0} \partial \mathbf{J}_{skip-gram}(v_c, w_{t+j}, U)/\partial U$
- (ii) $\displaystyle\sum_{-m \leq j \leq m, j \neq 0} \partial \mathbf{J}_{skip-gram}(v_c, w_{t+j}, U)/\partial v_c$
- (iii) 0

## Part 2:  Code

Now you are done, you are ready to implement <code>word2vec</code>!  Please complete the implementation below.

### Question 1 Implement the sigmoid function (1pt)

This should be fairly easy.  Recall that sigmoid function is given by:

$$ g(x) = \frac{1}{1+e^{-x}} $$

In [2]:
def sigmoid(x):
    """
    Compute the sigmoid function for the input here.
    Arguments:
    x -- A scalar or numpy array.
    Return:
    s -- sigmoid(x)
    """

    # ------------------
    # Write your implementation here (~1 line).
    
    s = 1 / (1 + np.exp(-x))
    
    # ------------------

    return s

In [3]:
def test_sigmoid():
    """ Test sigmoid function """
    print("=== Sanity check for sigmoid ===")
    assert sigmoid(0) == 0.5
    assert np.allclose(sigmoid(np.array([0])), np.array([0.5]))
    assert np.allclose(sigmoid(np.array([1,2,3])), np.array([0.73105858, 0.88079708, 0.95257413]))
    print("Tests for sigmoid passed!")

In [4]:
test_sigmoid() #turn on when you are ready to test

=== Sanity check for sigmoid ===
Tests for sigmoid passed!


### Question 2 Implement the gradient computation of naive softmax (1pt)

Here, this is a function that will return the loss, the gradient with respect to $v_c$ and to $U$.

1. For **loss**, recall that the loss is given by 

$$\mathbf{J}_\text{naive-softmax}(v_c, o, U) = -\log P(O=o |C=c)$$

where 

$$P (O = o|C = c) = \displaystyle\frac{\exp({u_o^{T} v_c})}{\sum_{w \in V} \exp({u_w^{T} v_c})}$$

*Implementation consideration* - use dot product to avoid unnecessary <code>for</code> loop, i.e., <code>yhat = softmax (outsideVectors @ centerWordVec)</code> should give you the dot product of all outside word vectors with the particular center word vector.  To calculate the loss for a specific $u_o$, simply put the <code>outsideWordIdx</code> as index after the softmax function.   For the softmax function, we have provided so please use it. Last, make sure that the loss is simply a scalar, i.e., shape of (1, ).

2. For **gradient with respect to $v_c$**, the gradient that you have calculated should be something like this:

$$\partial \frac{J_{\text{naive_softmax}}}{\partial v_c} = -u_o + \sum_{x \in V} \hat{y}_x u_x$$

*Implementation consideration* - since the shape of $v_c$ is <code>(embedding_dim, )</code>, its gradient will also has the same shape.  For people who is struggling, it should look something like this <code>-trueOutsideVec + np.sum(outsideVectors * y_hat, axis=0)</code> where <code>trueOutsideVec</code> is simply <code>outsideVectors[outsideWordIdx]</code>

3. For **gradient with respect to $U$**, the gradient for true outside vector that you have calculated should be something like this:

$$\partial \frac{J_{\text{naive_softmax}}}{\partial u_{w=o}} = -(v_c) + \hat{y}_o \cdot v_c$$

For not true outside vector, it is quite similar

$$v_c \cdot \hat{y}_{w \neq o}$$

*Implementation consideration* - note that the equation above is simply for one outside word, anyhow, as long as you use dot product, it will handle everything for you, i.e., <code>gradOutsideVecs = np.dot(y_hat, centerWordVec[:, np.newaxis].T)</code> should give you the gradient for all words except the true outside word vector.  By further subtracting it like this <code>gradOutsideVecs[outsideWordIdx] -= centerWordVec</code>, you will obtain the gradient for the true outside word vector.  Similarly above, since the shape of $U$ is <code>(vocab_size, embedding_dim)</code>, its gradient will also has the same shape.

Last, you can run <code>test_naiveSoftmaxLossAndGradient()</code> to see whether your work can pass the test.  Note that gradient checking is a sanity test that only checks whether the gradient and loss values produced by your implementation are consistent with each other. Gradient check passing on its own doesn’t guarantee that you have the correct gradients. It will pass, for example, if both the loss and gradient values produced by your implementation are 0s.

In [5]:
def naiveSoftmaxLossAndGradient(
    centerWordVec,
    outsideWordIdx,
    outsideVectors,
    dataset
):
    """ Naive Softmax loss & gradient function for word2vec models

    Implement the naive softmax loss and gradients between a center word's 
    embedding and an outside word's embedding. This will be the building block
    for our word2vec models. For those unfamiliar with numpy notation, note 
    that a numpy ndarray with a shape of (x, ) is a one-dimensional array, which
    you can effectively treat as a vector with length x.

    Arguments:
    centerWordVec -- numpy ndarray, center word's embedding
                    in shape (embedding_dim, )
                    (v_c in our part 1)
    outsideWordIdx -- integer, the index of the outside word
                    (o of u_o in our part 1)
    outsideVectors -- outside vectors is
                    in shape (vocab_size, embedding_dim) 
                    for all words in vocab (tranpose of U in our part 1)
    dataset -- needed for negative sampling, unused here.

    Return:
    loss -- naive softmax loss
    gradCenterVec -- the gradient with respect to the center word vector
                     in shape (embedding_dim, )
                     (dJ / dv_c in part 1)
    gradOutsideVecs -- the gradient with respect to all the outside word vectors
                    in shape (vocab_size, embedding_dim) 
                    (dJ / dU)
    """
    
    # ------------------
    # Write your implementation here.
    ### Please use the provided softmax function
    
    
    scores = outsideVectors @ centerWordVec  # UT @ v_c : [vocab_size, embed] @ [embed, 1] = [vocab_size, ]
    y_hat = softmax(scores)[:, np.newaxis]  #y_hat: [vocab_size, 1]
    loss = float(-np.log(y_hat[outsideWordIdx])) #naive-softmax loss: scalar (1, )

    #dJ / dv_c : [embed, ]
    trueOutsideVec = outsideVectors[outsideWordIdx] #trueOutsideVec:  [embed, ]
    gradCenterVec = -trueOutsideVec + np.sum(outsideVectors * y_hat, axis=0) # [embed, ] + [embed, ] = [embed, ]  (refer to broadcasting rule if you don't understand here)

    #dJ / dU : [vocab_size, embed]
    gradOutsideVecs = np.dot(y_hat, centerWordVec[:, np.newaxis].T) # y_hat @ centerWordVec : [vocab_size, 1] @ [1, embed] = [vocab_size, embed]
    gradOutsideVecs[outsideWordIdx] -= centerWordVec #[vocab_size, embed]
    
    
    # ------------------

    return loss, gradCenterVec, gradOutsideVecs

In [6]:
def test_naiveSoftmaxLossAndGradient():
    """ Test naiveSoftmaxLossAndGradient """
    dataset, dummy_vectors, dummy_tokens = getDummyObjects()

    print("==== Gradient check for naiveSoftmaxLossAndGradient ====")
    def temp(vec):
        loss, gradCenterVec, gradOutsideVecs = naiveSoftmaxLossAndGradient(vec, 1, dummy_vectors, dataset)
        return loss, gradCenterVec
    gradcheck_naive(temp, np.random.randn(3), "naiveSoftmaxLossAndGradient gradCenterVec")

    centerVec = np.random.randn(3)
    def temp(vec):
        loss, gradCenterVec, gradOutsideVecs = naiveSoftmaxLossAndGradient(centerVec, 1, vec, dataset)
        return loss, gradOutsideVecs
    gradcheck_naive(temp, dummy_vectors, "naiveSoftmaxLossAndGradient gradOutsideVecs")

In [7]:
def getDummyObjects():
    """ Helper method for naiveSoftmaxLossAndGradient and negSamplingLossAndGradient tests """

    def dummySampleTokenIdx():
        return random.randint(0, 4)

    def getRandomContext(C):
        tokens = ["a", "b", "c", "d", "e"]
        return tokens[random.randint(0,4)], \
            [tokens[random.randint(0,4)] for i in range(2*C)]

    dataset = type('dummy', (), {})()
    dataset.sampleTokenIdx = dummySampleTokenIdx
    dataset.getRandomContext = getRandomContext

    random.seed(31415)
    np.random.seed(9265)
    dummy_vectors = normalizeRows(np.random.randn(10,3))
    dummy_tokens = dict([("a",0), ("b",1), ("c",2),("d",3),("e",4)])

    return dataset, dummy_vectors, dummy_tokens

In [8]:
test_naiveSoftmaxLossAndGradient() #turn on when you are ready to test

==== Gradient check for naiveSoftmaxLossAndGradient ====
Gradient check passed!. Read the docstring of the `gradcheck_naive` method in utils.gradcheck.py to understand what the gradient check does.
Gradient check passed!. Read the docstring of the `gradcheck_naive` method in utils.gradcheck.py to understand what the gradient check does.


### Question 3 Implement the gradient computation using negative sampling loss (1pt)

1. For **loss**, recall that the negative sampling loss is

$$\mathbf{J}_{\text{neg-sample}}(v_c, o, U) = -\log(\sigma(u_o^Tv_c)) - \sum_{k=1}^K\log(\sigma(-u_k^Tv_c))$$

*Coding implementation*:  indices are given where the first index belongs to the true outside word, while the remaining $K$ number of indices belong to the negative samples.  For negative sampling, we have provided the function <code>getNegativeSamples</code> so please use it.  One good way to do this is first to calculate the dot product between all relevant outside word vectors within the selected indices and the center word vector like this <code>scores = (outsideVectors[indices] @ centerWordVec)[:, np.newaxis]</code>.  Then for the left side of the equation, use <code>scores[0]</code> as part of the calculation, and for the right side, use <code>-scores[1:]</code>.  The remaining should be easy, applying the already implemented <code>sigmoid</code> function, and <code>log</code> and <code>np.sum</code> accordingly.  Final reminder that the loss is of scalar (1, ) shape.

2. For **gradient with respect to $v_c$**, the gradient that you have calculated should be something like this:

$$ \frac{\partial \mathbf{J}_{\text{neg-sample}}}{\partial v_c} = u_o(1 - \sigma(u_o^Tv_c)) + \sum_{k=1}^{K} u_k(1 - \sigma(-u_k^Tv_c))$$

*Coding implementation*: for the left side of the equation, you may want to use <code>outsideVectors[outsideWordIdx]</code>, and for the right side of the equation, use <code>outsideVectors[negSampleWordIndices]</code>.  Other than that, this should be fairly simple.  Remind that the output shape is <code>(embedding_dim, )</code>

3. For **gradient with respect to $U$**, there are two parts, the gradient for true outside vector that you have calculated should be something like this:

$$\frac{\partial \mathbf{J}_{\text{neg-sample}}}{\partial u_o} = -v_c(1 - \sigma(u_o^Tv_c))$$

The gradient for negative vector that you have calculated should be something like this:

$$\frac{\partial \mathbf{J}_{\text{neg-sample}}}{\partial u_k} = v_c(1 - \sigma(-u_k^Tv_c))$$

*Coding implementation*:  Both of the gradient should be simple to implement using the indexing approach we have done before.  There is some technicality, i.e., the same word may be negatively sampled multiple times. For example if an outside word is sampled twice, you shall have to double count the gradient with respect to this word. Thrice if it was sampled three times, and so forth.  A good way to do this is to first count the occurrences of indices like this:  <code>indexCount = np.bincount(indices)[:, np.newaxis]</code>, then loop through all distinct indices and multiply the gradients with the number of occurences like this: <code>for i in np.unique(indices): gradOutsideVecs[i] *= indexCount[i]</code>

In [9]:
#we have provided the function for getting negative samples
def getNegativeSamples(outsideWordIdx, dataset, K):
    """ Samples K indexes which are not the outsideWordIdx """

    negSampleWordIndices = [None] * K
    for k in range(K):
        newidx = dataset.sampleTokenIdx()
        while newidx == outsideWordIdx:
            newidx = dataset.sampleTokenIdx()
        negSampleWordIndices[k] = newidx
    return negSampleWordIndices #[K, ]

In [10]:
def negSamplingLossAndGradient(
    centerWordVec,
    outsideWordIdx,
    outsideVectors,
    dataset,
    K=10
):
    """ Negative sampling loss function for word2vec models

    Implement the negative sampling loss and gradients for a centerWordVec
    and a outsideWordIdx word vector as a building block for word2vec
    models. K is the number of negative samples to take.

    Note: The same word may be negatively sampled multiple times. For
    example if an outside word is sampled twice, you shall have to
    double count the gradient with respect to this word. Thrice if
    it was sampled three times, and so forth.

    Arguments/Return Specifications: same as naiveSoftmaxLossAndGradient
    """

    # Negative sampling of words is done for you.
    negSampleWordIndices = getNegativeSamples(outsideWordIdx, dataset, K)
    indices = [outsideWordIdx] + negSampleWordIndices

    # ------------------
    # Write your implementation here
    
    scores = (outsideVectors[indices] @ centerWordVec)[:, np.newaxis]  #u_ok @ v_c = [K+1, embed] @ [embed, ] = [K + 1, ] => (newaxis) => [K + 1, 1]
    trueOutsideWordScore = scores[0] #scalar
    probFromCorpus = sigmoid(trueOutsideWordScore) #scalar in the range of 0 to 1
    negSampleScores = -scores[1:] #[K, 1]
    probNotFromCorpus = sigmoid(negSampleScores) #[K, 1] in the range of 0 to 1

    loss = float(-np.log(probFromCorpus) -
                 np.sum(np.log(probNotFromCorpus), axis=0)) #scalar
    
    #dJ / dv_c : [embed, ]
    gradFromCorpus = -outsideVectors[outsideWordIdx] * (1 - probFromCorpus) #[embed, ] * (1 - [1, ]) => #[embed, ] using broadcasting rule
    gradNotFromCorpus = np.sum(
        outsideVectors[negSampleWordIndices] * (1 - probNotFromCorpus), axis=0) #sum ( [K, embed] * (1 - [K, 1])  , axis = 0) => [embed, ]
    gradCenterVec = gradFromCorpus + gradNotFromCorpus # [embed, ] + [embed, ] = [embed, ]
    
    #dJ / dU : [vocab_size, embed]
    gradOutsideVecs = np.zeros(outsideVectors.shape) #[vocab_size, embed]
    gradOutsideVecs[outsideWordIdx] = -centerWordVec * (1 - probFromCorpus) #[embed, ] * scalar = [embed, ]
    gradOutsideVecs[negSampleWordIndices] += centerWordVec * (1 - probNotFromCorpus) # [K, embed] + ([embed, ] * [K, 1])  => [K, embed] + [K, embed]  => [K, embed] adding all gradients of the negative samples
    
    # Factor in repeatedly drawn negative samples.
    indexCount = np.bincount(indices)[:, np.newaxis]  #count occurrences
    for i in np.unique(indices):
        gradOutsideVecs[i] *= indexCount[i]  #multiply the gradient according to the occurences

    # ------------------

    return loss, gradCenterVec, gradOutsideVecs

In [11]:
def test_negSamplingLossAndGradient():
    """ Test negSamplingLossAndGradient """
    dataset, dummy_vectors, dummy_tokens = getDummyObjects()

    print("==== Gradient check for negSamplingLossAndGradient ====")
    def temp(vec):
        loss, gradCenterVec, gradOutsideVecs = negSamplingLossAndGradient(vec, 1, dummy_vectors, dataset)
        return loss, gradCenterVec
    gradcheck_naive(temp, np.random.randn(3), "negSamplingLossAndGradient gradCenterVec")

    centerVec = np.random.randn(3)
    def temp(vec):
        loss, gradCenterVec, gradOutsideVecs = negSamplingLossAndGradient(centerVec, 1, vec, dataset)
        return loss, gradOutsideVecs
    gradcheck_naive(temp, dummy_vectors, "negSamplingLossAndGradient gradOutsideVecs")

In [12]:
test_negSamplingLossAndGradient() #turn on when you are ready to test

==== Gradient check for negSamplingLossAndGradient ====
Gradient check passed!. Read the docstring of the `gradcheck_naive` method in utils.gradcheck.py to understand what the gradient check does.
Gradient check passed!. Read the docstring of the `gradcheck_naive` method in utils.gradcheck.py to understand what the gradient check does.


### Question: Implement the skipgram model (5pts)

In [13]:
def skipgram(currentCenterWord, windowSize, outsideWords, word2Ind,
             centerWordVectors, outsideVectors, dataset,
             word2vecLossAndGradient=naiveSoftmaxLossAndGradient):
    """ Skip-gram model in word2vec

    Implement the skip-gram model in this function.

    Arguments:
    currentCenterWord -- a string of the current center word
    windowSize -- integer, context window size
    outsideWords -- list of no more than 2*windowSize strings, the outside words
    word2Ind -- a dictionary that maps words to their indices in
              the word vector list
    centerWordVectors -- center word vectors (as rows) is in shape 
                        (num words in vocab, word vector length) 
                        for all words in vocab (V in pdf handout)
    outsideVectors -- outside vectors is in shape 
                        (num words in vocab, word vector length) 
                        for all words in vocab (transpose of U in the pdf handout)
    word2vecLossAndGradient -- the loss and gradient function for
                               a prediction vector given the outsideWordIdx
                               word vectors, could be one of the two
                               loss functions you implemented above.

    Return:
    loss -- the loss function value for the skip-gram model
            (J in the pdf handout)
    gradCenterVec -- the gradient with respect to the center word vector
                     in shape (word vector length, )
                     (dJ / dv_c in the pdf handout)
    gradOutsideVecs -- the gradient with respect to all the outside word vectors
                    in shape (num words in vocab, word vector length) 
                    (dJ / dU)
    """

    loss = 0.0
    gradCenterVecs = np.zeros(centerWordVectors.shape)
    gradOutsideVectors = np.zeros(outsideVectors.shape)

    # ------------------
    # Write your implementation here
    
    
    currCenterWordIdx = word2Ind[currentCenterWord]
    centerWordVec = centerWordVectors[currCenterWordIdx]

    for outsideWord in outsideWords:
        outsideWordIdx = word2Ind[outsideWord]
        currLoss, currGradCenter, currGradOutside = word2vecLossAndGradient(
            centerWordVec, outsideWordIdx, outsideVectors, dataset)
        loss += currLoss
        gradCenterVecs += currGradCenter
        gradOutsideVectors += currGradOutside

    # Clear out all non-center word gradients.
    gradCenterVecs[np.arange(gradCenterVecs.shape[0]) != currCenterWordIdx] = 0
    

    # ------------------
    
    return loss, gradCenterVecs, gradOutsideVectors

In [14]:
def test_skipgram():
    """ Test skip-gram with naiveSoftmaxLossAndGradient """
    dataset, dummy_vectors, dummy_tokens = getDummyObjects()

    print("==== Gradient check for skip-gram with naiveSoftmaxLossAndGradient ====")
    gradcheck_naive(lambda vec: word2vec_sgd_wrapper(
        skipgram, dummy_tokens, vec, dataset, 5, naiveSoftmaxLossAndGradient),
        dummy_vectors, "naiveSoftmaxLossAndGradient Gradient")
    grad_tests_softmax(skipgram, dummy_tokens, dummy_vectors, dataset)

    print("==== Gradient check for skip-gram with negSamplingLossAndGradient ====")
    gradcheck_naive(lambda vec: word2vec_sgd_wrapper(
        skipgram, dummy_tokens, vec, dataset, 5, negSamplingLossAndGradient),
        dummy_vectors, "negSamplingLossAndGradient Gradient")
    grad_tests_negsamp(skipgram, dummy_tokens, dummy_vectors, dataset, negSamplingLossAndGradient)

In [15]:
def word2vec_sgd_wrapper(word2vecModel, word2Ind, wordVectors, dataset,
                         windowSize,
                         word2vecLossAndGradient=naiveSoftmaxLossAndGradient):
    batchsize = 50
    loss = 0.0
    grad = np.zeros(wordVectors.shape)
    N = wordVectors.shape[0]
    centerWordVectors = wordVectors[:int(N/2),:]
    outsideVectors = wordVectors[int(N/2):,:]
    for i in range(batchsize):
        windowSize1 = random.randint(1, windowSize)
        centerWord, context = dataset.getRandomContext(windowSize1)

        c, gin, gout = word2vecModel(
            centerWord, windowSize1, context, word2Ind, centerWordVectors,
            outsideVectors, dataset, word2vecLossAndGradient
        )
        loss += c / batchsize
        grad[:int(N/2), :] += gin / batchsize
        grad[int(N/2):, :] += gout / batchsize

    return loss, grad

In [16]:
test_skipgram()  #turn on when you are ready to test

==== Gradient check for skip-gram with naiveSoftmaxLossAndGradient ====
Gradient check passed!. Read the docstring of the `gradcheck_naive` method in utils.gradcheck.py to understand what the gradient check does.
The first test passed!
The second test passed!
The third test passed!
All 3 tests passed!
==== Gradient check for skip-gram with negSamplingLossAndGradient ====
Gradient check passed!. Read the docstring of the `gradcheck_naive` method in utils.gradcheck.py to understand what the gradient check does.
The first test passed!
The second test passed!
The third test passed!
All 3 tests passed!


In [17]:
def test_word2vec():
    """ Test the two word2vec implementations, before running on Stanford Sentiment Treebank """
    dataset = type('dummy', (), {})()

    def dummySampleTokenIdx():
        return random.randint(0, 4)

    def getRandomContext(C):
        tokens = ["a", "b", "c", "d", "e"]
        return tokens[random.randint(0, 4)], \
            [tokens[random.randint(0, 4)] for i in range(2*C)]
    dataset.sampleTokenIdx = dummySampleTokenIdx
    dataset.getRandomContext = getRandomContext

    random.seed(31415)
    np.random.seed(9265)
    dummy_vectors = normalizeRows(np.random.randn(10, 3))
    dummy_tokens = dict([("a", 0), ("b", 1), ("c", 2), ("d", 3), ("e", 4)])

    print("==== Gradient check for skip-gram with naiveSoftmaxLossAndGradient ====")
    gradcheck_naive(lambda vec: word2vec_sgd_wrapper(
        skipgram, dummy_tokens, vec, dataset, 5, naiveSoftmaxLossAndGradient),
        dummy_vectors, "naiveSoftmaxLossAndGradient Gradient")

    print("==== Gradient check for skip-gram with negSamplingLossAndGradient ====")
    gradcheck_naive(lambda vec: word2vec_sgd_wrapper(
        skipgram, dummy_tokens, vec, dataset, 5, negSamplingLossAndGradient),
        dummy_vectors, "negSamplingLossAndGradient Gradient")

    print("\n=== Results ===")
    print("Skip-Gram with naiveSoftmaxLossAndGradient")

    print("Your Result:")
    print("Loss: {}\nGradient wrt Center Vectors (dJ/dV):\n {}\nGradient wrt Outside Vectors (dJ/dU):\n {}\n".format(
        *skipgram("c", 3, ["a", "b", "e", "d", "b", "c"],
                  dummy_tokens, dummy_vectors[:5, :], dummy_vectors[5:, :], dataset)
    )
    )

    print("Expected Result: Value should approximate these:")
    print("""Loss: 11.16610900153398
Gradient wrt Center Vectors (dJ/dV):
 [[ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [-1.26947339 -1.36873189  2.45158957]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]]
Gradient wrt Outside Vectors (dJ/dU):
 [[-0.41045956  0.18834851  1.43272264]
 [ 0.38202831 -0.17530219 -1.33348241]
 [ 0.07009355 -0.03216399 -0.24466386]
 [ 0.09472154 -0.04346509 -0.33062865]
 [-0.13638384  0.06258276  0.47605228]]
    """)

    print("Skip-Gram with negSamplingLossAndGradient")
    print("Your Result:")
    print("Loss: {}\nGradient wrt Center Vectors (dJ/dV):\n {}\n Gradient wrt Outside Vectors (dJ/dU):\n {}\n".format(
        *skipgram("c", 1, ["a", "b"], dummy_tokens, dummy_vectors[:5, :],
                  dummy_vectors[5:, :], dataset, negSamplingLossAndGradient)
    )
    )
    print("Expected Result: Value should approximate these:")
    print("""Loss: 16.15119285363322
Gradient wrt Center Vectors (dJ/dV):
 [[ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [-4.54650789 -1.85942252  0.76397441]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]]
 Gradient wrt Outside Vectors (dJ/dU):
 [[-0.69148188  0.31730185  2.41364029]
 [-0.22716495  0.10423969  0.79292674]
 [-0.45528438  0.20891737  1.58918512]
 [-0.31602611  0.14501561  1.10309954]
 [-0.80620296  0.36994417  2.81407799]]
    """)

In [18]:
test_word2vec()

==== Gradient check for skip-gram with naiveSoftmaxLossAndGradient ====
Gradient check passed!. Read the docstring of the `gradcheck_naive` method in utils.gradcheck.py to understand what the gradient check does.
==== Gradient check for skip-gram with negSamplingLossAndGradient ====
Gradient check passed!. Read the docstring of the `gradcheck_naive` method in utils.gradcheck.py to understand what the gradient check does.

=== Results ===
Skip-Gram with naiveSoftmaxLossAndGradient
Your Result:
Loss: 11.16610900153398
Gradient wrt Center Vectors (dJ/dV):
 [[ 0.          0.          0.        ]
 [ 0.          0.          0.        ]
 [-1.26947339 -1.36873189  2.45158957]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]]
Gradient wrt Outside Vectors (dJ/dU):
 [[-0.41045956  0.18834851  1.43272264]
 [ 0.38202831 -0.17530219 -1.33348241]
 [ 0.07009355 -0.03216399 -0.24466386]
 [ 0.09472154 -0.04346509 -0.33062865]
 [-0.13638384  0.06258276  0.47605228]]

Expected