# Word2Vec

## Skip-gram Model

The neural bi-gram model that we have discussed so far does produce word embeddings (word vectors of fixed size, the size of the hidden layer). It does a poor job at capturing semantic relationships between words, because it is just trained to predict the next word given a preceding word. 

The skip-gram models extends the bi-gram model with the goal of capturing a larger context around each word.

Let's look at an example:

```
Alice was beginning to get very **tired** of sitting by her sister on the bank.
```

The bi-gram model would produce the following training sample for the word "tired":

- (tired, of)

The skip-gram model looks at a set of nearby words (context window $m$) and produces the following training data for the word tired $m = 2$

- (tired, get)
- (tired, very)
- (tired, of)
- (tired, sitting)

In this sense it looks very much like the bi-gram model, except that it skips a few words between the words in the training samples. This is why it is called the skip-gram model.

You can think about the model in two ways which are equivalent:

1. One training sample with four targets (and therefore four softmax classifiers)
    - tired -> (very, get, of, sitting)
2. Four training samples with the same input word and four different target words
    - tired -> very
    - tired -> get
    - tired -> of
    - tired -> sitting


![Skip-gram Model](../figures/skip-gram-architecture.webp)

Different from the bi-gram model, the skip-gram drops the non-linear activation function (tanh) and uses the identity function instead. Apart from making the model simpler, this change makes the dot-product between the word-vector of the center word and the word-vectors of the context words the only factor that determines the probability of the context words given the center word. Remember that the dot-product is a measure of similarity between two vectors, so the skip-gram model will produce word vectors that are closer in the vector space for words that are commonly used in similar contexts.

Let's denote the weight matrices of the skip-gram model as $\mathbf{W}^{1}$ and $\mathbf{W}^{2}$ for the input and output layers, respectively. The hidden layer is given by:

$$
h = \mathbf{W}^{(1)T} x
$$

where $x$ is the one-hot encoded input vector of length $V$ (the size of the vocabulary) and $h$ is the hidden layer of size $D$ (the size of the word embedding).

Note that because the input vector is one-hot encoded, the dot-product is equivalent to selecting the row of the weight matrix that corresponds to the index of the input word.

The output layer is given by:

$$
\hat{y} = \hat{P}(y | \text{input}) = \text{softmax}(\mathbf{W}^{(2)T} h)
$$

where $\hat{y}$ is the predicted probability distribution over the vocabulary and is a vector of length $V$.

## Negative Sampling

The practical application of the skip-gram model is computationally expensive. Consider what will happen if we try to train the model on a large corpus with a vocabulary size e.g., $V = 250000$ words and word vector size $D = 300$. Each of the two sets of weights $W^h$ and $W^o$ will be $250000 \times 300 = 7.5 \times 10^7$. In order to train the model, we will need a huge amount of training data to avoid overfitting because of the high number of parameters. This means that the model will take a long time to train. Furthermore, the model will have to compute the softmax function for each of the $V$ words in the vocabulary, which is also computationally expensive.

There are several approaches to work around the softmax computational problem.

- Hierarchical Softmax (not covered here)
- Negative Sampling

The idea behind negative sampling is to just throw away the softmax function and replace it with binary classification problems. Instead of predicting the probability of each word in the vocabulary, we predict the probability of the correct context word and the probability of a set of words that are not in the context of the center word. These words are called negative samples.

A natural question is how to choose the negative examples. The authors of the skip-gram model suggest using the uni-gram distribution raised to the 3/4 power. This is equivalent to sampling from the distribution.
 
The uni-gram distribution is simply the frequency of each word in the corpus. 

$$
p("tiger") = \frac{\text{frequency of "tiger"}}{\text{total number of words}}
$$

The uni-gram distribution raised to the 3/4 power is:

$$
p^1("tiger") = \frac{p(\text{"tiger"})^{3/4}}{
\sum}
$$

where the sum in the denominator is the sum of the frequency of each word in the corpus raised to the 3/4 power so that $p^1$ is a probability distribution over the vocabulary. The 3/4 power is used to reduce the probability of sampling very frequent words and is a hyperparameter of the model.

## Training the Skip-gram Model with Negative Sampling

Consider again the training sample:

- (tired, get)
- (tired, very)
- (tired, of)
- (tired, sitting)

The skip-gram model with negative sampling will produce the following training samples:


- (tired, get, 1)
- (tired, very, 1)
- (tired, of, 1)
- (tired, sitting, 1)
- (tired, cat, 0)
- (tired, egg, 0)
- (tired, sky, 0)
- (tired, smile, 0)

where the third element in the tuple is the label of the training sample. The label is 1 if the word is in the context of the center word and 0 if it is not. Now we can train the model using binary cross-entropy loss instead of the softmax loss.

We will have to predict the following probabilities:

$$
\begin{align*}
P(\text{get} | \text{tired}) & = \sigma(W^{(2)T}_{\text{get}} W^{(1)}_{\text{tired}}) \\
P(\text{very} | \text{tired}) & = \sigma(W^{(2)T}_{\text{very}} W^{(1)}_{\text{tired}}) \\
P(\text{of} | \text{tired}) & = \sigma(W^{(2)T}_{\text{of}} W^{(1)}_{\text{tired}}) \\
P(\text{sitting} | \text{tired}) & = \sigma(W^{(2)T}_{\text{sitting}} W^{(1)}_{\text{tired}}) \\
P(\text{cat} | \text{tired}) & = \sigma(W^{(2)T}_{\text{cat}} W^{(1)}_{\text{tired}}) \\
P(\text{egg} | \text{tired}) & = \sigma(W^{(2)T}_{\text{egg}} W^{(1)}_{\text{tired}}) \\
P(\text{sky} | \text{tired}) & = \sigma(W^{(2)T}_{\text{sky}} W^{(1)}_{\text{tired}}) \\
P(\text{smile} | \text{tired}) & = \sigma(W^{(2)T}_{\text{smile}} W^{(1)}_{\text{tired}})
\end{align*}
$$

where $\sigma$ is the sigmoid function.

Given these predicted probabilities, we can compute the binary cross-entropy loss. Remember
that the binary cross-entropy loss is:

$$
L = y_{\text{out}} \log P(y_{\text{out}} = 1 | x_{\text{in}}) + (1 - y_{\text{out}}) \log (1 - P(y_{\text{out}} = 1 | x_{\text{in}}))
$$

where $y_i$ is the true label and $\hat{y}_i$ is the predicted probability.

$$
\begin{align*}
L(\text{get}) & = -\log(P(\text{get} | \text{tired})) \\
L(\text{very}) & = -\log(P(\text{very} | \text{tired})) \\
L(\text{of}) & = -\log(P(\text{of} | \text{tired})) \\
L(\text{sitting}) & = -\log(P(\text{sitting} | \text{tired})) \\
L(\text{cat}) & = -\log(1 - P(\text{cat} | \text{tired})) \\
L(\text{egg}) & = -\log(1 - P(\text{egg} | \text{tired})) \\
L(\text{sky}) & = -\log(1 - P(\text{sky} | \text{tired})) \\
L(\text{smile}) & = -\log(1 - P(\text{smile} | \text{tired}))
\end{align*}
$$

The total loss is the sum of the losses:

$$
L = L(\text{get}) + L(\text{very}) + L(\text{of}) + L(\text{sitting}) + L(\text{cat}) + L(\text{egg}) + L(\text{sky}) + L(\text{smile})
$$

We can write the loss more generally as:

$$
L = -\sum_{w \in \text{Context}} \log(P(y_{\text{w}} = 1 | x_{\text{in}})) - \sum_{w \in \text{Neg. samples}} \log(1 - P(y_{\text{w}} = 1 | x_{\text{in}}))
$$

where $k$ is the number of negative samples.

Now that we have the loss function, we can derive the gradients so that we can update the weights of the model using gradient descent.

Let's first consider the updates for the output layer weights $\mathbf{W}^{(2)}$. For a single training sample $(x_{\text{in}}, y_{\text{out}})$, the gradient of the loss with respect to the output layer weights is:

$$
\begin{align*}
\frac{\partial L}{\partial \mathbf{W}^{(2)}_{\text{out}}} & = W^{(1)}_{\text{in}} (P(y_{\text{out}} = 1 | x_{\text{in}}) - y_{\text{out}}) \end{align*}
$$

The multiplication here is allowed, because for a single training sample $p - y$ is a scalar. Let's now consider a batch of training samples.

:::{.callout-important}
## Weight Update for the Output Layer

Note that the weight update for the output layer only affects a single row of the weight matrix $\mathbf{W}^{(2)}$. This is because the predicted probability only depends on the word vector of the output word.

:::

Now let's consider a batch of three training samples:

$$
\begin{align*}
\frac{\partial L}{\partial \mathbf{W}^{(2)}_{\text{of}}} & = W^{(1)}_{\text{tired}} (P(y_{\text{of}} = 1 | x_{\text{tired}}) - y_{\text{of}}) \\
\frac{\partial L}{\partial \mathbf{W}^{(2)}_{\text{sitting}}} & = W^{(1)}_{\text{tired}} (P(y_{\text{sitting}} = 1 | x_{\text{tired}}) - y_{\text{sitting}}) \\
\frac{\partial L}{\partial \mathbf{W}^{(2)}_{\text{cat}}} & = W^{(1)}_{\text{tired}} (P(y_{\text{cat}} = 1 | x_{\text{tired}}) - y_{\text{cat}}) \\
\end{align*}
$$

The weight updates only affect the rows of the weight matrix that correspond to the output words in the training samples. In order to vectorize the weight updates, consider the vector of prediction errors:

$$
\begin{align*}
(P - Y) & = \begin{bmatrix} P(y_{\text{of}} = 1 | x_{\text{tired}}) - y_{\text{of}} \\ P(y_{\text{sitting}} = 1 | x_{\text{tired}}) - y_{\text{sitting}} \\ P(y_{\text{cat}} = 1 | x_{\text{tired}}) - y_{\text{cat}} \end{bmatrix}
\end{align*}
$$

You can obtain the $3 \times V$ (sub)-matrix of gradients for the output layer weights by taking the _outer_ product of the word vector of the input word (with $D = 100$ embedding size) and the prediction errors:

$$
\frac{\partial L}{\partial \mathbf{W}^{(2)}} = W^{(1)}_{\text{tired}} \otimes (P - Y)
$$

where $\otimes$ is the outer product.

We can derive the gradient for the hidden layer weights $\mathbf{W}^{(1)}$ by using the chain rule. We must also keep in mind that for a single batch (center word, context words and negative samples) we only need to update one row of the weight matrix $\mathbf{W}^{(1)}$ corresponding to the center (input) word.

Let us review how we derived the gradient in the logistic regression model. Here the derivation is even simpler, because the hidden layer is just a linear transformation of the input layer.

$$
\begin{align*}
\frac{\partial L}{\partial z_n} & = - \frac{\partial}{\partial z_n} (y_n \log(\sigma(z_n)) + (1 - y_n)(1 - \sigma(z_n))) \\
& = - \frac{y_n}{\sigma(z_n)} \frac{\partial \sigma(z_n)}{\partial z_n} + \frac{1 - y_n}{1 - \sigma(z_n)} \frac{\partial \sigma(z_n)}{\partial z_n} \\
& = - \frac{y_n}{\sigma(z_n)} \sigma(z_n)(1 - \sigma(z_n)) + \frac{1 - y_n}{1 - \sigma(z_n)} \sigma(z_n)(1 - \sigma(z_n)) \\
& = - y_n (1 - \sigma(z_n)) + (1 - y_n) \sigma(z_n) \\
& = - y_n + y_n \sigma(z_n) + \sigma(z_n) - y_n \sigma(z_n) \\
& = \sigma(z_n) - y_n \\
& = P(y_n = 1 | x_{\text{in}}) - y_n
\end{align*}
$$

The output layer value $z_n$ is just the dot product of the hidden layer and the output layer weights:

$$
z_n = W^{(2)}_{\text{out}(n)} W^{(1)}_{\text{in}}
$$

One thing that we need to keep in mind is that the gradient of the loss with respect to the hidden layer weights is the sum of the gradients of the loss with respect to the output layer weights, because the hidden layer weights are shared across all output words and therefore the probabilities of all output words depend on the hidden layer weights.

$$
\begin{align*}
\frac{\partial L}{\partial W^{(1)}_{\text{in}}} & = \sum_{n = 1}^{N} \frac{\partial L}{\partial a_n} \frac{\partial a_n}{\partial W^{1}_{\text{in}}} \\
& = \sum_{n = 1}^{N} (P(y_n = 1 | x_{\text{in}}) - y_n) \frac{\partial W^{(2)T}_{\text{out}(n)}W^{(1)_{\text{in}}}}{\partial W^{(1)_{\text{in}}}} \\
& = \sum_{n = 1}^{N} (P(y_n = 1 | x_{\text{in}}) - y_n) W^{(2)}_{\text{out}(n)}
\end{align*}
$$






## Subsampling Frequent Words

A pattern occurring in most human language texts is that some words are much more frequent than others. There is an empirical finding known as Zipf's law that states that the frequency of a word is inversely proportional to its rank in the frequency table. For example, the most frequent word will occur twice as often as the second most frequent word, three times as often as the third most frequent word, and so on.

$$
\text{frequency} \propto \frac{1}{\text{rank}}
$$


In [4]:
import numpy as np
import nltk

nltk.download("gutenberg")
from nltk.corpus import gutenberg
import spacy

nlp = spacy.load("en_core_web_sm")
gutenberg.fileids()
alice = gutenberg.raw(fileids="carroll-alice.txt")

[nltk_data] Downloading package gutenberg to /home/amarov/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [5]:
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
words = word_tokenize(alice)
words = [word.lower() for word in words if word.isalpha()]

FreqDist(words).most_common(20)

[('the', 1616),
 ('and', 810),
 ('to', 720),
 ('a', 620),
 ('she', 544),
 ('it', 539),
 ('of', 499),
 ('said', 462),
 ('alice', 396),
 ('was', 366),
 ('i', 364),
 ('in', 359),
 ('you', 356),
 ('that', 284),
 ('as', 256),
 ('her', 248),
 ('at', 209),
 ('on', 191),
 ('had', 184),
 ('with', 179)]

Because some words are used very often (e.g., the), they provide little information about the context. Furthermore, the training will spend a lot of time learning the word vectors for these words.

A solution is to drop some of the frequent words from the training data. The authors of the skip-gram model suggest dropping words with a probability of:

$$
P_{drop}(w) = 1 - \sqrt{\frac{\text{threshold}}{p^1(w)}}
$$

with a threshold of $t = 10^{-5}$ and $p^1(w)$ is modified uni-gram distribution.

Consider our running example:

```
Alice was beginning to get very **tired** of sitting by her sister on the bank.
```

and suppose that we drop the words "get" and "of" of the words in the sentence. The sentence will become:

```
Alice was beginning to very **tired** sitting by her sister on the bank.
```

What is the effect of this technique on the context window size?
