# The Neural Bi-gram Model

The logistic regression model that we have considered this far has several considerable disadvantages.

It is slow and expensive to train, despite the fact that it is a linear model. Condider the $V \times V$ weights matrix. It has $V^2$ parameters. For a vocabulary of 100,000 words, this is 10 billion parameters. This is a lot of parameters to estimate, and it is not surprising that it takes a long time to train.
  

We can try to work around this by compressing the inputs into a lower dimensional space, which is exactly what neural networks do. We can think of the neural network as a non-linear function that maps the inputs into a lower dimensional space. The neural network is then a linear model in this lower dimensional space.

![Hidden Layer Network](03-one-hidden-layer.png)

This model will have two sets of weights: $W^h$ ($V \times D)$ and $W^o$ ($D \times V)$. Choosing $D$ to be much smaller than $V$ will reduce the number of parameters that we need to estimate. With $D = 100$ and $V = 10000$ we have 1 million parameters, which is a lot less than 10 billion.

Let's define the neural network model:
The hidden layer will have $D = 30$ neurons and a $\tanh$ activation.

$$
\begin{align}
\mathbf{H} = \tanh(\mathbf{X} \mathbf{W^h}) \\
\hat{Y} = \text{softmax}(\mathbf{H}\mathbf{W^o})
\end{align}
$$

The loss function is the same as before:

$$
J(\mathbf{W}^h, \mathbf{W}^o) = -\frac{1}{N} \sum_{i=1}^N \sum_{j=1}^V Y_{ij} \log \hat{Y}_{ij}
$$

The weights are updated using gradient descent:

$$
\begin{align}
\mathbf{W}^{h, \text{new}} = \mathbf{W}^{h, \text{old}} - \eta \nabla_{\mathbf{W}^h} J \\
\mathbf{W}^{o, \text{new}} = \mathbf{W}^{o, \text{old}} - \eta \nabla_{\mathbf{W}^o} J \\
\end{align}
$$

The gradients are computed using the chain rule and can be shown to be:

$$
\begin{align}
\nabla_{\mathbf{W}^o} J = \mathbf{H}^T (\hat{Y} - Y) \\
\nabla_{\mathbf{W}^h} J = \mathbf{X}^T (\hat{Y} - Y) \odot (1 - \mathbf{H}^2) \mathbf{W}^{oT}
\end{align}
$$

In the above, the $\odot$ operator is the element-wise product of two matrices.

In [None]:
# As before, we start by producing the training data
import numpy as np

# This is the same softmax function as before
def softmax(x):
    exp_x = np.exp(x)
    return exp_x / exp_x.sum(axis=1, keepdims=True)

def train_neural_bigram(sentences: list, V: int, D: int = 100, learning_rate: float = 0.01, epochs: int = 100):
    """
    Train a neural bi-gram model with tanh activation function
    
    :param sentences: A list sentences. Each sentence is a list of integers corresponding to the indices of the words in the vocabulary
    :param V: The size of the vocabulary
    :param D: The size of the hidden layer (size of the word vectors)
    :param learning_rate: The learning rate of the gradient descent algorithm
    :param epochs: Number of epochs to train the model
    :return: Hidden and output weights and a list of losses at each iteration
    """
    # initialize weights
    Wh = np.random.randn(V, D) / np.sqrt(V)
    Wo = np.random.randn(D, V) / np.sqrt(D)
    
    # A list to store the loss at each iteration
    losses = []
    
    for epoch in range(epochs):
        # shuffle sentences at each epoch
        np.random.shuffle(sentences)
        
        j = 0 # keep track of iterations
        for sentence in sentences:
            # convert sentence into one-hot encoded inputs and targets
            n = len(sentence)
            inputs = np.zeros((n - 1, V))
            targets = np.zeros((n - 1, V))
            inputs[np.arange(n - 1), sentence[:n-1]] = 1
            targets[np.arange(n - 1), sentence[1:]] = 1
            
            # get output predictions
            hidden = np.tanh(inputs.dot(Wh))
            predictions = softmax(hidden.dot(Wo))
            
            # do a gradient descent step
            W2 = Wo - learning_rate * hidden.T.dot(predictions - targets)
            dhidden = (predictions - targets).dot(W2.T) * (1 - hidden * hidden)
            Wh = Wh - learning_rate * inputs.T.dot(dhidden)
            
            # keep track of the loss
            loss = -np.sum(targets * np.log(predictions)) / (n - 1)
            losses.append(loss)
            
            if j % 10 == 0:
                print("epoch:", epoch, "sentence: %s/%s" % (j, len(sentences)), "loss:", loss)
            j += 1
            
    return Wh, Wo, losses

# The Skip-Gram Model


In the previous model, we used the first word in a bi-gram to predict the second word. 

$$
P(w_{t+1} | w_t) = \text{softmax}(\mathbf{W}_{o}^T f(\mathbf{W}_{h}^T x_t))
$$

The model already produces word vectors (embeddings) for each word in the vocabulary. A problem with the model, however, is that these
word vectors perform poorly on word similarity and other NLP tasks.

The skip-gram model is a modification of the bi-gram model that produces better word vectors. The skip-gram model uses the current word to predict the surrounding words. The model is trained on a corpus of words and their surrounding words. The model is trained to maximize the probability of the surrounding words given the current word.

Furthermore, it uses the dot product between the sets of weights (word vectors) to compute the probability of the surrounding words. This is equivalent to a cosine similarity between the word vectors. The skip-gram model is trained to maximize the cosine similarity between the word vectors of the current word and the surrounding words.

Mathematically, this implies that the activation function of the hidden layer is the identity function.

Let's look at an example:

```
Alice was beginning to get very **tired** of sitting by her sister on the bank.
```

The bi-gram model would produce the following training data the word "tired":

- (tired, of)

The skip-gram model looks at a set of nearby words (context window $m$) and produces the following training data for the word tired $m = 2$

- (tired, get)
- (tired, very)
- (tired, of)
- (tired, sitting)

You can think about the model in two ways

1. One sample with four targets
    - tired -> (very, get, of, sitting)
2. Four samples with a single target each
    - tired -> very
    - tired -> get
    - tired -> of
    - tired -> sitting

$$
p(\text{context words} | \text{center word}) = \prod_{j \in \text{context}} \text{softmax}(\mathbf{W}^o_j \mathbf{W}^h_i)
$$

We see that regarding the training data the skip-gram model is an extension of the bi-gram model. Because it strives to maximize the probability of similar words, it drops the tanh activation function that we used in the previous example and replaces it with the identity function.

The equations for the skip-gram model are:

$$
H = X \mathbf{W}^h \\
\hat{Y} = softmax(H \mathbf{W}^o)
$$

and the prediction of the model is based on the dot-products (similarities) between the center word vector (input) and the context word vectors (output).

## Negative Sampling

The practical application of the skip-gram model is computationally expensive. Consider what will happen if we try to train the model on a large corpus with a vocabulary size e.g., $V = 250000$ words and word vector size $D = 300$. Each of the two sets of weights $W^h$ and $W^o$ will be $250000 \times 300 = 7.5 \times 10^7$. In order to train the model, we will need a huge amount of training data to avoid overfitting because of the high number of parameters. This means that the model will take a long time to train.

A practical solution to this problem is to use negative sampling. To illustrate the idea, let's consider the following example:

The output of each neuron in the output layer is a vector of probabilities for each of the $V$ words in the vocabulary. The perfect prediction of that neuron will be to have 1 at the index of the correct context word and zeros everywhere else.

$$
(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)
$$

Instead of computing a probability for all words in the vocabulary, we can select a sample of words from that vocabulary and use them as negative examples. The model is then trained to maximize the probability of the correct context word and minimize the probability of the negative examples.

If we choose 5 negative examples, the perfect prediction output of the neuron will be a 6-dimensional vector with 1 at the index of the correct context word and zeros everywhere else.

$$
(0, 0, 0, 0, 0, 1)
$$

This will massively reduce the computational cost of the model. A natural question is how to choose the negative examples. The authors of the skip-gram model suggest using the uni-gram distribution raised to the 3/4 power. This is equivalent to sampling from the distribution.
 
The uni-gram distribution is simply the frequency of each word in the corpus. 

$$
p("tiger") = \frac{\text{frequency of "tiger"}}{\text{total number of words}}
$$

The uni-gram distribution raised to the 3/4 power is:

$$
p^1("tiger") = \frac{p(\text{"tiger"})^{3/4}}{
\sum}
$$


The 3/4 power is used to reduce the probability of sampling very frequent words and is a hyperparameter of the model.


## Multi-class and Binary Classification

The first version of the skip-gram model that we presented solves a multi-class classification problem. The output of the model is a vector of probabilities for each of the $V$ words in the vocabulary.

The idea of negative sampling is to transform the multi-class classification problem into a sequence binary classification problems (logistic regressions). Instead of predicting the probability of each word in the vocabulary, we predict the probability of the correct context word and the probability of the negative examples.

$$
p(y | x) = \frac{\exp(W_o^TW_h)}{\sum_{j=1}^V \exp(W_{oj}^TW^h)}
$$

$$
p(y = 1 | x) = \sigma(W_h^TW_o)
$$

In our example the logistic regressions for the center word "tired" will be:

- (tired, very): $p(very | tired) = \sigma(W_{o,\text{very}}^TW_{h, \text{tired}})$
- (tired, get): $p(get | tired) = \sigma(W_{o,\text{get}}^TW_{h,\text{tired}})$
- (tired, very): $p(of | tired) = \sigma(W_{o,\text{of}}^TW{h,\text{tired}})$
- (tired, very): $p(sitting | tired) = \sigma(W_{o,\text{sitting}}^TW^_{h, \text{tired}})$

By sampling from the modified uni-gram distribution we can select a couple (hyperparameter) of negative examples for each center word. For example, we can sample the words "cat", "dog", "mouse", "rabbit", and "horse" as negative examples for the center word "tired". The logistic regressions for the negative examples will be:

- (tired, cat): $p(cat | tired) = \sigma(W_{o, \text{cat}}^T W_{h, \text{tired}})$
- (tired, dog): $p(dog | tired) = \sigma(W_{o, \text{dog}}W^{o}_{h, \text{tired}})$
- (tired, mouse): $p(mouse | tired) = \sigma(W_{o, \text{mouse}}W_{h, \text{tired}})$
- (tired, rabbit): $p(rabbit | tired) = \sigma(W_{o, \text{rabbit}}W_{h, \text{tired}})$
- (tired, horse): $p(horse | tired) = \sigma(W_{o, \text{horse}}W_{h, \text{tired}})$


The loss function with negative sampling is:

$$
J = \log p(very | tired) + \log p(get | tired) + \log p(of | tired) + \log p(sitting | tired) + \log (1 - p(cat | tired)) + \log (1 - p(dog | tired)) + \log (1 - p(mouse | tired)) + \log (1 - p(rabbit | tired)) + \log (1 - p(horse | tired))
$$

Generalizing the loss will give us

$$
J = \sum_{j \in \text{pos. examples}} \log \sigma(W^o_jW^h) + \sum_{j \in \text{neg. examples}} \log (1 - \sigma(W^o_jW^h))
$$

We can use a property of the sigmoid function to simplify the loss function:

$$
\sigma(x) + \sigma(-x) = 1
$$

$$
J = \sum_{j \in \text{pos. examples}} \log \sigma(W^o_jW^h) + \sum_{j \in \text{neg. examples}} \log (\sigma(-W^o_jW^h))
$$


## Subsampling Frequent Words

A pattern occurring in most human language texts is that some words are much more frequent than others. There is an empirical finding known as Zipf's law that states that the frequency of a word is inversely proportional to its rank in the frequency table. For example, the most frequent word will occur twice as often as the second most frequent word, three times as often as the third most frequent word, and so on.

$$
\text{frequency} \propto \frac{1}{\text{rank}}
$$


In [4]:
import numpy as np
import nltk

nltk.download("gutenberg")
from nltk.corpus import gutenberg
import spacy

nlp = spacy.load("en_core_web_sm")
gutenberg.fileids()
alice = gutenberg.raw(fileids="carroll-alice.txt")

[nltk_data] Downloading package gutenberg to /home/amarov/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [5]:
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
words = word_tokenize(alice)
words = [word.lower() for word in words if word.isalpha()]

FreqDist(words).most_common(20)

[('the', 1616),
 ('and', 810),
 ('to', 720),
 ('a', 620),
 ('she', 544),
 ('it', 539),
 ('of', 499),
 ('said', 462),
 ('alice', 396),
 ('was', 366),
 ('i', 364),
 ('in', 359),
 ('you', 356),
 ('that', 284),
 ('as', 256),
 ('her', 248),
 ('at', 209),
 ('on', 191),
 ('had', 184),
 ('with', 179)]

Because some words are used very often (e.g., the), they provide little information about the context. Furthermore, the training will spend a lot of time learning the word vectors for these words.

A solution is to drop some of the frequent words from the training data. The authors of the skip-gram model suggest dropping words with a probability of:

$$
P_{drop}(w) = 1 - \sqrt{\frac{\text{threshold}}{p^1(w)}}
$$

with a threshold of $t = 10^{-5}$ and $f(w)$ is the frequency of the word in the corpus.


Consider our running example:

```
Alice was beginning to get very **tired** of sitting by her sister on the bank.
```

and suppose that we drop the words "get" and "of" of the words in the sentence. The sentence will become:

```
Alice was beginning to very **tired** sitting by her sister on the bank.
```

What is the effect of this technique on the context window size?


# Negative Sampling

The basic idea is to restrict the number of cases where a word is considered wrong.

$$
p(\text{very} | \text{tired}) = \frac{\exp(W^o_{\text{very}}W^h_{\text{tired}})}{\sum_{j=1}^V \exp(W^o_jW^h_{\text{tired}})}
$$

Let's say that we have selected a number of words that did not occur in the context of the word "tired" as negative examples: "boat", "egg", "horse".

Then we can formulate a series of binary classification problems to maximize the probability that the words "very", "sitting", "by", "her" are in the context of the word "tired" and minimize the probability that the words "boat", "egg", "horse" are in the context of the word "tired".

For the positive samples (words that occurred in the context of the word "tired") we have:
$$
p(\text{very} | \text{tired}) = \sigma(W^o_{\text{very}}W^h_{\text{tired}}) \\
p(\text{sitting} | \text{tired}) = \sigma(W^o_{\text{sitting}}W^h_{\text{tired}}) \\
p(\text{by} | \text{tired}) = \sigma(W^o_{\text{by}}W^h_{\text{tired}}) \\
p(\text{her} | \text{tired}) = \sigma(W^o_{\text{her}}W^h_{\text{tired}}) \\
$$

For the negative samples (words that did not occur in the context of the word "tired") we have:

$$
p(\text{boat} | \text{tired}) = \sigma(W^o_{\text{boat}}W^h_{\text{tired}}) \\
p(\text{egg} | \text{tired}) = \sigma(W^o_{\text{egg}}W^h_{\text{tired}}) \\
p(\text{horse} | \text{tired}) = \sigma(W^o_{\text{horse}}W^h_{\text{tired}}) \\
$$

The loss function for this example will be:

$$
J = \log p(\text{very} | \text{tired}) + \log p(\text{sitting} | \text{tired}) + \log p(\text{by} | \text{tired}) + \log p(\text{her} | \text{tired}) + \log (1 - p(\text{boat} | \text{tired})) + \log (1 - p(\text{egg} | \text{tired})) + \log (1 - p(\text{horse} | \text{tired}))
$$

Generalizing the loss function will give us:

(XXX, TODO: check notation)
$$
J = \sum_{c in \text{context}} \log \sigma(W^o_cW^h_{\text{tired}}) + \sum_{j \in \text{neg. examples}} \log (1 - \sigma(W^o_jW^h_{\text{tired}}))
$$


## Weights Update with Negative Sampling

### Updating the Output Weights

The loss function is given by:

$$
J = -\sum_{n = 1}^{N} t_n \log p_n + (1 - t_n) \log (1 - p_n)
$$

From our notes on logistic regression we know (XXX, TODO: check if this is included) that the gradient of the loss function with respect to the weights is:

(XXX, TODO, check notation consistency)

$$
\frac{\partial J}{\partial W^{o}} = H^T (P - T)
$$ 

### Updating the Hidden Weights

The gradient of the loss function with respect to the hidden weights is:

$$
\frac{\partial J}{\partial W^{h}} = X^T ((P - T)W^o)^T
$$

