# The Bi-gram Neural Model

[Open in Colab](https://colab.research.google.com/github/febse/ta2024/blob/main/03-Probabilistic-Language-Models/04-The-Bigram-Neural-Model.ipynb)

The logistic regression model that we employed to estimate the probability of a word given its preceding word produced a vector representation of each word in the vocabulary (the weights of the model). However, these vectors suffer from some drawbacks. First of all, the length of these vectors is equal to the size of the vocabulary, which can be very large. This makes the model computationally expensive. The number of parameters in the model is equal to the square of the size of the vocabulary.

We can try to alleviate this problem by using a neural network model with one hidden layer where the dimension of the hidden layer is much smaller than the size of the vocabulary. For a vocabulary of 3 words and a hidden layer of size 2, the model would look like this:

```mermaid
graph LR
    A1[1] -->|wh11| H1[h1]
    A1 -->|wh12| H2[h2]
    A2[0] -->|wh21| H1 
    A2 -->|wh22| H2
    A3[0] -->|wh31| H1
    A3 -->|wh32| H2
    H1 -->|wo11| O1[z1]
    H1 -->|wo21| O2[z2]
    H1 -->|wo31| O3[z3]
    H2 -->|wo12| O1
    H2 -->|wo22| O2
    H2 -->|wo32| O3
    O1 --> SM
    O2 --> SM
    O3 --> SM
    SM[Softmax]  --> P1[P1] --> Y1[0]
    SM --> P2[P2] --> Y2[1]
    SM --> P3[P3] --> Y3[0]
```

Notice that the number of parameters in this model is equal to the number of weights connecting the input layer to the hidden layer plus the number of weights connecting the hidden layer to the output layer. Here we have 6 weights connecting the input layer to the hidden layer and 6 weights connecting the hidden layer to the output layer, for a total of 12 weights. In this case this even appears to worsen the problem of the number of parameters as the logistic regression would have only $3 \cdot 3 = 9$ parameters.

However, with a vocabulary of 1000 words and a hidden layer of size 100, the number of parameters in the neural network model is $2 \cdot 100000$, which is much smaller than the one million parameters in the logistic regression model.


To complete the model we must specify an activation function for the hidden layer. As an exercise, let us choose the `tanh` function. The `tanh` function is defined as:

$$
\text{tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
$$

## The Forward Pass

For a single bi-gram with input $x$ and output $y$, the forward pass of the model is given by:

$$
\begin{align*}
a = W^{(1) T} \cdot x \\
h = \text{tanh}(a) \\
z = W^{(2) T} \cdot h \\
\hat{y} = \text{Softmax}(z)
\end{align*}
$$

where $W^{(1)}$ is the matrix of weights connecting the input layer to the hidden layer, $W^{(2)}$ is the matrix of weights connecting the hidden layer to the output layer, and $\hat{y}$ is the predicted probability distribution over the vocabulary.

## The Backward Pass

The loss function for the model is the cross-entropy loss function just as in the logistic regression model.

$$
L(y, \hat{y}) = - \sum_{i} y_i \log(\hat{y}_i)
$$

This time we have two sets of weights to update, $W^{(1)}$ and $W^{(2)}$. The gradients of the loss function with respect to the weights are given by:

The gradient of the loss function with respect to the output layer weights is the same as in the logistic regression model, only this time the input to the softmax function is the output of the hidden layer instead of the input layer.

$$
\frac{\partial L}{\partial W^{(2)}} = (\hat{y} - y) \cdot h^T
$$

We can find the gradient of the loss function with respect to the hidden layer weights by applying the chain rule:

$$
\begin{align*}
\frac{\partial L}{\partial W^{(1)}} & = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial h} \cdot \frac{\partial h}{\partial a} \cdot \frac{\partial a}{\partial W^{(1)}}
\end{align*}
$$

We already know the derivative of the cross-entropy loss function with respect to the output layer (z). This was a vector of size $V$ where $V$ is the size of the vocabulary giving the prediction errors.

$$
\frac{\partial L}{\partial \hat{z}} = \hat{y} - y
$$


The next derivative is the derivative of the output layer with respect to activation of the hidden layer.

$$
\frac{\partial \hat{z}}{\partial h} = W^{(2)}
$$

The next derivative is the derivative of the hidden layer with respect to its activation function. This is the derivative of the `tanh` function.

Note that because we are taking the derivative of a vector with respect to itself, the derivative is a diagonal matrix of dimension $H \times H$ where $H$ is the size of the hidden layer. See @exr-tanh-derivative for the derivative of the `tanh` function.

$$
h = \text{tanh}(a) = \begin{bmatrix} \text{tanh}(a_1) \\ \text{tanh}(a_2) \\ \vdots \\ \text{tanh}(a_H) \end{bmatrix}
$$

$$
\frac{\partial h}{\partial a} = \begin{bmatrix} 1 - \text{tanh}^2(a_1) & 0 & \cdots & 0 \\ 0 & 1 - \text{tanh}^2(a_2) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 - \text{tanh}^2(a_H) \end{bmatrix}
$$

Multiplying the derivatives up to the last term we get:

$$
\begin{align}
\frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial h} \cdot \frac{\partial h}{\partial a} & = (\hat{y} - y) W^{(2)} \cdot \text{diag}(1 - \text{tanh}^2(a))\\
& = \underbrace{(\hat{y} - y) W^{(2)} \odot (1 - h^2)}_{1 \times H} \\
\end{align}
$${#eq-bi-gram-hidden-weights-partial}

For the last derivative we need to follow the same step as in the logistic regression model, so the full derivative is the outer product of @eq-bi-gram-hidden-weights-partial and the input vector.

$$
\underbrace{\frac{\partial L}{\partial W^{(1)}}}_{V \times H} = \underbrace{x^T}_{V \times 1} \underbrace{(\hat{y} - y) W^{(2)} \odot (1 - h^2)}_{1 \times H}
$$


For a matrix of input vectors $X$ and output vectors $Y$, the gradients of the loss function with respect to the weights are given by:

$$
\begin{align*}
\frac{\partial L}{\partial W^{(2)}} & = \underbrace{H^T}_{H \times n} \underbrace{(\hat{Y} - Y)}_{n \times V} \\
\frac{\partial L}{\partial W^{(1)}} & = \underbrace{X^T}_{V \times n} \underbrace{(\hat{Y} - Y) \cdot W^{(2)} \odot (1 - H^2)}_{n \times H}
\end{align*}
$$

:::{#exr-tanh-derivative}
## Derivative of the tanh Function

Calculate the derivative of the `tanh` function with respect to its input $x$.

:::

:::{.callout-note}
## Solution (click to expand)

We just need to apply the quotient rule to the `tanh` function. The derivative of the `tanh` function is:

$$
\begin{align*}
\frac{d}{dx} \text{tanh}(x) & = \frac{d}{dx} \left(\frac{e^x - e^{-x}}{e^x + e^{-x}}\right) \\
&  = \frac{(e^x + e^{-x})(e^x + e^{-x}) - (e^x - e^{-x})(e^x - e^{-x})}{(e^x + e^{-x})^2}
& = \frac{(e^x + e^{-x})^2}{(e^x + e^{-x})^2} - \frac{(e^x - e^{-x})^2}{(e^x + e^{-x})^2} \\
& = \frac{\cancel{(e^x + e^{-x})^2}}{\cancel{(e^x + e^{-x})^2}} - \left(\frac{e^x - e^{-x}}{e^x + e^{-x}}\right)^2 \\
& = 1 - \text{tanh}^2(x)
\end{align*}
$$

:::





## Implementation

Note that the neural bi-gram model is just a minor modification of the logistic regression model. We can literally copy the code from the logistic regression model and modify it to include the hidden layer. We need to change two things:

- Instead of a single large matrix of weights connecting the input layer to the output layer, we now have two matrices of weights, one connecting the input layer to the hidden layer and one connecting the hidden layer to the output layer.
- We need to add the `tanh` function to the forward pass and its derivative to the backward pass.

In [None]:
# We copy the helper functions from the previous notebook here

import numpy as np
import spacy

nlp = spacy.load("en_core_web_sm")

def tokenize_doc(text: str) -> list:
    sentences = []
    text_doc = nlp(text)
    word2idx = {
        "BEGINNING": 0,
        "END": 1
    }
    idx2word = {
        0: "BEGINNING",
        1: "END"
    }
    
    for i, sentence in enumerate(text_doc.sents):
        tokens = ["BEGINNING"]
        
        for token in sentence:
            if token.is_space or token.is_punct:
                continue

            token_normalized = token.lower_ 
            tokens.append(token_normalized)
            
            if token_normalized not in word2idx:            
                idx = len(word2idx)
                word2idx[token_normalized] = idx
                idx2word[idx] = token_normalized
        
        tokens.append("END")
        sentences.append(tokens)

    return sentences, word2idx, idx2word

def tokens_to_index(tokens: list, word2idx: dict):
    indexes_list = []
    
    for token in tokens:
        indexes_list.append(word2idx[token])
        
    return indexes_list

In [None]:
# Again, define the softmax function and the training function

def softmax(x):
    exp_x = np.exp(x)
    return exp_x / exp_x.sum(axis=1, keepdims=True)
    
def train_neural_bi_gram_model(
        sentences: list[list[int]],
        D: int,
        V: int,
        learning_rate: float = 0.01,
        epochs: int = 100
        ):
    """
    Train a neural bigram model using the gradient descent algorithm.

    Args:
    sentences: list of lists of integers representing the sentences in the corpus
    D: the size of the hidden layer
    V: the size of the vocabulary
    learning_rate: the learning rate of the gradient descent algorithm
    epochs: the number of epochs to train the model    
    """

    # As before, we will store the loss values in a list for plotting later
    losses = []

    # Initialize the weights to small random values
    # We divide by sqrt(V) to make the weights smaller

    W1 = np.random.randn(V, D) / np.sqrt(V)
    W2 = np.random.randn(D, V) / np.sqrt(D)
    
    for epoch in range(epochs):
        # shuffle sentences at each epoch so that the model sees the sentences in a different order
        # at each epoch

        np.random.shuffle(sentences)
        
        # Set up a counter to keep track of the number of sentences processed
        i = 0
        
        for sentence in sentences:
            # convert sentence into one-hot encoded inputs and targets
            
            # An example sentence has the form ["BEGINNING", "hello", "my", "name", "is", "john", "END"]
            # Only with the word indices instead of the words
            # It has n = 7 words and therefore n - 1 = 6 bi-grams
            # So each row of the inputs and targets matrices will have the shape (1, vocab_size)
            
            # This is the same as in the logistic regression model
            n = len(sentence)
            
            X = np.zeros((n - 1, V))
            Y = np.zeros((n - 1, V))
            X[np.arange(n - 1), sentence[:n-1]] = 1
            Y[np.arange(n - 1), sentence[1:]] = 1
            
            # Compute the predicted probabilities (forward pass)
            
            # DIFFERENT FROM LOGISTIC REGRESSION
            # Now the prediction is done in two steps

            # First, we compute the hidden layer representation
            A = X.dot(W1)
            
            # Then the hidden layer activation
            H = np.tanh(A)

            # We map the hidden layer activation to the output layer
            z = H.dot(W2)

            # We pass the output layer through the softmax function
            # to produce the predicted probabilities
            Y_hat = softmax(z)
            

            # AGAIN, DIFFERENT FROM LOGISTIC REGRESSION

            # Compute the gradients of the loss with respect to the weights

            # The gradient of the loss with respect to the output layer weights
            # is the difference between the predicted probabilities and the actual targets
            # multiplied by the hidden layer activation

            gW2 = H.T.dot(Y_hat - Y)
            gW1 = X.T.dot((Y_hat - Y).dot(W2.T) * (1 - H * H))

            # Once that we have the gradients, we update the weights

            W2 = W2 - learning_rate * gW2
            W1 = W1 - learning_rate * gW1
                        
            # THE LOSS COMPUTATION IS THE SAME AS IN THE LOGISTIC REGRESSION MODEL

            # Compute the loss and store it so that we can plot it later
            loss = -np.sum(Y * np.log(Y_hat)) / (n - 1)
            losses.append(loss)

            # Plot some information about the training process every 10 sentences            
            if i % 10 == 0:
                print("epoch:", epoch, "sentence: %s/%s" % (i, len(sentences)), "loss:", loss)
            
            # Increment the sentence counter
            i += 1
    
    return W1, W2, losses