# The Bi-gram Neural Model

[Open in Colab](https://colab.research.google.com/github/febse/ta2024/blob/main/03-Probabilistic-Language-Models/04-The-Bigram-Neural-Model.ipynb)

The logistic regression model that we employed to estimate the probability of a word given its preceding word produced a vector representation of each word in the vocabulary (the weights of the model). However, these vectors suffer from some drawbacks. First of all, the length of these vectors is equal to the size of the vocabulary, which can be very large. This makes the model computationally expensive. The number of parameters in the model is equal to the square of the size of the vocabulary.

We can try to alleviate this problem by using a neural network model with one hidden layer where the dimension of the hidden layer is much smaller than the size of the vocabulary. For a vocabulary of 3 words and a hidden layer of size 2, the model would look like this:

```mermaid
graph LR
    A1[1] -->|wh11| H1[h1]
    A1 -->|wh12| H2[h2]
    A2[0] -->|wh21| H1 
    A2 -->|wh22| H2
    A3[0] -->|wh31| H1
    A3 -->|wh32| H2
    H1 -->|wo11| O1[z1]
    H1 -->|wo21| O2[z2]
    H1 -->|wo31| O3[z3]
    H2 -->|wo12| O1
    H2 -->|wo22| O2
    H2 -->|wo32| O3
    O1 --> SM
    O2 --> SM
    O3 --> SM
    SM[Softmax]  --> P1[P1] --> Y1[0]
    SM --> P2[P2] --> Y2[1]
    SM --> P3[P3] --> Y3[0]
```

Notice that the number of parameters in this model is equal to the number of weights connecting the input layer to the hidden layer plus the number of weights connecting the hidden layer to the output layer. Here we have 6 weights connecting the input layer to the hidden layer and 6 weights connecting the hidden layer to the output layer, for a total of 12 weights. In this case this even appears to worsen the problem of the number of parameters as the logistic regression would have only $3 \cdot 3 = 9$ parameters.

However, with a vocabulary of 1000 words and a hidden layer of size 100, the number of parameters in the neural network model is $2 \cdot 100000$, which is much smaller than the one million parameters in the logistic regression model.


To complete the model we must specify an activation function for the hidden layer. As an exercise, let us choose the `tanh` function. The `tanh` function is defined as:

$$
\text{tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
$$

## The Forward Pass

For a single bi-gram with input $x$ and output $y$, the forward pass of the model is given by:

$$
\begin{align*}
a = W_{h}^T \cdot x \\
h = \text{tanh}(a) \\
z = W_{o}^T \cdot h \\
\hat{y} = \text{Softmax}(z)
\end{align*}
$$

where $W_{h}$ is the matrix of weights connecting the input layer to the hidden layer, $W_{o}$ is the matrix of weights connecting the hidden layer to the output layer, and $\hat{y}$ is the predicted probability distribution over the vocabulary.

## The Backward Pass

The loss function for the model is the cross-entropy loss function just as in the logistic regression model.

$$
L(y, \hat{y}) = - \sum_{i} y_i \log(\hat{y}_i)
$$

This time we have two sets of weights to update, $W_{h}$ and $W_{o}$. The gradients of the loss function with respect to the weights are given by:

The gradient of the loss function with respect to the output layer weights is the same as in the logistic regression model, only this time the input to the softmax function is the output of the hidden layer instead of the input layer.

$$
\frac{\partial L}{\partial W_{o}} = (\hat{y} - y) \cdot h^T
$$

We can find the gradient of the loss function with respect to the hidden layer weights by applying the chain rule:

$$
\begin{align*}
\frac{\partial L}{\partial W_{h}} & = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial h} \cdot \frac{\partial h}{\partial a} \cdot \frac{\partial a}{\partial W_{h}}
\end{align*}
$$

We already know the derivative of the cross-entropy loss function with respect to the output layer (z). This was a vector of size $V$ where $V$ is the size of the vocabulary giving the prediction errors.

$$
\frac{\partial L}{\partial \hat{z}} = \hat{y} - y
$$


The next derivative is the derivative of the output layer with respect to activation of the hidden layer.

$$
\frac{\partial \hat{z}}{\partial h} = W_{o}
$$

The next derivative is the derivative of the hidden layer with respect to its activation function. This is the derivative of the `tanh` function.

Note that because we are taking the derivative of a vector with respect to itself, the derivative is a diagonal matrix of dimension $H \times H$ where $H$ is the size of the hidden layer. See @exr-tanh-derivative for the derivative of the `tanh` function.

$$
h = \text{tanh}(a) = \begin{bmatrix} \text{tanh}(a_1) \\ \text{tanh}(a_2) \\ \vdots \\ \text{tanh}(a_H) \end{bmatrix}
$$

$$
\frac{\partial h}{\partial a} = \begin{bmatrix} 1 - \text{tanh}^2(a_1) & 0 & \cdots & 0 \\ 0 & 1 - \text{tanh}^2(a_2) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 - \text{tanh}^2(a_H) \end{bmatrix}
$$

Multiplying the derivatives up to the last term we get:

$$
\begin{align}
\frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial h} \cdot \frac{\partial h}{\partial a} & = (\hat{y} - y) W_{o} \cdot \text{diag}(1 - \text{tanh}^2(a))\\
& = \underbrace{(\hat{y} - y) W_{o} \odot (1 - h^2)}_{1 \times H} \\
\end{align}
$${#eq-bi-gram-hidden-weights-partial}

For the last derivative we need to follow the same step as in the logistic regression model, so the full derivative is the outer product of @eq-bi-gram-hidden-weights-partial and the input vector.

$$
\underbrace{\frac{\partial L}{\partial W_{h}}}_{V \times H} = \underbrace{x^T}_{V \times 1} \underbrace{(\hat{y} - y) W_{o} \odot (1 - h^2)}_{1 \times H}
$$


For a matrix of input vectors $X$ and output vectors $Y$, the gradients of the loss function with respect to the weights are given by:

$$
\begin{align*}
\frac{\partial L}{\partial W_{o}} & = \underbrace{H^T}_{H \times n} \underbrace{(\hat{Y} - Y)}_{n \times V} \\
\frac{\partial L}{\partial W_{h}} & = \underbrace{X^T}_{V \times n} \underbrace{(\hat{Y} - Y) \cdot W_{o} \odot (1 - H^2)}_{n \times H}
\end{align*}
$$

:::{#exr-tanh-derivative}
## Derivative of the tanh Function

Calculate the derivative of the `tanh` function with respect to its input $x$.

:::

:::{.callout-note}
## Solution (click to expand)

We just need to apply the quotient rule to the `tanh` function. The derivative of the `tanh` function is:

$$
\begin{align*}
\frac{d}{dx} \text{tanh}(x) & = \frac{d}{dx} \left(\frac{e^x - e^{-x}}{e^x + e^{-x}}\right) \\
&  = \frac{(e^x + e^{-x})(e^x + e^{-x}) - (e^x - e^{-x})(e^x - e^{-x})}{(e^x + e^{-x})^2}
& = \frac{(e^x + e^{-x})^2}{(e^x + e^{-x})^2} - \frac{(e^x - e^{-x})^2}{(e^x + e^{-x})^2} \\
& = \frac{\cancel{(e^x + e^{-x})^2}}{\cancel{(e^x + e^{-x})^2}} - \left(\frac{e^x - e^{-x}}{e^x + e^{-x}}\right)^2 \\
& = 1 - \text{tanh}^2(x)
\end{align*}
$$

:::





:::{.callout-info}
## Assignment

Use the functions in the logistic regression model and adapt them to train the neural bi-gram model.

:::