# Logistic Regression Model

In the previous notebook we counted the number of times each bi-gram appeared in the text. We can also formulate this problem as a classification problem and use a logistic regression model to predict the second word of a bi-gram given the first word.

Let's start by looking at a simple sentence (the same one we used in the previous notebook) and see how we can formulate the classification problem.

```
Be brave as brave can be.
```

We convert this sentence into lower case and remove the punctuation to get the following bi-grams:

- BEGINNING be
- be brave
- brave as
- as brave
- brave can
- can be
- be END

In the counting approach we did not need a vector representation of the words but for the logistic regression model the inputs and the outputs must be numeric. We can use a one-hot encoding to represent the words. In this encoding each word is represented by a vector of zeros with a one at the index of the word in the vocabulary. In the example we have the following 6 tokens in the vocabulary:

```
vocabulary = ['BEGINNING', 'END', 'be', 'brave', 'as', 'can']
```

## Implementation

The logistic regression model should predict the probability of the second word in the bi-gram given the first word. To make the example clear, consider a vocabulary of only 3 words: ["be", "brave", "as"] and the bi-gram ["be", "brave"].

```mermaid
graph LR
    A1[1] -->|w11| B1[z1] --> SM
    A1 -->|w12| B2[z2] --> SM
    A1 -->|w13| B3[z3] --> SM
    A2[0] -->|w21| B1 
    A2 -->|w22| B2
    A2 -->|w23| B3
    A3[0] -->|w31| B1
    A3 -->|w32| B2
    A3 -->|w33| B3
    SM[Softmax]  --> P1[P1] --> Y1[0]
    SM --> P2[P2] --> Y2[1]
    SM --> P3[P3] --> Y3[0]
```

The one-hot encoding of the first word "be" is `[1, 0, 0]` and the one-hot encoding of the second word "brave" is `[0, 1, 0]`. Given the input vector `x = [1, 0, 0]` the model should output a vector of length 3 with the probabilities of each word in the vocabulary. The output vector `y = [P1, P2, P3]` is the result of the softmax function applied to the dot product of the input vector `x` and the weight matrix `W`:

## The Model for a Single Bi-gram

Let $x$ be the one-hot-encoded vector of the first word in the bi-gram and $y$ be the one-hot-encoded vector of the second word in the bi-gram. The model for a single bi-gram is:

$$
\begin{align*}
z & = W^T x \in \mathbb{R}^3 \\
\hat{y} & = \text{softmax}(z) \in \mathbb{R}^3
\end{align*}
$$

where $W$ the weight matrix of size $3 \times 3$.

## Backward Pass (Gradient Descent)

Now that we have computed the loss for the batch in the forward pass, we can compute the gradients of the loss with respect to the weights. Here it is convenient to use matrix notation and vectorized operations.

For a single observation $i$, the predicted probabilities are:

$$
\begin{align*}
z & = W^T x \\
\hat{y} & = \text{softmax}(z)
\end{align*}
$$

The cross-entropy loss is is a scalar value that depends on the weights $W$. The gradient of the loss with respect to the weights is a matrix of the same shape as $W$.

$$
\text{CE}(W) = - \sum_{k=1}^K y_{k} \log \hat{y}_{k}
$$

The chain rule tells us that the gradient of the loss with respect to the weights is:

$$
\frac{\partial \text{CE}(W)}{\partial W} = \frac{\partial \text{CE}(W)}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z} \frac{\partial z}{\partial W}
$$

The derivative of the cross-entropy loss with respect to $z$ is actually quite simple. As we are differentiation a mapping from $\mathbb{R}^{K \times 1}$ to $\mathbb{R}$, the derivative is a matrix of the same shape as $z$. It is easy to derive this derivative with respect to a single element of $z$: $z_{k}$. The first thing to notice is that the loss depends on $z_{k}$ through $\hat{y}_{k}$, and $\hat{y}_{k}$ depends on $z_{1}$, $Z_{2}$, ..., $Z_{K}$ because the softmax function divides each element of the column by the _sum of all elements_ of the column. The chain rule tells us that the derivative of the loss with respect to $z_{k}$ is:

$$
\frac{\partial \text{CE}(W)}{\partial z_{k'}} = \sum_{k=1}^K \frac{y_{k}}{\hat{y}_{k}}\frac{\partial \hat{y}_{k}}{\partial z_{k}}
$$

Note that we are using $k'$ as the index of the class in the derivative to avoid confusion with the summation index $k$.





The second case is when $k \neq k'$. In this case, the derivative is:

$$
\begin{align*}
\frac{\partial}{\partial z_{k'}} \hat{y}_{k} & = \frac{\partial}{\partial z_{k'}} \frac{e^{z_{k}}}{\sum_{j=1}^K e^{z_{j}}} \\
& = \frac{0 - e^{z_{k}}e^{z_{k'}}}{\left(\sum_{j=1}^K e^{z_{j}}\right)^2} \\
& = -\frac{e^{z_{k}}}{\sum_{j=1}^K e^{z_{j}}} \frac{e^{z_{k'}}}{\sum_{j=1}^K e^{z_{j}}} \\
& = -\hat{y}_{k} \hat{y}_{k'}
\end{align*}
$$


We can combine both cases into a single expression using the Kronecker delta $\delta_{kk'}$ which is 1 if $k = k'$ and 0 otherwise:

$$
\delta_{kk'} = \begin{cases}
1 & \text{if } k = k' \\
0 & \text{if } k \neq k'
\end{cases}
$$

$$
\begin{align*}
\frac{\partial}{\partial z_{k'}} \hat{y}_{k} & = \hat{y}_{k} (\delta_{kk'} - \hat{y}_{k'})
\end{align*}
$$

You can check that this expression is correct by verifying that it gives the correct results for the two cases we considered above.

Now we are ready to substitute this derivative into the expression for the derivative of the loss with respect to $z_{ki}$:

$$
\begin{align*}
\frac{\partial \text{CE}(W)}{\partial \hat{y}_{k'}} & = - \sum_{k=1}^K \frac{y_{k}}{\hat{y}_{k}}\frac{\partial \hat{y}_{k}}{\partial z_{k'}} \\
& = - \sum_{k=1}^K \frac{y_{k}}{\cancel{\hat{y}_{k}}} \cancel{\hat{y}_{k}} (\delta_{kk'} - \hat{y}_{k'}) \\
& = - \sum_{k = 1}^{K} y_{ki} (\delta_{kk'} - \hat{y}_{k'})
\end{align*}
$$

The inner sum simplifies beautifully because of the special structure of $\delta_{kk'}$ and $y_{ki}$. The inner sum is 

$$
\sum_{k = 1}^{K} y_{k} (\delta_{kk'} - \hat{y}_{k'}) = \sum_{k = 1}^{K} y_{k} \delta_{kk'} - \sum_{k = 1}^{K} y_{k} \hat{y}_{k'}
$$

Now you need to consider only two things. In the first sum we are multiplying $y_{k}$ by $\delta_{kk'}$. The Kronecker delta is 1 only when $k = k'$, so the sum is only over the terms where $k = k'$.

$$
\sum_{k = 1}^{K} y_{k} \delta_{kk'} = y_{k'}
$$

For the second sum, notice that the $\hat{y}_{k'}$ does not depend on the summation index $k$. Therefore, we can take it out of the sum.

$$
\sum_{k = 1}^{K} y_{ki} \hat{y}_{k'} = \hat{y}_{k'} \sum_{k = 1}^{K} y_{k} = \hat{y}_{k'}
$$

The last equality is true because the sum is over the elements of the $i$-th row of $Y$, which is a one-hot encoded vector showing the true class of the $i$-th observation. Therefore, its sum over all elements is 1.

In the end, the derivative of the loss with respect to the predicted probabilities is:

$$
\frac{\partial \text{CE}(W)}{\partial z_{k'}} = (-y_{k'} + \hat{y}_{k'}) = \hat{y}_{k'} - y_{k'}
$$

So the $ki$-th element of the gradient of the loss with respect to $z$ is the difference between the predicted probability and the true label. We can write this as a matrix operation:

$$
\frac{\partial \text{CE}(W)}{\partial z} = \hat{y} - y
$$

What remains now is to compute the derivative of $z = W^T x$ with respect to $W$. Here it is helpful to consider the derivative with respect to a single weight $W_{ij}$ and consider a small example.

Let $W$ be a $3 \times 4$ matrix:

$$
z = W^T x = \begin{bmatrix}
w_{11} & w_{21} & w_{31} \\
w_{12} & w_{22} & w_{32} \\
w_{13} & w_{23} & w_{33} \\
w_{14} & w_{24} & w_{34}
\end{bmatrix} \begin{bmatrix}
x_1 \\
x_2 \\
x_3
\end{bmatrix} = \begin{bmatrix}
w_{11}x_1 + w_{21}x_2 + w_{31}x_3 \\
w_{12}x_1 + w_{22}x_2 + w_{32}x_3 \\
w_{13}x_1 + w_{23}x_2 + w_{33}x_3 \\
w_{14}x_1 + w_{24}x_2 + w_{34}x_3
\end{bmatrix} = 
\begin{bmatrix}
z_1 \\
z_2 \\
z_3 \\
z_4
\end{bmatrix}
$$

The derivative of $z_{k}$ with respect to $W_{ij}$ is:

$$
\frac{\partial z_{k}}{\partial W_{ij}} = \begin{cases}
x_{j} & \text{if } i = k \\
0 & \text{if } i \neq k
\end{cases}
$$

So the derivative of the whole vector $z$ with respect to a single weight, say $W_{1j}$ is again a vector of the same shape as $z$:

$$
\frac{\partial z}{\partial W_{1j}} = \begin{bmatrix}
x_{j} \\
0 \\
0 \\
0
\end{bmatrix}
\quad
\frac{\partial z}{\partial W_{2j}} = \begin{bmatrix}
0 \\
x_{j} \\
0 \\
0
\end{bmatrix}
\quad
\frac{\partial z}{\partial W_{3j}} = \begin{bmatrix}
0 \\
0 \\
x_{j} \\
0
\end{bmatrix}
$$

The last result implies that when we take the derivative of the loss with respect to the weights, we will get a matrix of the same shape as $W$
with the $ij$-th element being the product of the $i$-th row of the derivative of the loss with respect to $z$ and the $j$-th element of the input vector $x$ which is the outer product of the prediction error and the input vector.

$$
\frac{\partial \text{CE}(W)}{\partial W} = (\hat{y} - y) x^T
$$


In [50]:
import numpy as np
import spacy

nlp = spacy.load("en_core_web_sm")

def tokenize_doc(text: str) -> list:
    sentences = []
    text_doc = nlp(text)
    word2idx = {
        "BEGINNING": 0,
        "END": 1
    }
    idx2word = {
        0: "BEGINNING",
        1: "END"
    }
    
    for i, sentence in enumerate(text_doc.sents):
        tokens = ["BEGINNING"]
        
        for token in sentence:
            if token.is_space or token.is_punct:
                continue

            token_normalized = token.lower_ 
            tokens.append(token_normalized)
            
            if token_normalized not in word2idx:            
                idx = len(word2idx)
                word2idx[token_normalized] = idx
                idx2word[idx] = token_normalized
        
        tokens.append("END")
        sentences.append(tokens)

    return sentences, word2idx, idx2word

In [51]:
sample_sentences, sample_word2idx, sample_idx2word = tokenize_doc("Be brave as brave can be.")

In [52]:
sample_sentences

[['BEGINNING', 'be', 'brave', 'as', 'brave', 'can', 'be', 'END']]

In [53]:
# This will return a one-hot encoded vector with 1 at the index of idx
def one_hot_encode_word(idx: int, vocab_size: int):
    v = np.zeros(vocab_size)
    v[idx] = 1
    return v

In [54]:
# Vector of "be"

idx_be = sample_word2idx["be"]
print(idx_be)

one_hot_encode_word(idx_be, len(sample_word2idx))

2


array([0., 0., 1., 0., 0., 0.])

In [55]:
# It is convenient to have a function that processes the raw text and returns the word indices
def text_to_indexed_sentences(sentences: list, word2idx: dict):
    sentences_with_idx = []
    
    for sentence in sentences:
        sentence_with_idx = []
        
        for word in sentence:
            idx = word2idx[word]    
            sentence_with_idx.append(idx)
            
        sentences_with_idx.append(sentence_with_idx)
    
    return sentences_with_idx

In [56]:
sample_sentences_indexed_list = text_to_indexed_sentences(sample_sentences, sample_word2idx)
sample_sentences_indexed_list

[[0, 2, 3, 4, 3, 5, 2, 1]]

In [57]:
sample_sentence_indexed = sample_sentences_indexed_list[0]

V = len(sample_word2idx)
sample_n = len(sample_sentence_indexed)

sample_inputs = np.zeros((sample_n - 1, V))
sample_targets = np.zeros((sample_n - 1, V))
sample_inputs[np.arange(sample_n - 1), sample_sentence_indexed[:sample_n-1]] = 1
sample_targets[np.arange(sample_n - 1), sample_sentence_indexed[1:]] = 1

In [61]:
# Print the number of words in the sentence and in the vocabulary

print("Number of words in the sentence: ", sample_n)
print("Number of words in the vocabulary: ", V)

Number of words in the sentence:  8
Number of words in the vocabulary:  6


In [None]:
sample_inputs

array([[1., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0., 0.]])

In [None]:
# The input matrix has one row less than the sentence length (this is the number of bi-grams) and has as many columns as the vocabulary size
# (this is the length of the one-hot encoded vectors).
sample_inputs.shape

(7, 6)

In [60]:
# The target matrix has the same shape as the input matrix.
sample_targets

array([[0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.]])

 Using the cross-entropy loss
 
$$
J(w) = -\frac{1}{N}\sum_{i = 1}^{N} \sum_{j = 1}^{V} y_{ij} \log \hat{y}_{ij}
$$

the gradient descent update rule for the weights is

$$
W^{\text{new}} = W^{\text{old}} - \eta \nabla_{w} J(w) \\
$$

where $\eta$ is the learning rate and the gradient of the loss with respect to the weights is

$$
\nabla J = X^T (\hat{Y} - Y)
$$



In [None]:
# Next we will define the softmax function that will take a np.array of shape (N, D) and return a np.array of shape (N, D) where each row is
# the softmax of the corresponding row in the input array

def softmax(x):
    exp_x = np.exp(x)
    return exp_x / exp_x.sum(axis=1, keepdims=True)


def train_logistic(sentences: list[list[int]], V: int, learning_rate: float = 0.01, epochs: int = 100):    
    losses = []

    # Initialize the weights
    W = np.random.randn(V, V) / np.sqrt(V)
    
    for epoch in range(epochs):
        # shuffle sentences at each epoch
        np.random.shuffle(sentences)
        
        j = 0 # keep track of iterations
        for sentence in sentences:
            # convert sentence into one-hot encoded inputs and targets
            
            # An example sentence has the form ["BEGINNING", "hello", "my", "name", "is", "john", "END"]
            # Only with the word indices instead of the words
            # It has n = 7 words and therefore n - 1 = 6 bi-grams
            # So each row of the inputs and targets matrices will have the shape (1, vocab_size)
            
            n = len(sentence)
            
            inputs = np.zeros((n - 1, V))
            targets = np.zeros((n - 1, V))
            inputs[np.arange(n - 1), sentence[:n-1]] = 1
            targets[np.arange(n - 1), sentence[1:]] = 1
            
            # Compute the loss and update the weights
            
            # if j % 10 == 0:
            #     print("epoch:", epoch, "sentence: %s/%s" % (j, len(sentences)), "loss:", loss)
            # j += 1
    
    return W, losses

In [None]:
full_sentences, full_word2idx, full_idx2word = tokenize_doc(alice)
full_sentences_idx = text_to_indexed_sentences(full_sentences, full_word2idx)

In [None]:
train_logistic(full_sentences_idx, len(alice_word2idx))

NameError: name 'softmax' is not defined