# Chapter 69 (Recurrent Networks)

### This code runs simulations for Examples 4 and 7 in Chapter 69: Recurrent Neural Networks (vol. III)
TEXT: A. H. Sayed, INFERENCE AND LEARNING FROM DATA, Cambridge University Press, 2022.

<div style="text-align: justify">
DISCLAIMER:  This computer code is  provided  "as is"   without  any  guarantees.
Practitioners  should  use it  at their own risk.  While  the  codes in  the text 
are useful for instructional purposes, they are not intended to serve as examples 
of full-blown or optimized designs. The author has made no attempt at optimizing 
the codes, perfecting them, or even checking them for absolute accuracy. In order 
to keep the codes at a level  that is  easy to follow by students, the author has 
often chosen to  sacrifice  performance or even programming elegance in  lieu  of 
simplicity. Students can use the computer codes to run variations of the examples 
shown in the text. 
</div>

The Jupyter notebook and python codes are developed by Eduardo Faria Cabrera

required libraries:
    
1. numpy
2. matplotlib
3. scipy
4. torch
5. tqdm

In [68]:
import scipy
import numpy as np
import torch
from torch import nn

## Example 69.4 (Sentence Analysis)

In this example, we apply the RNN construction to analyze the structure of a sentence, as well as the mood that is reflected by the same sentence. By structure we mean that the RNN will classify the individual words into nouns, verbs, adjectives, adverbs, pronouns, and so forth. By mood we mean that the RNN will detect whether the sentence is reflecting a good or bad sentiment.

We consider a dictionary consisting of $D=15$ sentences; this is of course a small training set and is only meant for illustration purposes. At the end of each sentence, we indicate its mood or sentiment (good or bad) for later use:

$$
\begin{array}{ll}
\textnormal{  <sos> It was a great play . <eos>  }&\textnormal{ (good)}\\
\textnormal{  <sos> I had a bad experience with the rental car . <eos>  }&\textnormal{ (bad)}\\
\textnormal{  <sos> Her flight was delayed for over five hours . <eos>  }&\textnormal{ (bad)}\\
\textnormal{  <sos> She passed her exam easily . <eos>   }&\textnormal{ (good)}\\
\textnormal{  <sos> He regularly exercises and goes to the beach . <eos>   }&\textnormal{ (good)}\\
\textnormal{  <sos> The examination was hard and he is nervous . <eos>  }&\textnormal{ (bad)}\\
\textnormal{  <sos> They are happy their team won the championship game . <eos>   }&\textnormal{ (good)}\\
\textnormal{  <sos> I do not like action movies . <eos>  }&\textnormal{ (bad)}\\
\textnormal{  <sos> I love juicy fruits . <eos>   }&\textnormal{ (good)}\\
\textnormal{  <sos> He was nervous about his visit to the doctor . <eos>  }&\textnormal{ (bad)}\\
\textnormal{  <sos> We enjoyed walking in the park . <eos>   }&\textnormal{ (good)}\\
\textnormal{  <sos> The plot in this book is terrible . <eos>  }&\textnormal{ (bad)}\\
\textnormal{  <sos> The chair I bought is comfortable and affordable . <eos>   }&\textnormal{ (good)}\\
\textnormal{  <sos> The desk fit splendidly into the room . <eos>   }&\textnormal{ (good)}\\
\textnormal{  <sos> It rained and the program was unfortunately canceled . <eos>}&\textnormal{ (bad) }
\end{array} \tag{69.80}
$$

The beginning and end of each sentence are marked by the commands <sos> (start-of-sentence) and <eos>  (end-of-sentence). We parse through the sentences and collect the individual words (in lower-case letters) into a collection of  alphabetically ordered words (including the punctuation mark "." and the <eos> and <sos> symbols):

$$
\textnormal{ words}=\\
\begin{array}{l}
\Bigl\{\textnormal{ 
 ".",     "eos",     "sos",     "a",     "about",     "action",     "affordable",     "and",     "are",     "bad",}\\ \\
\quad \textnormal{  "beach",     "book",     "bought",     "canceled",     "car",     "chair",     "championship",     "comfortable",}\\ \\
\quad\textnormal{  "delayed",     "desk",     "do",     "doctor",     "easily",     "enjoyed",     "exam",     "examination",}\\ \\
\quad\textnormal{  "exercises",     "experience",     "fit",     "five",     "flight",     "for",     "fruits",     "game",     "goes",}\\ \\
\quad\textnormal{  "great",     "had",     "happy",     "hard",     "he",     "her",     "his",     "hours",     "i",     "in",}\\ \\
\quad\textnormal{  "into",     "is",     "it",     "juicy",     "like",     "love",     "movies",     "nervous",     "not",     "over",}\\ \\
\quad\textnormal{  "park",     "passed",     "play",     "plot",     "program",     "rained",     "regularly",     "rental",     "room",}\\ \\
\quad\textnormal{  "she",     "splendidly",     "team",     "terrible",     "the",     "their",     "they",     "this",     "to",}\\ \\
\quad\textnormal{  "unfortunately",     "visit",     "walking",     "was",     "we",     "with",     "won"}\Bigr\}
\end{array} \tag{69.81}
$$

There is a total of $M=80$ words in this collection. We use one-hot encoding to represent the words (i.e., we use the basis vectors from $\mathbb{R}^{80}$). For example, the word "about" is the fifth word in the collection and it will be represented by the feature vector $h=e_5$. In this way, if we refer to the first sentence corresponding to $d=1$: 

$$
\textnormal{<sos> It was a great play . <eos> } \tag{69.82}
$$

we find that it consists of a sequence of $N_1=8$ words with one-hot encoded vectors given by

$$
\begin{array}{lcll}
h_0&=& e_3,&\;\;\;(\textnormal{ sos})\\
h_1&=& e_{48},&\;\;\;(\textnormal{ it})\\
h_2&=& e_{77},&\;\;\;(\textnormal{ was})\\
h_3&=& e_4,&\;\;\;(\textnormal{ a})\\
h_4&=& e_{36},&\;\;\;(\textnormal{ great})\\
h_5&=& e_{58},&\;\;\;(\textnormal{ play})\\
h_6&=& e_1,&\;\;\;(\textnormal{ .})\\
h_7&=& e_2,&\;\;\;(\textnormal{ eos})
\end{array} \tag{69.83}
$$

where the $\{e_m\}$ refer to basis vectors in $\mathbb{R}^{80}$. We further associate a label with each word. There are $11$ label types:

$$
\textnormal{ labels}
=\left\{\begin{array}{l}\textnormal{ start, article, verb, noun, pronoun, adjective, adverb,}\\
\textnormal{ preposition, conjunction, punctuation, end}
\end{array}\right\} \tag{69.84}
$$

We again use one-hot encoding to represent each label by using the basis vectors from $\mathbb{R}^{11}$. For example, the word "beach" is a noun and it will be associated with the label $\gamma=e_4$, where $e_4$ is the fourth basis vector in $\mathbb{R}^{11}$. In this way, if we return to the same sentence 
(69.82), we find that its constituent words will be represented by the following feature vectors and labels:

$$
\begin{array}{c|c|c|c|c|c|c|c|c}\hline
&\textnormal{ <sos>}& \textnormal{ It} &\textnormal{ was} &\textnormal{ a} &\textnormal{ great}& \textnormal{ play} &{ .} &\textnormal{ <eos>}\\\hline\hline
n: &0&1&2&3&4&5&6&7 \\
h_n: &e_3&e_{48}&e_{77}&e_4&e_{36}&e_{58}&e_1&e_2\\
\gamma_n: &e_1&e_5&e_3&e_2&e_6&e_4&e_{10}&e_{11}\\\hline
\textnormal{ labels}:&\textnormal{ start}&\textnormal{ pronoun}&\textnormal{ verb}&\textnormal{ article}&\textnormal{ adjective}&\textnormal{ noun}&\textnormal{ punct.}&\textnormal{ end}\\\hline
\end{array} \tag{69.85}
$$

We also associate with each sentence of index $d$ a mood label given by

$$
\gamma_d=\begin{bmatrix}1\\0\end{bmatrix}\;\; (\textnormal{ bad sentiment}),\;\;\;\;
\gamma_d=\begin{bmatrix}0\\1\end{bmatrix}\;\; (\textnormal{ good sentiment}) \tag{69.86}
$$

In this way, by following this construction, each sentence $d$ will have some length $N_d$ (number of words including start, end, and punctuation), and will be  represented by $N_d$ feature vectors $\{h_n\}\in\mathbb{R}^{80}$ (one for each word) with $n=0,1,\ldots, N_{d}-1$, as well as $N_d$ labels $\{\gamma_n\}\in\mathbb{R}^{11}$ (one for each word), and a single mood label $\gamma_d\in\mathbb{R}^{2}$. All representations employ one-hot encoding. 

The first simulation trains an RNN with $P=30$ internal nodes using algorithm 
(\ref{kajasxafAAAL.rnn.2}) applied to a cross-entropy empirical risk with $\mu=0.01$ and $\rho=0.0001$. We perform $2000$ runs over the $D=15$ sentences; at the start of each run, we reshuffle the sentences randomly. At the end of the training phase, we employ the parameters $\{W^{\star}, U^{\star}, V^{\star}, \theta^{\star}, \alpha^{\star}\}$ to perform testing. We feed the following four test sentences into the RNN: 

$$
\begin{array}{ll}
\textnormal{  <sos> I do not like fruits . <eos>  }&\textnormal{ (bad)}\\
\textnormal{  <sos> He is happy with his car . <eos>  }&\textnormal{ (good)}\\
\textnormal{  <sos> She is nervous about the exam . <eos>  }&\textnormal{ (bad)}\\
\textnormal{  <sos> It was a comfortable game . <eos>   }&\textnormal{ (good)}
\end{array} \tag{69.87}
$$

We use the RNN to determine the predicted labels $\widehat{\gamma}_n$ for each word in every sentence. For a given estimate $\widehat{\gamma}_n\in\mathbb{R}^{11}$, the index of its highest entry determines the type of the word. The result of this simulation is shown below, where it is seen that the RNN successfully categorizes the word types in the four sentences (for brevity, we are removing the <sos> and <eos> word categories, which have been correctly identified as well):

$$
\begin{array}{lllllllllllll}
\textnormal{ I}&\textnormal{  do}&\textnormal{ not}&
\textnormal{  like}&\textnormal{  fruits}&\textnormal{  .} \\
\textnormal{ pronoun}&\textnormal{  verb}&\textnormal{ adverb}&
\textnormal{  verb}&\textnormal{  noun}&\textnormal{  punctuation} \\\\
\textnormal{ He}&\textnormal{ is}&\textnormal{ happy}&\textnormal{ with}&\textnormal{ his}&\textnormal{ car}&\textnormal{ .}\\
\textnormal{ pronoun}&\textnormal{ verb}&\textnormal{ adjective}&\textnormal{ preposition}&\textnormal{ pronoun}&\textnormal{ noun}&\textnormal{ punctuation}&\\\\
\textnormal{ She}&\textnormal{ is}&\textnormal{ nervous}&\textnormal{ about}&\textnormal{ the}&\textnormal{ exam}&\textnormal{ .}\\
\textnormal{ pronoun}&\textnormal{ verb}&\textnormal{ adjective}&\textnormal{ preposition}&\textnormal{ article}&\textnormal{ noun}&\textnormal{ punctuation}\\\\
\textnormal{ It}&\textnormal{ was}&\textnormal{ a}&\textnormal{ comfortable}&\textnormal{ game}&\textnormal{ .}\\
\textnormal{ pronoun}&\textnormal{ verb}&\textnormal{ article}&\textnormal{ adjective}&
\textnormal{ noun}&\textnormal{ punctuation}
\end{array}
$$

The second simulation uses algorithm (69.77) to perform sentiment analysis with $\mu=0.001$. We perform $5000$ runs over the $D=15$ sentences; at the start of each run, we reshuffle the sentences randomly. At the end of the training phase, we employ the parameters $\{W^{\star}, U^{\star}, V^{\star}, \theta^{\star}, \alpha^{\star}\}$ to perform sentiment analysis and compute $\widehat{\gamma}_d$. The index of the largest entry on $\widehat{\gamma}_d$ decides whether the mood is bad (first entry) or good (second entry).  We feed the above four test sentences and arrive at the following predicted moods:

$$
\begin{array}{l|l|l}\hline\textnormal{ Sentence}&\textnormal{ Actual mood}&\textnormal{ Predicted mood}\\\hline\hline
\textnormal{  <sos> I do not like fruits . <eos>  }&\textnormal{ (bad)} & \color{red}\textnormal{ (good)}\\
\textnormal{  <sos> He is happy with his car . <eos>  }&\textnormal{ (good)} & \textnormal{ (good)}\\
\textnormal{  <sos> She is nervous about the exam . <eos>  }&\textnormal{ (bad)} & \textnormal{ (bad)}\\
\textnormal{  <sos> It was a comfortable game . <eos>   }&\textnormal{ (good)} & \textnormal{ (good)}\\\hline
\end{array} \tag{69.89}
$$

\smallskip

\noindent There is one error in the prediction of the mood of the first sentence. This is a contrived example that uses a small sample set for training and it is expected that performance can be improved for more extensive training with larger datasets. We obtained similar simulation results by performing 10,000 runs of the bidirectional RNN algorithm from Section 69.3. 

In [4]:
data = scipy.io.loadmat("data/datafile.mat")

In [26]:
n_train = data["D"].item()
n_test = data["T"].item()
word_types = data["TW"].item()
features_size = data["M"].item()
P = 30 # size of hidden layer for RNN

In [62]:
H_train = torch.tensor([[idx.item() if idx.shape == (1, 1) else 0 for idx in sentence] for sentence in data["H_train"]])
H_test = torch.tensor([[idx.item() if idx.shape == (1, 1) else 0 for idx in sentence] for sentence in data["H_test"]])

labels_train = torch.tensor([[idx.item() if idx.shape == (1, 1) else 0 for idx in sentence] for sentence in data["labels_train"]])
labels_test = torch.tensor([[idx.item() if idx.shape == (1, 1) else 0 for idx in sentence] for sentence in data["labels_test"]])

In [81]:
class SentenceAnalysis(nn.Module):
    def __init__(self):
        super().__init__()
        self.rnn = nn.RNN(
            input_size=1, hidden_size=features_size, num_layers=1
        )
        self.linear = nn.Linear(features_size, word_types)
    
    def forward(self, x):
        x_ = self.rnn(x)
        logits = self.linear(x_[0])

        return logits

In [119]:
model = SentenceAnalysis()
lr = 1e-5
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
loss = nn.CrossEntropyLoss()
epochs = 4000

In [120]:
for epoch in range(epochs):
    train_loss = []
    train_acc = []
    test_loss = []
    test_acc = []
    model.train()
    for x, y in zip(H_train, labels_train):
        final_len = (x == 2).nonzero().squeeze().item()+1

        x = x[:final_len].float().unsqueeze(-1)
        y_ = y[:final_len].long()-1

        logits = model(x)

        loss_out = loss(logits, y_)
        loss_out.backward()
        optimizer.step()

        class_hat = torch.round(nn.functional.softmax(logits, dim=-1))
        classes = nn.functional.one_hot(y_)

        errors = (classes != class_hat).int().sum(axis=-1).nonzero().shape[0]
        acc = (x.shape[0] - errors)/x.shape[0]*100


        train_loss.append(loss_out.item())
        train_acc.append(acc)
    
    model.eval()
    for x, y in zip(H_test, labels_test):
        final_len = (x == 2).nonzero().squeeze().item()+1

        x = x[:final_len].float().unsqueeze(-1)
        y_ = y[:final_len].long()-1

        logits = model(x)

        loss_out = loss(logits, y_)

        class_hat = torch.round(nn.functional.softmax(logits, dim=-1))
        classes = nn.functional.one_hot(y_)

        errors = (classes != class_hat).int().sum(axis=-1).nonzero().shape[0]
        acc = (x.shape[0] - errors)/x.shape[0]*100


        test_loss.append(loss_out.item())
        test_acc.append(acc)

    if epoch % 100 == 0:
        print(f"Epoch: {epoch}/{epochs}. Training Loss: {round(sum(train_loss)/len(train_loss), 3)}. Training Accuracy: {round(sum(train_acc)/len(train_acc), 3)}. Test Loss: {round(sum(test_loss)/len(test_loss), 3)}. Test Accuracy: {round(sum(test_acc)/len(test_acc), 3)}.")

Epoch: 0/4000. Training Loss: 2.537. Training Accuracy: 0.0. Test Loss: 2.493. Test Accuracy: 0.0.
Epoch: 100/4000. Training Loss: 1.698. Training Accuracy: 8.919. Test Loss: 1.608. Test Accuracy: 11.806.
Epoch: 200/4000. Training Loss: 1.508. Training Accuracy: 18.915. Test Loss: 1.385. Test Accuracy: 23.611.
Epoch: 300/4000. Training Loss: 1.338. Training Accuracy: 21.785. Test Loss: 1.201. Test Accuracy: 29.514.
Epoch: 400/4000. Training Loss: 1.193. Training Accuracy: 27.797. Test Loss: 1.071. Test Accuracy: 35.764.
Epoch: 500/4000. Training Loss: 1.086. Training Accuracy: 42.079. Test Loss: 0.959. Test Accuracy: 50.347.
Epoch: 600/4000. Training Loss: 1.007. Training Accuracy: 46.693. Test Loss: 0.863. Test Accuracy: 61.806.
Epoch: 700/4000. Training Loss: 0.94. Training Accuracy: 49.497. Test Loss: 0.8. Test Accuracy: 64.583.
Epoch: 800/4000. Training Loss: 0.885. Training Accuracy: 52.006. Test Loss: 0.775. Test Accuracy: 70.833.
Epoch: 900/4000. Training Loss: 0.833. Training A

In [158]:
sentence = data["sentences"][0][0].item().split(" ")
sentence_idx = H_train[0][:8]
labels = labels_train[0][:8]
labels_hat = torch.round(nn.functional.softmax(model(sentence_idx.float().unsqueeze(-1)), dim=-1))

print("Word; Word Id; Label; Label Hat")
for word, word_id, label, label_hat in zip(sentence, sentence_idx, labels, labels_hat):
    print(word, word_id.item(), label.item(), (label_hat==1).nonzero().item()+1)

Word; Word Id; Label; Label Hat
START 3 1 1
It 48 5 5
was 77 3 3
a 4 2 2
great 36 6 6
play 58 4 4
. 1 10 10
END 2 11 11


In [203]:
labels_train_sentiment = torch.tensor([sentence.item() for sentence in data["sentiment_train"]])
labels_test_sentiment = torch.tensor([sentence.item() for sentence in data["sentiment_test"]])

In [163]:
class SentimentAnalysis(nn.Module):
    def __init__(self):
        super().__init__()
        self.rnn = nn.RNN(
            input_size=1, hidden_size=features_size, num_layers=1
        )
        self.linear = nn.Linear(features_size, 1)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        x_ = self.rnn(x)
        logits = self.linear(x_[0][-1])
        y = self.sigmoid(logits)

        return y

In [215]:
model_sentiment = SentimentAnalysis()
lr = 1e-5
optimizer = torch.optim.Adam(model_sentiment.parameters(), lr=lr)
loss = nn.BCELoss()
epochs = 1000

In [216]:
for epoch in range(epochs):
    train_loss = []
    train_acc = []
    test_loss = []
    test_acc = []
    model_sentiment.train()
    for x, y in zip(H_train, labels_train_sentiment):
        final_len = (x == 2).nonzero().squeeze().item()+1

        x = x[:final_len].float().unsqueeze(-1)

        y_hat = model_sentiment(x)

        loss_out = loss(y_hat, y.unsqueeze(-1).float())
        loss_out.backward()
        optimizer.step()

        class_hat = torch.round(y_hat)

        errors = (y != class_hat).int().sum(axis=-1).nonzero().shape[0]
        acc = (x.shape[-1] - errors)/x.shape[-1]*100


        train_loss.append(loss_out.item())
        train_acc.append(acc)
    
    model_sentiment.eval()
    for x, y in zip(H_test, labels_test_sentiment):
        final_len = (x == 2).nonzero().squeeze().item()+1

        x = x[:final_len].float().unsqueeze(-1)

        y_hat = model_sentiment(x)

        loss_out = loss(y_hat, y.unsqueeze(-1).float())

        class_hat = torch.round(y_hat)

        errors = (y != class_hat).int().sum(axis=-1).nonzero().shape[0]
        acc = (x.shape[-1] - errors)/x.shape[-1]*100


        test_loss.append(loss_out.item())
        test_acc.append(acc)

    if epoch % 100 == 0:
        print(f"Epoch: {epoch}/{epochs}. Training Loss: {round(sum(train_loss)/len(train_loss), 3)}. Training Accuracy: {round(sum(train_acc)/len(train_acc), 3)}. Test Loss: {round(sum(test_loss)/len(test_loss), 3)}. Test Accuracy: {round(sum(test_acc)/len(test_acc), 3)}.")

Epoch: 0/1000. Training Loss: 0.702. Training Accuracy: 46.667. Test Loss: 0.697. Test Accuracy: 50.0.
Epoch: 100/1000. Training Loss: 0.593. Training Accuracy: 66.667. Test Loss: 0.561. Test Accuracy: 75.0.
Epoch: 200/1000. Training Loss: 0.343. Training Accuracy: 93.333. Test Loss: 0.365. Test Accuracy: 75.0.
Epoch: 300/1000. Training Loss: 0.053. Training Accuracy: 100.0. Test Loss: 0.239. Test Accuracy: 75.0.
Epoch: 400/1000. Training Loss: 0.004. Training Accuracy: 100.0. Test Loss: 0.598. Test Accuracy: 75.0.
Epoch: 500/1000. Training Loss: 0.001. Training Accuracy: 100.0. Test Loss: 1.002. Test Accuracy: 75.0.
Epoch: 600/1000. Training Loss: 0.0. Training Accuracy: 100.0. Test Loss: 1.04. Test Accuracy: 75.0.
Epoch: 700/1000. Training Loss: 0.0. Training Accuracy: 100.0. Test Loss: 1.091. Test Accuracy: 75.0.
Epoch: 800/1000. Training Loss: 0.0. Training Accuracy: 100.0. Test Loss: 1.247. Test Accuracy: 75.0.
Epoch: 900/1000. Training Loss: 0.0. Training Accuracy: 100.0. Test Lo

In [186]:
sentence = data["sentences"][0][0].item()
sentence_idx = H_train[0][:8]
label = labels_train_sentiment[0]
labels_hat = torch.round(model_sentiment(sentence_idx.float().unsqueeze(-1)))

In [217]:
for i in range(4):
    sentence = data["sentences"][0][i].item()
    final_len = (H_train[i] == 2).nonzero().squeeze().item()+1
    sentence_idx = H_train[i][:final_len]
    label = labels_train_sentiment[i]
    label_hat = torch.round(model_sentiment(sentence_idx.float().unsqueeze(-1)))
    print(sentence, label.item(), label_hat.item())

START It was a great play . END 1 1.0
START I had a bad experience with the rental car . END 0 0.0
START Her flight was delayed for over five hours . END 0 0.0
START She passed her exam easily . END 1 1.0


## Example 69.8 (Predicting word types)

We reconsider the problem from Example 69.4 involving a small training set with $D=15$ sentences for illustration purposes. Enhanced performance would require a larger amount of training data. In this example, we employ an LSTM implementation to predict the type of future words in a sentence. For example, consider the first sentence corresponding to $d=1$:

$$
\textnormal{ <sos> It was a great play . <eos> } \tag{69.230}
$$

The objective is for the network to predict that the word following the start-of-sentence <sos> will be a pronoun, and the one following the pronoun will be a verb, and the one following the verb will be an article, and so forth. We will do so by changing the label $\gamma_n$ that is assigned to each word of index $n$.
For example, if we refer to the tabular data in (69.85) for the first sentence, the entries in the last two rows corresponding to the values of $\gamma_n$ (and their interpretation) will be shifted to the left and take the form shown below:

$$
\begin{array}{c|c|c|c|c|c|c|c|c}\hline
&\textnormal{ <sos>}& \textnormal{ It} &\textnormal{ was} &\textnormal{ a} &\textnormal{ great}& \textnormal{ play} &{ .} &\textnormal{ <eos>}\\\hline\hline
n: &0&1&2&3&4&5&6&7 \\
h_n: &e_3&e_{48}&e_{77}&e_4&e_{36}&e_{58}&e_1&e_2\\
\gamma_n: &e_5&e_3&e_2&e_6&e_4&e_{10}&e_{11}&--\\\hline
\textnormal{ labels}:&\textnormal{ pronoun}&\textnormal{ verb}&\textnormal{ article}&\textnormal{ adjective}&\textnormal{ noun}&\textnormal{ punct.}&\textnormal{ end}&--\\\hline
\end{array} \tag{69.231}
$$

For instance, the label associated with the word "great" will now be $e_4$, which corresponds to the type "noun". This means that the next word in the sentence is of the type "noun". We therefore adjust the labels for all words in the $D=15$ sentences in this manner and perform $5000$ runs of algorithm (69.220) over the $D=15$ sentences assuming a regularized cross-entropy risk function (where we use instead the expression $\lambda_n=\widehat{\gamma}_n-\gamma_n$ in the listing of the algorithm and the output layer involves a softmax construction). At the start of each run, we reshuffle the sentences randomly. At the end of the training phase, we employ the learned parameters  to perform testing. We feed the same four test sentences into the trained LSTM: 

$$
\begin{array}{l}
\textnormal{  <sos> I do not like fruits . <eos>  }\\
\textnormal{  <sos> He is happy with his car . <eos>  }\\
\textnormal{  <sos> She is nervous about the exam . <eos>  }\\
\textnormal{  <sos> It was a comfortable game . <eos>   }
\end{array} \tag{69.232}
$$

\noindent We use the LSTM to predict the word type of future words in every sentence. For a given estimate $\widehat{\gamma}_n\in\mathbb{R}^{11}$, the index of its highest entry determines the type predicted for the next word. The result of this simulation is shown below. Some errors marked in color, occur due to the limited amount of data used to train the network in this example; the intent is to  illustrate the operation of the LSTM network and its training algorithm:

$$
\begin{array}{llllllllllllll}
\textnormal{ <sos>}&\textnormal{ I}&\textnormal{  do}&\textnormal{ not}&
\textnormal{  like}&\textnormal{  fruits}&\textnormal{  .} \\
\textnormal{ pronoun}&\textnormal{  verb}&\textnormal{ adverb}&
\textnormal{  verb}&\textnormal{  noun}&\textnormal{  punctuation} &\\\\
\textnormal{ <sos>}& \textnormal{ He}&\textnormal{ is}&\textnormal{ happy}&\textnormal{ with}&\textnormal{ his}&\textnormal{ car}&\textnormal{ .}\\
\textnormal{ pronoun}&\textnormal{ verb}&\textnormal{ adjective}&\color{red}\textnormal{ pronoun}&\color{red}\textnormal{ noun}&\textnormal{ noun}&\textnormal{ punctuation}&\\\\
\textnormal{ <sos>}& \textnormal{ She}&\textnormal{ is}&\textnormal{ nervous}&\textnormal{ about}&\textnormal{ the}&\textnormal{ exam}&\textnormal{ .}\\
\textnormal{ pronoun}&\textnormal{ verb}&\textnormal{ adjective}&\textnormal{ preposition}&\color{red}\textnormal{ pronoun}&\textnormal{ noun}&\color{red}\textnormal{ adverb}&\\\\
\textnormal{ <sos>}& \textnormal{ It}&\textnormal{ was}&\textnormal{ a}&\textnormal{ comfortable}&\textnormal{ game}&\textnormal{ .}\\
\textnormal{ pronoun}&\textnormal{ verb}&\textnormal{ article}&\textnormal{ adjective}&\color{red}\textnormal{ conjunction}&\color{red}\textnormal{ noun}&
\end{array} \tag{69.233}
$$


In [231]:
class SentenceAnalysisLSTM(nn.Module):
    def __init__(self):
        super().__init__()
        self.rnn = nn.LSTM(
            input_size=1, hidden_size=features_size, num_layers=1
        )
        self.linear = nn.Linear(features_size, word_types)
    
    def forward(self, x):
        x_ = self.rnn(x)
        logits = self.linear(x_[0])

        return logits

In [232]:
lstm_model = SentenceAnalysisLSTM()
lr = 1e-5
optimizer = torch.optim.Adam(lstm_model.parameters(), lr=lr)
loss = nn.CrossEntropyLoss()
epochs = 4000

In [229]:
for epoch in range(epochs):
    train_loss = []
    train_acc = []
    test_loss = []
    test_acc = []
    model.train()
    for x, y in zip(H_train, labels_train):
        final_len = (x == 2).nonzero().squeeze().item()+1

        x = x[:final_len].float().unsqueeze(-1)
        y_ = y[:final_len].long()-1

        logits = lstm_model(x)

        loss_out = loss(logits, y_)
        loss_out.backward()
        optimizer.step()

        class_hat = torch.round(nn.functional.softmax(logits, dim=-1))
        classes = nn.functional.one_hot(y_)

        errors = (classes != class_hat).int().sum(axis=-1).nonzero().shape[0]
        acc = (x.shape[0] - errors)/x.shape[0]*100


        train_loss.append(loss_out.item())
        train_acc.append(acc)
    
    model.eval()
    for x, y in zip(H_test, labels_test):
        final_len = (x == 2).nonzero().squeeze().item()+1

        x = x[:final_len].float().unsqueeze(-1)
        y_ = y[:final_len].long()-1

        logits = lstm_model(x)

        loss_out = loss(logits, y_)

        class_hat = torch.round(nn.functional.softmax(logits, dim=-1))
        classes = nn.functional.one_hot(y_)

        errors = (classes != class_hat).int().sum(axis=-1).nonzero().shape[0]
        acc = (x.shape[0] - errors)/x.shape[0]*100


        test_loss.append(loss_out.item())
        test_acc.append(acc)

    if epoch % 100 == 0:
        print(f"Epoch: {epoch}/{epochs}. Training Loss: {round(sum(train_loss)/len(train_loss), 3)}. Training Accuracy: {round(sum(train_acc)/len(train_acc), 3)}. Test Loss: {round(sum(test_loss)/len(test_loss), 3)}. Test Accuracy: {round(sum(test_acc)/len(test_acc), 3)}.")

Epoch: 0/4000. Training Loss: 2.419. Training Accuracy: 0.0. Test Loss: 2.399. Test Accuracy: 0.0.
Epoch: 100/4000. Training Loss: 2.01. Training Accuracy: 0.0. Test Loss: 1.973. Test Accuracy: 0.0.
Epoch: 200/4000. Training Loss: 1.734. Training Accuracy: 10.131. Test Loss: 1.642. Test Accuracy: 11.806.
Epoch: 300/4000. Training Loss: 1.57. Training Accuracy: 19.656. Test Loss: 1.471. Test Accuracy: 23.611.
Epoch: 400/4000. Training Loss: 1.447. Training Accuracy: 19.656. Test Loss: 1.35. Test Accuracy: 23.611.
Epoch: 500/4000. Training Loss: 1.338. Training Accuracy: 23.09. Test Loss: 1.228. Test Accuracy: 26.389.
Epoch: 600/4000. Training Loss: 1.245. Training Accuracy: 25.826. Test Loss: 1.114. Test Accuracy: 29.167.
Epoch: 700/4000. Training Loss: 1.161. Training Accuracy: 31.281. Test Loss: 1.027. Test Accuracy: 40.972.
Epoch: 800/4000. Training Loss: 1.081. Training Accuracy: 32.443. Test Loss: 0.955. Test Accuracy: 44.097.
Epoch: 900/4000. Training Loss: 1.01. Training Accuracy

In [230]:
sentence = data["sentences"][0][0].item().split(" ")
sentence_idx = H_train[0][:8]
labels = labels_train[0][:8]
labels_hat = torch.round(nn.functional.softmax(lstm_model(sentence_idx.float().unsqueeze(-1)), dim=-1))

print("Word; Word Id; Label; Label Hat")
for word, word_id, label, label_hat in zip(sentence, sentence_idx, labels, labels_hat):
    print(word, word_id.item(), label.item(), (label_hat==1).nonzero().item()+1)

Word; Word Id; Label; Label Hat
START 3 1 1
It 48 5 5
was 77 3 3
a 4 2 2
great 36 6 6
play 58 4 4
. 1 10 10
END 2 11 11


In [236]:
class SentenceAnalysisBiderectionalLSTM(nn.Module):
    def __init__(self):
        super().__init__()
        self.rnn = nn.LSTM(
            input_size=1, hidden_size=features_size, num_layers=1, bidirectional=True
        )
        self.linear = nn.Linear(features_size*2, word_types)
    
    def forward(self, x):
        x_ = self.rnn(x)
        logits = self.linear(x_[0])

        return logits

In [246]:
bidirectional_lstm_model = SentenceAnalysisBiderectionalLSTM()
lr = 1e-5
bidirectional_lstm_optimizer = torch.optim.Adam(bidirectional_lstm_model.parameters(), lr=lr)
loss = nn.CrossEntropyLoss()
epochs = 4000

In [247]:
for epoch in range(epochs):
    train_loss = []
    train_acc = []
    test_loss = []
    test_acc = []
    model.train()
    for x, y in zip(H_train, labels_train):
        final_len = (x == 2).nonzero().squeeze().item()+1

        x = x[:final_len].float().unsqueeze(-1)
        y_ = y[:final_len].long()-1

        logits = bidirectional_lstm_model(x)

        loss_out = loss(logits, y_)
        loss_out.backward()
        bidirectional_lstm_optimizer.step()

        class_hat = torch.round(nn.functional.softmax(logits, dim=-1))
        classes = nn.functional.one_hot(y_)

        errors = (classes != class_hat).int().sum(axis=-1).nonzero().shape[0]
        acc = (x.shape[0] - errors)/x.shape[0]*100


        train_loss.append(loss_out.item())
        train_acc.append(acc)
    
    model.eval()
    for x, y in zip(H_test, labels_test):
        final_len = (x == 2).nonzero().squeeze().item()+1

        x = x[:final_len].float().unsqueeze(-1)
        y_ = y[:final_len].long()-1

        logits = bidirectional_lstm_model(x)

        loss_out = loss(logits, y_)

        class_hat = torch.round(nn.functional.softmax(logits, dim=-1))
        classes = nn.functional.one_hot(y_)

        errors = (classes != class_hat).int().sum(axis=-1).nonzero().shape[0]
        acc = (x.shape[0] - errors)/x.shape[0]*100


        test_loss.append(loss_out.item())
        test_acc.append(acc)

    if epoch % 100 == 0:
        print(f"Epoch: {epoch}/{epochs}. Training Loss: {round(sum(train_loss)/len(train_loss), 3)}. Training Accuracy: {round(sum(train_acc)/len(train_acc), 3)}. Test Loss: {round(sum(test_loss)/len(test_loss), 3)}. Test Accuracy: {round(sum(test_acc)/len(test_acc), 3)}.")

Epoch: 0/4000. Training Loss: 2.37. Training Accuracy: 0.0. Test Loss: 2.385. Test Accuracy: 0.0.
Epoch: 100/4000. Training Loss: 1.763. Training Accuracy: 0.0. Test Loss: 1.704. Test Accuracy: 0.0.
Epoch: 200/4000. Training Loss: 1.386. Training Accuracy: 29.18. Test Loss: 1.271. Test Accuracy: 35.417.
Epoch: 300/4000. Training Loss: 1.191. Training Accuracy: 30.898. Test Loss: 1.072. Test Accuracy: 38.194.
Epoch: 400/4000. Training Loss: 1.057. Training Accuracy: 41.152. Test Loss: 0.956. Test Accuracy: 55.903.
Epoch: 500/4000. Training Loss: 0.955. Training Accuracy: 45.817. Test Loss: 0.864. Test Accuracy: 55.903.
Epoch: 600/4000. Training Loss: 0.876. Training Accuracy: 49.363. Test Loss: 0.81. Test Accuracy: 62.153.
Epoch: 700/4000. Training Loss: 0.81. Training Accuracy: 52.242. Test Loss: 0.764. Test Accuracy: 62.153.
Epoch: 800/4000. Training Loss: 0.751. Training Accuracy: 55.028. Test Loss: 0.74. Test Accuracy: 62.153.
Epoch: 900/4000. Training Loss: 0.697. Training Accuracy

In [248]:
sentence = data["sentences"][0][0].item().split(" ")
sentence_idx = H_train[0][:8]
labels = labels_train[0][:8]
labels_hat = torch.round(nn.functional.softmax(bidirectional_lstm_model(sentence_idx.float().unsqueeze(-1)), dim=-1))

print("Word; Word Id; Label; Label Hat")
for word, word_id, label, label_hat in zip(sentence, sentence_idx, labels, labels_hat):
    print(word, word_id.item(), label.item(), (label_hat==1).nonzero().item()+1)

Word; Word Id; Label; Label Hat
START 3 1 1
It 48 5 5
was 77 3 3
a 4 2 2
great 36 6 6
play 58 4 4
. 1 10 10
END 2 11 11
