# Lab 5 : More on Recurrent neural networks (LSTM)
```
- [S25] Advanced Machine Learning, Innopolis University
- Teaching Assistant: Gcinizwe Dlamini
```
<hr>


```
Lab Plan
1. LSTM basics
2. Application of LSTM
3. Self practice tasks
```

<hr>


## 0. Recap

![](http://karpathy.github.io/assets/rnn/diags.jpeg)

## Sample Data

In [None]:
import torch
from torch import nn
import torch.nn.functional as F

simple_sequence = torch.Tensor([[0.3,1.9,4.5],[0.4,0.1,0.23],[0.7,0.91,0.43], [0.34,0.01,0.002]])
simple_sequence = simple_sequence.unsqueeze(0)
simple_sequence.shape

torch.Size([1, 4, 3])

## 1. LSTM basics

The `simple_sequence` variable represents a sequence of length 4, where each element (time-stamp) is represented by a feature vector of length 3. LSTM calculations are defined as:

![](https://media.licdn.com/dms/image/v2/C5612AQH5Im8XrvLmYQ/article-cover_image-shrink_600_2000/article-cover_image-shrink_600_2000/0/1564974698831?e=2147483647&v=beta&t=4sP9wrqZVaKsUt8NLXwuN4hfYc0m8RKI3a5g_jUW2xc)


$$i_{t} = \sigma\left(W_{ii}x_t + b_{ii} + W_{hi}h_{t-1} + b_{hi} \right)$$
$$f_t = \sigma \left( W_{if}x_t + b_{if} + W_{hf}h_{t-1} + b_{hf} \right)$$
$$g_t = tanh(W_{ig}x_t + b_{ig} + W_{hg}h_{t-1} + b_{hg})$$
$$o_t = \sigma \left( W_{io}x_t + b_{io} + W_{ho}h_{t-1} + h_{ho}\right)$$
$$c_t = f_t \odot c_{t-1} + i_t \odot g_t$$
$$h_t = o_t \odot tanh(c_t)$$

where $h_t$ represents the hidden state at time $t$; $c_t$ cell cell state at time $t$, $x_t$ is the input at time $t$, $h_{t-1}$ is the hidden state of the layer at time $t-1$ or the initial hidden state at time 0, and $i_t$, $f_t$, $g_t$, $o_t$ are the input, forget, cell, and output gates, respectively.

 <br>
Lets see whats inside Pytorch and compare with our theory

**Note:** For simplicity, the bias is set to zeros and weights set to ones

In [None]:
torch.manual_seed(20)
hidden_size = 1
simple_lstm_layer = torch.nn.LSTM(input_size=3, hidden_size=hidden_size, bidirectional=False, num_layers=1, batch_first=True)


share_weight = torch.randn(simple_lstm_layer.weight_ih_l0.shape, dtype = torch.float)
simple_lstm_layer.weight_ih_l0 = torch.nn.Parameter(share_weight)

# bias set to zeros
simple_lstm_layer.bias_ih_l0 = torch.nn.Parameter(torch.zeros(simple_lstm_layer.bias_ih_l0.shape))
simple_lstm_layer.bias_hh_l0 = torch.nn.Parameter(torch.zeros(simple_lstm_layer.bias_ih_l0.shape))

lstm_pytorch_output = simple_lstm_layer(simple_sequence[0][0].unsqueeze(dim=0).unsqueeze(dim=0))
simple_lstm_layer.state_dict()

OrderedDict([('weight_ih_l0',
              tensor([[-0.9475, -0.6130, -0.1291],
                      [-0.4107,  1.3931, -0.0984],
                      [ 1.6791, -0.9381, -0.4899],
                      [ 0.2811, -0.2813,  0.4779]])),
             ('weight_hh_l0',
              tensor([[ 0.8846],
                      [-0.4928],
                      [ 0.4776],
                      [ 0.0807]])),
             ('bias_ih_l0', tensor([0., 0., 0., 0.])),
             ('bias_hh_l0', tensor([0., 0., 0., 0.]))])

### Whole sequence output

In [None]:
output, (hidden, cell) = simple_lstm_layer(simple_sequence)

In [None]:
output, hidden

### 1.2 Extract / define the calculation variables (weights \& bias)

In [None]:
W_ii, W_if, W_ig, W_io = simple_lstm_layer.weight_ih_l0.split(hidden_size, dim=0)
b_ii, b_if, b_ig, b_io = simple_lstm_layer.bias_ih_l0.split(hidden_size, dim=0)

W_hi, W_hf, W_hg, W_ho = simple_lstm_layer.weight_hh_l0.split(hidden_size, dim=0)
b_hi, b_hf, b_hg, b_ho = simple_lstm_layer.bias_hh_l0.split(hidden_size, dim=0)

### 2.2 Calculations

$i_{t} = \sigma\left(W_{ii}x_t + b_{ii} + W_{hi}h_{t-1} + b_{hi} \right)$ <br>
$f_t = \sigma \left( W_{if}x_t + b_{if} + W_{hf}h_{t-1} + b_{hf} \right)$ <br>
$g_t = tanh(W_{ig}x_t + b_{ig} + W_{hg}h_{t-1} + b_{hg})$ <br>
$o_t = \sigma \left( W_{io}x_t + b_{io} + W_{ho}h_{t-1} + h_{ho}\right)$ <br>
$c_t = f_t \odot c_{t-1} + i_t \odot g_t$ <br>
$h_t = o_t \odot tanh(c_t)$ <br>

In [None]:
input_x = simple_sequence[0][0].unsqueeze(0)
prev_h = torch.zeros((1, hidden_size))
prev_c = torch.zeros((1, hidden_size))

i_t = torch.sigmoid(F.linear(input_x,W_ii,b_ii )+ F.linear(prev_h, W_hi,b_hi))
f_t = torch.sigmoid(F.linear(input_x,W_if,b_if )+ F.linear(prev_h, W_hf,b_hf))
g_t = torch.tanh(F.linear(input_x,W_ig,b_ig )+ F.linear(prev_h, W_hg,b_hg))
o_t = torch.sigmoid(F.linear(input_x,W_io,b_io )+ F.linear(prev_h, W_ho,b_ho))
c_t = f_t * prev_c + i_t * g_t
h_t = o_t * torch.tanh(c_t)

### 2.3 Comapre manual calculations with Pytorch implementation

In [None]:
output.squeeze(0)[0], h_t

(tensor([-0.0976], grad_fn=<SelectBackward0>),
 tensor([[-0.0976]], grad_fn=<MulBackward0>))

**Task:** Calculate the outputs for the rest of the full sentence -> `simple_sequence` manually and compare with PyTorch output

In [None]:
simple_sequence.squeeze(0).squeeze(0)[1]

tensor([0.4000, 0.1000, 0.2300])

In [None]:
prev_h = torch.zeros((1, hidden_size))
prev_c = torch.zeros((1, hidden_size))

for i in range(simple_sequence.shape[1]):
  input_x = simple_sequence[0][i].unsqueeze(0)
  h_t, c_t = simple_lstm_layer(i_t,f_t,g_t,)


tensor([[-0.0976]], grad_fn=<MulBackward0>)
tensor([[0.0470]], grad_fn=<MulBackward0>)
tensor([[0.0490]], grad_fn=<MulBackward0>)
tensor([[0.1372]], grad_fn=<MulBackward0>)


In [None]:
output

tensor([[[-0.0976],
         [ 0.0470],
         [ 0.0490],
         [ 0.1372]]], grad_fn=<TransposeBackward0>)

## 2. Application of LSTM (Sentiment Analysis)

### 2.1 Dataset Description

[IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/) having 50K movie reviews for natural language processing or Text analytics. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets.

In [1]:
!pip install datasets

import collections

import datasets
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import tqdm



### 2.2 Get Dataset and preprocess

In [2]:
train_data, test_data = datasets.load_dataset("imdb", split=["train", "test"])

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

### 2.3 Tokenize Dataset

In [None]:
max_text_length = 128

train_data, test_data = datasets.load_dataset("imdb", split=["train", "test"])

### 2.4 Create Vocabulary

In [None]:
min_freq = 5
special_tokens = ["<unk>", "<pad>"]

def tokenize_function(examples):
    return {"tokens": [text.split() for text in examples["text"]]}def tokenize_function(examples):
    return {"tokens": [text.split() for text in examples["text"]]}

train_data = train_data.map(tokenize_function, batched=True)
test_data = test_data.map(tokenize_function, batched=True)

# Build vocabulary
token_counts = Counter()
for example in train_data["tokens"]:
    token_counts.update(example)

# Filter tokens by min_freq and add special tokens
filtered_tokens = [token for token, count in token_counts.items() if count >= min_freq]
vocab_list = special_tokens + filtered_tokens
    
vocab = {token: idx for idx, token in enumerate(vocab_list)}

### 2.5 Encode Data

In [None]:
train_data = None
test_data = None

### 2.6 Creating Dataloaders


In [None]:
batch_size = 64

train_data_loader = None
test_data_loader = None

### 2.7 Define LSTM model

In [None]:
class SentimentLSTM(nn.Module):
  def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, pad_index, n_layers=1, bidirectional=False):
    super().__init__()
    self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_index)
    self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, batch_first=True)
    self.fc = nn.Linear(hidden_dim, output_dim)

  def forward(self, ids):
      embedded = self.embedding(ids)
      output, (hidden, cell) = self.lstm(embedded)
      prediction = self.fc(hidden[-1])
      return prediction

### 2.8 Model training parameters

In [None]:
vocab_size = len(vocab)
embedding_dim = 300
hidden_dim = 32
output_dim = None # TODO
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
lr = 5e-4

model = SentimentLSTM(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    hidden_dim=hidden_dim,
    output_dim=output_dim,
    pad_index=pad_index,
).to(device)

optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

### 2.9 Model Evaluation

In [None]:
def get_accuracy(prediction, label):
  batch_size, _ = prediction.shape
  predicted_classes = prediction.argmax(dim=-1)
  correct_predictions = predicted_classes.eq(label).sum()
  accuracy = correct_predictions / batch_size
  return accuracy

def evaluate(dataloader, model, criterion, device):
  model.eval()
  epoch_losses = []
  epoch_accs = []
  # TODO: Write your code here
  return np.mean(epoch_losses), np.mean(epoch_accs)

### 2.10 Model training Loop

**Task** : Add model evaluation (use `test_data_loader`)

In [None]:
n_epochs = 10
for ep in range(n_epochs):
  model.train()
  epoch_losses = []
  epoch_accs = []
  for batch in tqdm.tqdm(train_data_loader, desc="training..."):
    optimizer.zero_grad()
    ids = None
    label = None

    prediction = model(ids)
    loss = criterion(prediction, label)
    accuracy = get_accuracy(prediction, label)

    loss.backward()
    optimizer.step()

    epoch_losses.append(loss.item())
    epoch_accs.append(accuracy.item())
  test_loss, test_acc = evaluate(test_data_loader, model, criterion=criterion, device=device)
  print(f'[Epoch {ep}] Train Loss: {np.mean(epoch_losses):.3f}, Train Acc: {np.mean(epoch_accs):.3f}, Test Loss: {test_loss:.3f}, Test Acc: {test_acc:.3f}')

training...: 100%|██████████| 391/391 [00:04<00:00, 79.26it/s]
evaluating...: 100%|██████████| 391/391 [00:05<00:00, 74.34it/s]


[Epoch 0] Train Loss: 0.694, Train Acc: 0.520, Test Loss: 0.692, Test Acc: 0.529


training...: 100%|██████████| 391/391 [00:07<00:00, 51.33it/s]
evaluating...: 100%|██████████| 391/391 [00:04<00:00, 92.18it/s] 


[Epoch 1] Train Loss: 0.673, Train Acc: 0.587, Test Loss: 0.685, Test Acc: 0.550


training...: 100%|██████████| 391/391 [00:03<00:00, 125.20it/s]
evaluating...: 100%|██████████| 391/391 [00:01<00:00, 249.91it/s]


[Epoch 2] Train Loss: 0.639, Train Acc: 0.647, Test Loss: 0.678, Test Acc: 0.576


training...: 100%|██████████| 391/391 [00:02<00:00, 133.95it/s]
evaluating...: 100%|██████████| 391/391 [00:01<00:00, 260.07it/s]


[Epoch 3] Train Loss: 0.569, Train Acc: 0.720, Test Loss: 0.669, Test Acc: 0.613


training...: 100%|██████████| 391/391 [00:03<00:00, 113.53it/s]
evaluating...: 100%|██████████| 391/391 [00:01<00:00, 256.75it/s]


[Epoch 4] Train Loss: 0.494, Train Acc: 0.775, Test Loss: 0.671, Test Acc: 0.644


training...: 100%|██████████| 391/391 [00:02<00:00, 131.17it/s]
evaluating...: 100%|██████████| 391/391 [00:01<00:00, 257.53it/s]


[Epoch 5] Train Loss: 0.406, Train Acc: 0.829, Test Loss: 0.735, Test Acc: 0.627


training...: 100%|██████████| 391/391 [00:02<00:00, 131.02it/s]
evaluating...: 100%|██████████| 391/391 [00:01<00:00, 199.24it/s]


[Epoch 6] Train Loss: 0.375, Train Acc: 0.847, Test Loss: 0.693, Test Acc: 0.665


training...: 100%|██████████| 391/391 [00:03<00:00, 120.02it/s]
evaluating...: 100%|██████████| 391/391 [00:01<00:00, 258.16it/s]


[Epoch 7] Train Loss: 0.287, Train Acc: 0.892, Test Loss: 0.713, Test Acc: 0.686


training...: 100%|██████████| 391/391 [00:02<00:00, 133.57it/s]
evaluating...: 100%|██████████| 391/391 [00:01<00:00, 252.33it/s]


[Epoch 8] Train Loss: 0.228, Train Acc: 0.919, Test Loss: 0.771, Test Acc: 0.686


training...: 100%|██████████| 391/391 [00:03<00:00, 110.81it/s]
evaluating...: 100%|██████████| 391/391 [00:01<00:00, 250.28it/s]

[Epoch 9] Train Loss: 0.202, Train Acc: 0.932, Test Loss: 0.831, Test Acc: 0.692





## 3. Tasks

```
Task 1
Implement and train a LSTM neural network for sentiment analysis using IMDb dataset and the following architecture:
- LSTM should be bidirectional
- LSTM should be Multi-layered
- LSTM should be use Regularization (i.e Dropout)
```

<hr>

```
Task 2
Implement, train and test a LSTM model for Part-of-speech tagging task.
```

**Task 2 Datasets**: [Train](https://www.dropbox.com/s/x9n6f9o9jl7pno8/train_pos.txt?dl=1), [Test](https://www.dropbox.com/s/v8nccvq7jewcl8s/test_pos.txt?dl=1)


In [28]:
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM,TimeDistributed,  Dense, Dropout, Bidirectional

In [4]:
vocab_size = 20000  # Dictionary size
max_length = 200  # Maximum review length
embedding_dim = 128  # Size of the word vector representation
lstm_units = 64  # Number of neurons in the LSTM layer
dropout_rate = 0.5  # Regularization
num_layers = 2  # Number of LSTM layers
batch_size = 64
epochs = 5

In [5]:
# Load IMDb dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [6]:
# Padding sequences to ensure uniform input size
x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

In [7]:
# Building the model
model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length=max_length))


In [8]:
# Adding multiple LSTM layers with Bidirectional wrapper
for _ in range(num_layers - 1):
    model.add(Bidirectional(LSTM(lstm_units, return_sequences=True)))
    model.add(Dropout(dropout_rate))


In [9]:
# Final LSTM layer
model.add(Bidirectional(LSTM(lstm_units)))
model.add(Dropout(dropout_rate))


In [11]:
# Output layer
model.add(Dense(1, activation='sigmoid'))

In [12]:
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [13]:
# Train the model
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(x_test, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x22005c0fc40>

In [14]:
# Evaluate model
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"Test Accuracy: {test_acc:.4f}")

Test Accuracy: 0.8302


## Task 2

In [15]:
import numpy as np

In [16]:
def load_data(filename):
    sentences, sentence, tags, tag_seq = [], [], [], []
    with open(filename, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line:
                sentences.append(sentence)
                tags.append(tag_seq)
                sentence, tag_seq = [], []
            else:
                word, tag = line.split()
                sentence.append(word)
                tag_seq.append(tag)
    return sentences, tags

In [20]:
train_sentences, train_tags = load_data("lab5/train_pos.txt")
test_sentences, test_tags = load_data("lab5/test_pos.txt")

In [21]:
# === 2. Creating Token dictionaries ===
word2idx = {"<PAD>": 0, "<UNK>": 1}  # Special Characters
tag2idx = {"<PAD>": 0}
for sent in train_sentences:
    for word in sent:
        if word not in word2idx:
            word2idx[word] = len(word2idx)

for tag_seq in train_tags:
    for tag in tag_seq:
        if tag not in tag2idx:
            tag2idx[tag] = len(tag2idx)

idx2tag = {i: tag for tag, i in tag2idx.items()}

In [22]:
# === 3. Converting to numeric tensors ===
max_len = max(len(s) for s in train_sentences)  # Maximum sentence length

X_train = [[word2idx.get(word, 1) for word in sent] for sent in train_sentences]
X_test = [[word2idx.get(word, 1) for word in sent] for sent in test_sentences]

y_train = [[tag2idx[tag] for tag in tags] for tags in train_tags]
y_test = [[tag2idx[tag] for tag in tags] for tags in test_tags]

In [25]:
# === 4. Padding Sequences ===
X_train = pad_sequences(X_train, maxlen=max_len, padding="post")
X_test = pad_sequences(X_test, maxlen=max_len, padding="post")

y_train = pad_sequences(y_train, maxlen=max_len, padding="post")
y_test = pad_sequences(y_test, maxlen=max_len, padding="post")

# One-hot encoding для меток
y_train = [to_categorical(i, num_classes=len(tag2idx)) for i in y_train]
y_test = [to_categorical(i, num_classes=len(tag2idx)) for i in y_test]

y_train = np.array(y_train)
y_test = np.array(y_test)

In [26]:
# === 5. Creating a BiLSTM model ===
vocab_size = len(word2idx)
tag_size = len(tag2idx)
embedding_dim = 128
lstm_units = 64
dropout_rate = 0.3

In [29]:
with tf.device('/GPU:0'):  # Running on the GPU
    model = Sequential([
        Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len),
        Bidirectional(LSTM(units=lstm_units, return_sequences=True)),
        Dropout(dropout_rate),
        TimeDistributed(Dense(tag_size, activation="softmax"))  # Output layer of POS tags
    ])

    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])


In [30]:
# === 6. Model Training ===
model.fit(X_train, y_train, batch_size=32, epochs=5, validation_data=(X_test, y_test))


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x2200d17dde0>

In [31]:
# === 7. Testing ===
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {test_acc:.4f}")

Test Accuracy: 0.9842
