# Lab 4: Recurrent neural networks basics
```
- [S25] Advanced Machine Learning, Innopolis University
- Teaching Assistant: Gcinizwe Dlamini
```
<hr>


```
Lab Plan
1. Recap (Basics)
2. Recurrent neural networks different achitectures
3. Application of RNN
4. Self practice task
```

<hr>


## 1. Basics (Sequences and RNN)


Each rectangle is a vector and arrows represent functions (e.g. matrix multiply). Input vectors are in red, output vectors are in blue and green vectors hold the RNN's state. The core reason that recurrent nets are more exciting is that they allow us to operate over sequences of vectors

![](http://karpathy.github.io/assets/rnn/diags.jpeg)

### Mode details on the RNN cell

<!-- ![](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/assets/sentiment7.png?raw=1) -->

![](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/experimental/assets/sentiment7.png?raw=1)

## 1.2 Simple RNN cell

In [1]:
import torch
from torch import nn

simple_sequence = torch.Tensor([[0.3,1.9,4.5], [0.4,0.1,0.23], [0.7,0.91,0.43], [0.34,0.01,0.002]])
simple_sequence = simple_sequence.unsqueeze(0)
simple_sequence.shape

torch.Size([1, 4, 3])

The `simple_sequence` variable represents a sequence of length 4, where each element (time-stamp) is represented by a feature vector of length 3.


$$a^{(t)} = b + Wh^{(t-1)} + Ux^{(t)}$$
$$h^{(t)} = tanh(a^{(t)})$$
$$h_t = tanh(W_{ih}x_t + b_{ih} + W_{hh}h_{(t-1)} + b_{hh})$$

where $h_t$ represents the hidden state at time $t$


Lets see whats inside Pytorch and compare with our theory

In [3]:
simple_rnn_layer = nn.RNN(input_size=3, hidden_size=1, num_layers = 1, bias = True, batch_first=True)
simple_rnn_layer.state_dict()

OrderedDict([('weight_ih_l0', tensor([[ 0.9805, -0.9531,  0.3402]])),
             ('weight_hh_l0', tensor([[0.5527]])),
             ('bias_ih_l0', tensor([-0.3233])),
             ('bias_hh_l0', tensor([0.7690]))])

## 1.2 Feedfoward

In [5]:
output_all, output_last = simple_rnn_layer(simple_sequence)

wih = simple_rnn_layer.weight_ih_l0.squeeze(0)
whh = simple_rnn_layer.weight_hh_l0.squeeze(0)

bih = simple_rnn_layer.bias_ih_l0
bhh = simple_rnn_layer.bias_hh_l0

x = simple_sequence[0][0] # The first input feature of the first sequence

# Computing the hidden state for time = 1
h1 = torch.tanh(torch.Tensor(torch.dot(x,wih) + bih  + torch.dot(whh,torch.Tensor([0.0])) + bhh))


assert h1.item() == output_all[0][0].item()

In [7]:
output_all

tensor([[[0.4298],
         [0.7850],
         [0.6884],
         [0.8180]]], grad_fn=<TransposeBackward1>)

**Task** : Compute all the other hidden states

In [13]:
result = []

h_previous = torch.Tensor([0.0])

for i in range(simple_sequence.shape[1]):
  # TODO: Compute and print the hidden states using the given example
    x = simple_sequence[0][i]  # Input vector
    
    h_t = torch.tanh(torch.dot(x, wih) + bih + torch.dot(whh, h_previous) + bhh)
    
    result.append(h_t.item())  # saving results
    h_previous = h_t # update hidden layer

result

[0.4298314154148102,
 0.7850464582443237,
 0.6883629560470581,
 0.8179722428321838]

## 2.1  Bidirectional RNN
The concept behind a bidirectional RNN is simple. As well as having an RNN processing the words in the sentence from the first to the last (a forward RNN), we have a second RNN processing the words in the sentence from the **last to the first** (a backward RNN). At time step $t$, the forward RNN is processing word $x_t$, and the backward RNN is processing word $x_{T-t+1}$.

$$\hat{y}=f(h_T^\rightarrow, h_T^\leftarrow)$$

In [17]:
bi_rnn_layer = nn.RNN(input_size=3, hidden_size=1, num_layers = 1, bidirectional=True, bias = True, batch_first=True)
bi_rnn_layer.state_dict()

OrderedDict([('weight_ih_l0', tensor([[-0.1223, -0.7003,  0.7123]])),
             ('weight_hh_l0', tensor([[0.9007]])),
             ('bias_ih_l0', tensor([0.2881])),
             ('bias_hh_l0', tensor([-0.7138])),
             ('weight_ih_l0_reverse', tensor([[0.9240, 0.6898, 0.8998]])),
             ('weight_hh_l0_reverse', tensor([[0.9774]])),
             ('bias_ih_l0_reverse', tensor([0.6255])),
             ('bias_hh_l0_reverse', tensor([-0.5311]))])

In [19]:
output_all, output_last = bi_rnn_layer(simple_sequence)
output_last

tensor([[[-0.7058]],

        [[ 1.0000]]], grad_fn=<StackBackward0>)

In [21]:
output_all

tensor([[[ 0.8880,  1.0000],
         [ 0.3961,  0.9342],
         [-0.4506,  0.9728],
         [-0.7058,  0.3946]]], grad_fn=<TransposeBackward1>)

**Task** : Compute all the other hidden states

In [25]:
result = []

wih_fwd = bi_rnn_layer.weight_ih_l0.squeeze(0)
whh_fwd = bi_rnn_layer.weight_hh_l0.squeeze(0)
bih_fwd = bi_rnn_layer.bias_ih_l0
bhh_fwd = bi_rnn_layer.bias_hh_l0

wih_bwd = bi_rnn_layer.weight_ih_l0_reverse.squeeze(0)
whh_bwd = bi_rnn_layer.weight_hh_l0_reverse.squeeze(0)
bih_bwd = bi_rnn_layer.bias_ih_l0_reverse
bhh_bwd = bi_rnn_layer.bias_hh_l0_reverse
h_fwd = torch.Tensor([0.0])  
h_bwd = torch.Tensor([0.0])

for i in range(simple_sequence.shape[1]):
  # TODO: Compute and print the hidden states using the given example
    x_fwd = simple_sequence[0][i]  
    x_bwd = simple_sequence[0][-i - 1]  

    h_fwd = torch.tanh(torch.dot(x_fwd, wih_fwd) + bih_fwd + torch.dot(whh_fwd, h_fwd) + bhh_fwd)
    h_bwd = torch.tanh(torch.dot(x_bwd, wih_bwd) + bih_bwd + torch.dot(whh_bwd, h_bwd) + bhh_bwd)

    result.append((h_fwd.item(), h_bwd.item()))  

result

[(0.8880150318145752, 0.39461836218833923),
 (0.3961097002029419, 0.9727751016616821),
 (-0.4506370723247528, 0.93424391746521),
 (-0.7057655453681946, 0.9999966025352478)]

## 2.2 Multi-layer / Stacked-layer RNN

The idea is that we add additional RNNs on top of the initial standard RNN, where each RNN added is another *layer*. The hidden state output by the first (bottom) RNN at time-step $t$ will be the input to the RNN above it at time step $t$. The prediction is then made from the final hidden state of the final (highest) layer.

In [27]:
multi_rnn_layer = nn.RNN(input_size=3, hidden_size=1, num_layers = 2, bidirectional=True, bias = True, batch_first=True)
multi_rnn_layer.state_dict()

OrderedDict([('weight_ih_l0', tensor([[ 0.3051, -0.3719, -0.9945]])),
             ('weight_hh_l0', tensor([[-0.0343]])),
             ('bias_ih_l0', tensor([0.4285])),
             ('bias_hh_l0', tensor([-0.5050])),
             ('weight_ih_l0_reverse', tensor([[-0.3493,  0.9412, -0.5624]])),
             ('weight_hh_l0_reverse', tensor([[0.1537]])),
             ('bias_ih_l0_reverse', tensor([-0.2837])),
             ('bias_hh_l0_reverse', tensor([-0.6106])),
             ('weight_ih_l1', tensor([[-0.0176,  0.2359]])),
             ('weight_hh_l1', tensor([[0.1203]])),
             ('bias_ih_l1', tensor([0.9325])),
             ('bias_hh_l1', tensor([-0.6230])),
             ('weight_ih_l1_reverse', tensor([[-0.3068, -0.3985]])),
             ('weight_hh_l1_reverse', tensor([[-0.5722]])),
             ('bias_ih_l1_reverse', tensor([0.6833])),
             ('bias_hh_l1_reverse', tensor([0.5232]))])

In [29]:
output_all, output_last = multi_rnn_layer(simple_sequence)
output_last

tensor([[[ 0.0404]],

        [[-0.9534]],

        [[ 0.1514]],

        [[ 0.8913]]], grad_fn=<StackBackward0>)

In [31]:
output_all

tensor([[[0.1018, 0.8913],
         [0.1308, 0.8127],
         [0.1988, 0.7946],
         [0.1514, 0.9049]]], grad_fn=<TransposeBackward1>)

## 3. Application (News classification)

### 3.1 Dataset Description

The AG News dataset consists of 120,000 training and 7,600 test news articles, categorized into four labels: `World`, `Sports`, `Business`, and `Science/Technology`. Each article is paired with its corresponding label, and the goal is to predict the correct category based on the content.

In [35]:
!pip install datasets

from datasets import load_dataset
import re
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
import torch.optim as optim
import tqdm

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
streamlit 1.32.0 requires packaging<24,>=16.8, but you have packaging 24.1 which is incompatible.


Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-19.0.1-cp312-cp312-win_amd64.whl.metadata (3.4 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp312-cp312-win_amd64.whl.metadata (13 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py312-none-any.whl.metadata (7.2 kB)
Collecting huggingface-hub>=0.24.0 (from datasets)
  Downloading huggingface_hub-0.29.1-py3-none-any.whl.metadata (13 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
   ---------------------------------------- 0.0/485.4 kB ? eta -:--:--
   -- ------------------------------------- 30.7/485.4 kB 1.3 MB/s eta 0:00:01
   ----- --------------------------------- 71.7/485.4 kB 975.2 kB/s eta 0:00:01
   ---------------- ----------------------- 194.6/485.4 kB 1.7 MB/s eta 0:00:01
   --------------------------- ------------ 337.9/485.4 kB 2.1 MB/s eta 0:00:01
   ---------

ImportError: The pyarrow installation is not built with support for the Parquet file format (DLL load failed while importing _parquet: Не найдена указанная процедура.)

### 3.2 Get Dataset and preprocess

Here are the steps:

1. **Load the Dataset**: Use the `datasets` library to load the dataset (AG News).
2. **Tokenize the Text**: Split the text into smaller units (e.g., words or subwords).
3. **Create a Vocabulary**: Build a vocabulary based on the tokens from the training dataset.
4. **Encode the Text**: Convert the text into numerical representations, applying padding or truncation to ensure consistent input length.
5. **Organize Data into Batches**: Use a DataLoader to batch the data for efficient training

In [37]:
# Load the AG News dataset
dataset = load_dataset("ag_news")

# Check the dataset structure (train, test, validation splits)
print(dataset)

NameError: name 'load_dataset' is not defined

### 3.3 Tokenize & Encode Dataset

In [None]:
# Custom tokenization function (basic whitespace and punctuation removal)
def tokenize(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text.split()  # Tokenize by whitespace

# Build vocabulary from the dataset
def build_vocab(dataset, min_freq=5):
    counter = Counter()
    for example in dataset:
        tokens = tokenize(example['text'])
        counter.update(tokens)

    # Create vocabulary dictionary (word -> index)
    vocab = {'<unk>': 0, '<pad>': 1}  # Special tokens
    idx = 2
    for word, count in counter.items():
        if count >= min_freq:
            vocab[word] = idx
            idx += 1

    return vocab

# Build vocabulary from the training set
train_data = dataset['train']
vocab = build_vocab(train_data)

# Tokenize the dataset and encode into integer indices
def encode_text(text, vocab):
    tokens = tokenize(text)
    return [vocab.get(token, vocab['<unk>']) for token in tokens]

# Encoding the train and test dataset
def encode_data(dataset, vocab):
    encoded_data = []
    for example in dataset:
        input_ids = encode_text(example['text'], vocab)
        label = example['label']
        encoded_data.append({'input_ids': input_ids, 'label': label})
    return encoded_data

train_encoded = encode_data(dataset['train'], vocab)
test_encoded = encode_data(dataset['test'], vocab)

### 3.4 Creating Dataloaders

In [None]:
from torch.utils.data import Dataset, DataLoader

class AGNewsDataset(Dataset):
    def __init__(self, data, vocab, max_len=256):
        self.data = data
        self.vocab = vocab
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        example = self.data[idx]
        input_ids = example['input_ids']
        label = example['label']

        # Padding or truncating to max_len
        input_ids = input_ids[:self.max_len]  # Truncate to max_len
        padding_len = self.max_len - len(input_ids)
        input_ids = input_ids + [self.vocab['<pad>']] * padding_len  # Pad with <pad> token

        return torch.tensor(input_ids), torch.tensor(label)

# Create PyTorch datasets for train and test
train_dataset = AGNewsDataset(train_encoded, vocab)
test_dataset = AGNewsDataset(test_encoded, vocab)

# Create dataloaders
BATCH_SIZE = 32
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

## 3.5 Define RNN model

The steps for defining the Model:
1. **Embedding Layer**: Create an embedding layer to convert token indices into dense vector representations.
2. **RNN Layer**: Use an RNN (or LSTM/GRU) to process the sequence of embeddings and capture temporal dependencies in the data.
3. **Fully Connected Layer**: Add a fully connected layer to output the class probabilities (for classification tasks).
4. **Apply Dropout**: Implement dropout to prevent overfitting by randomly dropping units during training.
5. **Forward Pass**: Define the forward pass, which processes input sequences through the embedding, RNN, and fully connected layers.

In [None]:
class RNNModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout):
        super(RNNModel, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, num_layers=n_layers, dropout=dropout, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input_ids):
        embedded = self.embedding(input_ids)
        rnn_out, hidden = self.rnn(embedded)
        hidden = hidden[-1, :, :]  # Use the last hidden state
        output = self.fc(self.dropout(hidden))
        return output


### 3.6 Model training parameters

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Initialize model, loss function, and optimizer
embedding_dim = 100
hidden_dim = 256
output_dim = 4  # 4 classes in AG News dataset
n_layers = 2
dropout = 0.5
learning_rate = 0.001

model = RNNModel(len(vocab), embedding_dim, hidden_dim, output_dim, n_layers, dropout)
model.to(device)

criterion = nn.CrossEntropyLoss()  # For multi-class classification
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
def train_model():
    num_epochs = 5
    for epoch in range(num_epochs):
        model.train()
        running_loss = 0.0
        for input_ids, label in train_loader:
            input_ids, label = input_ids.to(device), label.to(device)
            optimizer.zero_grad()
            output = model(input_ids)
            loss = criterion(output, label)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        print(f'Epoch {epoch+1}, Loss: {running_loss/len(train_loader):.4f}')

train_model()

### 3.7 Model Evaluation

In [None]:
from sklearn.metrics import accuracy_score

# Evaluate the model
def evaluate_model():
    model.eval()
    predictions, labels = [], []

    with torch.no_grad():
        for input_ids, label in test_loader:
            input_ids, label = input_ids.to(device), label.to(device)
            output = model(input_ids)
            preds = torch.argmax(output, dim=1)
            predictions.extend(preds.cpu().numpy())
            labels.extend(label.cpu().numpy())

    accuracy = accuracy_score(labels, predictions)
    print(f'Accuracy: {accuracy:.4f}')

evaluate_model()

## 4. Tasks

```
Task 1
Implement and train a non-RNN neural network for news analysis using AG News dataset and the following architecture:
- Embedding Layer -> 1D Convolution -> Flattening layer -> Fully connected layer -> output layer
- For Flattening layer use : Max-pooling or Average-pooling
- For Embedding layer use pre-trained word embeddings (i.e FastText, GloVe)
```

<hr>

```
Task 2
Implement, train and test a RNN model for Part-of-speech tagging task with the following requirements:
  - RNN should be bidirectional
  - RNN should be Multi-layered
  - RNN should be use Regularization (i.e Dropout)
```

**Task 2 Datasets**: [Train](https://www.dropbox.com/s/x9n6f9o9jl7pno8/train_pos.txt?dl=1), [Test](https://www.dropbox.com/s/v8nccvq7jewcl8s/test_pos.txt?dl=1)

[**Pretrained Word Embeddings**](https://pytorch.org/text/stable/vocab.html#pretrained-word-embeddings)

[**1D convolutions**](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html)


## Self Tasks

### Task 1

In [53]:
!pip uninstall -y pyarrow datasets packaging
!pip install pyarrow datasets packaging

Found existing installation: pyarrow 14.0.1
Uninstalling pyarrow-14.0.1:
  Successfully uninstalled pyarrow-14.0.1
Found existing installation: datasets 3.3.2
Uninstalling datasets-3.3.2:
  Successfully uninstalled datasets-3.3.2
Found existing installation: packaging 24.1
Uninstalling packaging-24.1:
  Successfully uninstalled packaging-24.1


You can safely remove it manually.


Collecting pyarrow
  Using cached pyarrow-19.0.1-cp312-cp312-win_amd64.whl.metadata (3.4 kB)
Collecting datasets
  Using cached datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Using cached pyarrow-19.0.1-cp312-cp312-win_amd64.whl (25.3 MB)
Using cached datasets-3.3.2-py3-none-any.whl (485 kB)
Installing collected packages: pyarrow, datasets
Successfully installed datasets-3.3.2 pyarrow-19.0.1


In [19]:
!pip install scipy

Collecting scipy
  Downloading scipy-1.15.2-cp312-cp312-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.8 kB ? eta -:--:--
     ------------ ------------------------- 20.5/60.8 kB 320.0 kB/s eta 0:00:01
     ------------------- ------------------ 30.7/60.8 kB 259.2 kB/s eta 0:00:01
     -------------------------------- ----- 51.2/60.8 kB 327.7 kB/s eta 0:00:01
     -------------------------------------- 60.8/60.8 kB 293.7 kB/s eta 0:00:00
Downloading scipy-1.15.2-cp312-cp312-win_amd64.whl (40.9 MB)
   ---------------------------------------- 0.0/40.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/40.9 MB 1.9 MB/s eta 0:00:22
   ---------------------------------------- 0.1/40.9 MB 1.3 MB/s eta 0:00:32
   ---------------------------------------- 0.2/40.9 MB 1.8 MB/s eta 0:00:23
   ---------------------------------------- 0.5/40.9 MB 2.7 MB/s eta 0:00:15
    --------------------------------------- 0.8/40.9 MB 3.7 MB/s eta 0:00:11
   - 

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ydata-profiling 4.9.0 requires scipy<1.14,>=1.4.1, but you have scipy 1.15.2 which is incompatible.


## Task 2

In [51]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import nltk
from nltk.tokenize import word_tokenize

In [58]:
def load_data(file_path):
    sentences = []
    labels = []
    with open(file_path, "r", encoding="utf-8") as f:
        sentence = []
        label = []
        for line in f:
            line = line.strip()
            if not line:  # Если пустая строка — начало нового предложения
                if sentence:
                    sentences.append(sentence)
                    labels.append(label)
                    sentence = []
                    label = []
            else:
                word, tag = line.split()
                sentence.append(word)
                label.append(tag)
                
        # Добавляем последнее предложение, если файл не заканчивается пустой строкой
        if sentence:
            sentences.append(sentence)
            labels.append(label)
    
    return sentences, labels

# Загружаем данные
train_texts, train_labels = load_data("train_pos.txt")
test_texts, test_labels = load_data("test_pos.txt")

print(train_texts[:2])  
print(train_labels[:2])  

[['Confidence', 'in', 'the', 'pound', 'is', 'widely', 'expected', 'to', 'take', 'another', 'sharp', 'dive', 'if', 'trade', 'figures', 'for', 'September', ',', 'due', 'for', 'release', 'tomorrow', ',', 'fail', 'to', 'show', 'a', 'substantial', 'improvement', 'from', 'July', 'and', 'August', "'s", 'near-record', 'deficits', '.'], ['Chancellor', 'of', 'the', 'Exchequer', 'Nigel', 'Lawson', "'s", 'restated', 'commitment', 'to', 'a', 'firm', 'monetary', 'policy', 'has', 'helped', 'to', 'prevent', 'a', 'freefall', 'in', 'sterling', 'over', 'the', 'past', 'week', '.']]
[['NN', 'IN', 'DT', 'NN', 'VBZ', 'RB', 'VBN', 'TO', 'VB', 'DT', 'JJ', 'NN', 'IN', 'NN', 'NNS', 'IN', 'NNP', ',', 'JJ', 'IN', 'NN', 'NN', ',', 'VB', 'TO', 'VB', 'DT', 'JJ', 'NN', 'IN', 'NNP', 'CC', 'NNP', 'POS', 'JJ', 'NNS', '.'], ['NNP', 'IN', 'DT', 'NNP', 'NNP', 'NNP', 'POS', 'VBN', 'NN', 'TO', 'DT', 'NN', 'JJ', 'NN', 'VBZ', 'VBN', 'TO', 'VB', 'DT', 'NN', 'IN', 'NN', 'IN', 'DT', 'JJ', 'NN', '.']]


In [55]:
# 2.  Load pre-trained GloVe embeddings (300-dimensional)
def load_glove_embeddings(file_path, embedding_dim=300):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

# Load GloVe embeddings (download and specify the path)
glove_embeddings = load_glove_embeddings('glove.6B.300d.txt')

In [57]:
# 3. Prepare the data (convert words to indices and labels to indices)
def preprocess_data(texts, glove_embeddings, vocab_size=10000, embedding_dim=300):
    word2idx = {'<PAD>': 0, '<UNK>': 1}  # Padding and unknown words
    idx2word = {0: '<PAD>', 1: '<UNK>'}
    embedding_matrix = np.zeros((vocab_size, embedding_dim))
    idx = 2  # Start indexing from 2 (0 and 1 are reserved for <PAD> and <UNK>)

     # Build word-to-index and index-to-word dictionaries and embedding matrix
    for sentence in texts:
        for word in sentence:
            if word not in word2idx:
                word2idx[word] = idx
                idx2word[idx] = word
                idx += 1
            if word in glove_embeddings:
                embedding_matrix[word2idx[word]] = glove_embeddings[word]
            else:
                embedding_matrix[word2idx[word]] = np.random.randn(embedding_dim)  # Random initialization for unknown words

    return word2idx, idx2word, embedding_matrix

# Preprocess the texts and labels
word2idx, idx2word, embedding_matrix = preprocess_data(texts, glove_embeddings)

# Convert POS tags to indices
unique_tags = list(set(tag for sentence in labels for tag in sentence))
tag2idx = {tag: idx for idx, tag in enumerate(unique_tags)}
idx2tag = {idx: tag for tag, idx in tag2idx.items()}

# def convert_labels(labels, tag2idx):
#     return [[tag2idx[tag] for tag in sentence] for sentence in labels]

# labels_idx = convert_labels(labels, tag2idx)


In [None]:
# 4. Convert text data into sequences of indices
def text_to_sequence(texts, word2idx, max_length):
    sequences = []
    for sentence in texts:
        seq = [word2idx.get(word, 1) for word in sentence]
        seq = seq[:max_length] 
        seq += [0] * (max_length - len(seq))  
        sequences.append(seq)
    return np.array(sequences)

max_length = max(len(sentence) for sentence in train_texts)

In [None]:
X_train = torch.tensor(text_to_sequence(train_texts, word2idx, max_length), dtype=torch.long)
y_train = torch.tensor([[tag2idx[tag] for tag in sentence] + [0] * (max_length - len(sentence)) for sentence in train_labels], dtype=torch.long)

X_test = torch.tensor(text_to_sequence(test_texts, word2idx, max_length), dtype=torch.long)
y_test = torch.tensor([[tag2idx[tag] for tag in sentence] + [0] * (max_length - len(sentence)) for sentence in test_labels], dtype=torch.long)

train_data = TensorDataset(X_train, y_train)
test_data = TensorDataset(X_test, y_test)

train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
test_loader = DataLoader(test_data, batch_size=32, shuffle=False)

In [None]:
# 5. Define the RNN model (Bidirectional, Multi-layered, with Dropout and Convolutions)
class POS_RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, embedding_matrix, num_classes, hidden_dim=256, num_layers=2, dropout=0.5):
        super(POS_RNN, self).__init__()
        
        # Embedding layer (pretrained GloVe embeddings)
        self.embedding = nn.Embedding.from_pretrained(torch.tensor(embedding_matrix, dtype=torch.float32), freeze=False)
        
        # Bidirectional LSTM layer with dropout
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True, bidirectional=True, dropout=dropout)
        
        # 1D Convolution layer (optional, used for feature extraction)
        self.conv1 = nn.Conv1d(embedding_dim, 64, kernel_size=5, padding=2)  # Convolution layer
        self.pool = nn.MaxPool1d(2)  # Max pooling
        
        # Fully connected layers for classification
        self.fc1 = nn.Linear(hidden_dim * 2, 128)  # Bidirectional LSTM outputs hidden_dim * 2
        self.fc2 = nn.Linear(128, num_classes)
    
    def forward(self, x):
        x = self.embedding(x)
        x = x.permute(0, 2, 1)  # Prepare for convolution (batch, embedding_dim, sequence_length)
        x = self.conv1(x)  # Apply 1D convolution
        x = self.pool(x)  # Apply max pooling
        x = x.permute(0, 2, 1)  # Prepare for LSTM (batch, sequence_length, hidden_dim * 2)
        
        # Pass through LSTM
        rnn_out, _ = self.rnn(x)
        
        # We can take the output from the last time step
        rnn_out = rnn_out[:, -1, :]  # Get the output from the last time step for classification
        
        # Fully connected layers for final classification
        x = self.fc1(rnn_out)
        x = self.fc2(x)
        return x

In [None]:
# 6. Initialize and train the model
model = POS_RNN(vocab_size=10000, embedding_dim=300, embedding_matrix=embedding_matrix, num_classes=len(tag2idx))

criterion = nn.CrossEntropyLoss() # Loss function for classification
optimizer = optim.Adam(model.parameters(), lr=0.001) # Optimizer

num_epochs = 10
for epoch in range(num_epochs):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
    
    model.eval()
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for data, target in test_loader:
            output = model(data)
            _, preds = torch.max(output, 1)
            all_preds.extend(preds.numpy())
            all_labels.extend(target.numpy())
    
    accuracy = accuracy_score(all_labels, all_preds)
    print(f'Epoch {epoch+1}/{num_epochs}, Accuracy: {accuracy:.4f}')


In [None]:
# Set model to evaluation mode
model.eval()
with torch.no_grad():
    test_output = model(X_test)
    _, test_preds = torch.max(test_output, 1)
    print("Test Accuracy:", accuracy_score(y_test, test_preds.numpy()))