### Web Intelligence - Exercise 11

In this exercise, we will explore the core principles and practical implementations of Recurrent Neural Networks (RNNs), one of the fundamental architectures for sequential data modeling. We will begin by building a simple RNN to understand its mechanics, including how hidden states evolve and how sequential dependencies are captured. 

**Question 1.** Classifying Names with a Character-Level RNN

We will implement a character-level Recurrent Neural Network (RNN) to classify names by their language of origin in PyTorch. A dataset containing names from different languages can be downloaded from [here](https://download.pytorch.org/tutorial/data.zip).

Import the required packages and set the data folder path

In [1]:
import os
import string
import torch
from tqdm.notebook import tqdm

# Define the data folder path
data_folder = "./data/names"

In [12]:
# Print possible languages
possible_languages = [filename[:-3] for filename in os.listdir(data_folder)]
print(possible_languages)

['Arabic.', 'Chinese.', 'Czech.', 'Dutch.', 'English.', 'French.', 'German.', 'Greek.', 'Irish.', 'Italian.', 'Japanese.', 'Korean.', 'Polish.', 'Portuguese.', 'Russian.', 'Scottish.', 'Spanish.', 'Vietnamese.']


In [17]:
# Choose languages and read the files.
languages = ["Spanish", "French"]
# Set the maximum length of names (discard names whose length larger than the threshold)
max_name_length = 15


# Read the names stored in the language files.
names = []
for language in languages:
    file_path = os.path.join(data_folder, f"{language}.txt")
    with open(file_path, "r", encoding="utf-8") as f:
        current_language_names = [line.strip() for line in f if len(line) <= max_name_length]
    names.append(current_language_names)

Map letters to character indices and represent names and letters as one-hot encodings.

**Note:** Indexing can be started from $1$ and $0$ can be used for unknown characters.

In [18]:
# Map letters to indices. Note that for unknown characters, we can use index 0.
letter2idx = {letter: idx+1 for idx, letter in enumerate(string.ascii_letters)}

letters_num = len(letter2idx) + 1 # +1 for unknown letters

# Map given letter to one-hot encoding
def letter2onehot(letter):
    one_hot_vect = torch.zeros(1, letters_num)
    idx = letter2idx.get(letter, 0)
    if idx != 0:
        one_hot_vect[0][idx] = 1
    return one_hot_vect

# Convert the 
def name2onehot(name):
    assert len(name) < max_name_length, f"Given name is longer than {max_name_length}!"
    one_hot_enc = torch.zeros(max_name_length, letters_num)
    for pos, letter in enumerate(name):
        # if letter not in letter2idx:
        #     raise ValueError(f"Letter {letter} = {name}")
        one_hot_enc[-len(name)+pos:, :] = letter2onehot(letter)
    return one_hot_enc

Define the dataset tensors

In [20]:
train_data = torch.stack(
    [name2onehot(name) for language_names in names for name in language_names]
)
train_labels = torch.as_tensor(
    [label for label, language_names in enumerate(names) for _ in language_names],
    dtype=torch.long
)

Define an RNN architecture

In [21]:
class RNN(torch.nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self.hidden_size = hidden_size

        self.i2h = torch.nn.Linear(input_size, hidden_size)
        self.h2h = torch.nn.Linear(hidden_size, hidden_size)
        self.h2o = torch.nn.Linear(hidden_size, output_size)
        self.log_softmax = torch.nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        hidden = torch.nn.functional.tanh(self.i2h(input) + self.h2h(hidden))
        output = self.h2o(hidden)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

For training, define the model and optimization parameters

In [22]:
# Define the number of languages
class_num = len(languages)
# Define the model hyper-parameters
n_hidden = 128
# Define the optimization parameter
lr = 0.01
epochs_num = 100

# Initialize the model, loss, and optimizer
model = RNN(letters_num, n_hidden, class_num)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

Train the model.

In [23]:
# Set the platform that we will use for training.
#torch.device("cuda" if torch.cuda.is_available() else "cpu") (for Mac M1/M2, torch.device('mps'))
# device = torch.device('mps') 
device = torch.device('cpu') 

# Move the model parameters
model.to(device)

# Training loop
for epoch in tqdm(range(epochs_num)):
    
    epoch_loss = 0
    for idx in tqdm(torch.randperm(train_data.shape[0]), desc="Current epoch:"):
        
        input_mat = train_data[idx].to(device)
        label = train_labels[idx].to(device).unsqueeze(0)
    
        # Forward pass
        optimizer.zero_grad()
        
        # Initialize the hidden state of the model
        hidden = model.initHidden().to(device)
        # 
        for letter_enc in input_mat:
            output, hidden = model(letter_enc, hidden)
        
        # Calculate loss
        loss = loss_function(output, label)
        loss.backward()
        # optimizer.step()
        for p in model.parameters():
            p.data.add_(p.grad.data, alpha=-lr)
        
        epoch_loss += loss.item()
    
    print(f"Epoch {epoch+1}, Loss: {epoch_loss/train_data.shape[0]:.4f}")  
        
# Transfer the model back to the cpu.
# model.to('cpu')

  0%|          | 0/100 [00:00<?, ?it/s]

Current epoch::   0%|          | 0/575 [00:00<?, ?it/s]

Epoch 1, Loss: 0.5951


Current epoch::   0%|          | 0/575 [00:00<?, ?it/s]

Epoch 2, Loss: 0.4312


Current epoch::   0%|          | 0/575 [00:00<?, ?it/s]

Epoch 3, Loss: 0.3577


Current epoch::   0%|          | 0/575 [00:00<?, ?it/s]

Epoch 4, Loss: 0.3207


Current epoch::   0%|          | 0/575 [00:00<?, ?it/s]

KeyboardInterrupt: 

Implement an evaluation to measure accuracy on the dataset.

In [9]:
# Evaluate the model
accuracies = torch.zeros(class_num, class_num)
with torch.no_grad():
    
    for class_idx, language_names in enumerate(names):
        success_count = 0
        for name in language_names:
            
            hidden = model.initHidden()
            for letter_enc in name2onehot(name):
                output, hidden = model(letter_enc, hidden)
            
            pred = output.argmax(dim=1, keepdim=True)        
            accuracies[class_idx][pred] += 1

print(accuracies)

tensor([[296.,   2.],
        [  8., 269.]])


**Question 2.** Sentiment Analysis

We will implement a Recurrent Neural Network (RNN) to perform sentiment analysis on the [IMDB movie review](https://huggingface.co/datasets/stanfordnlp/imdb) dataset, classifying reviews as either positive or negative. 


In [10]:
import torch
from torch.utils.data import TensorDataset, DataLoader
from datasets import load_dataset # 
from nltk.tokenize import word_tokenize
from tqdm.notebook import tqdm
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import Counter
import re

Load the dataset and prepare the training and testing sets

In [11]:
imdb = load_dataset("imdb")
train_data = imdb['train']["text"]
train_labels = imdb["train"]["label"]
test_data = imdb['test']["text"]
test_labels = imdb["test"]["label"]

Preprocess the data by lowercasing, removing punctuation, special characters, and stop words, and tokenizing the text. 

In [12]:
stop_words = set(stopwords.words("english"))

def preprocess_texts(texts, stop_words):
    
    # Initialize the lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    cleaned_texts = []
    for text in tqdm(texts, desc="Pre-processing text"):
        
        # 1. Lowercase the text
        text = text.lower()
        
        # 3. Remove punctuation and special characters
        text = re.sub(r"[^a-z\s]", "", text)
        
        # Remove html tags
        clean = re.compile('<.*?>')
        text = re.sub(clean, '', text)
        
        # 4. Tokenize the text
        words = word_tokenize(text)
        
        # 5. Remove stopwords
        words = [word for word in words if word not in stop_words]
        
        # 6. Lemmatize the tokens 
        words = [lemmatizer.lemmatize(word) for word in words]
        
        # Join tokens back into a single string
        cleaned_texts.append(words)
        
    return cleaned_texts

train_corpus = preprocess_texts(train_data, stop_words)
test_corpus = preprocess_texts(test_data, stop_words)

Pre-processing text:   0%|          | 0/25000 [00:00<?, ?it/s]

Pre-processing text:   0%|          | 0/25000 [00:00<?, ?it/s]

Map the words to integers and pad the sequences to ensure fixed-length inputs.

**Note:** Indexing can be started from $1$ and $0$ can be used for tokens that do not appear in the vocabulary.

In [13]:
## Build a dictionary that maps words to integers
counts = Counter([word for text in train_corpus+test_corpus for word in text])
# Define the vocabulary
vocab = sorted(counts, key=counts.get, reverse=True)
# 0 might be used for padding so start from 1.
word2int = {word: idx for idx, word in enumerate(vocab, 1)}

print("Vocab size:", len(word2int))
print("Training set:", len(train_data))
print("Most frequent 5 words:", vocab[:5] )

Vocab size: 164578
Training set: 25000
Most frequent 5 words: ['br', 'movie', 'film', 'one', 'like']


Convert word sequences to integer sequences

In [14]:
train_int_corpus = [[word2int[word] for word in text] for text in train_corpus]
test_int_corpus = [[word2int[word] for word in text] for text in test_corpus]

def add_padding(corpus, max_length):

    # Initialize the output matrix
    output = torch.zeros(size=(len(corpus), max_length), dtype=int)

    # Add padding and discard the remaining part if it is longer than the 'max_length'
    for idx, text in tqdm(enumerate(corpus), desc="Adding pad"):
        output[idx, -len(text):] = torch.as_tensor(text, dtype=int)[:max_length]
    
    return output

# Define the maximum length for each input text.
# Discard the remaining part of the texts if they are longer than the threshold value.
max_length = 200
train_int_corpus = add_padding(train_int_corpus, max_length=max_length)
test_int_corpus = add_padding(test_int_corpus, max_length=max_length)


Adding pad: 0it [00:00, ?it/s]

Adding pad: 0it [00:00, ?it/s]

Implement a RNN model in PyTorch, including:
- an embedding layer to convert words into dense vector representations.
- a recurrent layer to capture sequential patterns.
- a fully connected output layer for output.

In [15]:
class SentimentRNN(torch.nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, output_size):
        super(SentimentRNN, self).__init__()
        self.embedding = torch.nn.Embedding(vocab_size, embed_size)
        self.rnn = torch.nn.RNN(embed_size, hidden_size, batch_first=True)
        self.fc = torch.nn.Linear(hidden_size, output_size)

    def forward(self, input_ids, attention_mask=None):
        embedded = self.embedding(input_ids)
        output, hidden = self.rnn(embedded)
        # Use the last hidden state
        logits = self.fc(hidden.squeeze(0))
        return logits

Set the model and optimization hyperparameters, and implement the evaluation function.

In [16]:
vocab_size = len(vocab)
embed_size = 100
hidden_size = 50
output_size = 2
batch_size = 64
lr = 0.01
epochs_num = 10

# Initialize the model, loss, and optimizer
model = SentimentRNN(vocab_size, embed_size, hidden_size, output_size)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

# Construct data loaders
train_int_corpus = torch.as_tensor(train_int_corpus)
test_int_corpus = torch.as_tensor(test_int_corpus)

train_labels = torch.as_tensor(train_labels, dtype=torch.long)
test_labels = torch.as_tensor(test_labels, dtype=torch.long)

train_data = TensorDataset(train_int_corpus, train_labels)
test_data = TensorDataset(test_int_corpus, test_labels)

train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

Define the evaluation function that can be used to compute the loss and to count the number of correct predictions for each mini-batch.

In [17]:
# Evaluation function
def batch_evaluate(inputs, labels, model, loss_function):
    correct = 0
    
    logits = model(inputs)
    loss = loss_function(logits, labels)
    
    preds = logits.argmax(dim=1)
    correct += (preds == labels).sum().item()
    
    return correct, loss


In [18]:
# Set the platform that we will use for training.
#torch.device("cuda" if torch.cuda.is_available() else "cpu") (for Mac M1/M2, torch.device('mps'))
device = torch.device('mps') 

# Move the model parameters
model.to(device)

# Training loop
for epoch in tqdm(range(epochs_num)):
    model.train()
    
    train_loss, train_total_correct, total_count = 0, 0, 0
    for batch in tqdm(train_loader, desc="Batch:"):
        
        # Forward pass
        optimizer.zero_grad()
        
        train_inputs, train_labels = batch
        train_inputs, train_labels = train_inputs.to(device), train_labels.to(device)
        
        batch_correct, batch_loss = batch_evaluate(train_inputs, train_labels, model, loss_function)
        
        # Backward pass
        optimizer.zero_grad()
        batch_loss.backward()
        optimizer.step()

        train_loss += batch_loss.item()
        train_total_correct += batch_correct
        
        total_count += len(train_labels)
        
    accuracy = train_total_correct / total_count
    print(f"Epoch {epoch+1}/{epochs_num}, Train Loss: {train_loss:.4f}, Accuracy: {accuracy:.4f}")
        
    # Evaluate test
    test_loss, test_total_correct, total_count = 0, 0, 0
    for batch in tqdm(test_loader, desc="Test Batch:"):
        
        with torch.no_grad():
        
            test_inputs, test_labels = batch
            test_inputs, test_labels = test_inputs.to(device), test_labels.to(device)
            
            batch_correct, batch_loss = batch_evaluate(test_inputs, test_labels, model, loss_function)
        
            test_loss += batch_loss.item()
            test_total_correct += batch_correct
            total_count += len(test_labels)
        
    print(f"Test Loss: {train_loss:.4f}, Accuracy: {accuracy:.4f}")

# Transfer the model back to the cpu.
model.to('cpu')

  0%|          | 0/10 [00:00<?, ?it/s]

Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Epoch 1/10, Train Loss: 258.0152, Accuracy: 0.6076


Test Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Test Loss: 258.0152, Accuracy: 0.6076


Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Epoch 2/10, Train Loss: 210.2010, Accuracy: 0.7414


Test Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Test Loss: 210.2010, Accuracy: 0.7414


Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Epoch 3/10, Train Loss: 185.3649, Accuracy: 0.7794


Test Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Test Loss: 185.3649, Accuracy: 0.7794


Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Epoch 4/10, Train Loss: 180.2875, Accuracy: 0.7895


Test Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Test Loss: 180.2875, Accuracy: 0.7895


Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Epoch 5/10, Train Loss: 155.0892, Accuracy: 0.8257


Test Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Test Loss: 155.0892, Accuracy: 0.8257


Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Epoch 6/10, Train Loss: 138.8202, Accuracy: 0.8509


Test Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Test Loss: 138.8202, Accuracy: 0.8509


Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Epoch 7/10, Train Loss: 132.1910, Accuracy: 0.8593


Test Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Test Loss: 132.1910, Accuracy: 0.8593


Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Epoch 8/10, Train Loss: 124.9832, Accuracy: 0.8686


Test Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Test Loss: 124.9832, Accuracy: 0.8686


Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Epoch 9/10, Train Loss: 118.4366, Accuracy: 0.8762


Test Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Test Loss: 118.4366, Accuracy: 0.8762


Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Epoch 10/10, Train Loss: 112.4382, Accuracy: 0.8829


Test Batch::   0%|          | 0/391 [00:00<?, ?it/s]

Test Loss: 112.4382, Accuracy: 0.8829


SentimentRNN(
  (embedding): Embedding(164578, 100)
  (rnn): RNN(100, 50, batch_first=True)
  (fc): Linear(in_features=50, out_features=2, bias=True)
)