
**Task Description:**
You have learned about transformers and their applications in natural language processing. In this assignment, you will apply your knowledge by implementing a transformer-based model to solve a text classification task.


**Dataset:**
You will be using the IMDB movie review dataset, which contains movie reviews labeled as positive or negative sentiment. The dataset will be downloaded and loaded using Python's file handling capabilities.

**Task:**

Your task is to build a transformer-based model using the torch.nn.Transformer module to classify movie reviews as positive or negative sentiment. You can use the provided dataset for training and evaluation.

**Instructions:**

(1) Download and Extract the IMDB Dataset:Run the following script to download and extract the IMDB dataset:

In [6]:
import os
import tarfile
import urllib.request

# Function to download and extract IMDB dataset
def download_extract_imdb(root="./imdb_data"):
    if not os.path.exists(root):
        os.makedirs(root)

    url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
    filename = os.path.join(root, "aclImdb_v1.tar.gz")
    urllib.request.urlretrieve(url, filename)

    # Extract the tar.gz file
    with tarfile.open(filename, "r:gz") as tar:
        tar.extractall(root)

# Download and extract IMDB dataset
download_extract_imdb()


(2) Load and Preprocess the Dataset:Use the following script to load the IMDB dataset, preprocess it, and tokenize the reviews:

In [7]:
import os
from torchtext.data.utils import get_tokenizer

# Set up tokenizer
tokenizer = get_tokenizer("basic_english")

# Load training data
def load_imdb_data(root="./imdb_data/aclImdb"):
    train_data = []
    for label in ["pos", "neg"]:
        label_dir = os.path.join(root, "train", label)
        for filename in os.listdir(label_dir):
            with open(os.path.join(label_dir, filename), "r", encoding="utf-8") as file:
                review = file.read()
                # Tokenize review
                tokenized_review = tokenizer(review)
                train_data.append((tokenized_review, 1 if label == "pos" else 0))
    return train_data

# Load training data
train_data = load_imdb_data()

# Load testing data
def load_test_data(root="./imdb_data/aclImdb"):
    test_data = []
    for label in ["pos", "neg"]:
        label_dir = os.path.join(root, "test", label)
        for filename in os.listdir(label_dir):
            with open(os.path.join(label_dir, filename), "r", encoding="utf-8") as file:
                review = file.read()
                # Tokenize review
                tokenized_review = tokenizer(review)
                test_data.append((tokenized_review, 1 if label == "pos" else 0))
    return test_data

# Load testing data
test_data = load_test_data()


In [8]:
# Display tokenized positive and negative examples
print("Tokenized Positive Example:")
print(train_data[0][0])
print("Tokenized Negative Example:")
print(train_data[len(train_data)//2][0])

Tokenized Positive Example:
['for', 'a', 'movie', 'that', 'gets', 'no', 'respect', 'there', 'sure', 'are', 'a', 'lot', 'of', 'memorable', 'quotes', 'listed', 'for', 'this', 'gem', '.', 'imagine', 'a', 'movie', 'where', 'joe', 'piscopo', 'is', 'actually', 'funny', '!', 'maureen', 'stapleton', 'is', 'a', 'scene', 'stealer', '.', 'the', 'moroni', 'character', 'is', 'an', 'absolute', 'scream', '.', 'watch', 'for', 'alan', 'the', 'skipper', 'hale', 'jr', '.', 'as', 'a', 'police', 'sgt', '.']
Tokenized Negative Example:
['working', 'with', 'one', 'of', 'the', 'best', 'shakespeare', 'sources', ',', 'this', 'film', 'manages', 'to', 'be', 'creditable', 'to', 'it', "'", 's', 'source', ',', 'whilst', 'still', 'appealing', 'to', 'a', 'wider', 'audience', '.', 'branagh', 'steals', 'the', 'film', 'from', 'under', 'fishburne', "'", 's', 'nose', ',', 'and', 'there', "'", 's', 'a', 'talented', 'cast', 'on', 'good', 'form', '.']


In [9]:
# Display tokenized examples with labels for training dataset
print("Training Dataset:")
for review, label in train_data[:3]:
    print("Label:", "Positive" if label == 1 else "Negative")
    print("Tokenized Review:", review)
    print()

# Display tokenized examples with labels for testing dataset
print("Testing Dataset:")
for review, label in test_data[:3]:
    print("Label:", "Positive" if label == 1 else "Negative")
    print("Tokenized Review:", review)
    print()


Training Dataset:
Label: Positive
Tokenized Review: ['for', 'a', 'movie', 'that', 'gets', 'no', 'respect', 'there', 'sure', 'are', 'a', 'lot', 'of', 'memorable', 'quotes', 'listed', 'for', 'this', 'gem', '.', 'imagine', 'a', 'movie', 'where', 'joe', 'piscopo', 'is', 'actually', 'funny', '!', 'maureen', 'stapleton', 'is', 'a', 'scene', 'stealer', '.', 'the', 'moroni', 'character', 'is', 'an', 'absolute', 'scream', '.', 'watch', 'for', 'alan', 'the', 'skipper', 'hale', 'jr', '.', 'as', 'a', 'police', 'sgt', '.']

Label: Positive
Tokenized Review: ['bizarre', 'horror', 'movie', 'filled', 'with', 'famous', 'faces', 'but', 'stolen', 'by', 'cristina', 'raines', '(', 'later', 'of', 'tv', "'", 's', 'flamingo', 'road', ')', 'as', 'a', 'pretty', 'but', 'somewhat', 'unstable', 'model', 'with', 'a', 'gummy', 'smile', 'who', 'is', 'slated', 'to', 'pay', 'for', 'her', 'attempted', 'suicides', 'by', 'guarding', 'the', 'gateway', 'to', 'hell', '!', 'the', 'scenes', 'with', 'raines', 'modeling', 'are',

This script loads the IMDB dataset, tokenizes the reviews using the basic_english tokenizer, and displays tokenized examples for both positive and negative sentiment reviews.

(3) Implement the Transformer Model:Implement the Transformer model using the torch.nn.Transformer module.

(4)Train the Model:Define loss function and optimizer, and train the model on the training dataset.

(5) Evaluate the Model:Evaluate the trained model on the testing dataset.

(6) Calculate accuracy and other relevant metrics.

Submission:Submit your implementation along with a brief report describing your model architecture, training procedure, evaluation results, and any insights gained.

# 1st Model

In [None]:
# code helper
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.data.utils import get_tokenizer
from torch.utils.data import DataLoader, TensorDataset
from torchtext.vocab import build_vocab_from_iterator
from collections import Counter


# Your code here: implement the Transformer model
class TransformerModel(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, num_encoder_layers, hidden_dim, dropout):
        super(TransformerModel, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embed_size, 
            nhead=num_heads, 
            dim_feedforward=hidden_dim, 
            dropout=dropout,
            batch_first=True  # Set batch_first to True
        )
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_encoder_layers)
        self.fc = nn.Linear(embed_size, 2)  # Assuming 2 classes for your output

    def forward(self, x):
        x = self.embed(x)  # [batch_size, seq_len, embed_size]
        x = x.transpose(0, 1)  # Transpose to [seq_len, batch_size, embed_size] for Transformer
        x = self.transformer_encoder(x)  # [seq_len, batch_size, embed_size]
        x = x.mean(dim=0)  # Average pooling over the sequence dimension [batch_size, embed_size]
        x = self.fc(x)  # [batch_size, 2]
        return x
    
# Define loss function and optimizer
tokenizer = get_tokenizer("basic_english")

# Flatten all token lists into a single list and count occurrences
# all_train_tokens = [token for token_list, _ in train_data for token in token_list]
# vocab_counter = Counter(all_train_tokens)
# # The vocabulary size is the number of unique tokens
# vocab_size = len(vocab_counter)

embed_size = 128
num_heads = 2
num_encoder_layers = 2
hidden_dim = 256
dropout = 0.2

# Your code here: define loss function and optimizer
model = TransformerModel(vocab_size, embed_size, num_heads, num_encoder_layers, hidden_dim, dropout)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


# Build vocab from train data
vocab = build_vocab_from_iterator((token_list for token_list, _ in train_data), specials=["<unk>", "<pad>"])
vocab.set_default_index(vocab["<unk>"])

vocab_size = len(vocab)

# Helper function to encode and pad token lists
def encode_and_pad(token_list, vocab, pad_index, max_length=None):
    indices = [vocab[token] for token in token_list]
    if max_length is None:
        max_length = max(len(t) for t, _ in train_data)
    padded_indices = indices + [pad_index] * (max_length - len(indices))
    return padded_indices

# Encode and pad the token lists from train and test data
pad_index = vocab["<pad>"]
max_length = max(max(len(t) for t, _ in train_data), max(len(t) for t, _ in test_data))  # Get max length if needed

train_encoded = [torch.tensor(encode_and_pad(token_list, vocab, pad_index, max_length), dtype=torch.long) for token_list, _ in train_data]
train_labels = torch.tensor([label for _, label in train_data], dtype=torch.long)
test_encoded = [torch.tensor(encode_and_pad(token_list, vocab, pad_index, max_length), dtype=torch.long) for token_list, _ in test_data]
test_labels = torch.tensor([label for _, label in test_data], dtype=torch.long)

# Create TensorDatasets
train_dataset = TensorDataset(torch.stack(train_encoded), train_labels)
test_dataset = TensorDataset(torch.stack(test_encoded), test_labels)

# Create DataLoaders
batch_size = 32 # You can adjust the batch size
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


### BUG FIX
max_index = max([max(seq) for seq in train_encoded if len(seq) > 0])
if max_index >= vocab_size:
    raise ValueError(f"Index {max_index} out of range with vocab size {vocab_size}")

for token_list, _ in train_data:
    for token in token_list:
        index = vocab[token]
        if index >= vocab_size:  # vocab_size is the size set in the embedding layer
            print(f"Out-of-range token '{token}' with index {index}")
### BUG FIX

# Your code here: train the model
# Train the model
def train(model, loader):
    model.train()
    total_loss = 0
    count = 0

    for batch in loader:
        if count % 100==0:
            print(count)

        count += 1
        inputs, targets = batch
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(loader)

# Your code here: evaluate the model
def evaluate(model, loader):
    model.eval()
    count = 0
    correct, total = 0, 0
    with torch.no_grad():
        for batch in loader:
            inputs, targets = batch
            outputs = model(inputs)
            _, predicted = torch.max(outputs.data, 1)
            total += targets.size(0)
            correct += (predicted == targets).sum().item()
    return correct / total

# Run training and evaluation
for epoch in range(10):  # Number of epochs
    train_loss = train(model, train_loader)
    accuracy = evaluate(model, test_loader)
    print(f"Epoch: {epoch}, Loss: {train_loss:.4f}, Accuracy: {accuracy:.2f}")


0
100
200
300
400
500
600
700
Epoch: 0, Loss: 0.6790, Accuracy: 0.67
0
100
200
300
400
500
600
700
Epoch: 1, Loss: 0.5858, Accuracy: 0.72
0
100
200
300
400
500
600
700
Epoch: 2, Loss: 0.5149, Accuracy: 0.74
0
100
200
300
400
500
600
700
Epoch: 3, Loss: 0.4858, Accuracy: 0.75
0
100
200
300
400
500
600
700
Epoch: 4, Loss: 0.4697, Accuracy: 0.74
0
100
200
300
400
500
600
700
Epoch: 5, Loss: 0.4638, Accuracy: 0.74
0
100
200
300
400
500
600
700
Epoch: 6, Loss: 0.4290, Accuracy: 0.77
0
100
200
300
400
500
600
700
Epoch: 7, Loss: 0.4196, Accuracy: 0.78
0
100
200
300
400
500
600
700
Epoch: 8, Loss: 0.6297, Accuracy: 0.57
0
100
200
300
400
500
600
700
Epoch: 9, Loss: 0.6778, Accuracy: 0.51


# 2nd Model

In [25]:
import os
import tarfile
import urllib.request
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, random_split


# Function to download and extract IMDB dataset
def download_extract_imdb(root="./imdb_data"):
    if not os.path.exists(root):
        os.makedirs(root)

    url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
    filename = os.path.join(root, "aclImdb_v1.tar.gz")
    urllib.request.urlretrieve(url, filename)

    # Extract the tar.gz file
    with tarfile.open(filename, "r:gz") as tar:
        tar.extractall(root)

# Check if the dataset is downloaded and extracted
if not os.path.exists("./imdb_data/aclImdb"):
    download_extract_imdb()

# Tokenizer
tokenizer = get_tokenizer("basic_english")

# Load data
def load_imdb_data(root="./imdb_data/aclImdb"):
    data = []
    for label in ["pos", "neg"]:
        label_dir = os.path.join(root, "train", label)
        for filename in os.listdir(label_dir):
            with open(os.path.join(label_dir, filename), "r", encoding="utf-8") as file:
                review = file.read()
                # Tokenize review
                tokenized_review = tokenizer(review)
                data.append((tokenized_review, 1 if label == "pos" else 0))
    return data

# Load training data
train_data = load_imdb_data()

# Build vocabulary
def build_vocab(data, unk_token="<unk>", pad_token="<pad>"):
    vocab = set()
    for tokens, _ in data:
        vocab.update(tokens)
    vocab = list(vocab)
    vocab.insert(0, pad_token)  # padding token
    vocab.insert(0, unk_token)  # unknown token
    vocab_to_idx = {word: idx for idx, word in enumerate(vocab)}
    return vocab_to_idx, vocab

vocab_to_idx, vocab = build_vocab(train_data)

# Model
class TransformerModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, hidden_dim, num_layers, num_classes, dropout=0.5):
        super(TransformerModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.transformer_encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads, dim_feedforward=hidden_dim, dropout=dropout),
            num_layers=num_layers
        )
        self.fc = nn.Linear(embed_dim, num_classes)
        self.dropout = nn.Dropout(dropout)

    def forward(self, text):
        embedded = self.embedding(text)
        embedded = embedded.permute(1, 0, 2)
        transformer_output = self.transformer_encoder(embedded)
        pooled_output = torch.mean(transformer_output, dim=0)
        pooled_output = self.dropout(pooled_output)
        logits = self.fc(pooled_output)
        return logits

# Initialize model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
VOCAB_SIZE = len(vocab_to_idx)
EMBED_DIM = 60
NUM_HEADS = 2
HIDDEN_DIM = 60
NUM_LAYERS = 1
NUM_CLASSES = 2

model = TransformerModel(VOCAB_SIZE, EMBED_DIM, NUM_HEADS, HIDDEN_DIM, NUM_LAYERS, NUM_CLASSES).to(device)

# Training
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()



# Define the collate function to pad sequences
def collate_batch(batch):
    text, labels = zip(*batch)
    labels = torch.tensor(labels)
    # Find the maximum length of text in the batch
    max_length = max(len(item) for item in text)
    # Create a tensor to hold the padded sequences
    padded_text = torch.zeros((len(text), max_length), dtype=torch.long)
    for i, item in enumerate(text):
        # Fill the tensor with the sequences, leaving the remaining space as padding
        padded_text[i, :len(item)] = torch.tensor([vocab_to_idx[token] for token in item])
    return padded_text, labels

# Split dataset into training and validation sets
train_size = int(0.8 * len(train_data))
val_size = len(train_data) - train_size
train_dataset, val_dataset = random_split(train_data, [train_size, val_size])

# Define batch size
BATCH_SIZE = 6

# Create data loaders
train_iterator = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
val_iterator = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_batch)

# Define the number of epochs
N_EPOCHS = 10

# Loop over epochs
for epoch in range(N_EPOCHS):
    # Training Phase
    print("epoch", epoch)
    model.train()  # Set the model to training mode
    epoch_train_loss = 0
    correct_train = 0  # Initialize correct prediction counter for training

    for text, labels in train_iterator:
        text, labels = text.to(device), labels.to(device)  # Move data to GPU if available

        optimizer.zero_grad()  # Zero the gradients

        # Forward pass
        predictions = model(text)

        # Compute the loss
        train_loss = criterion(predictions, labels)

        # Backward pass
        train_loss.backward()

        # Update the parameters
        optimizer.step()

        # Accumulate the training loss for this batch
        epoch_train_loss += train_loss.item()

        # Calculate training accuracy
        _, predicted = torch.max(predictions.data, 1)
        correct_train += (predicted == labels).sum().item()

    # Compute average training loss and accuracy for the epoch
    average_train_loss = epoch_train_loss / len(train_iterator)
    train_accuracy = 100 * correct_train / len(train_iterator.dataset)

    # Validation Phase
    model.eval()  # Set the model to evaluation mode
    epoch_val_loss = 0
    correct_val = 0  # Initialize correct prediction counter for validation

    with torch.no_grad():  # No gradient computation during validation
        for text, labels in val_iterator:
            text, labels = text.to(device), labels.to(device)  # Move data to GPU if available

            # Forward pass
            predictions = model(text)

            # Compute the loss
            val_loss = criterion(predictions, labels)

            # Accumulate the validation loss for this batch
            epoch_val_loss += val_loss.item()

            # Calculate validation accuracy
            _, predicted = torch.max(predictions.data, 1)
            correct_val += (predicted == labels).sum().item()

    # Compute average validation loss and accuracy for the epoch
    average_val_loss = epoch_val_loss / len(val_iterator)
    val_accuracy = 100 * correct_val / len(val_iterator.dataset)

    # Print epoch information
    print(f'Epoch: {epoch+1:02} | Train Loss: {average_train_loss:.3f} | Train Acc: {train_accuracy:.2f}% | Val. Loss: {average_val_loss:.3f} | Val Acc: {val_accuracy:.2f}%')

epoch 0
Epoch: 01 | Train Loss: 0.484 | Train Acc: 75.44% | Val. Loss: 0.359 | Val Acc: 86.28%
epoch 1
Epoch: 02 | Train Loss: 0.257 | Train Acc: 89.84% | Val. Loss: 0.344 | Val Acc: 88.50%
epoch 2
Epoch: 03 | Train Loss: 0.171 | Train Acc: 93.67% | Val. Loss: 0.405 | Val Acc: 88.68%
epoch 3
Epoch: 04 | Train Loss: 0.112 | Train Acc: 96.20% | Val. Loss: 0.467 | Val Acc: 87.98%
epoch 4
Epoch: 05 | Train Loss: 0.074 | Train Acc: 97.61% | Val. Loss: 0.614 | Val Acc: 88.14%
epoch 5
Epoch: 06 | Train Loss: 0.050 | Train Acc: 98.40% | Val. Loss: 0.610 | Val Acc: 87.32%
epoch 6
Epoch: 07 | Train Loss: 0.033 | Train Acc: 98.96% | Val. Loss: 0.871 | Val Acc: 87.18%
epoch 7
Epoch: 08 | Train Loss: 0.022 | Train Acc: 99.30% | Val. Loss: 0.959 | Val Acc: 87.42%
epoch 8
Epoch: 09 | Train Loss: 0.015 | Train Acc: 99.39% | Val. Loss: 1.119 | Val Acc: 87.34%
epoch 9
Epoch: 10 | Train Loss: 0.014 | Train Acc: 99.56% | Val. Loss: 0.960 | Val Acc: 86.56%


# **BRIEF REPORT**

Transformer-Based Models for Sentiment Analysis:


**Model 1:**

- Architecture: Multi-head encoding, manual token handling and vocabulary management.
- Training Time: Extensive, around 9 hours.
- Performance: Achieved low accuracy (51%) with high training time, indicating inefficiency.
- Insight: The substantial training duration of approximately 9 hours and a drop in accuracy from 77% to 51% in later epochs for Model 1 emphasize the inefficiencies in its manual data preprocessing approach and the need for more streamlined data management techniques to optimize performance.


**Model 2:**

- Architecture: Simplified and efficient, utilizing built-in functions for data handling and model operations, multi-head encoding within its Transformer architecture. Streamlines data handling by using a custom collate function to manage padding and batch creation efficiently.
- Training Time: Significantly shorter, around 30 minutes.
- Performance: High initial accuracy (up to 88%) but showed signs of overfitting as indicated by increasing validation loss despite improving training accuracy.
- Insight: Model 2’s efficiency is evident with a training time of approximately 34 minutes and an impressive peak accuracy of 88.68%. However, the rise in validation loss during later epochs highlights the need for adjustments in model regularization to prevent overfitting.

The significant difference in training times between the two transformer models is mainly due to Model 1's manual data processing and complex architecture, which increase computational overhead and extend training duration. In contrast, Model 2 uses optimized torchtext functions for efficient data handling and a simpler architecture, along with a smaller batch size, enhancing training speed, thereby substantially reducing its training time. 

Model 2 is preferable for practical applications due to its efficiency and simplicity, though it may require adjustments to prevent overfitting, such as fine-tuning dropout rates or epoch numbers. Model 1, while detailed and informative for understanding model internals, needs optimization to reduce training time and improve stability.

**Submission Instructions:**

Submit your Python code in a single notebook file, show your work in detail.