
**Task Description:**
You have learned about transformers and their applications in natural language processing. In this assignment, you will apply your knowledge by implementing a transformer-based model to solve a text classification task.


**Dataset:**
You will be using the IMDB movie review dataset, which contains movie reviews labeled as positive or negative sentiment. The dataset will be downloaded and loaded using Python's file handling capabilities.

**Task:**

Your task is to build a transformer-based model using the torch.nn.Transformer module to classify movie reviews as positive or negative sentiment. You can use the provided dataset for training and evaluation.

**Instructions:**

(1) Download and Extract the IMDB Dataset:Run the following script to download and extract the IMDB dataset:

In [10]:
import os
import tarfile
import urllib.request

# Function to download and extract IMDB dataset
def download_extract_imdb(root="./imdb_data"):
    if not os.path.exists(root):
        os.makedirs(root)

    url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
    filename = os.path.join(root, "aclImdb_v1.tar.gz")
    urllib.request.urlretrieve(url, filename)

    # Extract the tar.gz file
    with tarfile.open(filename, "r:gz") as tar:
        tar.extractall(root)

# Download and extract IMDB dataset
download_extract_imdb()


(2) Load and Preprocess the Dataset:Use the following script to load the IMDB dataset, preprocess it, and tokenize the reviews:

In [21]:
import os
from torchtext.data.utils import get_tokenizer

# Set up tokenizer
tokenizer = get_tokenizer("basic_english")

# Load training data
def load_imdb_data(root="./imdb_data/aclImdb"):
    train_data = []
    for label in ["pos", "neg"]:
        label_dir = os.path.join(root, "train", label)
        for filename in os.listdir(label_dir):
            with open(os.path.join(label_dir, filename), "r", encoding="utf-8") as file:
                review = file.read()
                # Tokenize review
                tokenized_review = tokenizer(review)
                train_data.append((tokenized_review, 1 if label == "pos" else 0))
    return train_data

# Load training data
train_data = load_imdb_data()

# Load testing data
def load_test_data(root="./imdb_data/aclImdb"):
    test_data = []
    for label in ["pos", "neg"]:
        label_dir = os.path.join(root, "test", label)
        for filename in os.listdir(label_dir):
            with open(os.path.join(label_dir, filename), "r", encoding="utf-8") as file:
                review = file.read()
                # Tokenize review
                tokenized_review = tokenizer(review)
                test_data.append((tokenized_review, 1 if label == "pos" else 0))
    return test_data

# Load testing data
test_data = load_test_data()




In [12]:
# Display tokenized positive and negative examples
print("Tokenized Positive Example:")
print(train_data[0][0])
print("Tokenized Negative Example:")
print(train_data[len(train_data)//2][0])

Tokenized Positive Example:
['.', '.', '.', 'for', 'paris', 'is', 'a', 'moveable', 'feast', '.', 'ernest', 'hemingway', 'it', 'is', 'impossible', 'to', 'count', 'how', 'many', 'great', 'talents', 'have', 'immortalized', 'paris', 'in', 'paintings', ',', 'novels', ',', 'songs', ',', 'poems', ',', 'short', 'but', 'unforgettable', 'quotes', ',', 'and', 'yes', '-', 'movies', '.', 'the', 'celebrated', 'film', 'director', 'max', 'ophüls', 'said', 'about', 'paris', ',', 'it', 'offered', 'the', 'shining', 'wet', 'boulevards', 'under', 'the', 'street', 'lights', ',', 'breakfast', 'in', 'montmartre', 'with', 'cognac', 'in', 'your', 'glass', ',', 'coffee', 'and', 'lukewarm', 'brioche', ',', 'gigolos', 'and', 'prostitutes', 'at', 'night', '.', 'everyone', 'in', 'the', 'world', 'has', 'two', 'fatherlands', 'his', 'own', 'and', 'paris', '.', 'paris', 'is', 'always', 'associated', 'with', 'love', 'and', 'romance', ',', 'and', 'paris', ',', 'je', 't', "'", 'aime', 'which', 'is', 'subtitled', 'petite', 

In [15]:
# Display tokenized examples with labels for training dataset
print("Training Dataset:")
for review, label in train_data[:3]:
    print("Label:", "Positive" if label == 1 else "Negative")
    print("Tokenized Review:", review)
    print()

# Display tokenized examples with labels for testing dataset
print("Testing Dataset:")
for review, label in test_data[:3]:
    print("Label:", "Positive" if label == 1 else "Negative")
    print("Tokenized Review:", review)
    print()


Training Dataset:
Label: Positive
Tokenized Review: ['.', '.', '.', 'for', 'paris', 'is', 'a', 'moveable', 'feast', '.', 'ernest', 'hemingway', 'it', 'is', 'impossible', 'to', 'count', 'how', 'many', 'great', 'talents', 'have', 'immortalized', 'paris', 'in', 'paintings', ',', 'novels', ',', 'songs', ',', 'poems', ',', 'short', 'but', 'unforgettable', 'quotes', ',', 'and', 'yes', '-', 'movies', '.', 'the', 'celebrated', 'film', 'director', 'max', 'ophüls', 'said', 'about', 'paris', ',', 'it', 'offered', 'the', 'shining', 'wet', 'boulevards', 'under', 'the', 'street', 'lights', ',', 'breakfast', 'in', 'montmartre', 'with', 'cognac', 'in', 'your', 'glass', ',', 'coffee', 'and', 'lukewarm', 'brioche', ',', 'gigolos', 'and', 'prostitutes', 'at', 'night', '.', 'everyone', 'in', 'the', 'world', 'has', 'two', 'fatherlands', 'his', 'own', 'and', 'paris', '.', 'paris', 'is', 'always', 'associated', 'with', 'love', 'and', 'romance', ',', 'and', 'paris', ',', 'je', 't', "'", 'aime', 'which', 'is',

This script loads the IMDB dataset, tokenizes the reviews using the basic_english tokenizer, and displays tokenized examples for both positive and negative sentiment reviews.

(3) Implement the Transformer Model:Implement the Transformer model using the torch.nn.Transformer module.

(4)Train the Model:Define loss function and optimizer, and train the model on the training dataset.

(5) Evaluate the Model:Evaluate the trained model on the testing dataset.

(6) Calculate accuracy and other relevant metrics.

Submission:Submit your implementation along with a brief report describing your model architecture, training procedure, evaluation results, and any insights gained.

In [13]:
# Your code here: implement the Transformer model

# Your code here: define loss function and optimizer

# Your code here: train the model

# Your code here: evaluate the model


**Submission Instructions:**

Submit your Python code in a single notebook file, show your work in detail.

In [None]:
# code helper
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.data.utils import get_tokenizer
from torch.utils.data import DataLoader, TensorDataset


# Define the Transformer model architecture
class TransformerModel(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, num_encoder_layers, hidden_dim, dropout):
        super(TransformerModel, self).__init__()

        # Define the embedding layer


        # Define the transformer encoder layers


        # Define the linear layer for classification
        # Output dimension is 2 for binary classification

    def forward(self, x):
        # x: [seq_len, batch_size]

        # Embedding layer
           # [seq_len, batch_size, embed_size]

        # Transformer encoder layers
           # [seq_len, batch_size, embed_size]

        # Average pooling over the sequence dimension
          # [batch_size, embed_size]

        # Classification
           # [batch_size, 2]

        pass

# Define loss function and optimizer
vocab_size = len(tokenizer.get_vocab())
embed_size = 128
num_heads = 2
num_encoder_layers = 2
hidden_dim = 256
dropout = 0.2

model = TransformerModel(vocab_size, embed_size, num_heads, num_encoder_layers, hidden_dim, dropout)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model


# Test the accuracy of the model


