In [None]:
import torch
from torch import nn
from utils import *
import collections

##  3.1 Exploring the Dataset

First, download and extract this IMDb review dataset
in the path `./data/aclImdb`.

In [None]:
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
data_dir = download_extract(url)

We'll start by loading the training data, which includes the reviews and their associated sentiment labels. After loading the data, we'll inspect a few examples to understand the format and content of the reviews and their corresponding sentiment labels. 

An example review might look like: "The movie was fantastic, and I thoroughly enjoyed it." The corresponding label for this positive review would be: 1. 

Conversely, a negative review might look like: "The plot was confusing, and the acting was subpar.". The corresponding label for this negative review would be: 0


In [None]:
train_data = read_imdb(data_dir, is_train=True)
print('# trainings:', len(train_data[0]))
for x, y in zip(train_data[0][:3], train_data[1][:3]):
    print('label:', y, 'review:', x[:150])

Please complete the following function.

When completing this function, consider the following:

- The function should take a text input and split it into individual word tokens (where each word is treated as a separate token).

In the case of word tokenization, a sentence like "The quick brown fox" would be split into individual tokens: ["The", "quick", "brown", "fox"].

In [None]:
def tokenize(lines):
    """Split text lines into word or character tokens."""
    ### START CODE HERE ###

    ### END CODE HERE ###

Please plot the data distribution of the training data using a histogram. The x-axis should represent the length of each review, and the y-axis should indicate the number of corresponding samples. The x-axis will display the lengths of the reviews, while the y-axis will indicate the frequency or count of reviews with similar lengths.

In [None]:
### START CODE HERE ###

### END CODE HERE ###

The **Vocab** class encapsulates the core operations related to vocabulary management, offering a convenient interface for working with text data in a structured manner.

In [None]:
class Vocab:
    """Vocabulary for text."""
    def __init__(self, tokens=[], min_freq=0, reserved_tokens=[]):
        if tokens and isinstance(tokens[0], list):
            tokens = [token for line in tokens for token in line]
        counter = collections.Counter(tokens)
        self.token_freqs = sorted(counter.items(), key=lambda x: x[1],
                                  reverse=True)
        self.idx_to_token = list(sorted(set(['<unk>'] + reserved_tokens + [
            token for token, freq in self.token_freqs if freq >= min_freq])))
        self.token_to_idx = {token: idx
                             for idx, token in enumerate(self.idx_to_token)}

    def __len__(self):
        return len(self.idx_to_token)

    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]

    def to_tokens(self, indices):
        if hasattr(indices, '__len__') and len(indices) > 1:
            return [self.idx_to_token[int(index)] for index in indices]
        return self.idx_to_token[indices]

    @property
    def unk(self): 
        return self.token_to_idx['<unk>']


In [None]:
def truncate_pad(line, num_steps, padding_token):
    """Truncate or pad sequences."""
    if len(line) > num_steps:
        return line[:num_steps]  # Truncate
    return line + [padding_token] * (num_steps - len(line))  # Pad

Please complete the **load_data_imdb** function, which aims to facilitate the processing of the IMDb review dataset by providing a data loader and the vocabulary associated with the dataset. To achieve this, the function is expected to utilize predefined functions and classes to handle the dataset and its associated vocabulary. 

**Note: To process a minibatch of such reviews at each time, we set the length of each review to 500 with truncation and padding.**

In [None]:
def load_data_imdb(batch_size, num_steps=500):
    """Return data loader and the vocabulary of the IMDb review dataset."""
    
    return train_loader, test_loader, vocab

In [None]:
batch_size = 
train_iter, test_iter, vocab = load_data_imdb(batch_size)

##  3.2 Using Bidirectional Recurrent Neural Networks for Sentiment Analysis

Please design a multilayer bidirectional RNN to process the IMDB dataset.

In [None]:
class BiRNN(nn.Module):
    def __init__(self, vocab_size, embed_size, num_hiddens,
                 num_layers, **kwargs):
        ### START CODE HERE ###

        ### END CODE HERE ###

    def forward(self, inputs):
        ### START CODE HERE ###

        ### END CODE HERE ###


In [None]:
embed_size, num_hiddens, num_layers =  ### YOUR CODE HERE ###
net =   ### YOUR CODE HERE ###

Please initialize the weights of your predefined model.

In [None]:
def init_weights(module):
     ### START CODE HERE ###

     ### END CODE HERE ###  
net.apply(init_weights);

Next, we will load the pretrained 100-dimensional GloVe embeddings for tokens in the vocabulary. GloVe (Global Vectors for Word Representation) provides pre-defined dense vectors for a vast number of words in the English language, allowing for immediate utilization in various natural language processing (NLP) applications. The GloVe embeddings are available in different dimensions, such as 50-d, 100-d, 200-d, or 300-d vectors, with the dimensionality indicating the size of the vector representation for each word. To incorporate these pretrained embeddings, it's crucial to ensure consistency with the specified embed_size of 100. By leveraging the GloVe embeddings, we can enrich the model's understanding of the textual data and enhance its performance in NLP tasks.

You can downlaod the **glove.6b.100d** from https://nlp.stanford.edu/data/glove.6B.zip

In [None]:
glove_embedding = TokenEmbedding('glove.6b.100d', './data')
embeds = glove_embedding[vocab.idx_to_token]

Please use these pretrained word vectors to represent tokens in the reviews and ensure not update these vectors during training.

In [None]:
### START CODE HERE ###
### END CODE HERE ### 

Finally, we will proceed with training our network. Within the following function, you are expected to train your network and evaluate your model on the testing dataset. Additionally, you should generate visualizations depicting the training loss, training accuracy, and testing accuracy for each epoch.

In [None]:
def train(net, train_iter, test_iter, loss, trainer, num_epochs):
    ### START CODE HERE ###

    ### END CODE HERE ### 

In [None]:
lr, num_epochs = 0.01, 5
trainer = torch.optim.Adam(net.parameters(), lr=lr)
loss = nn.CrossEntropyLoss(reduction="none")
train(net, train_iter, test_iter, loss, trainer, num_epochs)

##  3.3 Using TextCNN for Sentiment Analysis

Please design a TextCNN to process the IMDB dataset.

In [None]:
class TextCNN(nn.Module):
    def __init__(self, vocab_size, embed_size, kernel_sizes, num_channels,
                 **kwargs):
        ### START CODE HERE ###

        ### END CODE HERE ### 

    def forward(self, inputs):
        ### START CODE HERE ###

        ### END CODE HERE ### 

Please initialize the weights of your predefined model.

In [None]:
def init_weights(module):
    ### START CODE HERE ###

    ### END CODE HERE ### 

net.apply(init_weights);

Next, we will load the pretrained 100-dimensional GloVe embeddings for tokens in the vocabulary. 

In [None]:
glove_embedding = TokenEmbedding('glove.6b.100d', './data')
embeds = glove_embedding[vocab.idx_to_token]

Please use these pretrained word vectors to represent tokens in the reviews.

In [None]:
### START CODE HERE ###
### END CODE HERE ### 

Finally, we will proceed with training our network.

In [None]:
embed_size, kernel_sizes, nums_channels =  ### YOUR CODE HERE ###
net = ### YOUR CODE HERE ###

In [None]:
lr, num_epochs = 0.001, 5
trainer = torch.optim.Adam(net.parameters(), lr=lr)
loss = nn.CrossEntropyLoss(reduction="none")
train(net, train_iter, test_iter, loss, trainer, num_epochs)

Make the prediction for a text sequence.

In [None]:
def predict_sentiment(net, vocab, sequence):
    """Predict the sentiment of a text sequence."""
    ### START CODE HERE ###
    ### END CODE HERE ### 

In [None]:
predict_sentiment(net, vocab, 'this movie is so great')

In [None]:
predict_sentiment(net, vocab, 'this movie is so bad')