# RTML Final 2021 - 2.5 Hrs (9.13 - 11:43)

In this exam, we'll have some practical exercises using RNNs and some short answer questions regarding the Transformer/attention
and reinforcement learning.

Consider the AGNews text classification dataset:

In [1]:
# !wget http://www.cs.ait.ac.th/~mdailey/data.zip

In [2]:
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab

train_iter = AG_NEWS(split='train')
tokenizer = get_tokenizer('basic_english')
counter = Counter()

def clean(line):
    line = line.replace('\\', ' ')
    return line

labels = {}
for (label, line) in train_iter:
    if label in labels:
        labels[label] += 1
    else:
        labels[label] = 1
    counter.update(tokenizer(clean(line)))

vocab = Vocab(counter, min_freq=1)

print('Label frequencies:', labels)
print('A few token frequencies:', vocab.freqs.most_common(5))
print('Label meanings: 1: World news, 2: Sports news, 3: Business news, 4: Sci/Tech news')

Label frequencies: {3: 30000, 4: 30000, 2: 30000, 1: 30000}
A few token frequencies: [('.', 225971), ('the', 205040), (',', 165685), ('to', 119817), ('a', 110942)]
Label meanings: 1: World news, 2: Sports news, 3: Business news, 4: Sci/Tech news


Here's how we can get a sequence of tokens for a sentence with the cleaner, tokenizer, and vocabulary:

In [3]:
[vocab[token] for token in tokenizer(clean('Bangkok, or The Big Mango, is one of the great cities of Asia'))]

[4248, 4, 116, 3, 244, 46857, 4, 23, 62, 7, 3, 812, 2009, 7, 989]

Let's make pipelines for processing a news story and a label:

In [4]:
text_pipeline = lambda x: [vocab[token] for token in tokenizer(clean(x))]
label_pipeline = lambda x: int(x) - 1

Here's how to create dataloaders for the training and test datasets:

In [5]:
import torch
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, length_list = [], [], []
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        length_list.append(processed_text.shape[0])
        text_list.append(processed_text)
    label_list = torch.tensor(label_list, dtype=torch.int64)
    text_list = pad_sequence(text_list, padding_value=0)
    length_list = torch.tensor(length_list, dtype=torch.int64)
    return label_list.to(device), text_list.to(device), length_list.to(device)

train_iter = AG_NEWS(split='train')
train_dataset = list(train_iter)
test_iter = AG_NEWS(split='test')
test_dataset = list(test_iter)
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=8, shuffle=False, collate_fn=collate_batch)

Here's how to get a batch from one of these dataloaders. The first entry is a 1D tensor of labels for the batch
(8 values between 0 and 3), then a 2D tensor representing the stories with dimension T x B (number of tokens x batch size). 

In [6]:
batch = next(enumerate(train_dataloader))
print(batch)

(0, (tensor([2, 1, 2, 0, 3, 1, 1, 1], device='cuda:1'), tensor([[ 1470,   761,  4465,  1381,   840,  8809,   426, 26032],
        [  188,  8055,  3937,  3846,  2019,  1648,  1482,  9958],
        [ 2279,  1816,   728,  5084,   135,    20,  9688,  3566],
        [   11,  2217,     5,  1438, 10870,   629,  3473,  7800],
        [ 1564,    20,    45,  8168,   137,  1406,    12, 26032],
        [    4,    52,   284,     8,  6056,    13, 13021,     7],
        [ 1771,     2,    11,   497,    37,    10,  3121,     3],
        [ 4233,    10,    45,  5931,   840,   343,    90,    89],
        [   67,     2,    35,  3846,    40,   232,  1696,   160],
        [ 1569,  5497,  4447,  4769,  2019,   629,  1159,  6186],
        [  145,   283,  1604,  5021,   135,  1406,  2984, 18865],
        [ 1030,  2604,  1334,  2839,   456,  1648,  9688, 36067],
        [  247,   358,  1678,    11,  2010,   142,    34, 31190],
        [ 1470,    14,   426,    65,   137,   213,    38,     7],
        [    9,    2

## Question 1, 10 points

The vocabulary currently is too large for a simple one-hot embedding. Let's reduce the vocabulary size
so that we can use one-hot. First, add a step that removes tokens from a list of "stop words" to the `text_pipeline` function.
You probably want to remove punctuation ('.', ',', '-', etc.) and articles ("a", "the").

Once you've removed stop words, modify the vocabulary to include only the most frequent 1000 tokens (including 0 for an unknown/infrequent word).

Write your revised code in the cell below and output the 999 top words with their frequencies:

In [70]:
# Place code for Question 1 here
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab

train_iter = AG_NEWS(split='train')
tokenizer = get_tokenizer('basic_english')
counter = Counter()

def clean(line):
    line = line.replace('\\', ' ')
    ### Before these replace 
    # [('.', 225971),('the', 205040),(',', 165685),('a', 110942),('s', 61915),('on', 57279),('for',50417),('#39',44316),('(',41106),(')',40787),('-',39212),("'",32235),('that',28167),('with',26801),('as', 25324),('at', 24999)]
    ### remove but keep format
    line = line.replace('.',' ').replace(',',' ').replace('-',' ')
    line = line.replace(' the ',' ').replace('The ',' ').replace(' a ',' ').replace('A ',' ')
    ### After the replace
    # I keep 's' '#39' because I don't know what are they and '(' ')' "'" because I think it contribute to the context.
    # [('to', 120680), ('of', 98652), ('in', 96422), ('and', 69670), ('s', 62116), ('on', 57667), ('for', 50674), ('#39', 44316), ('(', 41106), (')', 40787), ("'", 32235), ('that', 28169), ('with', 26812), ('as', 25381), ('at', 25234), ('its', 22123), ('is', 22106), ('new', 21393), ('by', 20942), ('it', 20537)]
    return line

labels = {}
for (label, line) in train_iter:
    if label in labels:
        labels[label] += 1
    else:
        labels[label] = 1
    counter.update(tokenizer(clean(line)))

# Original
# vocab = Vocab(counter, min_freq=1, max_size=None, specials=('<unk>', '<pad>'))
# Modified
vocab = Vocab(counter, min_freq=1, max_size=1000-1, specials=('<unk>',))

print('Label frequencies:', labels)
print('A few token frequencies:', vocab.freqs.most_common(5))
print('Label meanings: 1: World news, 2: Sports news, 3: Business news, 4: Sci/Tech news')

Label frequencies: {3: 30000, 4: 30000, 2: 30000, 1: 30000}
A few token frequencies: [('to', 120680), ('of', 98652), ('in', 96422), ('and', 69670), ('s', 62116)]
Label meanings: 1: World news, 2: Sports news, 3: Business news, 4: Sci/Tech news


In [75]:
# Write your revised code in the cell below and output the 999 top words with their frequencies:
print("Top 999 freq\n",vocab.freqs.most_common(999))

Top 999 freq


In [76]:
print("1000 words in vocab\n",vocab.stoi)

1000 words in vocab


In [46]:
text_pipeline = lambda x: [vocab[token] for token in tokenizer(clean(x))]
label_pipeline = lambda x: int(x) - 1

## Question 2, 30 points

Next, let's build a simple RNN for classification of the AGNews dataset. Use a one-hot embedding of the vocabulary
entries and the basic RNN from Lab 10. Use the lengths tensor (the third element in the batch returned by the dataloaders)
to determine which output to apply the loss to.

Place your training code below, and plot the training and test accuracy as a
function of epoch. Finally, output a confusion matrix for the test set.

*Do not spend a lot of time on the training! A few minutes is enough. The point is to show that the model is
learning, not to get the best possible performance.*

In [8]:
# Place code for Question 2 here

## Question 3, 10 points

Next, replace the SRNN from Question 2 with a single-layer LSTM. Give the same output (training and testing accuracy as a function of epoch, as well as confusion
matrix for the test set). Comment on the differences you observe between the two models.

In [9]:
# Place code for Question 3 here

## Question 4, 10 points

Explain how you could use the Transformer model to perform the same task you explored in Questions 2 and 3.
How would attention be useful for this text classification task? Give a precise and detailed answer. Be sure to discuss what
parts of the original Transformer you would use and what you would have to remove.

*Write your answer here.*

## Question 5, 10 points

In Lab 13, you implemented a DQN model for tic-tac-toe. You method learned to play against a fairly dumb `expert_action` opponent, however.  Also,
DQN has proven to be less stable than other methods such as Double DQN, also discussed in Lab 13.

Explain below how you would apply double DQN and self-play to improve your tic-tac-toe agent.
Provide pseudocode for the algorithm below.

*Write your explanation and pseudocode here.*

## Question 6, 30 points

Based on your existing DQN implementation, implement the double DQN and self-play training method
you just described. After some training (don't spend too much time on training -- again, we just want to see that the model can
learn), show the result you playing a game against your learned agent.

In [10]:
# Code for training and playing goes here