<a href="https://colab.research.google.com/github/danielsaggau/deep-learning-for-nlp/blob/main/exercise7_tagginghyper_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 7: Tagging and hyperparameter optimization

In this exercise, you will implement and train a bidirectional GRU for Part-Of-Speech tagging. You will also do some hyperparameter tuning.

You should complete the parts of the exercise that are marked as **TODO**.
A correctly completed **TODO** gives 2 bonus points. Partially correct answers give 1 bonus point.
Some **TODO**s are inside a comment in a code block: Here, you should complete the line of code.
Other **TODO**s are inside a text block: Here, you should write a few sentences to answer the question.

**Important:** Some students were under the impression that you have to complete a TODO in a _single_ line of code. That is not the case, you can use as many lines as you need.

**Submission deadline:** 13.01.2021, 23:59 Central European Time

**Instructions for submission:** After completing the exercise, save a copy of the notebook as exercise7_tagginghyper_MATRIKELNUMMER.ipynb, where MATRIKELNUMMER is your student ID number. Then upload the notebook to moodle (submission exercise sheet 7).

In order to understand the code, it can be helpful to experiment a bit during development, e.g., to print tensors or their shapes. But please remove these changes before submitting the notebook. If we cannot run your notebook, or if a print statement is congesting stdout too much, then we cannot grade it. 

To make the most of this exercise, you should try to read and understand the entire code, not just the parts that contain a **TODO**. If you have questions, write them down for the exercise, which will happen in the week after the submission deadline.

**CUDA:** You can use a GPU for this exercise (on colab: Runtime -> Change Runtime Type -> GPU). This is not mandatory, but it will speed up training epochs, thereby allowing you to test more hyperparameters.

# Required libraries

In [None]:
!pip install -q numpy==1.18.0
!pip install -q torch==1.7.0
!pip install -q matplotlib==3.2.2
!pip install -q nltk==3.2.5
!pip install -q ipywidgets==7.5.1
!pip install -q tqdm==4.41.1

[K     |████████████████████████████████| 20.1MB 1.3MB/s 
[31mERROR: tensorflow 2.4.0 has requirement numpy~=1.19.2, but you'll have numpy 1.18.0 which is incompatible.[0m
[31mERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.[0m
[31mERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.[0m
[?25h

In [None]:
import numpy as np
np.random.seed(0)

import torch
torch.manual_seed(0)

if torch.cuda.is_available():
  torch.cuda.manual_seed(0)

import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from torch.nn.utils.rnn import pad_sequence

import nltk
import copy
from tqdm.notebook import tqdm, trange
from collections import defaultdict

# Data

We will do Part-Of-Speech tagging, which is the task of assigning Part-Of-Speech tags (e.g., NOUN, VERB) to the words of a sentence.

We will use the Wall Street Journal portion of the Penn treebank.
State-of-the-Art models can get very high accuracies on this benchmark.
With our simple model, we are aiming for a test set accuracy of around 80-85%.

We are using the dataset split from https://nlp.stanford.edu/pubs/CICLing2011-manning-tagging.pdf

In [None]:
WSJ_FILEIDS = {'train': [f'wsj_{i:04d}.mrg' for i in range(1, 19)],
               'dev': [f'wsj_{i:04d}.mrg' for i in range(19, 22)],
               'test': [f'wsj_{i:04d}.mrg' for i in range(22, 25)]}

nltk.download('treebank')
nltk.download('universal_tagset')
corpus = nltk.corpus.treebank

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


## Vocabularies

Here, we build the word and tag vocabulary. Note that we are not using any pretrained embeddings or models, so we can only learn words that exist in the training set. This will severely limit our test set preformance, but that's okay for the purpose of this exercise.

In [None]:
word2idx = {'!PAD!': 0, '!UNK!': 1}
tag2idx = {'!PAD!': 0}

for word, tag in corpus.tagged_words(WSJ_FILEIDS['train'], tagset='universal'):
  word2idx[word] = word2idx.get(word, len(word2idx))
  tag2idx[tag] = tag2idx.get(tag, len(tag2idx))

## Encoding and padding

Here, we translate the sentences into sequences of word indices (inputs) and tag indices (targets).

Since the sequences have different lengths, we must pad the shorter ones with zeros.
This ensures that pytorch can put them into matrix format.
We will later ignore the padded positions when calculating the loss.

You should pad the inputs and targets with zeros, so that they become a matrix of shape $\mathbb{R}^{N \times J}$, where $N$ is the number of sentences (datapoints) and $J$ is the length of the longest sentence (sequence length).
The padding function has already been imported above. 

**Important:** Set batch_first=True when using the padding function, otherwise the batch and sequence axes will be reversed.

In [None]:
datasets = {}
for dsetname, fileids in WSJ_FILEIDS.items():
  inputs, targets = [], []
  for i, sent in enumerate(corpus.tagged_sents(fileids, tagset='universal')):
    words = [word2idx.get(word, word2idx['!UNK!']) for word, tag in sent]
    tags = [tag2idx[tag] for word, tag in sent]
    inputs.append(torch.tensor(words))
    targets.append(torch.tensor(tags))

  # TODO: Pad inputs and targets with zeros
  inputs_padded = pad_sequence(inputs, batch_first=True, padding_value=0) # EXAMPLE
  targets_padded = pad_sequence(targets, batch_first=True, padding_value=0) # EXAMPLE
  
  datasets[dsetname] = data.TensorDataset(inputs_padded, targets_padded)

Let's look at some data.
We show each datapoint first as a sequence of indices, and then as a sequence of words and tags (word|tag).

Since we are not using pretrained embeddings, we have many cases of unknown (!UNK!) words in the dev and test sets.

In [None]:
idx2word = sorted(list(word2idx.keys()), key=lambda word:word2idx[word])
idx2tag = sorted(list(tag2idx.keys()), key=lambda tag:tag2idx[tag])

for dsetname in ('train', 'dev', 'test'):
  print(dsetname, 'with', len(datasets[dsetname]), 'sentences')
  for (inputs, targets), _ in zip(datasets[dsetname], range(3)):
    inputs, targets = inputs.numpy(), targets.numpy()
    print(' '.join(f'{word}|{tag}' \
                   for word, tag in zip(inputs, targets) if word != 0))
    print(' '.join(f'{idx2word[word]}|{idx2tag[tag]}'\
                   for word, tag in zip(inputs, targets) if word != 0))
    print()

train with 203 sentences
2|1 3|1 4|2 5|3 6|1 7|4 4|2 8|5 9|5 10|6 11|1 12|7 13|6 14|4 15|1 16|1 17|3 18|2
Pierre|NOUN Vinken|NOUN ,|. 61|NUM years|NOUN old|ADJ ,|. will|VERB join|VERB the|DET board|NOUN as|ADP a|DET nonexecutive|ADJ director|NOUN Nov.|NOUN 29|NUM .|.

19|1 3|1 20|5 21|1 22|7 23|1 24|1 4|2 10|6 25|1 26|5 27|1 18|2
Mr.|NOUN Vinken|NOUN is|VERB chairman|NOUN of|ADP Elsevier|NOUN N.V.|NOUN ,|. the|DET Dutch|NOUN publishing|VERB group|NOUN .|.

28|1 29|1 4|2 30|3 6|1 7|4 31|8 32|4 21|1 22|7 33|1 34|1 35|1 36|1 4|2 37|5 38|5 39|9 13|6 14|4 15|1 22|7 40|6 41|4 42|4 43|1 18|2
Rudolph|NOUN Agnew|NOUN ,|. 55|NUM years|NOUN old|ADJ and|CONJ former|ADJ chairman|NOUN of|ADP Consolidated|NOUN Gold|NOUN Fields|NOUN PLC|NOUN ,|. was|VERB named|VERB *-1|X a|DET nonexecutive|ADJ director|NOUN of|ADP this|DET British|ADJ industrial|ADJ conglomerate|NOUN .|.

dev with 38 sentences
1141|1 1263|1 1|1 4|2 1354|3 6|1 7|4 4|2 37|5 38|5 1|9 669|4 690|1 375|1 376|1 31|8 1143|1 1153|5 1144|1 4|2 

# Model

The model consists of four layers.
You should instantiate all layers in the init function, using the appropriate hyperparameters from the config dictionary.

- A **word embedding lookup layer**, which transforms our inputs into a tensor $\mathbf{E} \in \mathbb{R}^{B \times J \times D_{emb}}$, where $B$ is the batch size, $J$ is the sequence length, and $D_{emb}$ is our embedding_size hyperparameter
- A **single-layer bidirectional GRU**, which transforms $\mathbf{E}$ into a tensor $\mathbf{H} \in \mathbb{R}^{B \times J \times 2D_{hidden}}$, where $D_{hidden}$ is our hidden_size hyperparameter. The factor $2$ stems from the fact that the GRU is bidirectional. **Important:** When instantiating the GRU layer, you should set batch_first=True, otherwise the layer will expect the sequence axis to come before the batch axis.
- A **linear layer**, which transforms $\mathbf{H}$ into a tensor $\mathbf{O} \in \mathbb{R}^{B \times J \times C}$, where $C$ is the number of tags (output classes).
- A **dropout layer**, which zeros out neurons with probability $P$ during training (where $P$ is our dropout hyperparameter). This is a form of regularization, which is applied between layers. During inference and evaluation, the dropout layer will do nothing.



You should also implement the forward function.
The intended order of the layers is:

- word embeddings -> dropout -> GRU -> dropout -> linear

**Note:** The forward function of the GRU layer will return two tensors as a tuple.
Look [here](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html) to find out which of them you should pass to the next layer.
Remember that this is a tagging task, i.e., we want to classify all words in the sequence. 

There is no nonlinearity after the final linear layer. We will use nn.CrossEntropyLoss() later, which has a built-in softmax.

In [None]:
class GRUTagger(nn.Module):
  def __init__(self, config):
    super().__init__()
    # TODO: Instantiate the layers, using the hyperparameters in the config dictionary.

    self.embedding = nn.Embedding(num_embeddings=config['num_embeddings'],
                                  embedding_dim=config['embedding_dim']) # EXAMPLE
    self.gru = nn.GRU(input_size=config['embedding_dim'],
                      hidden_size=config['hidden_size'],
                      bidirectional=True,
                      batch_first=True) # EXAMPLE
    self.linear = nn.Linear(in_features=config['hidden_size']*2,
                            out_features=config['num_classes']) # EXAMPLE
    self.dropout = nn.Dropout(p=config['dropout']) # EXAMPLE


  def forward(self, inputs):
    # TODO: Complete the forward function.
    embedded = self.embedding(inputs) # EXAMPLE
    all_states, _ = self.gru(self.dropout(embedded)) # EXAMPLE
    logits = self.linear(self.dropout(all_states)) # EXAMPLE
    return logits

# Training

## Single step
The do_step function does the forward pass and (if necessary) backward pass and gradient update on a single batch.

Remember that we padded our inputs with zeros.
Predicting the tag of a padding token is trivial (!PAD! is always tagged as !PAD!), so if we include the padded positions in our loss and accuracy calculations, we will overestimate our performance.
Therefore, we must get rid of them.
To do this, you should define a boolean mask, which is False in all positions where the inputs are padded (0) and True everywhere else.

In [None]:
def do_step(model, inputs, targets, optimizer = None):
  device = next(model.parameters()).device
  inputs, targets = inputs.to(device=device), targets.to(device=device)
  logits = model(inputs)

  pad_mask = inputs != 0 # EXAMPLE # TODO: Define the boolean mask

  loss_func = nn.CrossEntropyLoss() # CrossEntropyLoss = Softmax and NLL
  loss = loss_func(logits[pad_mask], targets[pad_mask])

  if optimizer is not None:
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

  # Note: We do not average the accuracy at this point.
  # That is because there may be different numbers of non-pad tokens per batch,
  # so if we average within batches first, we will weight batches differently.
  # Instead, we return the number of correct predictions and the number
  # of non-padding tokens, so we can calculate the unweighted accuracy later.

  correct = (logits[pad_mask].argmax(-1) == targets[pad_mask])
  num_correct = correct.to(device='cpu', dtype=torch.float32).sum().detach()
  num_non_pad = pad_mask.to(device='cpu', dtype=torch.float32).sum().detach()
  loss_sum = (loss * inputs.shape[0]).to(device='cpu').detach()

  return loss_sum.numpy(), num_correct.numpy(), num_non_pad.numpy()

## Single epoch

The do_epoch function does a single epoch and calculates the overall loss and accuracy.

In [None]:
def do_epoch(model, dataloader, optimizer=None):
  if optimizer is None:
    model.eval() # this disables dropout during evaluation
  else:
    model.train()
    
  total_loss, total_correct, total_non_pad = 0.0, 0.0, 0.0
  for inputs, targets in dataloader:
    loss, num_correct, num_non_pad = do_step(model, inputs, targets, 
                                             optimizer=optimizer)
    total_loss += loss
    total_correct += num_correct
    total_non_pad += num_non_pad
  
  return total_loss / len(dataloader.dataset), total_correct / total_non_pad

# Training with early stopping

The do_training_with_early_stopping function trains the model on for a specified number of epochs.
It also keeps track of the model's dev set (validation set) accuracy.

We tune the number of epochs by doing early stopping, where we return the model from the best epoch.

In [None]:
def do_training_with_early_stopping(model, 
                                    optimizer, 
                                    dataloaders, 
                                    epochs, 
                                    patience):
  best_model, best_epoch, best_dev_acc = None, 0, -np.inf

  for epoch in trange(epochs):
    _, _ = do_epoch(model, dataloaders['train'], optimizer=optimizer)
    _, dev_acc = do_epoch(model, dataloaders['dev'], optimizer=None)

    if dev_acc > best_dev_acc:
      best_epoch = epoch
      best_dev_acc = dev_acc
      best_model = copy.deepcopy(model) 
      # We want to return the model from the best epoch, not from the last epoch

    if epoch - best_epoch > patience:
      break

  return best_model, best_dev_acc

# Hyperparameter search

## Ranges

Here you should define ranges (permissible values) for the hyperparameters, by filling in the list for each hyperparameter key in the CONFIG_RANGES dictionary. There should be about 3 or 4 values per key.

**Important:** The optimizers are strings. Learning rate, weight decay and dropout are floats, the others are integers. Remember that dropout is a probability.

**Note:** Use your domain knowledge to choose a sensible range for each hyperparameter. 
Keep in mind that we are dealing with a small dataset, and that overfitting will be an issue. To get an idea of what sensible learning rates and weight decays are, look at the defaults [here](https://pytorch.org/docs/stable/optim.html).

In [None]:
# TODO: Fill in some reasonable hyperparameter values for each hyperparameter
CONFIG_RANGES = {'embedding_dim': [16, 32, 64, 128], # EXAMPLE
                 'hidden_size': [16, 32, 64, 128], # EXAMPLE
                 'batch_size': [8, 16, 32, 64], # EXAMPLE
                 'optimizer': ['adamw', 'adagrad', 'sgd', 'adadelta'], # EXAMPLE
                 'learning_rate': [0.01, 0.05, 0.1], # EXAMPLE
                 'weight_decay': [0.0, 0.01, 0.05], # EXAMPLE
                 'dropout': [0.0, 0.2, 0.4]} # EXAMPLE

## Search strategy

Now, we implement our hyperparameter search strategy.
We will do simple uniform sampling.

The input to the sample_configs function consists of a budget (number of configurations that you will sample), and a dictionary of ranges, e.g.:
```
{
  'embedding_dim': [16, 32, ...], 
  'optimizer': ['adamw', 'sgd'...], 
  ...
}
```

The output is a list of config dictionaries. 
For every config dictionary, pick one random value per hyperparameter, e.g.:
```
[
  {'embedding_dim': 32, 'optimizer': 'adamw', ...}, 
  {'embedding_dim': 16, 'optimizer': 'sgd', ...}, 
  {'embedding_dim': 64, 'optimizer': 'sgd', ...},
  ...
]
```

The length of the output list corresponds to the budget.

In [None]:
def sample_configs(budget, config_ranges):
  configs = []
  for _ in range(budget):
    configs.append({key: r[np.random.randint(len(r))] \
                    for key, r in config_ranges.items()})
  return configs #EXAMPLE

# Hyperparameter optimization

## Optimization run

This function does a single hyperparameter optimization run. It takes as input a hyperparameter config dictionary and:
- instantiates a model, according to the config
- instantiates an optimizer, according to the config (TODO!)
- instantiates data loaders with the specified batch size
- trains the model and returns the best dev set accuracy and model

In [None]:
def do_optimization_run(config, datasets):
  assert not 'test' in datasets
  
  OPTIMIZER_CLASSES = {'adamw': optim.AdamW, 
                       'adam': optim.Adam,
                       'adagrad': optim.Adagrad,
                       'rmsprop': optim.RMSprop,
                       'adadelta': optim.Adadelta,
                       'sgd': optim.SGD}

  model = GRUTagger(config)
  if torch.cuda.is_available():
    model = model.to(device='cuda')

  # TODO: Instantiate the selected optimizer with the selected learning rate and
  # weight decay, according to the config dictionary
  optimizer_class = OPTIMIZER_CLASSES[config['optimizer']] # EXAMPLE
  optimizer = optimizer_class(model.parameters(),
                              lr=config['learning_rate'],
                              weight_decay=config['weight_decay']) # EXAMPLE
  
  dataloaders = {dsetname: data.DataLoader(datasets[dsetname], 
                                           batch_size=config['batch_size'],
                                           shuffle=dsetname=='train')
                 for dsetname in datasets}
  
  return do_training_with_early_stopping(model, 
                                         optimizer, 
                                         dataloaders, 
                                         epochs=config['epochs'],
                                         patience=config['patience'])

# Outer loop

This is the outer loop of the hyperparameter optimization:
We loop over the configurations, store their development set accuracies, and remember which model performed best.
You should implement the logic that decides which model we keep as the best model.

In [None]:
def do_hyperparameter_optimization(configs, datasets):
  dev_accs = []
  best_model = None
  for i, config in enumerate(configs):
    print(f'Config {i+1}/{len(configs)}:', 
          ' '.join(f'{key}:{val}' for key, val in config.items()))

    # we put a try-catch around the optimization runs
    # this is in case of GPU memory errors or similar issues
    # if every run fails and you end up with best_model=None, 
    # assume that something is wrong with your code
    try:
      model, dev_acc = do_optimization_run(config, datasets)
    except Exception as e:
      print('Unsuccessful run threw exception:', e)
      model, dev_acc = None, -np.inf

    dev_accs.append(dev_acc)

    if dev_acc == np.max(dev_accs): # EXAMPLE # TODO: Condition under which the current model is the new best model
      best_model = model
      print(f'New best dev acc: {dev_acc:.4}')

  return best_model, dev_accs

# Let's go!

Feel free to increase the BUDGET parameter for a higher chance of finding a good configuration. A higher budget means that you will wait longer for the result.

If you are on a GPU, every epoch will be faster, so you can afford a higher budget and a higher number of epochs.

In [None]:
BUDGET = 50 if torch.cuda.is_available() else 15

configs = sample_configs(budget=BUDGET, config_ranges=CONFIG_RANGES)

# some additional model and training parameters
for config in configs:
  config['num_classes'] = len(tag2idx)
  config['num_embeddings'] = len(word2idx)
  config['epochs'] = 250 if torch.cuda.is_available() else 75
  config['patience'] = 25

train_and_dev = {dsetname: datasets[dsetname] for dsetname in ('train', 'dev')}
best_model, dev_accs = do_hyperparameter_optimization(configs, train_and_dev)

Config 1/50: embedding_dim:16 hidden_size:128 batch_size:16 optimizer:adamw learning_rate:0.05 weight_decay:0.01 dropout:0.4 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

New best dev acc: 0.7782
Config 2/50: embedding_dim:16 hidden_size:128 batch_size:32 optimizer:adamw learning_rate:0.01 weight_decay:0.0 dropout:0.4 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

New best dev acc: 0.8383
Config 3/50: embedding_dim:32 hidden_size:64 batch_size:64 optimizer:adadelta learning_rate:0.1 weight_decay:0.0 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))


Config 4/50: embedding_dim:32 hidden_size:32 batch_size:16 optimizer:adamw learning_rate:0.05 weight_decay:0.0 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 5/50: embedding_dim:128 hidden_size:32 batch_size:32 optimizer:adadelta learning_rate:0.01 weight_decay:0.05 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))


Config 6/50: embedding_dim:32 hidden_size:128 batch_size:16 optimizer:adadelta learning_rate:0.1 weight_decay:0.0 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))


Config 7/50: embedding_dim:32 hidden_size:32 batch_size:64 optimizer:adamw learning_rate:0.1 weight_decay:0.0 dropout:0.4 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 8/50: embedding_dim:128 hidden_size:64 batch_size:64 optimizer:adamw learning_rate:0.1 weight_decay:0.0 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 9/50: embedding_dim:16 hidden_size:32 batch_size:16 optimizer:sgd learning_rate:0.01 weight_decay:0.0 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 10/50: embedding_dim:128 hidden_size:16 batch_size:16 optimizer:sgd learning_rate:0.1 weight_decay:0.0 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 11/50: embedding_dim:32 hidden_size:128 batch_size:16 optimizer:adagrad learning_rate:0.1 weight_decay:0.05 dropout:0.4 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 12/50: embedding_dim:128 hidden_size:16 batch_size:32 optimizer:adadelta learning_rate:0.05 weight_decay:0.0 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))



Config 13/50: embedding_dim:64 hidden_size:16 batch_size:64 optimizer:adamw learning_rate:0.1 weight_decay:0.0 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 14/50: embedding_dim:128 hidden_size:16 batch_size:8 optimizer:adamw learning_rate:0.01 weight_decay:0.05 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 15/50: embedding_dim:128 hidden_size:64 batch_size:64 optimizer:adadelta learning_rate:0.05 weight_decay:0.01 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 16/50: embedding_dim:16 hidden_size:32 batch_size:16 optimizer:adagrad learning_rate:0.01 weight_decay:0.01 dropout:0.4 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))


Config 17/50: embedding_dim:16 hidden_size:32 batch_size:32 optimizer:adamw learning_rate:0.1 weight_decay:0.0 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

New best dev acc: 0.8433
Config 18/50: embedding_dim:128 hidden_size:64 batch_size:32 optimizer:adagrad learning_rate:0.01 weight_decay:0.01 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 19/50: embedding_dim:128 hidden_size:16 batch_size:32 optimizer:sgd learning_rate:0.1 weight_decay:0.05 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))


Config 20/50: embedding_dim:64 hidden_size:64 batch_size:64 optimizer:sgd learning_rate:0.1 weight_decay:0.05 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 21/50: embedding_dim:32 hidden_size:64 batch_size:64 optimizer:sgd learning_rate:0.05 weight_decay:0.05 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))


Config 22/50: embedding_dim:16 hidden_size:64 batch_size:32 optimizer:adadelta learning_rate:0.01 weight_decay:0.05 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 23/50: embedding_dim:16 hidden_size:64 batch_size:8 optimizer:sgd learning_rate:0.1 weight_decay:0.05 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 24/50: embedding_dim:16 hidden_size:16 batch_size:16 optimizer:sgd learning_rate:0.01 weight_decay:0.01 dropout:0.4 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))


Config 25/50: embedding_dim:128 hidden_size:64 batch_size:32 optimizer:adadelta learning_rate:0.05 weight_decay:0.0 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 26/50: embedding_dim:16 hidden_size:128 batch_size:64 optimizer:adamw learning_rate:0.1 weight_decay:0.05 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

New best dev acc: 0.8606
Config 27/50: embedding_dim:16 hidden_size:128 batch_size:8 optimizer:adagrad learning_rate:0.1 weight_decay:0.0 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 28/50: embedding_dim:128 hidden_size:32 batch_size:64 optimizer:adamw learning_rate:0.1 weight_decay:0.01 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 29/50: embedding_dim:32 hidden_size:128 batch_size:8 optimizer:adamw learning_rate:0.01 weight_decay:0.01 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

New best dev acc: 0.8657
Config 30/50: embedding_dim:64 hidden_size:32 batch_size:8 optimizer:adamw learning_rate:0.05 weight_decay:0.05 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

New best dev acc: 0.8728
Config 31/50: embedding_dim:32 hidden_size:128 batch_size:16 optimizer:adamw learning_rate:0.01 weight_decay:0.0 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 32/50: embedding_dim:64 hidden_size:32 batch_size:64 optimizer:adagrad learning_rate:0.01 weight_decay:0.0 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))


Config 33/50: embedding_dim:64 hidden_size:16 batch_size:32 optimizer:adadelta learning_rate:0.1 weight_decay:0.01 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))


Config 34/50: embedding_dim:32 hidden_size:128 batch_size:32 optimizer:adamw learning_rate:0.01 weight_decay:0.01 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 35/50: embedding_dim:128 hidden_size:64 batch_size:64 optimizer:adadelta learning_rate:0.1 weight_decay:0.0 dropout:0.4 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))


Config 36/50: embedding_dim:128 hidden_size:64 batch_size:32 optimizer:adagrad learning_rate:0.05 weight_decay:0.0 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 37/50: embedding_dim:16 hidden_size:64 batch_size:32 optimizer:adadelta learning_rate:0.05 weight_decay:0.01 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 38/50: embedding_dim:128 hidden_size:16 batch_size:32 optimizer:adadelta learning_rate:0.05 weight_decay:0.01 dropout:0.4 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))


Config 39/50: embedding_dim:16 hidden_size:32 batch_size:16 optimizer:adadelta learning_rate:0.05 weight_decay:0.01 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 40/50: embedding_dim:128 hidden_size:128 batch_size:32 optimizer:sgd learning_rate:0.1 weight_decay:0.0 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 41/50: embedding_dim:64 hidden_size:128 batch_size:16 optimizer:adadelta learning_rate:0.05 weight_decay:0.01 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 42/50: embedding_dim:16 hidden_size:64 batch_size:32 optimizer:adamw learning_rate:0.05 weight_decay:0.01 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 43/50: embedding_dim:128 hidden_size:16 batch_size:8 optimizer:adadelta learning_rate:0.1 weight_decay:0.01 dropout:0.4 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))


Config 44/50: embedding_dim:32 hidden_size:64 batch_size:8 optimizer:sgd learning_rate:0.01 weight_decay:0.01 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))


Config 45/50: embedding_dim:32 hidden_size:16 batch_size:64 optimizer:adamw learning_rate:0.01 weight_decay:0.0 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 46/50: embedding_dim:16 hidden_size:32 batch_size:64 optimizer:adadelta learning_rate:0.01 weight_decay:0.05 dropout:0.4 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 47/50: embedding_dim:32 hidden_size:32 batch_size:16 optimizer:sgd learning_rate:0.05 weight_decay:0.01 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 48/50: embedding_dim:32 hidden_size:64 batch_size:16 optimizer:adagrad learning_rate:0.01 weight_decay:0.0 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))


Config 49/50: embedding_dim:32 hidden_size:64 batch_size:64 optimizer:adadelta learning_rate:0.05 weight_decay:0.05 dropout:0.4 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Config 50/50: embedding_dim:32 hidden_size:64 batch_size:64 optimizer:sgd learning_rate:0.01 weight_decay:0.01 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25


HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))

Here is a ranking of all the configurations that we evaluated:

In [None]:
print('Configs ranked:')
for idx in np.argsort(dev_accs)[::-1]:
  print(f'Dev acc: {dev_accs[idx]:.4};', 
        ' '.join(f'{key}:{val}' for key, val in configs[idx].items()))

Configs ranked:
Dev acc: 0.8728; embedding_dim:64 hidden_size:32 batch_size:8 optimizer:adamw learning_rate:0.05 weight_decay:0.05 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25
Dev acc: 0.8657; embedding_dim:32 hidden_size:128 batch_size:8 optimizer:adamw learning_rate:0.01 weight_decay:0.01 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25
Dev acc: 0.8606; embedding_dim:16 hidden_size:128 batch_size:64 optimizer:adamw learning_rate:0.1 weight_decay:0.05 dropout:0.0 num_classes:13 num_embeddings:1682 epochs:250 patience:25
Dev acc: 0.8505; embedding_dim:64 hidden_size:32 batch_size:64 optimizer:adagrad learning_rate:0.01 weight_decay:0.0 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25
Dev acc: 0.8464; embedding_dim:32 hidden_size:128 batch_size:16 optimizer:adamw learning_rate:0.01 weight_decay:0.0 dropout:0.2 num_classes:13 num_embeddings:1682 epochs:250 patience:25
Dev acc: 0.8433; embedding_dim:16 hidden_size:32 batch

# Evaluation on test set

As a final step, we evaluate the model on the test set.
We will also look at some predictions. 

Again, remember that there are lots of !UNK! words in the test set, which makes it hard to get a high accuracy.
In a realistic setting, we would have used pretrained embeddings or a pretrained model, such as BERT.

In [None]:
test_loader = data.DataLoader(datasets['test'], batch_size=1)
test_loss, test_acc = do_epoch(best_model, test_loader, optimizer=None)

print(f'Final test acc: {test_acc:.4}')
print()

for (inputs, targets), _ in zip(test_loader, range(3)):
  device = next(best_model.parameters()).device
  inputs, targets = inputs.to(device), targets.to(device)
  predictions = best_model(inputs).argmax(-1)
  inputs, targets, predictions = [x.to('cpu').detach().numpy().squeeze(0) \
                                  for x in (inputs, targets, predictions)]

  print('Target:')
  print(' '.join(f'{idx2word[word]}|{idx2tag[tag]}' 
                for word, tag in zip(inputs, targets) if word != 0))
  print('Prediction:')
  print(' '.join(f'{idx2word[word]}|{idx2tag[tag]}' 
                for word, tag in zip(inputs, predictions) if word != 0))
  print()

Final test acc: 0.8677

Target:
!UNK!|. !UNK!|ADP its|PRON !UNK!|NOUN year|NOUN ,|. The|NOUN !UNK!|NOUN !UNK!|NOUN Journal|NOUN will|VERB report|VERB events|NOUN of|ADP the|DET past|ADJ !UNK!|NOUN that|DET !UNK!|X !UNK!|VERB as|ADP !UNK!|NOUN of|ADP !UNK!|ADJ business|NOUN !UNK!|NOUN .|. !UNK!|.
Prediction:
!UNK!|NOUN !UNK!|NOUN its|PRON !UNK!|NOUN year|NOUN ,|. The|DET !UNK!|NOUN !UNK!|NOUN Journal|NOUN will|VERB report|VERB events|NOUN of|ADP the|DET past|ADJ !UNK!|NOUN that|ADP !UNK!|NOUN !UNK!|NOUN as|ADP !UNK!|NOUN of|ADP !UNK!|NOUN business|NOUN !UNK!|NOUN .|. !UNK!|NOUN

Target:
!UNK!|NUM !UNK!|NOUN !UNK!|DET !UNK!|X !UNK!|VERB the|DET face|NOUN of|ADP !UNK!|ADJ !UNK!|NOUN were|VERB !UNK!|VERB !UNK!|X in|ADP !UNK!|NUM .|.
Prediction:
!UNK!|NOUN !UNK!|NOUN !UNK!|NOUN !UNK!|NOUN !UNK!|NOUN the|DET face|NOUN of|ADP !UNK!|NOUN !UNK!|NOUN were|VERB !UNK!|VERB !UNK!|NOUN in|ADP !UNK!|NOUN .|.

Target:
That|DET year|NOUN the|DET !UNK!|NOUN !UNK!|NOUN ,|. !UNK!|NOUN !UNK!|NOUN and|CONJ 

  self.dropout, self.training, self.bidirectional, self.batch_first)


**TODO:** Write a short report (approx. 4 sentences) about how you optimized the hyperparameters in this experiment.
Include: The absolute and relative sizes of the train, dev and test sets, your method of hyperparameter search, the number of hyperparameter configurations (budget), the metric by which you chose the best model, what the best configuration was, and the final dev and test set accuracy.

**EXAMPLE:** We trained on 203 sentences, evaluated our hyperparameters on 38 sentences, and used 29 sentences as our test set. We created hyperparameter configurations by random sampling from the following ranges: embedding_dim (16, 32, 64, 128), hidden size (16, 32, 64, 128), batch size (8, 16, 32, 64), optimizer (adamW, adagrad, SGD, adadelta), learning rate (0.01, 0.05, 0.1), weight decay (0.0, 0.01, 0.05), dropout (0.0, 0.2, 0.4) -- furthermore, we optimized the number of epochs by early stopping. Our metric for choosing the best model was development set accuracy. The best configuration was: embedding dim 64, hidden size 32, batch size 8, optimizer adamW, learning rate 0.05, weight decay 0.05, dropout 0.2; and the final development and test accuracies were 0.8728 and 0.8677 respectively.

