# Assignment 3: Machine Translation

TA contact for this assignment: Raymond Xiong (raymond.xiong@duke.edu), Xinchang Xiong (xinchang.xiong@duke.edu)


---

In this assignment you will implement a LSTM based sequence-to-sequence model for machine translation.
* We have provided an implementation of the encoder. You will need to implement an LSTM based decoder and then use it to train a basic sequence-to-sequence model.
* Next, you will extend the decoder with attention and implement beam search decoding (we have provided a greedy decoder as reference).
* Lastly you can try to improve the model using extensions such as a back translation or data augmentation.

**Warning**: Attention and beam search can be tricky to implement. We expect this assignment to take longer than the CRF one. Please don't start the day before it is due!

We will use the Multi30k for this assignment which consists of 30k German and English sentences.

**Note**: When implementing beam search, to keep things simple we will not use batching (beam search on one sentence at a time). However, for implementing the decoders, please use batching.

**Grading Rubric**
- 70% results
  - 20% seq2seq_predictions_baseline.json (meets target)
  - 20% seq2seq_predictions_attention.json (meets target)
  - 20% beam_seqs.json (correctness)
  - 10% seq2seq_predictions_attention.json (improvement)
  
- 30% writeup
 - 12.5% clarity
 - 12.5% correctness
 - 5% interestingness of ideas

## Imports

Feel free to add other libraries here (that don't already implement what you are supposed to!)

In [1]:
%%capture
!pip install --upgrade sacrebleu sentencepiece gdown
# Standard library imports
import json
import math
import random

# Third party imports
import matplotlib.pyplot as plt
import numpy as np
import sacrebleu
import sentencepiece
import torch
import torch.nn as nn
import torch.nn.functional as F
import tqdm.notebook

Before proceeding, let's verify that we're connected to a GPU runtime and that `torch` can detect the GPU. Manage this by go to the Runtime tab in your colab.
We'll define a variable `device` here to use throughout the code so that we can easily change to run on CPU for debugging.

Note that if you use "CPU" training time would be much slower depending on the CPU (likely 20 times slower). So use of GPU is recommended, be sure to manage your GPU so that it doesn't run out of quota.

In [2]:
try:
    assert torch.cuda.is_available()
    device = torch.device("cuda")
except:
    device = torch.device("mps")
print("Using device:", device)

Using device: cuda


## Data

The data for this assignment comes from the [Multi30K dataset](https://arxiv.org/abs/1605.00459), which contains English and German captions for images from Flickr. We can download it using `gdown`. We use the Multi30K dataset because it is simpler than standard translation benchmark datasets and allows for models to be trained and evaluated in a matter of minutes rather than days using a GPU.

We will be translating from German to English in this assignment, but the same techniques apply equally well to any language pair.



**You do not need to modify anything in this section.**

First let's download the data and visualize some of the data:

In [3]:
!gdown 1ll4fDiPLQ0u9osdtSlsUcehK_p_2dykV
!gdown 1OEBVpX9F2FX0Mqj17jOWJKI2efUN_HBR
!gdown 1zZF8EXtzcd3oXSGEfyKywkSMosX_T6Jo

Downloading...
From: https://drive.google.com/uc?id=1ll4fDiPLQ0u9osdtSlsUcehK_p_2dykV
To: /storage/ice1/3/4/pponnusamy7/Yash/training_data.json
100%|███████████████████████████████████████| 4.28M/4.28M [00:00<00:00, 185MB/s]
Downloading...
From: https://drive.google.com/uc?id=1OEBVpX9F2FX0Mqj17jOWJKI2efUN_HBR
To: /storage/ice1/3/4/pponnusamy7/Yash/validation_data.json
100%|████████████████████████████████████████| 152k/152k [00:00<00:00, 24.7MB/s]


In [None]:
with open("training_data.json","r") as f:
    training_data = json.load(f)
with open("validation_data.json","r") as f:
    validation_data = json.load(f)
with open("test_data.json","r") as f:
    test_data = json.load(f)
print("Number of training examples:", len(list(training_data)))
print("Number of validation examples:", len(list(validation_data)))
print("Number of test examples:", len(list(test_data)))
print()

for example in training_data[:10]:
  print(example[0])
  print(example[1])
  print()

Vocabulary:
Now We can use `sentencepiece` to create a joint German-English subword vocabulary from the training corpus. Subwords are words being divided into smaller pieces. They usually provide better performance since it takes advantages over common section among different words (see more at https://huggingface.co/docs/transformers/en/tokenizer_summary) and it handles Out of Vocabulary (OOV) words a lot better (https://blog.octanove.org/guide-to-subword-tokenization/). Because the number of training examples is small, we choose a smaller vocabulary size than would be used for large-scale NMT.

Let's download the English and German training corpous to construct our vocabulary. The two files downloaded here contains English and German sentences are from the training data we downloaded above but are decoupled to be used for the `sentencepiece` library.

In [None]:
!gdown 1bO7SVCjvVzp__ibwED8wbMRSiJQNNP52
!gdown 1A2w-F6kmUXNtuFtG2qdpfw0dx2Mk7phR

We will use a unigram language model for subword segmentation (https://aclanthology.org/P18-1007.pdf). There are other techniques such as using BPE and or using characters, but we won't explore into them here (you can consider trying different subword strategies to improve your model later in the improvement section).

In [None]:
args = {
    "pad_id": 0,
    "bos_id": 1,
    "eos_id": 2,
    "unk_id": 3,
    "input": "train.de,train.en",
    "vocab_size": 8000,
    "model_prefix": "Multi30k",
    # "model_type": "word",
}
combined_args = " ".join(
    "--{}={}".format(key, value) for key, value in args.items())
sentencepiece.SentencePieceTrainer.Train(combined_args)

This creates two files: `Multi30k.model` and `Multi30k.vocab`. The first is a binary file containing the relevant data for the vocabulary. The second is a human-readable listing of each subword and its associated score. The score is the logged probability of the subword in the corpus. A higher score means that subword appears more frequently in the corpus.

`sentencepiece` trainer basically finds a set of those subwords such that their joint probabily maximaizes over the corpus. How do you find the correct segmentation of words to subword such that you can maximize this joint probability becomes the question. This can be done by using a Viterbi algorithm (you implemented last assignment!). You don't need to know exactly how this is done but if you are intrested you can look into this [paper](https://aclanthology.org/P18-1007.pdf).

You can preveiw some of the word scores:

In [None]:
!head -n 30 Multi30k.vocab

As we can see, the vocabulary consists of four special tokens (`<pad>` for padding, `<s>` for beginning of sentence (BOS), `</s>` for end of sentence (EOS), `<unk>` for unknown) and a mixture of German and English words and subwords. In order to ensure reversability, word boundaries are encoded with a special unicode character "▁" (U+2581).

To use the vocabulary, we first need to load it from the binary file produced above.

In [None]:
vocab = sentencepiece.SentencePieceProcessor()
vocab.Load("Multi30k.model")

The vocabulary object includes a number of methods for working with full sequences or individual pieces. We explore the most relevant ones below. A complete interface can be found on [GitHub](https://github.com/google/sentencepiece/tree/master/python#usage) for reference.

In [None]:
print("Vocabulary size:", vocab.GetPieceSize())
print()

for example in training_data[:3]:
  sentence = example[1]
  pieces = vocab.EncodeAsPieces(sentence)
  indices = vocab.EncodeAsIds(sentence)
  print(sentence)
  print(pieces)
  print(vocab.DecodePieces(pieces))
  print(indices)
  print(vocab.DecodeIds(indices))
  print()

piece = vocab.EncodeAsPieces("the")[0]
index = vocab.PieceToId(piece)
print(piece)
print(index)
print(vocab.IdToPiece(index))

We define some constants here for the first three special tokens that you may find useful in the following sections.

In [None]:
pad_id = vocab.PieceToId("<pad>")
bos_id = vocab.PieceToId("<s>")
eos_id = vocab.PieceToId("</s>")
print(f"<pad>: {pad_id}, <s>: {bos_id}, </s>: {eos_id}")

Note that these tokens will be stripped from the output when converting from word pieces to text. This may be helpful when implementing greedy search and beam search.

In [None]:
sentence = training_data[0][1]
indices = vocab.EncodeAsIds(sentence)
indices_augmented = [bos_id] + indices + [eos_id, pad_id, pad_id, pad_id]
print(vocab.DecodeIds(indices))
print(vocab.DecodeIds(indices_augmented))
print(vocab.DecodeIds(indices) == vocab.DecodeIds(indices_augmented))

Code for saving your results for submission. You don't have to read this.

In [None]:
# Please do not change the code below
def generate_predictions_file_for_submission(filepath, model, dataset, method, batch_size=64):
    assert method in {"greedy", "beam"}
    source_sentences = [example[0] for example in dataset]
    model.eval()
    predictions = []
    with torch.no_grad():
      for start_index in range(0, len(source_sentences), batch_size):
        if method == "greedy":
          prediction_batch = predict_greedy(
              model, source_sentences[start_index:start_index + batch_size])
          prediction_batch = [[x] for x in prediction_batch]
        else:
          prediction_batch = predict_beam(
              model, source_sentences[start_index:start_index + batch_size])
        predictions.extend(prediction_batch)
    with open(filepath, "w") as outfile:
        json.dump(predictions, outfile, indent=2)
    print("Finished writing predictions to {}.".format(filepath))


## Seq2Seq Machine Translation Model

Now let's implement a sequence-to-sequence machine translation model. We will first implement an `Encode` and an `Decode` method and then put these together in a `Seq2seqBaseline` class.

We have implemented the `Encode` method below which encodes input sequences using a bi-directional LSTM. A bi-LSTM consists of a stack of two LSTM networks, one which processes the sequence in forward direction and another which processes the sequence in reverse direction. The output hidden states from both are concatenated to get the representations at each position. Further, we average the final states in either direction before returning.



### **Implementation Task \# 1**

Let's begin by defining a batch iterator for the training data. Given a dataset and a batch size, it will iterate over the dataset and yield pairs of tensors containing the subword indices for the source and target sentences in the batch, respectively.  We filled in `make_batch` below. It is advised to read the code below to get a sense of how sentences are tokenized and batched.

**Note**: Maybe a little different from previous assignments, we are keeping batch_size to be the **2nd** dimension.

In [None]:
def make_batch(sentences):
  """Convert a list of sentences into a batch of subword indices.

  Args:
    sentences: A list of sentences, each of which is a string.

  Returns:
    A LongTensor of size (max_sequence_length, batch_size) containing the
    subword indices for the sentences, where max_sequence_length is the length
    of the longest sentence as encoded by the subword vocabulary and batch_size
    is the number of sentences in the batch. A beginning-of-sentence token
    should be included before each sequence, and an end-of-sentence token should
    be included after each sequence. Empty slots at the end of shorter sequences
    should be filled with padding tokens. The tensor should be located on the
    device defined at the beginning of the notebook.
  """

  batch_indices = []
  for sentence in sentences:
    indices = vocab.EncodeAsIds(sentence)
    indices_augmented = [bos_id] + indices + [eos_id]
    indices_augmented = torch.LongTensor(indices_augmented)
    batch_indices.append(indices_augmented)

  batched_seq = torch.nn.utils.rnn.pad_sequence(batch_indices, padding_value=pad_id).to(device)
  return batched_seq

def make_batch_iterator(dataset, batch_size, shuffle=False):
  """Make a batch iterator that yields source-target pairs.

  Args:
    dataset: A torchtext dataset object.
    batch_size: An integer batch size.
    shuffle: A boolean indicating whether to shuffle the examples.

  Yields:
    Pairs of tensors constructed by calling the make_batch function on the
    source and target sentences in the current group of examples. The max
    sequence length can differ between the source and target tensor, but the
    batch size will be the same. The final batch may be smaller than the given
    batch size.
  """

  examples = list(dataset)
  if shuffle:
    random.shuffle(examples)

  for start_index in range(0, len(examples), batch_size):
    example_batch = examples[start_index:start_index + batch_size]
    source_sentences = [example[0] for example in example_batch]
    target_sentences = [example[1] for example in example_batch]
    yield make_batch(source_sentences), make_batch(target_sentences)

test_batch = make_batch(["a test input", "a longer input than the first"])
print("Example batch tensor:")
print(test_batch)
assert test_batch.shape[1] == 2
assert test_batch[0, 0] == bos_id
assert test_batch[0, 1] == bos_id
assert test_batch[-1, 0] == pad_id
assert test_batch[-1, 1] == eos_id

Implement an LSTM based decoder below. The decoder should be similar to the encoder which is already implemented, except it will accept a `state` tuple with the initial values of the `h_n` and `c_n` states (which will be the final states from the encoder above). Also the inputs to the decoder will be embed using the same embedder from the encoder. We will also return the final state from the decoder since we will need it for inference later.

In [None]:
class Seq2seqBaseline(nn.Module):
  def __init__(self, hidden_dim, word_vector_dim, dropout,num_layers):
    super().__init__()
    """
    args:
      hidden_dim: hidden state size of LSTM
      word_vector_dim: size of the word embedding table
      dropout: this is applied to the output of the LSTM
    """
    ### Encoder Params. Please do not change these functions at all.
    self.dropout = nn.Dropout(dropout)
    # Embedding table over input vocabulary
    self.embedder = nn.Embedding(vocab.GetPieceSize(), word_vector_dim)
    self.lstm = nn.LSTM(word_vector_dim, hidden_dim, bidirectional=True, num_layers=num_layers)
    self.layer = nn.Linear(hidden_dim*num_layers*2, hidden_dim*num_layers)
    self.layer2 = nn.Linear(hidden_dim*num_layers*2, hidden_dim*num_layers)
    self.num_layers = num_layers
    ### Decoder Params.
    self.dropout2 = nn.Dropout(dropout)
    self.lstm2 = nn.LSTM(word_vector_dim, hidden_dim, num_layers=2)
    self.output_layer2 = nn.Linear(hidden_dim, vocab.GetPieceSize())
    self.log_softmax_layer = nn.LogSoftmax(dim=2)

  def encode(self, source):
    """Encode the source batch using a bidirectional LSTM encoder.

    Args:
      source: An integer tensor with shape (max_source_sequence_length,
        batch_size) containing subword indices for the source sentences.

    Returns:
      A tuple with three elements:
        encoder_output: The output of the bidirectional LSTM with shape
          (max_source_sequence_length, batch_size, 2 * hidden_size).
        encoder_mask: A boolean tensor with shape (max_source_sequence_length,
          batch_size) indicating which encoder outputs correspond to padding
          tokens. Its elements should be True at positions corresponding to
          padding tokens and False elsewhere.
        encoder_hidden: The final hidden states of the bidirectional LSTM (after
          a suitable projection) that will be used to initialize the decoder.
          This should be a pair of tensors (h_n, c_n), each with shape
          (num_layers, batch_size, hidden_size). Note that the hidden state
          returned by the LSTM cannot be used directly. Its initial dimension is
          twice the required size because it contains state from two directions.

    The first two return values are not required for the baseline model and will
    only be used later in the attention model. If desired, they can be replaced
    with None for the initial implementation.
    """

    # Using packed sequences to more easily work
    # with the variable-length sequences represented by the source tensor.
    # See https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.PackedSequence.


    # Compute a tensor containing the length of each source sequence.
    lengths = torch.sum(source != pad_id, axis=0).cpu()

    seq_len, batch_size = source.size()

    # embedded_sentence: seq_len x batch_size x word_vector_dim
    embedded_sentence = self.embedder(source)

    # pack it for rnn input
    embedded_sentence = torch.nn.utils.rnn.pack_padded_sequence(embedded_sentence,lengths,enforce_sorted=False)

    # lstm_out: seq_len x batch_size x 2 * hidden_dim
    # h_n, c_n: num_lay*2 x batch_size x hidden_dim
    lstm_out, (h_n, c_n) = self.lstm(embedded_sentence)

    # Take sum of states across forward and reverse directions.
    h_n = h_n.view(2, -1, batch_size, h_n.shape[-1]).sum(1)
    c_n = c_n.view(2, -1, batch_size, c_n.shape[-1]).sum(1)

    encoder_mask = source == pad_id

    lstm_out, lens_unpacked = torch.nn.utils.rnn.pad_packed_sequence(lstm_out)
    lstm_out = self.dropout(lstm_out)

    return lstm_out, encoder_mask,(h_n, c_n)



  def decode(self, decoder_input, initial_hidden, encoder_output, encoder_mask):
    """Run the decoder LSTM starting from an initial hidden state.

    The third and fourth arguments are not used in the baseline model, but are
    included for compatibility with the attention model in the next section.

    Args:
      decoder_input: An integer tensor with shape (max_decoder_sequence_length,
        batch_size) containing the subword indices for the decoder input. During
        evaluation, where decoding proceeds one step at a time, the initial
        dimension should be 1.
      initial_hidden: A pair of tensors (h_0, c_0) representing the initial
        state of the decoder, each with shape (2, batch_size,
        hidden_size).
      encoder_output: The output of the encoder with shape
        (max_source_sequence_length, batch_size, 2 * hidden_size).
      encoder_mask: The output mask from the encoder with shape
        (max_source_sequence_length, batch_size). Encoder outputs at positions
        with a True value correspond to padding tokens and should be ignored.

    Returns:
      A tuple with three elements:
        log_probs: A tensor with shape (max_decoder_sequence_length, batch_size,
          vocab_size) containing scores for the next-word
          predictions at each position.
        decoder_hidden: A pair of tensors (h_n, c_n) with the same shape as
          initial_hidden representing the updated decoder state after processing
          the decoder input.
        attention_weights: This will be implemented later in the attention
          model, but in order to maintain compatible type signatures, we also
          include it here. This can be None or any other placeholder value.
    """

    # These arguments are not used in the baseline model.
    del encoder_output
    del encoder_mask

    # For the baseline model, we ignore encoder_output and encoder_mask.
    # Embed the decoder input
    embedded = self.embedder(decoder_input)  # * Added: embed decoder input; shape (seq_len, batch, word_vector_dim).
    # Run through the decoder LSTM using the provided initial hidden state.
    outputs, hidden = self.lstm2(embedded, initial_hidden)  # * Added: call to lstm2; outputs: (seq_len, batch, hidden_dim).
    outputs = self.dropout2(outputs)  # * Added: apply dropout.
    logits = self.output_layer2(outputs)  # * Added: project to vocabulary size.
    log_probs = self.log_softmax_layer(logits)  # * Added: compute log probabilities.
    return log_probs, hidden, None  # * Returns log_probs, updated hidden state, and None for attention.


  def compute_loss(self, source, target):
    """Run the model on the source and compute the loss on the target.

    Args:
      source: An integer tensor with shape (max_source_sequence_length,
        batch_size) containing subword indices for the source sentences.
      target: An integer tensor with shape (max_target_sequence_length,
        batch_size) containing subword indices for the target sentences.

    Returns:
      A scalar float tensor representing cross-entropy loss on the current batch.
    """

    # Note that for a target sequence like <s> A B C </s>, you would
    # want to run the decoder on the prefix <s> A B C and have it predict the
    # suffix A B C </s>.

    _, batch_size = source.size()
    enc_output, encoder_mask, curr_state = self.encode(source)

    lengths = torch.sum(target != pad_id, axis=0).cpu()-1
    target_prefix = torch.clone(target).cpu()
    target_prefix[lengths,torch.arange(target_prefix.size(1))] = pad_id
    decoder_input = target_prefix[:-1,:].to(device)

    log_probs, _, _ = self.decode(decoder_input, curr_state, enc_output, encoder_mask)

    criterion = nn.NLLLoss(ignore_index=pad_id)
    target = target[1:,:]
    loss = criterion(log_probs.reshape(-1, vocab.GetPieceSize()), target.reshape(-1))


    return loss



We define the following functions for training.  This code will run as provided.

In [None]:
def train(model, num_epochs, batch_size, model_file):
  """Train the model and save its best checkpoint.

  Model performance across epochs is evaluated using token-level accuracy on the
  validation set. The best checkpoint obtained during training will be stored on
  disk and loaded back into the model at the end of training.
  """
  optimizer = torch.optim.Adam(model.parameters())
  best_accuracy = 0.0
  for epoch in tqdm.notebook.trange(num_epochs, desc="training", unit="epoch"):
    with tqdm.notebook.tqdm(
        make_batch_iterator(training_data, batch_size, shuffle=True),
        desc="epoch {}".format(epoch + 1),
        unit="batch",
        total=math.ceil(len(training_data) / batch_size)) as batch_iterator:
      model.train()
      total_loss = 0.0
      for i, (source, target) in enumerate(batch_iterator, start=1):
        source, target = source.to(device),target.to(device)
        optimizer.zero_grad()
        loss = model.compute_loss(source, target)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        batch_iterator.set_postfix(mean_loss=total_loss / i)
      validation_perplexity, validation_accuracy = evaluate_next_token(
          model, validation_data)
      batch_iterator.set_postfix(
          mean_loss=total_loss / i,
          validation_perplexity=validation_perplexity,
          validation_token_accuracy=validation_accuracy)
      if validation_accuracy > best_accuracy:
        print(
            "Obtained a new best validation accuracy of {:.2f}, saving model "
            "checkpoint to {}...".format(validation_accuracy, model_file))
        torch.save(model.state_dict(), model_file)
        best_accuracy = validation_accuracy
  print("Reloading best model checkpoint from {}...".format(model_file))
  model.load_state_dict(torch.load(model_file))

def evaluate_next_token(model, dataset, batch_size=64):
  """Compute token-level perplexity and accuracy metrics.

  Note that the perplexity here is over subwords, not words.

  This function is used for validation set evaluation at the end of each epoch
  and should not be modified.
  """
  model.eval()
  total_cross_entropy = 0.0
  total_predictions = 0
  correct_predictions = 0
  with torch.no_grad():
    for source, target in make_batch_iterator(dataset, batch_size):
      encoder_output, encoder_mask, encoder_hidden = model.encode(source)
      decoder_input, decoder_target = target[:-1], target[1:]
      logits, decoder_hidden, attention_weights = model.decode(
          decoder_input, encoder_hidden, encoder_output, encoder_mask)
      total_cross_entropy += F.cross_entropy(
          logits.permute(1, 2, 0), decoder_target.permute(1, 0),
          ignore_index=pad_id, reduction="sum").item()
      total_predictions += (decoder_target != pad_id).sum().item()
      correct_predictions += (
          (decoder_target != pad_id) &
          (decoder_target == logits.argmax(2))).sum().item()
  perplexity = math.exp(total_cross_entropy / total_predictions)
  accuracy = 100 * correct_predictions / total_predictions
  return perplexity, accuracy

We can now train the baseline model.

Since we haven't yet defined a decoding method to output an entire string, we will measure performance for now by computing perplexity and the accuracy of predicting the next token given a gold prefix of the output. A correct implementation should get a validation token accuracy above 55%. The training code will automatically save the model with the highest validation accuracy and reload that checkpoint's parameters at the end of training.

In [None]:
# We tune it to be fairly optimal, so idealy you don't have to change the parameters below.
# But you are welcome to adjust these parameters.

num_epochs = 10
batch_size = 16
hidden_dim = 256
word_vector_dim = 256
num_layers = 2
dropout = 0.3

baseline_model = Seq2seqBaseline(hidden_dim,word_vector_dim,dropout,num_layers).to(device)
train(baseline_model, num_epochs, batch_size, "baseline_model.pt")

For evaluation, we also need to be able to generate entire strings from the model. We gave a greedy decoding algorithm here. Later on, we'll implement beam search.

A correct implementation of baseline model with greedy decoding should get above 19 BLEU on the validation set.

In [None]:
def predict_greedy(model, sentences, max_length=100):
  """Make predictions for the given inputs using greedy inference.

  Args:
    model: A sequence-to-sequence model.
    sentences: A list of input sentences, represented as strings.
    max_length: The maximum length at which to truncate outputs in order to
      avoid non-terminating inference.

  Returns:
    A list of predicted translations, represented as strings.
  """

  model.eval()
  batch_size = len(sentences)
  indices = make_batch(sentences)
  pred_translations = torch.zeros(max_length,batch_size, dtype=torch.long) # max_seq_length x batch_size

  enc_output, encoder_mask, curr_state = model.encode(indices)
  input = torch.LongTensor([bos_id] * batch_size).view(1, -1).to(device)
  finished_mask = torch.zeros(batch_size, dtype=torch.bool).to(device) # mask ones that have finished because some may finish ealier than the other in the same batch
  for i in range(0, max_length):
      log_probs, hidden, _ = model.decode(input, curr_state, enc_output, encoder_mask)

      # Prevent finished sequences from producing non-padding tokens.
      log_probs[:, finished_mask, pad_id] = 1e9

      # Get the most likely next token and its index.
      _, next_tokens = log_probs.squeeze(0).max(dim=1)
      pred_translations[i] = next_tokens

      # Update the input for the next decoding step.
      input = next_tokens.unsqueeze(0)

      # Update the state and finished masks.
      curr_state = hidden

      finished_mask = finished_mask | next_tokens.eq(eos_id)
      if finished_mask.all():
          break

  pred_translation_str = []
  for i in range(batch_size):
      string = vocab.DecodeIds(pred_translations[:,i].detach().cpu().numpy().astype(int).tolist())
      pred_translation_str.append(string)
  return pred_translation_str


def evaluate(model, dataset, batch_size=64, method="greedy"):
  assert method in {"greedy", "beam"}
  source_sentences = [example[0] for example in dataset]
  target_sentences = [example[1] for example in dataset]
  model.eval()
  predictions = []
  with torch.no_grad():
    for start_index in range(0, len(source_sentences), batch_size):
      if method == "greedy":
        prediction_batch = predict_greedy(
            model, source_sentences[start_index:start_index + batch_size])
      else:
        prediction_batch = predict_beam(
            model, source_sentences[start_index:start_index + batch_size])
        prediction_batch = [candidates[0] for candidates in prediction_batch]
      predictions.extend(prediction_batch)
  return sacrebleu.corpus_bleu(predictions, [target_sentences]).score

print("Baseline model validation BLEU using greedy search:",
      evaluate(baseline_model, validation_data))

### Generate the predictions for the baseline model using greedy decoding on the test_data.
generate_predictions_file_for_submission("seq2seq_predictions_baseline.json", baseline_model, test_data, "greedy")

In [None]:
def show_predictions(model, num_examples=4, num_beam=5,include_beam=False):
  for example in validation_data[:num_examples]:
    print("Input:")
    print(" ", example[0])
    print("Target:")
    print(" ", example[1])
    print("Greedy prediction:")
    print(" ", predict_greedy(model, [example[0]])[0])
    if include_beam:
      print(f"Beam predictions (showing top {num_beam}):")
      for candidate in predict_beam(model, [example[0]])[0][:num_beam]:
        print(" ", candidate)
    print()

print("Baseline model sample predictions:")
print()
show_predictions(baseline_model)

## Attention
We'll now improve our seq2seq parsing model by adding [Luong et al. (2015)](https://arxiv.org/pdf/1508.04025.pdf) style attention. In particular, (largely following the notation in the paper) let $\bar{\mathbf{h}}_1, \ldots, \bar{\mathbf{h}}_S$ be the sequence of *encoder* RNN states which are obtained from running the encoder RNN over $S$ source tokens (i.e., the English words in the question in our case). Also let $\mathbf{h}_t$ be the *decoder* RNN state after it consumes the $t$th target token (i.e., the $t$th token in the logical form). Then $\boldsymbol{\alpha}_t$, the attention vector at time $t$, has $S$ elements defined as follows:
$$
\alpha_{t,s} = \mathrm{softmax} ([\bar{\mathbf{h}}_{1}^{\top}; \ldots; \bar{\mathbf{h}}_{S}^{\top}] \mathbf{W}_a^{\top} \mathbf{h}_t)_s = \frac{\exp(\mathbf{h}_t^{\top} \mathbf{W}_a \bar{\mathbf{h}}_s)}{\sum_{s'=1}^S\exp(\mathbf{h}_t^{\top} \mathbf{W}_a \bar{\mathbf{h}}_{s'})},
$$
where vectors are assumed to be column-vectors and $;$ represents vertical stacking. Note $\mathbf{W}_a \in \mathbb{R}^{d_{dec} \times d_{enc}}$ is a learnable parameter matrix.

Given $\boldsymbol{\alpha}_t$, we can then compute a "context vector" $\mathbf{c}_t$ that is a weighted average of the encoder states:
$$
\mathbf{c}_t = [\bar{\mathbf{h}}_{1}, \ldots, \bar{\mathbf{h}}_{S}] \boldsymbol{\alpha}_t
$$
where $,$ represents horizontal stacking.

Finally, we concatenate $\mathbf{c}_t$ and $\mathbf{h}_t$ to arrive at a modified decoder state at time $t$ defined as follows:
$$
\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}_c [\mathbf{c}_t; \mathbf{h}_t]),
$$
where $\mathbf{W}_c \in \mathbb{R}^{d_{dec} + d_{enc} \times d_{dec} + d_{enc}} $ is some learned projection. We can then obtain our logits for each word type as usual with $\mathbf{V} \tilde{\mathbf{h}}_t + \mathbf{b}$.


### **Implementation Task \# 2**
Complete the `decode()` function of the `Seq2seqAttention` module below so that it implements the attention scheme described above. The function should return log probabilities and the final decoder state and cell just as the `decode()` function of the `Seq2seqBaseline` module above does.

**Hint:** The most efficient implementations will make use of [`torch.bmm`](https://pytorch.org/docs/stable/generated/torch.bmm.html) in computing attention.

In [None]:
class Seq2seqAttention(Seq2seqBaseline):
  # Note that this class inherents from Seq2seqBaseline, so all the parameters in Seq2seqBaseline are initialized when this class is
  # initialized.
  def __init__(self, hidden_dim, enc_output_size, word_vector_dim, dropout,num_layers):
    super().__init__(hidden_dim, word_vector_dim, dropout,num_layers)


    # Initialize any additional parameters needed for this model that are not
    # already included in the baseline model.

    ### YOUR CODE HERE !!!!!
    # Initialize additional parameters for Luong-style attention.
    self.W_a = nn.Linear(enc_output_size, hidden_dim, bias=False)  # * Added: projects encoder outputs to hidden_dim.
    self.W_c = nn.Linear(hidden_dim + enc_output_size, hidden_dim)   # * Added: projects concatenated [h_t; c_t] to hidden_dim.
    self.out = nn.Linear(hidden_dim, vocab.GetPieceSize())           # * Added: final output layer for vocabulary logits.
    
    ### END YOUR CODE HERE !!!!!

  def decode(self, decoder_input, initial_hidden, encoder_output, encoder_mask):
    """Run the decoder LSTM starting from an initial hidden state.

    Args:
      decoder_input: An integer tensor with shape (max_decoder_sequence_length,
        batch_size) containing the subword indices for the decoder input. During
        evaluation, where decoding proceeds one step at a time, the initial
        dimension should be 1.
      initial_hidden: A pair of tensors (h_0, c_0) representing the initial
        state of the decoder, each with shape (num_layers, batch_size,
        hidden_size).
      encoder_output: The output of the encoder with shape
        (max_source_sequence_length, batch_size, 2 * hidden_size).
      encoder_mask: The output mask from the encoder with shape
        (max_source_sequence_length, batch_size). Encoder outputs at positions
        with a True value correspond to padding tokens and should be ignored.

    Returns:
      A tuple with three elements:
        logits: A tensor with shape (max_decoder_sequence_length, batch_size,
          vocab_size) containing scores for the next-word
          predictions at each position.
        decoder_hidden: A pair of tensors (h_n, c_n) with the same shape as
          initial_hidden representing the updated decoder state after processing
          the decoder input.
        attention_weights: A tensor with shape (max_decoder_sequence_length,
          batch_size, max_source_sequence_length) representing the normalized
          attention weights. This should sum to 1 along the last dimension.
    """

    # Implementation tip: use a large negative number like -1e9 instead of
    # float("-inf") when masking logits to avoid numerical issues.

    # Implementation tip: the function torch.bmm may be useful here.
    # See https://pytorch.org/docs/stable/generated/torch.bmm.html

    ### YOUR CODE HERE !!!!!
    embedded = self.embedder(decoder_input)  # (T, B, word_vector_dim); * Embed decoder input.
    outputs, hidden = self.lstm2(embedded, initial_hidden)  # (T, B, hidden_dim); * Run LSTM2.
    outputs = self.dropout2(outputs)  # * Apply dropout.
    
    # Ensure encoder outputs match the current batch size.
    B_current = decoder_input.size(1)
    if encoder_output.size(1) != B_current:
        encoder_output = encoder_output.expand(-1, B_current, -1).contiguous()  # * Expand encoder output if needed.
        encoder_mask = encoder_mask.expand(-1, B_current).contiguous()  # * Likewise for encoder mask.
    
    # Prepare encoder outputs: (S, B, enc_output_size) -> (B, S, enc_output_size)
    enc_out = encoder_output.transpose(0, 1)
    projected_enc = self.W_a(enc_out)  # (B, S, hidden_dim); * Apply W_a to encoder outputs.
    dec_out = outputs.transpose(0, 1)   # (B, T, hidden_dim)
    scores = torch.bmm(dec_out, projected_enc.transpose(1, 2))  # (B, T, S); * Compute attention scores.
    attn_weights = F.softmax(scores, dim=2)  # (B, T, S); * Normalize to get attention weights.
    context = torch.bmm(attn_weights, enc_out)  # (B, T, enc_output_size); * Compute context vectors.
    concat = torch.cat([dec_out, context], dim=2)  # (B, T, hidden_dim+enc_output_size); * Concatenate decoder state and context.
    attended = torch.tanh(self.W_c(concat))  # (B, T, hidden_dim); * Apply tanh after linear projection.
    logits = self.out(attended)  # (B, T, vocab_size); * Compute vocabulary logits.
    log_probs = F.log_softmax(logits, dim=2)  # * Compute log probabilities.
    return log_probs.transpose(0, 1), hidden, attn_weights.transpose(0, 1)  # * Return with time dimension first.

    ### END YOUR CODE HERE !!!!!

As before, we can train an attention model using the provided training code.

A correct implementation should get a validation token accuracy above 67 and a validation BLEU above 36 with greedy search.

In [None]:
# You are welcome to adjust these parameters based on your model implementation.
num_epochs = 10
batch_size = 16
hidden_dim = 256
enc_output_size = hidden_dim * 2
word_vector_dim = 256
num_layers = 2
dropout = 0.3

attention_model = Seq2seqAttention(hidden_dim,enc_output_size, word_vector_dim,dropout,num_layers).to(device)
train(attention_model, num_epochs, batch_size, "attention_model.pt")
print("Attention model validation BLEU using greedy search:",
      evaluate(attention_model, validation_data))
# Generate the predictions for the attention model using greedy decoding on the test_data.
# Corret implementation of the baseline model and attention model should get you full credits here.
generate_predictions_file_for_submission("seq2seq_predictions_attention.json", attention_model, test_data, "greedy")

## Beam Search

We will now try to improve our model's predictions by decoding with beam search, rather than greedily. Beam search maintains a `beam_size`-length list of hypotheses at each step of decoding. A hypothesis is just a prefix of a full prediction, and is represented in the code below by a `Hyp` object.

### What beam search does, in (mostly) words
Beam search starts with a hypothesis consisting just of the BOS token, and then proceeds for `output_max_len` steps. At each step $t$, beam search considers adding every possible next-word to the hypotheses/prefixes from step $t-1$ (for a total of `beam_size`*V hypotheses). It then takes the highest scoring `beam_size` of these candidate hypotheses to be the hypotheses at step $t$. The score of a hypothesis of length $t$ is:
$$
\mathrm{score}(w_1, \ldots, w_t) = \sum_{i=1}^t \log p(w_i|w_1, \ldots, w_{i-1}, x)
$$
where $x$ is the source question. The log probabilities above are just the standard ones output by your RNN decoder.

A hypothesis is finished when it ends with an EOS token.

With beam search, you should get an improvement of at least 1 BLEU over greedy search, and should reach above 21 BLEU without attention and above 38 BLEU with attention.

**Tips:**

1) A good general strategy when doing complex code like this is to carefully annotate each line with a comment saying what each dimension represents.

2) You should only need one call to topk per step. You do not need to have a topk just over vocabulary first, you can directly go from vocab_size*beam_size to beam_size items.

3) Be sure you are correctly keeping track of which beam item a candidate is selected from and updating the beam states, such as LSTM hidden state, accordingly. A single state from the previous time step may need to be used for multiple new beam items or not at all. This includes all state associated with a beam, including all past tokens output by the beam and any extra tensors such as ones remembering when a beam is finished.

4) Once an EOS token has been generated, save the hypothesis and take it out of the beam for the next timestep.

### **Implementation Task \# 3**
Fill in the missing code in the `predict_beam` function below. You are not implmenting batched beam_search so you only need to consider one sentence at a time.

In [None]:
class Hyp:
  """
  A helper class representing a hypothesis (i.e., the prefix of a prediction) on the beam,
  using a linked list.
  """
  def __init__(self, token_id: int, parent, score: float):
    """
    Args:
      token_id: an integer representing the most recent token added to this hypothesis.
      parent: the Hyp object representing the prefix to which we've added this token.
      score: the cumulative log-probability score of this hypothesis.
    """
    self.token_id = token_id
    self.parent = parent
    self.score = score

  def trace(self):
    """
    Traces backward through the linked list to recover the whole hypothesis.
    
    Returns:
      A list of token IDs representing the entire hypothesis.
    """
    pred = []
    temp = self
    while temp is not None:
      # Append token if it exists (this avoids error when temp is None)
      if temp.token_id is not None:
        pred.append(temp.token_id)
      temp = temp.parent
    return pred[::-1]

def predict_beam(model, sentences, k=5, max_length=100):
    """Output the beam search result for the given sentences.
    
    Args:
      model: The model that will be used to generate the beams.
      sentences: A list of sentences (str) that the model will encode and do
        beam search over. For simplicity, this list has length 1.
      k: Beam size.
      max_length: Maximum timesteps you will generate. If it exceeds this timestep, stop.
    
    Returns:
      A list containing a single list of decoded generations (strings) sorted by their scores in descending order.
    """
    model.eval()
    V = vocab.GetPieceSize()
    # Encode the input sentence (batch_size will be 1)
    indices = make_batch(sentences)
    enc_output, encoder_mask, encoder_state = model.encode(indices)
    
    # Initialize the beam.
    # Start with a hypothesis that contains the BOS token.
    init_hyp = Hyp(bos_id, None, 0.0)
    beam = [init_hyp]
    # The corresponding hidden state for the beam is the encoder state.
    beam_states = [encoder_state]  # Each element is a tuple (h, c) with shape (num_layers, 1, hidden_dim)
    
    finished_hyps = []  # List to hold finished hypotheses

    # Run beam search for a maximum of max_length steps.
    for t in range(max_length):
        # Prepare the current input tokens for all hypotheses in the beam.
        # Each input should be the last token generated by that hypothesis.
        # This will create a tensor of shape (1, beam_size).
        current_tokens = [torch.tensor([hyp.trace()[-1]], device=device) for hyp in beam]
        current_input = torch.stack(current_tokens, dim=1)  # (1, beam_size)
        
        # Prepare the hidden states by concatenating each beam’s hidden state along the batch dimension.
        h_list = [state[0] for state in beam_states]  # each: (num_layers, 1, hidden_dim)
        c_list = [state[1] for state in beam_states]
        h_beam = torch.cat(h_list, dim=1)  # (num_layers, beam_size, hidden_dim)
        c_beam = torch.cat(c_list, dim=1)
        current_hidden = (h_beam, c_beam)
        
        # Decode one timestep for the entire beam.
        # log_probs: (1, beam_size, V)
        log_probs, new_hidden, _ = model.decode(current_input, current_hidden, enc_output, encoder_mask)
        log_probs = log_probs.squeeze(0)  # now (beam_size, V)
        
        # For each beam candidate, add its current cumulative score to the new log_probs.
        beam_scores = torch.tensor([hyp.score for hyp in beam], device=device).unsqueeze(1)  # (beam_size, 1)
        total_scores = beam_scores + log_probs  # (beam_size, V)
        
        # Flatten the scores to shape (beam_size * V) and select the top k candidates.
        total_scores_flat = total_scores.view(-1)
        topk_scores, topk_indices = torch.topk(total_scores_flat, k)
        
        # Prepare new lists for the beam and their hidden states.
        new_beam = []
        new_beam_states = []
        
        # Process each of the top k candidates.
        for score, flat_index in zip(topk_scores.tolist(), topk_indices.tolist()):
            # Determine which beam candidate (row) and which token (column) this corresponds to.
            prev_beam_idx = flat_index // V
            token_id = flat_index % V
            
            # Extract the new hidden state for the candidate.
            candidate_hidden = (
                new_hidden[0][:, prev_beam_idx:prev_beam_idx+1, :],
                new_hidden[1][:, prev_beam_idx:prev_beam_idx+1, :]
            )
            
            # Create a new hypothesis that extends the previous one with the new token.
            parent_hyp = beam[prev_beam_idx]
            new_hyp = Hyp(token_id, parent_hyp, score)
            
            # If the candidate token is EOS, add it to finished hypotheses.
            if token_id == eos_id:
                finished_hyps.append(new_hyp)
            else:
                new_beam.append(new_hyp)
                new_beam_states.append(candidate_hidden)
        
        # If no candidates remain in the beam (all ended with EOS), exit early.
        if len(new_beam) == 0:
            break
        
        # Update the beam with the new candidates for the next timestep.
        beam = new_beam
        beam_states = new_beam_states

    # If no hypothesis finished with EOS, use the current beam as finished candidates.
    if len(finished_hyps) == 0:
        finished_hyps = beam
    
    # Sort finished hypotheses by their cumulative score in descending order.
    finished_hyps = sorted(finished_hyps, key=lambda h: h.score, reverse=True)
    
    # Decode each hypothesis into a string.
    decoded_sentences = []
    for hyp in finished_hyps:
        token_ids = hyp.trace()
        # Remove the initial BOS token if present.
        if token_ids and token_ids[0] == bos_id:
            token_ids = token_ids[1:]
        # Cut off at EOS if it appears.
        if eos_id in token_ids:
            token_ids = token_ids[:token_ids.index(eos_id)]
        decoded_sentence = vocab.DecodeIds(token_ids)
        decoded_sentences.append(decoded_sentence)
    
    return [decoded_sentences]

# Testing beam search with baseline model (for example)
print("Baseline model validation BLEU using beam search:",
      evaluate(baseline_model, validation_data, batch_size=1, method="beam"))
print()
print("Baseline model sample predictions:")
print()
show_predictions(baseline_model, include_beam=True)


In [None]:
print("Attention model validation BLEU using beam search:",
      evaluate(attention_model, validation_data, batch_size=1, method="beam"))
print()
print("Attention model sample predictions:")
print()
show_predictions(attention_model, include_beam=True)

Run the cells to generate the beam_seqs.json file required for submission to check correctness of your beam_search.


In [None]:
!gdown 1zKM1vgKkRye1COYh4IlH_m0xDCq7chFF

In [None]:
device = torch.device("cpu")  # Force CPU usage.
hidden_dim = 100
word_vector_dim = 100
num_layers = 1
dropout = 0.3

# Create the model and move it to CPU.
special_model = Seq2seqBaseline(hidden_dim, word_vector_dim, dropout, num_layers).to(device)
# Load the state dictionary with map_location set to CPU.
sd = torch.load("special_model_beam_search.pt", map_location=device)
special_model.load_state_dict(sd)

V = vocab.GetPieceSize()
nsrcs, srcsize = 11, 6
special_preds = {}
for beam_size in [1, 5, 10, 15]:
    torch.manual_seed(beam_size)
    srcs = [(vocab.DecodeIds(torch.LongTensor(srcsize).random_(0, V).numpy().tolist()),
             'filler target sentence filler target sentence filler target sentence') 
            for _ in range(nsrcs)]
    predictions = []
    source_sentences = [x[0] for x in srcs]
    for start_index in range(0, len(source_sentences), 1):
        prediction_batch = predict_beam(
            special_model, 
            source_sentences[start_index:start_index + 1], 
            k=beam_size,
            max_length=50
        )
        predictions.extend(prediction_batch)
    special_preds[beam_size] = predictions

with open("beam_seqs.json", "w") as f:
    json.dump(special_preds, f)


If you implemented beam search correctly, you can save the results of beam search for the attention model by uncommenting the code below. It will have higher BLEU score, but the greedy decoding should give you full 20%.

In [None]:
# Ensure the attention model is on CPU.
attention_model = attention_model.to(torch.device("cpu"))
generate_predictions_file_for_submission("seq2seq_predictions_attention.json", attention_model, test_data, "beam", batch_size=1)


# Experimentation: 1-Page Report

Now it's time for you to experiment.  Try to improve the denotation accuracy on the validation set further and aim for a validation BLEU score of 42. Feel free to modify the code above directly or copy it in new cells below.

**NOTE:** We will award at least 7 of the 10 points if your improved model reaches a BLEU score of at least 40 on the hidden test cases on Gradescope.

Here are some ideas to try out:
* **Back translation**: Since the training dataset is small, another strategy for improving performance might be to generate more training instances. One popular technique people used in machine translation is called back translation(https://arxiv.org/abs/1511.06709). Train another model from English to Gernman and consturct more data.  
For this extension, you are allowed to use external datasets with monolingual text in either English or German. You cannot use datasets containing both English and German texts.

* **Word embeddings**: You can try initializing the input word embeddings with [Glove vectors](https://nlp.stanford.edu/projects/glove/). You should try both finetuning these embeddings or keeping them fixed.
* **Regularization**: You can also try some of the regularization techniques we tried in the language modeling assignment, however these may or may not help.
* **Tokenization**: Exploring with how tokenization is done or how big vocabulary size is can also be helpful. We used unigram language model subword tokenization and use a vocabulary of size 8000. You can consider using BPE (a very popular tokenization technique) or using characters or even some others.
* **Hyperparameter tuning**: Finally you can try playing around with hyperparameters to see if there is a better configuration than what we have provided (we only did a modest amount of tuning).

For this section, you will submit a write-up describing the extensions and/or modifications that you tried.  Your write-up should be **1-page maximum** in length and should be submitted in PDF format.  You may use any editor you like, but we recommend using LaTeX and working in an environment like Overleaf.
For full credit, your write-up should include:
1.   A concise and precise description of the extension that you tried.
2.   A motivation for why you believed this approach might improve your model.
3.   A discussion of whether the extension was effective and/or an analysis of the results.  This will generally involve some combination of tables, learning curves, etc.
4.   A bottom-line summary of your results comparing the scores of your improvement to the original model.
The purpose of this exercise is to experiment, so feel free to try/ablate multiple of the suggestions above as well as any others you come up with!
When you submit the file, please name it `report.pdf`.



In [None]:
###########################
# Improved Model Training #
###########################

# Define a custom label smoothing loss function.
def label_smoothing_loss(log_probs, target, smoothing=0.1, vocab_size=vocab.GetPieceSize(), ignore_index=pad_id):
    # log_probs: (N, vocab_size), target: (N)
    with torch.no_grad():
        true_dist = torch.zeros_like(log_probs)
        true_dist.fill_(smoothing / (vocab_size - 1))
        mask = (target == ignore_index)
        target = target.clone()
        target[mask] = 0  # Arbitrary index for ignore positions.
        true_dist.scatter_(1, target.unsqueeze(1), 1.0 - smoothing)
        true_dist[mask] = 0
    loss = torch.sum(-true_dist * log_probs, dim=1)
    return loss.mean()

# Create an improved model class that overrides compute_loss with label smoothing.
class ImprovedSeq2seqAttention(Seq2seqAttention):
    def compute_loss(self, source, target):
        """Compute loss using label smoothing instead of plain NLLLoss."""
        _, batch_size = source.size()
        enc_output, encoder_mask, curr_state = self.encode(source)
        
        # Create the decoder input: shift target right (remove the last token).
        lengths = torch.sum(target != pad_id, axis=0).cpu() - 1
        target_prefix = torch.clone(target).cpu()
        target_prefix[lengths, torch.arange(target_prefix.size(1))] = pad_id
        decoder_input = target_prefix[:-1, :].to(device)
        
        log_probs, _, _ = self.decode(decoder_input, curr_state, enc_output, encoder_mask)
        # Flatten the predictions and targets.
        log_probs_flat = log_probs.reshape(-1, vocab.GetPieceSize())
        target_flat = target[1:, :].reshape(-1)
        loss = label_smoothing_loss(log_probs_flat, target_flat)
        return loss

# Hyperparameters for the improved model.
improved_hidden_dim = 512
improved_word_vector_dim = 512
improved_num_layers = 2
improved_dropout = 0.5
num_epochs = 15
batch_size = 32

# Create the improved model.
improved_model = ImprovedSeq2seqAttention(improved_hidden_dim, 
                                          improved_hidden_dim * 2,  # encoder output size = 2 * hidden_dim
                                          improved_word_vector_dim, 
                                          improved_dropout, 
                                          improved_num_layers).to(device)

# Set up optimizer and a learning rate scheduler.
optimizer = torch.optim.Adam(improved_model.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=2, factor=0.5, verbose=True)

# Training loop.
best_val_acc = 0.0
for epoch in range(num_epochs):
    improved_model.train()
    total_loss = 0.0
    num_batches = 0
    for source, target in make_batch_iterator(training_data, batch_size, shuffle=True):
        source, target = source.to(device), target.to(device)
        optimizer.zero_grad()
        loss = improved_model.compute_loss(source, target)
        loss.backward()
        # Gradient clipping to prevent exploding gradients.
        torch.nn.utils.clip_grad_norm_(improved_model.parameters(), max_norm=5)
        optimizer.step()
        total_loss += loss.item()
        num_batches += 1
    avg_loss = total_loss / num_batches
    val_perplexity, val_accuracy = evaluate_next_token(improved_model, validation_data)
    print(f"Epoch {epoch+1}, Loss: {avg_loss:.4f}, Val Perplexity: {val_perplexity:.4f}, Val Accuracy: {val_accuracy:.2f}")
    scheduler.step(avg_loss)
    # Save the model if validation accuracy improves.
    if val_accuracy > best_val_acc:
         best_val_acc = val_accuracy
         torch.save(improved_model.state_dict(), "improved_model.pt")
         print("New best model saved.")

# Reload best model.
improved_model.load_state_dict(torch.load("improved_model.pt"))
print("Improved model BLEU with greedy search:", evaluate(improved_model, validation_data))

# Generate predictions for submission using beam search.
generate_predictions_file_for_submission("seq2seq_predictions_attention.json", improved_model, test_data, "beam", batch_size=1)


In [None]:
### For the improvement:
# If you implemented your own improvements, submit the predictions using that model on the test data instead:

#generate_predictions_file_for_submission("seq2seq_predictions_attention.json", improved_model, test_data, "beam", batch_size=1)

### Submission

Upload a submission with the following files to Gradescope:
* proj_3.ipynb (rename to match this exactly)
* seq2seq_predictions_baseline.json (baseline_model with greedy decoding would suffice)
* seq2seq_predictions_attention.json (if you have an improved model, use that model to generate this file, otherwise submiting attention model with greedy decoding will get you full points (20%) but probably not for the improvement evaluation (10%))
* beam_seqs.json
* report.pdf

You can upload files individually or as part of a zip file, but if using a zip file be sure you are zipping the files directly and not a folder that contains them.

Be sure to check the output of the autograder after it runs.  It should confirm that no files are missing and that the output files have the correct format.  Note that the test set accuracies shown by the autograder are on different data from your validation set.  We will compare your score on the test set to our model's score and assign points based on that.