# HW 3: Neural Machine Translation

In this homework you will build a full neural machine translation system using an attention-based encoder-decoder network to translate from German to English. The encoder-decoder network with attention forms the backbone of many current text generation systems. See [Neural Machine Translation and Sequence-to-sequence Models: A Tutorial](https://arxiv.org/pdf/1703.01619.pdf) for an excellent tutorial that also contains many modern advances.

## Goals


1. Build a non-attentional baseline model (pure seq2seq as in [ref](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf)). 
2. Incorporate attention into the baseline model ([ref](https://arxiv.org/abs/1409.0473) but with dot-product attention as in class notes).
3. Implement beam search: review/tutorial [here](http://www.phontron.com/slides/nlp-programming-en-13-search.pdf)
4. Visualize the attention distribution for a few examples. 

Consult the papers provided for hyperparameters, and the course notes for formal definitions.

This will be the most time-consuming assignment in terms of difficulty/training time, so we recommend that you get started early!

## Setup

This notebook provides a working definition of the setup of the problem itself. Feel free to construct your models inline, or use an external setup (preferred) to build your system.

In [1]:
# !pip install -q torch torchtext spacy opt_einsum
# !pip install -qU git+https://github.com/harvardnlp/namedtensor
# !python -m spacy download en
# !python -m spacy download de

In [2]:
# Torch
import torch
torch.set_default_tensor_type('torch.cuda.FloatTensor')

# Text processing library and methods for pretrained word embeddings
from torchtext import data, datasets

# Named Tensor wrappers
from namedtensor import ntorch, NamedTensor
from namedtensor.text import NamedField

# Word vectors
from torchtext.vocab import GloVe, FastText

# utilities for logging time
import time
from tqdm import tqdm_notebook as tqdm

We first need to process the raw data using a tokenizer. We are going to be using spacy, but you can use your own tokenization rules if you prefer (e.g. a simple `split()` in addition to some rules to acccount for punctuation), but we recommend sticking to the above.

In [3]:
import spacy
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

Note that we need to add the beginning-of-sentence token `<s>` and the end-of-sentence token `</s>` to the 
target so we know when to begin/end translating. We do not need to do this on the source side.

In [4]:
BOS_WORD = '<s>'
EOS_WORD = '</s>'

DE = NamedField(names=('srcSeqlen',), tokenize=tokenize_de)

# only target needs BOS/EOS
EN = NamedField(
    names=('trgSeqlen',), tokenize=tokenize_en,
    init_token = BOS_WORD, eos_token = EOS_WORD
)

Let's download the data. This may take a few minutes.

While this dataset of 200K sentence pairs is relatively small compared to others, it will still take some time to train. We only expect you to work with sentences of length at most 20 for this homework. You are expected to train on at least this reduced dataset for this homework, but are free to experiment with the rest of the training set as well.

**We encourage you to start with `MAX_LEN=20` but encourage experimentation after getting reasonable results with the filtered data. The baseline scores are based on models train on the filtered data.**

In [5]:
MAX_LEN = 20
def filter_pred(x):
    return len(vars(x)['src']) <= MAX_LEN and len(vars(x)['trg']) <= MAX_LEN
    
train, val, test = datasets.IWSLT.splits(
    exts=('.de', '.en'), fields=(DE, EN), filter_pred=filter_pred,
)
print(train.fields)
print(len(train))
print(vars(train[0]))

{'src': <namedtensor.text.torch_text.NamedField object at 0x7f908dddc8d0>, 'trg': <namedtensor.text.torch_text.NamedField object at 0x7f908dddc860>}
119076
{'src': ['David', 'Gallo', ':', 'Das', 'ist', 'Bill', 'Lange', '.', 'Ich', 'bin', 'Dave', 'Gallo', '.'], 'trg': ['David', 'Gallo', ':', 'This', 'is', 'Bill', 'Lange', '.', 'I', "'m", 'Dave', 'Gallo', '.']}


In [6]:
!head -n 1 .data/iwslt/de-en/train.de-en.de
!head -n 1 .data/iwslt/de-en/train.de-en.en

David Gallo: Das ist Bill Lange. Ich bin Dave Gallo.
David Gallo: This is Bill Lange. I'm Dave Gallo.


In [7]:
src = open("valid.src", "w")
trg = open("valid.trg", "w")
for example in val:
    print(" ".join(example.src), file=src)
    print(" ".join(example.trg), file=trg)
src.close()
trg.close()

Now we build the vocabulary and convert the text corpus into indices. We are going to be replacing tokens that occurred less than 5 times with `<unk>` tokens, and take the rest as our vocab.

In [8]:
MIN_FREQ = 5
DE.build_vocab(train.src, min_freq=MIN_FREQ)
EN.build_vocab(train.trg, min_freq=MIN_FREQ)

print("Most common German words:", DE.vocab.freqs.most_common(10))
print("Size of German vocab", len(DE.vocab))
print("\n")

print("Most common English words:", EN.vocab.freqs.most_common(10))
print("Size of English vocab", len(EN.vocab))
print("\n")

print(EN.vocab.stoi["<s>"], EN.vocab.stoi["</s>"]) #vocab index for <s>, </s>

Most common German words: [('.', 113253), (',', 67237), ('ist', 24189), ('die', 23778), ('das', 17102), ('der', 15727), ('und', 15622), ('Sie', 15085), ('es', 13197), ('ich', 12946)]
Size of German vocab 13353


Most common English words: [('.', 113433), (',', 59512), ('the', 46029), ('to', 29177), ('a', 27548), ('of', 26794), ('I', 24887), ('is', 21775), ("'s", 20630), ('that', 19814)]
Size of English vocab 11560


2 3


In [9]:
# Loading word vectors
EN.vocab.load_vectors(vectors=GloVe("840B"))
DE.vocab.load_vectors(vectors=FastText(language="de"))

Now we split our data into batches as usual. Batching for MT is slightly tricky because source/target will be of different lengths. Fortunately, `torchtext` lets you do this by allowing you to pass in a `sort_key` function. This will minimizing the amount of padding on the source side, but since there is still some padding you will inadvertendly "attend" to these padding tokens. 

One way to get rid of padding is to pass a binary `mask` vector to your attention module so its attention score (before the softmax) is minus infinity for the padding token. Another way (which is how we do it for some of our projects) is to manually sort data into batches so that each batch has exactly the same source length (this means that some batches will be less than the desired batch size, though).

However, for this homework padding won't matter too much, so it's fine to ignore it.

Let's check to see that the BOS/EOS token is indeed appended to the target (English) sentence.

In [10]:
import random

def batcher(data, batch_size):
    # sort first by src len, then by trg len
    data = sorted(data, key=lambda x: (len(x.src), len(x.trg)))
    curr_batch = []
    curr_lengths = None
    
    # all batches have the same src_len and trg_len
    for ex in data:
        lengths = (len(ex.src), len(ex.trg))
        if (lengths != curr_lengths and curr_batch) or len(curr_batch) == batch_size:
            yield curr_batch
            curr_batch = []
        curr_lengths = lengths
        curr_batch.append(ex)
        
    if curr_batch:
        yield curr_batch
    
class GoodBucketIterator(data.Iterator):
    """Defines an iterator that batches examples of similar lengths together.
    Minimizes amount of padding needed while producing freshly shuffled
    batches for each new epoch. See pool for the bucketing procedure used.
    """
    def create_batches(self):
        self.batches = list(batcher(self.data(), self.batch_size))
        random.shuffle(self.batches)
        
    def __len__(self):
        if hasattr(self, 'batches'):
            return len(self.batches)
        return super().__len__()

In [11]:
BATCH_SIZE = 128
device = torch.device('cuda:0')
train_iter, val_iter = GoodBucketIterator.splits(
    (train, val), batch_size=BATCH_SIZE, device=device, repeat=False
)

In [12]:
batch = next(iter(train_iter))

print("Source:", batch.src.shape)
print("Target:", batch.trg.shape)

Source: OrderedDict([('srcSeqlen', 2), ('batch', 52)])
Target: OrderedDict([('trgSeqlen', 5), ('batch', 52)])


Success! Now that we've processed the data, we are ready to begin modeling.

## Assignment

Now it is your turn to build the models described at the top of the assignment. 

When a model is trained, use the following test function to produce predictions, and then upload to the kaggle competition: https://www.kaggle.com/c/harvard-cs287-s19-hw3/

For the final Kaggle test, we will provide the source sentence, and you are to predict the **first three words of the target sentence**. The source sentence can be found under `source_test.txt`

In [13]:
!curl -Os https://raw.githubusercontent.com/harvard-ml-courses/cs287-s18/master/HW3/source_test.txt
!head -n 2 source_test.txt

Als ich in meinen 20ern war , hatte ich meine erste Psychotherapie-Patientin .
Ich war Doktorandin und studierte Klinische Psychologie in Berkeley .


Similar to HW1, you are to predict the 100 most probable 3-gram that will begin the target sentence. The submission format will be as follows, where each word in the 3-gram will be separated by "|", and each 3-gram will be separated by space. For example, here is what an example submission might look like with 5 most-likely 3-grams (instead of 100).

```
Id,Predicted
0,Newspapers|talk|about When|I|was Researchers|call|the Twentysomethings|like|Alex But|before|long
1,That|'s|what Newspapers|talk|about You|have|robbed It|'s|realizing My|parents|wanted
2,We|forget|how We|think|about Proust|actually|links Does|any|other This|is|something
3,But|what|do And|it|'s They|'re|on My|name|is It|only|happens
```

When you print out your data, you will need to escape quotes and commas with the following command so that Kaggle does not complain. 

In [14]:
def escape(l):
    return l.replace("\"", "<quote>").replace(",", "<comma>")

You should perform your hyperparameter search/early stopping/write-up based on perplexity, not the above metric. In practice, people use a metric called [BLEU](https://www.aclweb.org/anthology/P02-1040.pdf), which is roughly a geometric average of 1-gram, 2-gram, 3-gram, 4-gram precision, with a brevity penalty for producing translations that are too short.

The test data associated with `source_test.txt` can be found [here](https://gist.githubusercontent.com/justinchiu/c4340777fa86facd820c59ff4d84c078/raw/e6ec7daba76446bc1000813680f4722060e51900/gistfile1.txt). Compute the BLEU score of your conditional de-en model with the `multi-bleu.perl` script found [here](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl). Please submit your BLEU scores on test with your final writeup using the template provided in the repository:  https://github.com/harvard-ml-courses/nlp-template. 



# Non-Attentional Seq2Seq Model

In [15]:
EN_VECS = EN.vocab.vectors
DE_VECS = DE.vocab.vectors

EN_embed_size = EN_VECS.shape[1]
DE_embed_size = DE_VECS.shape[1]
print(EN_embed_size, DE_embed_size)

EN_VOCAB_LEN = len(EN.vocab)

300 300


In [16]:
class EncoderRNN(ntorch.nn.Module):
    def __init__(self, num_layers, hidden_size, emb_dropout=0.1, lstm_dropout=0.1):
        super().__init__()
        
        self.hidden_size = hidden_size
        self.emb_dropout = ntorch.nn.Dropout(p=emb_dropout)
        self.embeddings = ntorch.nn.Embedding.from_pretrained(DE_VECS.clone(), freeze=False)
        self.lstm = ntorch.nn.LSTM(DE_embed_size, hidden_size, num_layers, dropout=lstm_dropout) \
                             .spec("embedding", "srcSeqlen", "hidden")
        
    def forward(self, x, hidden=None):
        emb = self.emb_dropout(self.embeddings(x))
        output, hidden = self.lstm(emb, hidden)
        return output, hidden
    
    
# TODO: remove duplicated code
class DecoderRNN(ntorch.nn.Module):
    def __init__(self, num_layers, hidden_size, emb_dropout=0.1, lstm_dropout=0.1):
        super().__init__()
        
        self.hidden_size = hidden_size
        self.emb_dropout = ntorch.nn.Dropout(p=emb_dropout)
        self.embeddings = ntorch.nn.Embedding.from_pretrained(EN_VECS.clone(), freeze=False)
        self.lstm = ntorch.nn.LSTM(DE_embed_size, hidden_size, num_layers, dropout=lstm_dropout) \
                             .spec("embedding", "trgSeqlen", "hidden")
        
    def forward(self, x, hidden):
        emb = self.emb_dropout(self.embeddings(x))
        output, hidden = self.lstm(emb, hidden)
        return output, hidden

In [208]:
def flip(ntensor, dim):
    ntensor = ntensor.clone()
    idx = ntensor._schema._names.index(dim)
    ntensor._tensor = ntensor._tensor.flip(idx)
    return ntensor

class Seq2Seq(ntorch.nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.out = ntorch.nn.Linear(decoder.hidden_size, EN_VOCAB_LEN) \
                            .spec("hidden", "vocab")
    
    def _shift_tgt(self, tgt):
        start_of_sent = [[EN.vocab.stoi[BOS_WORD]] * tgt.shape['batch']]
        start_of_sent = ntorch.tensor(start_of_sent, names=('trgSeqlen', 'batch'))
        end_of_sent =  tgt[{'trgSeqlen': slice(0, tgt.shape['trgSeqlen'] - 1)}]
        shifted = ntorch.cat((start_of_sent, end_of_sent), 'trgSeqlen')
        return shifted
    
    # this function should only be used in training/evaluation
    def forward(self, src, tgt):
        # TODO: reverse src before encoding
        src = flip(src, 'srcSeqlen')
        _, enc_hidden = self.encoder(src)
        dec_output, _ = self.decoder(self._shift_tgt(tgt), enc_hidden)
        out = self.out(dec_output).log_softmax('vocab')
        return out
    
    # this function should implement beam search to translate the src
    # src should be (seqLen,) NamedTensor        
    def translate(self, src, max_len=30):
        model.eval()
        with torch.no_grad():
            # TODO: reverse src before encoding
            src = ntorch.tensor(src.values.unsqueeze(0), ('batch', 'srcSeqlen'))
            src = flip(src, 'srcSeqlen')
            _, enc_hidden = encoder.forward(src)
            
            dec_input = ntorch.tensor([[EN.vocab.stoi[BOS_WORD]]], ('batch', 'trgSeqlen'))                    
            dec_hidden = enc_hidden

            translated_sent = []
            for i in range(max_len):
                dec_output, dec_hidden = decoder.forward(dec_input, dec_hidden)                
                prediction = self.out(dec_output).argmax('vocab')
                if prediction.item() == EN.vocab.stoi[EOS_WORD]:
                    break
                else:
                    translated_sent.append(prediction.item())
                    dec_input = prediction

            return torch.tensor(translated_sent)

In [201]:
def evaluate(model, batches):
    model.eval()
    with torch.no_grad():
        loss_fn = ntorch.nn.NLLLoss(reduction="sum").spec("vocab")
        tot_loss = 0
        num_ex = 0
        for batch in batches:
            log_probs = model.forward(batch.src, batch.trg)
            tot_loss += loss_fn(log_probs, batch.trg).values
            num_ex += batch.trg.shape['batch'] * batch.trg.shape['trgSeqlen']

        # TODO: compute bleu
        return torch.exp(tot_loss / num_ex), 0

def train_model(model, num_epochs=300, learning_rate=0.001, weight_decay=0, log_freq=1):
    model.train()
    opt = torch.optim.Adam(
        model.parameters(), lr=learning_rate, weight_decay=weight_decay
    )
        
    loss_fn = ntorch.nn.NLLLoss().spec("vocab")
    start_time = time.time()
    
    best_params = {k: p.detach().clone() for k, p in model.named_parameters()}
    best_val_loss = float('inf')
    
    for i in range(num_epochs):
        try:
            for batch in tqdm(train_iter, total=len(train_iter)):
                opt.zero_grad()

                log_probs = model.forward(batch.src, batch.trg)
                loss = loss_fn(log_probs, batch.trg)

                # compute gradients and update weights
                loss.backward()
                opt.step()

            # evaluate performance on entire sets
            model.eval()
            train_loss, train_bleu = evaluate(model, train_iter)
            val_loss, val_bleu = evaluate(model, val_iter)
            model.train()
            
            # saving the parameters with the best validation loss
            if val_loss < best_val_loss:
                best_params = {k: p.detach().clone() for k, p in model.named_parameters()}
                best_val_loss = val_loss
            
            # logging
            if i == 0 or i == num_epochs - 1 or (i + 1) % log_freq == 0:
                msg = f"{round(time.time() - start_time)} sec: Epoch {i + 1}"
                print(f'{msg}\n{"=" * len(msg)}')
                print(f'Train Perplexity: {train_loss:.5f}\t Train BLEU: {train_bleu:.2f}%')
                print(f'Val Perplexity: {val_loss:.5f}\t Val BLEU: {val_bleu:.2f}%\n')

        except KeyboardInterrupt:
            print(f'\nStopped training after {i} epochs...')
            break

    model.eval()
    model.load_state_dict(best_params)
                      
    msg = f"{round(time.time() - start_time)} sec: Final Results"
    print(f'{msg}\n{"=" * len(msg)}')

    train_loss, train_bleu = evaluate(model, train_iter)
    val_loss, val_bleu = evaluate(model, val_iter)
    print(f'Train Perplexity: {train_loss:.5f}\t Train BLEU: {train_bleu:.2f}%')
    print(f'Val Perplexity: {val_loss:.5f}\t Val BLEU: {val_bleu:.2f}%\n')


In [202]:
encoder = EncoderRNN(num_layers=3, hidden_size=100, emb_dropout=0.5, lstm_dropout=0.5)
decoder = DecoderRNN(num_layers=3, hidden_size=100, emb_dropout=0.5, lstm_dropout=0.5)
model = Seq2Seq(encoder, decoder)

In [203]:
# train_model(model, num_epochs=100, learning_rate=0.001, weight_decay=0, log_freq=1)
# torch.save(model.state_dict(), "basic_seq2seq_weights")

In [204]:
model.load_state_dict(torch.load("basic_seq2seq_weights"))

In [205]:
batch = next(iter(train_iter))
src = batch.src[{'batch': 0}]
trg = batch.trg[{'batch': 0}]

translated = model.translate(src)

german = ' '.join([DE.vocab.itos[i] for i in src.values])
english_translation = ' '.join([EN.vocab.itos[i] for i in translated])
english_actual = ' '.join([EN.vocab.itos[i] for i in trg.values])

print('German:', german)
print('English Translated:', english_translation)
print('English Actual:', english_actual)

German: Der Wendepunkt kam unserer Meinung nach , als wir den ersten Computer besaßen .
English Translated: <s> The whole thing was that we had to think about the <unk> of our first <unk> .
English Actual: <s> And the turning point was probably , in our terms , when we had the first computer . </s>


In [206]:
batch.src.unbind('batch')[0].shape

OrderedDict([('srcSeqlen', 14)])

In [207]:
import subprocess

def bleu(target_file, predictions_file):
    cmd = f"./multi-bleu.perl {target_file} < {predictions_file} " \
          "-h 2> /dev/null | cut -d' ' -f3 | cut -d',' -f1"
    return float(subprocess.check_output(cmd, shell=True))
    

def make_translation_predictions(model, file='test_predictions.txt'):
    print('Generating translations')
    with open(file, 'w') as outfile:
        p = 0
        with open('source_test.txt', 'r') as infile:
            for line in tqdm(list(infile)):
                tokens = [DE.vocab.stoi[w] for w in tokenize_de(line.strip())]
                src = ntorch.tensor(tokens, names="srcSeqlen")
                translation = model.translate(src).tolist()
                assert translation[0] == EN.vocab.stoi[BOS_WORD]
                sent = ' '.join(EN.vocab.itos[i] for i in translation[1:])
                outfile.write(escape(sent) + '\n')
                p += 1
                if p < 10:
                    print(escape(sent))
    
def make_kaggle_predictions(model, file='kaggle_predictions.txt'):
    # TODO
    pass

make_translation_predictions(model, 'test_predictions.txt')
bleu("test_predictions.txt", "target_test.txt")

Generating translations


HBox(children=(IntProgress(value=0, max=800), HTML(value='')))

When I was my first birthday <comma> I was the first one .
I was born and raised in Ghana in Ghana in my 20s .
She was a name called his grandmother .
And when I was thinking <comma> I was sorry .
My <unk> was a <unk> of the first <unk> of my first <unk> .
And I heard my mother to read her girl in the audience .
And that 's what I was going to do .
But I do n't have any <unk> .
And then <comma> later <comma> I started <comma> and she went back to the day <comma> she went to <unk> her .


5.64

In [189]:
!head test_predictions.txt

( <unk> ] I was my first slogan was <comma> I was <unk> . </s>
First <comma> <unk> and 2005 and the early <unk> . </s>
... ... <unk> <comma> <quote> <unk> <unk> . <quote> </s>
It was me <comma> but I was a bit of it . </s>
Another way <comma> and the first <unk> of the first <unk> was a <unk> . </s>
They wanted to talk about the parents that I was wearing a girl . </s>
Sorry <comma> I thought <comma> I 'm sorry . </s>
It did n't know what I did . </s>
She was <comma> after the first time <comma> she said <comma> <quote> Go back to school <comma> and then I went back to school . </s>
but I was lucky . I had to see her colleagues . </s>


# Incorporating Attention

# Beam Search

In [None]:
# TODO

# Visualization