## Transformer Models
    

Transformer models took the NLP community by storm by achieving state-of-the-art results in machine translation in the paper [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf). A very good tutorial on this architecture can be found [here](http://jalammar.github.io/illustrated-transformer/). As you can tell from the name, this model is based on attention! It replaces standard recurrent neural networks used for translation with a self-attention-based network. 

The model is an encoder-decoder model, as shown below:


<center><img src="http://nlp.seas.harvard.edu/images/the-annotated-transformer_14_0.png" alt="mlp" align="middle"></center>

The encoder and decoder each consists of modules which are repeated N times (N=6 in the original paper). Words are transformed into embeddings before being input into these modules. However, these embeddings are added with positional embeddings, which give the model a notion of relative input position (remember that there is no recurrence model which keeps track sequentially.). These positional embeddings consist of the sine and cosine functions of different frequencies: $$PE(pos, 2i) = sin(pos/10000^\frac{2i}{d_{model}})$$   $$PE(pos, 2i+1) = cos(pos/10000^\frac{2i}{d_{model}}).$$ Thus, each dimension of the positional encoding
corresponds to a sinusoid.





The encoder consists of a self-attention layer (multi-head attention in the first figure) which is then followed by a feed-forward network. 
<center><img src="http://jalammar.github.io/images/t/encoder_with_tensors_2.png" alt="mlp" align="middle"></center>







<br/>


<br/>


<br/>

The self-attention matrix calculation is shown below: 

<center><img src="http://jalammar.github.io/images/t/self-attention-matrix-calculation-2.png" alt="mlp" align="middle"></center>


See the tutorial for a more in-depth explanation. On a higher level, the K matrix (K=Keys), refers to each word in the sentence. Then for the Q matrix (Q=Query) you want to use every word in the sentence as a query for that keyword and check its relevance for representing that key. This is divided by a constant and then softmax'ed before being multiplied by a matrix V (V=Value) which can be thought of as the *weighted* from the previous notebooks. The result of these calculations is a matrix called z: 

**TODO** In your readme file, please explain what the role of the denominator in the self-attention equation is (check the original paper). 

**Answer:** dk represents the dimensionality of the queries and keys. The dot-product attention is scaled by 1/sqrt(dk) because even though the dot-product attention is more efficient in time and space than the additive attention in practice (because of highly optimized matrix multiplication), it performs poorer than the additive attention if not scaled by larger values of dk. This is because for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. The role of the denominator is to counteract the above effect.

The paper then introduces multiple z matrices (8 in the original paper). 

**TODO** In your readme file, please explain what the motivation is behind using multiple z matrices. 

**Answer:** This is referred to as multi-head attention. The motivation is that it allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

You may have noticed in the first figure above that there is an extra input arrow pointing to the *Add & Norm* module. This is called a residual connection. 

**TODO** In your readme file, explain what the benefits of using residual connections are here (and in neural networks in general). 

**Answer:** Residual connections allow gradients to flow through the network directly, without passing through non-linear activation functions. This prevents the gradients from exploding or vanishing.

### Preprocessing

This is basically the same as before, with some slight modifications. 

In [0]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import torchtext
from torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIterator

import spacy

import random
import math
import os
import time

In [0]:
SEED = 1

random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.enabled = False 
torch.backends.cudnn.deterministic = True

In [0]:
%%capture
! python -m spacy download en
! python -m spacy download de
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

In [0]:
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings
    """
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

In [0]:
SRC = Field(tokenize=tokenize_de, init_token='<sos>', eos_token='<eos>', lower=True, batch_first=True)
TRG = Field(tokenize=tokenize_en, init_token='<sos>', eos_token='<eos>', lower=True, batch_first=True)

In [6]:
train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(SRC, TRG))

downloading training.tar.gz


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:02<00:00, 512kB/s]


downloading validation.tar.gz


validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 167kB/s]


downloading mmt_task1_test2016.tar.gz


mmt_task1_test2016.tar.gz: 100%|██████████| 66.2k/66.2k [00:00<00:00, 159kB/s]


In [0]:
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)

In [0]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [0]:
BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
     batch_size=BATCH_SIZE,
     device=device)

**TODO** 

1. Apply embeddings over the source, scale these embeddings, add positional embeddings and then at the end apply dropout to everything. 



In [0]:
class Encoder(nn.Module):
    def __init__(self, input_dim, hid_dim, n_layers, n_heads, pf_dim, encoder_layer, self_attention, positionwise_feedforward, dropout, device):
        super().__init__()

        self.input_dim = input_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        self.n_heads = n_heads
        self.pf_dim = pf_dim
        self.encoder_layer = encoder_layer
        self.self_attention = self_attention
        self.positionwise_feedforward = positionwise_feedforward
        self.dropout = dropout
        self.device = device
        
        self.tok_embedding = nn.Embedding(input_dim, hid_dim)
        self.pos_embedding = nn.Embedding(1000, hid_dim)
        
        self.layers = nn.ModuleList([encoder_layer(hid_dim, n_heads, pf_dim, self_attention, positionwise_feedforward, dropout, device) 
                                     for _ in range(n_layers)])
        
        self.do = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)
        
    def forward(self, src, src_mask):
        # src = [batch size, src sent len]
        # src_mask = [batch size, src sent len]
        
        pos = torch.arange(0, src.shape[1]).unsqueeze(0).repeat(src.shape[0], 1).to(self.device)
        # TODO create position-aware embeddings
        src = self.do((self.tok_embedding(src) * self.scale) + self.pos_embedding(pos))
        # src = [batch size, src sent len, hid dim]
        
        for layer in self.layers:
            src = layer(src, src_mask)
            
        return src

In [0]:
class EncoderLayer(nn.Module):
    def __init__(self, hid_dim, n_heads, pf_dim, self_attention, positionwise_feedforward, dropout, device):
        super().__init__()
        
        self.ln = nn.LayerNorm(hid_dim)
        self.sa = self_attention(hid_dim, n_heads, dropout, device)
        self.pf = positionwise_feedforward(hid_dim, pf_dim, dropout)
        self.do = nn.Dropout(dropout)
        
    def forward(self, src, src_mask):
        # src = [batch size, src sent len, hid dim]
        # src_mask = [batch size, src sent len]
        src = self.ln(src + self.do(self.sa(src, src, src, src_mask)))
        src = self.ln(src + self.do(self.pf(src)))
        
        return src

**TODO**

1. Modify the size of the Q, K and V matrices to be of size (batch size, n heads, sent len, hid dim // n heads). You will find the view and permute functions from torch helpful. This line will be the same for each of the three matrices. 

2. Matrix multiple Q and K and scale the output following the equation above. 

3. Matrix multiple attention and V

4. Change the shape of x to match the desired output shape. 

In [0]:
class SelfAttention(nn.Module):
    def __init__(self, hid_dim, n_heads, dropout, device):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_heads = n_heads
        
        assert hid_dim % n_heads == 0
        
        self.w_q = nn.Linear(hid_dim, hid_dim)
        self.w_k = nn.Linear(hid_dim, hid_dim)
        self.w_v = nn.Linear(hid_dim, hid_dim)
        
        self.fc = nn.Linear(hid_dim, hid_dim)
        
        self.do = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([hid_dim // n_heads])).to(device)
        
    def forward(self, query, key, value, mask=None):
        bsz = query.shape[0]
        
        # query = key = value [batch size, sent len, hid dim]
                
        Q = self.w_q(query)
        K = self.w_k(key)
        V = self.w_v(value)
        
        # Q, K, V = [batch size, sent len, hid dim]
        # TODO 1:
        Q = Q.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)
        K = K.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)
        V = V.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)
        # Q, K, V = [batch size, n heads, sent len, hid dim // n heads]
        
        # TODO 2: 
        energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
        # energy = [batch size, n heads, sent len, sent len]
        
        if mask is not None:
            energy = energy.masked_fill(mask == 0, -1e10)
        
        attention = self.do(F.softmax(energy, dim=-1))
        # attention = [batch size, n heads, sent len, sent len]
        
        # TODO 3: 
        x = torch.matmul(attention, V)
        # x = [batch size, n heads, sent len, hid dim // n heads]
        
        x = x.permute(0, 2, 1, 3).contiguous()
        # x = [batch size, sent len, n heads, hid dim // n heads]
        
        # TODO 4
        x = x.view(bsz, -1, self.n_heads * (self.hid_dim // self.n_heads))
        # x = [batch size, src sent len, hid dim]
        
        x = self.fc(x)
        # x = [batch size, sent len, hid dim]
        
        return x

In [0]:
class PositionwiseFeedforward(nn.Module):
    def __init__(self, hid_dim, pf_dim, dropout):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.pf_dim = pf_dim
        
        self.fc_1 = nn.Conv1d(hid_dim, pf_dim, 1)
        self.fc_2 = nn.Conv1d(pf_dim, hid_dim, 1)
        
        self.do = nn.Dropout(dropout)
        
    def forward(self, x):
        # x = [batch size, sent len, hid dim]
        x = x.permute(0, 2, 1)
        # x = [batch size, hid dim, sent len]
        
        x = self.do(F.relu(self.fc_1(x)))
        # x = [batch size, ff dim, sent len]
        
        x = self.fc_2(x)
        # x = [batch size, hid dim, sent len]
        
        x = x.permute(0, 2, 1)
        # x = [batch size, sent len, hid dim]
        
        return x

**TODO** 

1. (same as above) -- Apply embeddings over the source, scale these embeddings, add positional embeddings and then at the end apply dropout to everything. 


In [0]:
class Decoder(nn.Module):
    def __init__(self, output_dim, hid_dim, n_layers, n_heads, pf_dim, decoder_layer, self_attention, positionwise_feedforward, dropout, device):
        super().__init__()
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        self.n_heads = n_heads
        self.pf_dim = pf_dim
        self.decoder_layer = decoder_layer
        self.self_attention = self_attention
        self.positionwise_feedforward = positionwise_feedforward
        self.dropout = dropout
        self.device = device
        
        self.tok_embedding = nn.Embedding(output_dim, hid_dim)
        self.pos_embedding = nn.Embedding(1000, hid_dim)
        
        self.layers = nn.ModuleList([decoder_layer(hid_dim, n_heads, pf_dim, self_attention, positionwise_feedforward, dropout, device)
                                     for _ in range(n_layers)])
        
        self.fc = nn.Linear(hid_dim, output_dim)
        
        self.do = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)
        
    def forward(self, trg, src, trg_mask, src_mask):
        # trg = [batch_size, trg sent len]
        # src = [batch_size, src sent len]
        # trg_mask = [batch size, trg sent len]
        # src_mask = [batch size, src sent len]
        
        pos = torch.arange(0, trg.shape[1]).unsqueeze(0).repeat(trg.shape[0], 1).to(self.device)
        
        # TODO 
        trg = self.do((self.tok_embedding(trg) * self.scale) + self.pos_embedding(pos))
        # trg = [batch size, trg sent len, hid dim]
        
        for layer in self.layers:
            trg = layer(trg, src, trg_mask, src_mask)
            
        return self.fc(trg)

In [0]:
class DecoderLayer(nn.Module):
    def __init__(self, hid_dim, n_heads, pf_dim, self_attention, positionwise_feedforward, dropout, device):
        super().__init__()
        
        self.ln = nn.LayerNorm(hid_dim)
        self.sa = self_attention(hid_dim, n_heads, dropout, device)
        self.ea = self_attention(hid_dim, n_heads, dropout, device)
        self.pf = positionwise_feedforward(hid_dim, pf_dim, dropout)
        self.do = nn.Dropout(dropout)
        
    def forward(self, trg, src, trg_mask, src_mask):
        # trg = [batch size, trg sent len, hid dim]
        # src = [batch size, src sent len, hid dim]
        # trg_mask = [batch size, trg sent len]
        # src_mask = [batch size, src sent len]
                
        trg = self.ln(trg + self.do(self.sa(trg, trg, trg, trg_mask)))
        trg = self.ln(trg + self.do(self.ea(trg, src, src, src_mask)))
        trg = self.ln(trg + self.do(self.pf(trg)))
        
        return trg

The Seq2seq model itself doesn't change much. Yay modular code. 

In [0]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, pad_idx, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.pad_idx = pad_idx
        self.device = device
        
    def make_masks(self, src, trg):
        # src = [batch size, src sent len]
        # trg = [batch size, trg sent len]
        
        src_mask = (src != self.pad_idx).unsqueeze(1).unsqueeze(2)
        trg_pad_mask = (trg != self.pad_idx).unsqueeze(1).unsqueeze(3)
        trg_len = trg.shape[1]
        trg_sub_mask = torch.tril(torch.ones((trg_len, trg_len), dtype=torch.uint8, device=self.device))
        trg_mask = trg_pad_mask & trg_sub_mask
        
        return src_mask, trg_mask
    
    def forward(self, src, trg):
        # src = [batch size, src sent len]
        # trg = [batch size, trg sent len]
                
        src_mask, trg_mask = self.make_masks(src, trg)
        
        enc_src = self.encoder(src, src_mask)
        # enc_src = [batch size, src sent len, hid dim]
                
        out = self.decoder(trg, enc_src, trg_mask, src_mask)
        # out = [batch size, trg sent len, output dim]
        
        return out

In [0]:
input_dim = len(SRC.vocab)
hid_dim = 512
n_layers = 6
n_heads = 8
pf_dim = 2048
dropout = 0.1

enc = Encoder(input_dim, hid_dim, n_layers, n_heads, pf_dim, EncoderLayer, SelfAttention, PositionwiseFeedforward, dropout, device)

In [0]:
output_dim = len(TRG.vocab)
hid_dim = 512
n_layers = 6
n_heads = 8
pf_dim = 2048
dropout = 0.1

dec = Decoder(output_dim, hid_dim, n_layers, n_heads, pf_dim, DecoderLayer, SelfAttention, PositionwiseFeedforward, dropout, device)

In [0]:
pad_idx = SRC.vocab.stoi['<pad>']

model = Seq2Seq(enc, dec, pad_idx, device).to(device)

In [20]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 55,205,125 trainable parameters


In [0]:
for p in model.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

A new optimzer is introduced in this paper. 

In [0]:
class NoamOpt:
    "Optim wrapper that implements rate."
    def __init__(self, model_size, factor, warmup, optimizer):
        self.optimizer = optimizer
        self._step = 0
        self.warmup = warmup
        self.factor = factor
        self.model_size = model_size
        self._rate = 0
        
    def step(self):
        "Update parameters and rate"
        self._step += 1
        rate = self.rate()
        for p in self.optimizer.param_groups:
            p['lr'] = rate
        self._rate = rate
        self.optimizer.step()
        
    def rate(self, step = None):
        "Implement `lrate` above"
        if step is None:
            step = self._step
        return self.factor * \
            (self.model_size ** (-0.5) *
            min(step ** (-0.5), step * self.warmup ** (-1.5)))

In [0]:
optimizer = NoamOpt(hid_dim, 1, 2000,
            torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))

In [0]:
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)

In [0]:
def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        src = batch.src
        trg = batch.trg
        
        optimizer.optimizer.zero_grad()
        output = model(src, trg[:,:-1])
                
        # output = [batch size, trg sent len - 1, output dim]
        # trg = [batch size, trg sent len]
        output = output.contiguous().view(-1, output.shape[-1])
        trg = trg[:,1:].contiguous().view(-1)
        # output = [batch size * trg sent len - 1, output dim]
        # trg = [batch size * trg sent len - 1]
            
        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [0]:
def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0
    
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            src = batch.src
            trg = batch.trg

            output = model(src, trg[:,:-1])
            # output = [batch size, trg sent len - 1, output dim]
            # trg = [batch size, trg sent len]
            output = output.contiguous().view(-1, output.shape[-1])
            trg = trg[:,1:].contiguous().view(-1)
            # output = [batch size * trg sent len - 1, output dim]
            # trg = [batch size * trg sent len - 1]
            
            loss = criterion(output, trg)
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [0]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

**TODO** Train for 5 epochs

In [28]:
N_EPOCHS = 5
CLIP = 1
SAVE_DIR = 'models'
MODEL_SAVE_PATH = os.path.join(SAVE_DIR, 'transformer-seq2seq.pt')

best_valid_loss = float('inf')

if not os.path.isdir(f'{SAVE_DIR}'):
    os.makedirs(f'{SAVE_DIR}')

for epoch in range(N_EPOCHS):
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), MODEL_SAVE_PATH)
    
    print(f'| Epoch: {epoch+1:03} | Time: {epoch_mins}m {epoch_secs}s| Train Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f} | Val. Loss: {valid_loss:.3f} | Val. PPL: {math.exp(valid_loss):7.3f} |')

| Epoch: 001 | Time: 3m 21s| Train Loss: 5.948 | Train PPL: 383.051 | Val. Loss: 4.108 | Val. PPL:  60.820 |
| Epoch: 002 | Time: 3m 28s| Train Loss: 3.771 | Train PPL:  43.420 | Val. Loss: 3.200 | Val. PPL:  24.543 |
| Epoch: 003 | Time: 3m 28s| Train Loss: 3.132 | Train PPL:  22.909 | Val. Loss: 2.809 | Val. PPL:  16.598 |
| Epoch: 004 | Time: 3m 27s| Train Loss: 2.762 | Train PPL:  15.834 | Val. Loss: 2.575 | Val. PPL:  13.128 |
| Epoch: 005 | Time: 3m 27s| Train Loss: 2.503 | Train PPL:  12.218 | Val. Loss: 2.406 | Val. PPL:  11.095 |


In [29]:
model.load_state_dict(torch.load(MODEL_SAVE_PATH))
test_loss = evaluate(model, test_iterator, criterion)
print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 2.392 | Test PPL:  10.939 |


## Submission

Now that you have completed the assignment, follow the steps below to submit your aissgnment:
1. Click __Runtime__  > __Run all__ to generate the output for all cells in the notebook.
2. Save the notebook with the output from all the cells in the notebook by click __File__ > __Download .ipynb__.
3. Copy model train and test prints, answers to all short questions, and the shareable link of this notebook to a `README.txt` file.
4. Put the .ipynb file and `README.txt` under your hidden directory on the Zoo server `~/hidden/<YOUR_PIN>/Homework5/`.
5. As a final step, run a script that will set up the permissions to your homework files, so we can access and run your code to grade it. Make sure the command runs without errors, and do not make any changes or run the code again. If you do run the code again or make any changes, you need to run the permissions script again. Submissions without the correct permissions may incur some grading penalty.
`/home/classes/cs477/bash_files/hw5_set_permissions.sh <YOUR_PIN>`