## Welcome to the Tutorial
More to come

In [1]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple
import torchtext
import portalocker

### Loading the dataset
The dataset used in the training is the [Multi30k](https://aclanthology.org/W16-3210.pdf) dataset of translated German sentences. Fortunately for us, `torchtext` has this dataset built-in! We can load it in and use pytorch's `DataLoader` to allow pytorch's API to access our data.

In [2]:
train, valid, test = torchtext.datasets.multi30k.Multi30k() 
train_loader = torch.utils.data.DataLoader(train)
valid_loader = torch.utils.data.DataLoader(valid)
test_loader  = torch.utils.data.DataLoader(test)

## Print out the first few lines of the training dataset below!
## Don't be scared to look at the DataLoader documentation
print([sentence_pair for sentence_pair in list(train_loader)[:3]])

[[('Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.',), ('Two young, White males are outside near many bushes.',)], [('Mehrere Männer mit Schutzhelmen bedienen ein Antriebsradsystem.',), ('Several men in hard hats are operating a giant pulley system.',)], [('Ein kleines Mädchen klettert in ein Spielhaus aus Holz.',), ('A little girl climbing into a wooden playhouse.',)]]


### Building A Transformer, One Block at a Time
#### Scaled Dot-Product Attention
The smallest building block of the transformer model presented in Attention is All You Need is Scaled Dot-Product attention. This class takes in queries, keys, and values and applies a scaled dot product as each time it's called. Pytorch utilizes the `forward` method of its Modules like `__call__` in pure python (while keeping track of all the fine details like gradients in the background).

In [3]:
class ScaledDotProductAttention(torch.nn.Module):
    def __init__(self, temperature: float, attn_dropout=0.1):
        super().__init__()
        self.temperature = temperature
        self.dropout = torch.nn.Dropout(attn_dropout)
    
    
    # Called during training/testing/validation for each datapoint that passes through the model.
    def forward(self, 
                query: torch.Tensor, 
                key: torch.Tensor, 
                value: torch.Tensor, 
                mask: torch.Tensor=None) -> Tuple[torch.Tensor, torch.Tensor]:
        
        transposed_keys = keys.transpose(2,3)
        ## Scale queries down by the temperature and multiply by our transposed keys
        scaled_queries = query / self.temperature
        attention = torch.matmul(scaled_queries, transposed_keys)
        
        # Apply mask if needed
        if mask is not None:
            attention.masked_fill_(mask==0, -1e9)
    
        ## Apply dropout and softmax activation (don't forget to apply softmax to all dimensions with dim=-1)
        ## - Save the attention layer in a variable called dropout_attention -
        softmax_attention = F.softmax(attention, dim=-1)
        dropout_attention = self.dropout(softmax_attention)
        
        output = torch.matmul(dropout_attention, values)

        return output, dropout_attention

#### Multi-Head Attention
While our Scaled-Dot Product attention works well for single tensors, we really want to apply that concept to multiple attention heads at once in parallel! Enter: the MultiHeadAttention module! This module does a lot of things at once. Given a tensor of multiple queries, keys, and values it:
1. Splits the inputs into separate blocks for each attention head
2. Passes the split tensors into the self attention module
3. Normalizes and applies dropout to the queries

This gets a bit complicated, so I haven't left anything as an exercise for the tutorial, but hopefully the comments are helpful!

In [4]:
class MultiHeadAttention(torch.nn.Module):
    def __init__(self, 
                 num_heads: int, 
                 dim_model: int, 
                 dim_keys: int, 
                 dim_values: int, 
                 dropout=0.1):
        super().__init__()
        self.num_heads = num_heads
        self.dim_keys = dim_keys
        self.dim_values = dim_values

        # Initialize fully connected layers to merge the output of all our attention modules
        self.queries_weights = torch.nn.Linear(dim_model, num_heads * dim_keys, bias=False)  # Recall keys and queries same shape
        self.keys_weights    = torch.nn.Linear(dim_model, num_heads * dim_keys, bias=False)  
        self.values_weights  = torch.nn.Linear(dim_model, num_heads * dim_values, bias=False)
        self.fully_connected = torch.nn.Linear(n_head * d_v, d_model, bias=False)

        # Initialize an attention layer in the module (as a class attribute) with temperature sqrt(dim_keys)
        self.attention = ScaledDotProductAttention(temperature=dim_keys ** 0.5)
        
        # Dropout and normalization layers
        self.dropout = torch.nn.Dropout(dropout)
        self.layer_norm = torch.nn.LayerNorm(dim_model, eps=1e-6)

    def forward(self, 
                queries: torch.Tensor, 
                keys: torch.Tensor, 
                values: torch.Tensor, 
                mask: torch.Tensor=None):
    
        dim_keys, dim_values, num_heads = self.dim_keys, self.dim_values, self.num_heads
        batch_size, num_queries, num_keys, num_vals = queries.size(0), queries.size(1), keys.size(1), values.size(1)

        residual = queries  # Saving this value for later (queries = original queries + new queries)

        # Pass through the pre-attention projection: b x lq x (n*dv)
        # Separate different heads: b x lq x n x dv
        # Recall Tensor.view resizes a tensor 
        queries = self.queries_weights(queries).view(batch_size, num_queries, num_heads, dim_keys)
        keys = self.keys_weights(keys).view(batch_size, num_keys, num_heads, dim_keys)
        values = self.values_weights(values).view(batch_size, num_values, num_heads, dim_keys)

        # Transpose for attention dot product: b x n x lq x dv
        queries, keys, values = queries.transpose(1, 2), keys.transpose(1, 2), values.transpose(1, 2)

        if mask is not None:
            mask = mask.unsqueeze(1)   # For head axis broadcasting.

        queries, attention = self.attention(queries, keys, values, mask=mask)

        # Transpose to move the head dimension back: b x lq x n x dv
        # Combine the last two dimensions to concatenate all the heads together: b x lq x (n*dv)
        queries = queries.transpose(1, 2).contiguous().view(batch_size, num_queries, -1)
        queries_dropout = self.dropout(self.fully_connected(queries))
        queries_dropout += residual

        normalized_queries = self.layer_norm(queries_dropout)

        return normalized_queries, attention

#### Layers, Layers, and more Layers
We now need to build up our encoder and decoder from our multi-head attention! The only thing standing between us and that is the setup of the indivudal layers of the decoders and encodersm

In [6]:
from helpers.PositionwiseFeedForward import PositionwiseFeedForward  
class EncoderLayer(nn.Module):
    ''' Compose with two layers '''

    def __init__(self, d_model, d_inner, n_head, d_k, d_v, dropout=0.1):
        super().__init__()
        self.slf_attn = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout)
        # Check out the helpers folder to see the implemetation of this if you're intersted
        self.pos_ffn = PositionwiseFeedForward(d_model, d_inner, dropout=dropout)

    def forward(self, enc_input, slf_attn_mask=None):
        enc_output, enc_slf_attn = self.slf_attn(
            enc_input, enc_input, enc_input, mask=slf_attn_mask)
        enc_output = self.pos_ffn(enc_output)
        return enc_output, enc_slf_attn


class DecoderLayer(nn.Module):
    ''' Compose with three layers '''

    def __init__(self, d_model, d_inner, n_head, d_k, d_v, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.slf_attn = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout)
        self.enc_attn = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout)
        self.pos_ffn = PositionwiseFeedForward(d_model, d_inner, dropout=dropout)

    def forward(
            self, dec_input, enc_output,
            slf_attn_mask=None, dec_enc_attn_mask=None):
        dec_output, dec_slf_attn = self.slf_attn(
            dec_input, dec_input, dec_input, mask=slf_attn_mask)
        dec_output, dec_enc_attn = self.enc_attn(
            dec_output, enc_output, enc_output, mask=dec_enc_attn_mask)
        dec_output = self.pos_ffn(dec_output)
        return dec_output, dec_slf_attn, dec_enc_attn

#### Encoder and Decoder, Separately!

In [None]:
class Encoder(nn.Module):
    ''' A encoder model with self attention mechanism. '''

    def __init__(
            self, n_src_vocab, d_word_vec, n_layers, n_head, d_k, d_v,
            d_model, d_inner, pad_idx, dropout=0.1, n_position=200, scale_emb=False):

        super().__init__()

        self.src_word_emb = nn.Embedding(n_src_vocab, d_word_vec, padding_idx=pad_idx)
        self.position_enc = PositionalEncoding(d_word_vec, n_position=n_position)
        self.dropout = nn.Dropout(p=dropout)
        self.layer_stack = nn.ModuleList([
            EncoderLayer(d_model, d_inner, n_head, d_k, d_v, dropout=dropout)
            for _ in range(n_layers)])
        self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)
        self.scale_emb = scale_emb
        self.d_model = d_model

    def forward(self, src_seq, src_mask, return_attns=False):

        enc_slf_attn_list = []

        # -- Forward
        enc_output = self.src_word_emb(src_seq)
        if self.scale_emb:
            enc_output *= self.d_model ** 0.5
        enc_output = self.dropout(self.position_enc(enc_output))
        enc_output = self.layer_norm(enc_output)

        for enc_layer in self.layer_stack:
            enc_output, enc_slf_attn = enc_layer(enc_output, slf_attn_mask=src_mask)
            enc_slf_attn_list += [enc_slf_attn] if return_attns else []

        if return_attns:
            return enc_output, enc_slf_attn_list
        return enc_output,


class Decoder(nn.Module):
    """ A decoder model with self attention mechanism."""

    def __init__(
            self, n_trg_vocab, d_word_vec, n_layers, n_head, d_k, d_v,
            d_model, d_inner, pad_idx, n_position=200, dropout=0.1, scale_emb=False):

        super().__init__()

        self.trg_word_emb = nn.Embedding(n_trg_vocab, d_word_vec, padding_idx=pad_idx)
        self.position_enc = PositionalEncoding(d_word_vec, n_position=n_position)
        self.dropout = nn.Dropout(p=dropout)
        self.layer_stack = nn.ModuleList([
            DecoderLayer(d_model, d_inner, n_head, d_k, d_v, dropout=dropout)
            for _ in range(n_layers)])
        self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)
        self.scale_emb = scale_emb
        self.d_model = d_model

    def forward(self, trg_seq, trg_mask, enc_output, src_mask, return_attns=False):

        dec_slf_attn_list, dec_enc_attn_list = [], []

        # -- Forward
        dec_output = self.trg_word_emb(trg_seq)
        if self.scale_emb:
            dec_output *= self.d_model ** 0.5
        dec_output = self.dropout(self.position_enc(dec_output))
        dec_output = self.layer_norm(dec_output)

        for dec_layer in self.layer_stack:
            dec_output, dec_slf_attn, dec_enc_attn = dec_layer(
                dec_output, enc_output, slf_attn_mask=trg_mask, dec_enc_attn_mask=src_mask)
            dec_slf_attn_list += [dec_slf_attn] if return_attns else []
            dec_enc_attn_list += [dec_enc_attn] if return_attns else []

        if return_attns:
            return dec_output, dec_slf_attn_list, dec_enc_attn_list
        return dec_output,



#### Encoder and Decoder, Together - The Transformer Module

In [8]:
from helpers.mask_helpers import get_pad_mask, get_subsequent_mask
from helpers.PositionalEncoding import PositionalEncoding


class Transformer(nn.Module):
    ''' A sequence to sequence model with attention mechanism. '''

    def __init__(
            self, n_src_vocab, n_trg_vocab, src_pad_idx, trg_pad_idx,
            d_word_vec=512, d_model=512, d_inner=2048,
            n_layers=6, n_head=8, d_k=64, d_v=64, dropout=0.1, n_position=200,
            trg_emb_prj_weight_sharing=True, emb_src_trg_weight_sharing=True,
            scale_emb_or_prj='prj'):

        super().__init__()

        self.src_pad_idx, self.trg_pad_idx = src_pad_idx, trg_pad_idx

        assert scale_emb_or_prj in ['emb', 'prj', 'none']
        scale_emb = (scale_emb_or_prj == 'emb') if trg_emb_prj_weight_sharing else False
        self.scale_prj = (scale_emb_or_prj == 'prj') if trg_emb_prj_weight_sharing else False
        self.d_model = d_model

        self.encoder = Encoder(
            n_src_vocab=n_src_vocab, n_position=n_position,
            d_word_vec=d_word_vec, d_model=d_model, d_inner=d_inner,
            n_layers=n_layers, n_head=n_head, d_k=d_k, d_v=d_v,
            pad_idx=src_pad_idx, dropout=dropout, scale_emb=scale_emb)

        self.decoder = Decoder(
            n_trg_vocab=n_trg_vocab, n_position=n_position,
            d_word_vec=d_word_vec, d_model=d_model, d_inner=d_inner,
            n_layers=n_layers, n_head=n_head, d_k=d_k, d_v=d_v,
            pad_idx=trg_pad_idx, dropout=dropout, scale_emb=scale_emb)

        self.trg_word_prj = nn.Linear(d_model, n_trg_vocab, bias=False)

        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

        assert d_model == d_word_vec, "Model dimensions myst be the same at all outputs"

        if trg_emb_prj_weight_sharing:
            # Share the weight between target word embedding & last dense layer
            self.trg_word_prj.weight = self.decoder.trg_word_emb.weight

        if emb_src_trg_weight_sharing:
            self.encoder.src_word_emb.weight = self.decoder.trg_word_emb.weight


    def forward(self, src_seq, trg_seq):

        src_mask = get_pad_mask(src_seq, self.src_pad_idx)
        trg_mask = get_pad_mask(trg_seq, self.trg_pad_idx) & get_subsequent_mask(trg_seq)

        enc_output, *_ = self.encoder(src_seq, src_mask)
        dec_output, *_ = self.decoder(trg_seq, trg_mask, enc_output, src_mask)
        seq_logit = self.trg_word_prj(dec_output)
        if self.scale_prj:
            seq_logit *= self.d_model ** -0.5

        return seq_logit.view(-1, seq_logit.size(2))

## Training the Model

This section contains a basic pytorch traning loop in which it loads our Transformer model into memory, loops through the number of epochs we set training on the training dataset and validating on the validation dataset. I've again written some performance metrics in the helpers file, which I've commented about below. Since this tutorial doesn't focus on this part of the 

In [None]:
## Constants - feel free to change to see how they affect the model!

### Training
num_epochs = 10             # How many times to loop through all train/validation data
batch_size = 2048           # Batch size (how many sentences to pass to the model before updating weights)
warmup = 4000               # Number of warmup steps for the learning rate
lr_mul = 2.                 # Multiplier for learning rate
seed = 3621                 # For reproducility. Set to None to train without seed
dropout = 0.1               # Dropout probability 
cuda = True                 # Train on a cuda-enabled GPU?
label_smoothing = True      # Apply label smoothing?


### Model
d_model = 512               # Model dimensions
d_inner_hid = 2048          # Feed forward dimensions
d_k = 64                    # Key dimensions
d_v = 64                    # Value dimensions

### Attention 
n_head = 8                  # Number of attention heads (want )
n_layers = 6                # Number of embedding layers
share_emb_weight = True     # Shared embedding weight?
share_proj_weight = True    # Shared projection weight?
scale_emb_or_prj = 'prj'    # Apply scaling to the embedding or projection


### Logging
output_dir = "./model/"     # Where to save the model
use_tb = False              # Log with TensorBoard (requires additional dependencies)
save_mode = 'best'          # Save only the best model (can be changed to 'all' to save all models)

In [None]:
if seed is not None:
    torch.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    random.seed(seed)

os.makedirs(output_dir, exist_ok=True)  # Makes if not exists
device = torch.device('cuda' if cuda else 'cpu')

transformer = Transformer()

## Testing the Model
We have our test data still, let's see it's predictions on some of the test set! To save time, I've run the training loop myself earlier for 10 epochs 

In [9]:
import os

In [13]:
[sentence_pair[0][0] for sentence_pair in list(train_loader) ]

['Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.',
 'Mehrere Männer mit Schutzhelmen bedienen ein Antriebsradsystem.',
 'Ein kleines Mädchen klettert in ein Spielhaus aus Holz.',
 'Ein Mann in einem blauen Hemd steht auf einer Leiter und putzt ein Fenster.',
 'Zwei Männer stehen am Herd und bereiten Essen zu.',
 'Ein Mann in grün hält eine Gitarre, während der andere Mann sein Hemd ansieht.',
 'Ein Mann lächelt einen ausgestopften Löwen an.',
 'Ein schickes Mädchen spricht mit dem Handy während sie langsam die Straße entlangschwebt.',
 'Eine Frau mit einer großen Geldbörse geht an einem Tor vorbei.',
 'Jungen tanzen mitten in der Nacht auf Pfosten.',
 'Eine Ballettklasse mit fünf Mädchen, die nacheinander springen.',
 'Vier Typen, von denen drei Hüte tragen und einer nicht, springen oben in einem Treppenhaus.',
 'Ein schwarzer Hund und ein gefleckter Hund kämpfen.',
 'Ein Mann in einer neongrünen und orangefarbenen Uniform fährt auf einem grünen Traktor.',
 'Mehrere 

In [18]:
train

ShardingFilterIterDataPipe