In [1]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import seaborn as sns
import matplotlib.pyplot as plt

# Transformers

## Introduction

Sequence to sequence models up to 2017:
- Recurrent Neural Network
- Long Short Term Memory
- Gated Recurrent Unit

<center><img src = "https://stanford.edu/~shervine/teaching/cs-230/illustrations/architecture-rnn-ltr.png?9ea4417fc145b9346a3e288801dbdfdc" width = "60%"/></center>

Two main problems:
1. sequential input flow $\rightarrow$ slow training and prediction
2. many multiplications $\rightarrow$ exploding/vanishing gradients $\rightarrow$ loss of information + reduced window capacity

Transformers (Vaswani et al., 2017) are sequence2sequence encoder-decoder models mapping an input $x \in \mathbb{R}^x$ to an output $y \in \mathbb{R}^y$.\
They implement a mechanism called attention, which allows both parallelization ($\rightarrow$ faster training and prediction) and an infinitely large context window ($\rightarrow$ no more loss of information).

<center><img src="https://pytorch.org/tutorials/_images/transformer_architecture.jpg" width="30%"/></center>

## Transformers from scratch

The majority of the code is taken from https://github.com/ajhalthor/Transformer-Neural-Network .


In [2]:
max_sequence_len = 60 #number of tokens in each sentence
d_model = 6 #dimension of embeddings

### Input

<center><img src="img/word_embedding_0.svg" /></center>

#### Vocabularies lookup tables and Embeddings

In [3]:
START_TOKEN = '<START>'
PADDING_TOKEN = '<PADDING>'
END_TOKEN = '<END>'

#loading vocabulary through spacy
import spacy
nlp_en = spacy.load('en_core_web_lg')
nlp_it = spacy.load('it_core_news_lg')
italian_vocab = [START_TOKEN, PADDING_TOKEN] + list(nlp_it.vocab.strings) + [END_TOKEN]
english_vocab = [START_TOKEN, PADDING_TOKEN] + list(nlp_en.vocab.strings) + [END_TOKEN]

#lookup tables
index_to_italian = {k:v for k,v in enumerate(italian_vocab)}
italian_to_index = {v:k for k,v in enumerate(italian_vocab)}
index_to_english = {k:v for k,v in enumerate(english_vocab)}
english_to_index = {v:k for k,v in enumerate(english_vocab)}

In [4]:
italian_to_index['procioni'], english_to_index['racoons']

(559485, 660403)

From these vocabularies we can build $\texttt{nn.Embedding}$ layers, which behave as lookup tables. Indeed, they take as input an index and they retrieve the corresponging vector.\
The vector representation indicated the weighted matrix is initialized as random values and will be updated by backpropagation.\
We also specify the dimension for each embedding. In the original paper, $d_{model} = 512$; for simplicity, here is set to 6.

In [5]:
english_embedding = nn.Embedding(len(english_vocab), d_model)
italian_embedding = nn.Embedding(len(italian_vocab), d_model)

en_w, it_w = 660403, 559485
print(f"Embedding for the word {index_to_english[en_w]}: {english_embedding(torch.tensor(en_w))}")
print(f"Embedding for the word {index_to_italian[it_w]}: {italian_embedding(torch.tensor(it_w))}")


Embedding for the word racoons: tensor([-1.0445, -2.0593, -0.4914, -1.3520,  0.0834,  0.9723],
       grad_fn=<EmbeddingBackward0>)
Embedding for the word procioni: tensor([ 0.8912, -0.1354,  0.6910,  0.5355,  0.3882, -0.1405],
       grad_fn=<EmbeddingBackward0>)


#### Data Preprocessing and Batching

dataset: https://www.statmt.org/europarl/

Next steps:
1. Preprocess each sentence by removing '\n', applying $\texttt{lower()}$ 
2. Filter those sentences which are too long and/or have unknown words
3. Take 10000 of those filtered sentences

In [6]:
#utility functions for checking sentence validity
def is_valid_tokens(sentence_tokenized, vocab):
    for token in sentence_tokenized:
        if isinstance(token, str):
            w = token 
        else:
            w = token.text
        if w not in vocab:
            return False
    return True

def is_valid_length(sentence_tokenized, max_sequence_length):
    return len(sentence_tokenized) < (max_sequence_length)

def filter_and_preprocess(sent, tokenizer, vocab, max_s_len):
    tokenized = tokenizer(sent)
    if is_valid_length(tokenized, max_s_len) and is_valid_tokens(tokenized, vocab):
        return str(sent)
    return False

In [7]:
#taking only the first 10000 sentences meeting the requirements
MAX_SENTENCES = 10000

In [8]:
#use nltk word_tokenizer to speed up the process
from nltk.tokenize import word_tokenize

In [9]:
it_tokenizer = lambda text: word_tokenize(text, language='italian')
en_tokenizer = lambda text: word_tokenize(text, language='english')

In [10]:
english_file = "/Users/flint/Data/europarl/it-en/europarl-v7.it-en.en"
italian_file = "/Users/flint/Data/europarl/it-en/europarl-v7.it-en.it"

en_vocab = set(english_vocab)
it_vocab = set(italian_vocab)

#loading corpus for training
count = 0
english_lines, italian_lines = [], []
english_sentences, italian_sentences = [], []

print('Reading english sentences...')
with open(english_file,'rt') as f:
    english_lines = f.readlines()

print('Reading italian sentences...')   
with open(italian_file,'rt') as f:
    italian_lines = f.readlines()
    
for (sentence_en, sentence_it) in zip(english_lines, italian_lines):
    if count < MAX_SENTENCES:
        preprocessed_sent_en = filter_and_preprocess(sentence_en.lower()[:-1], 
                                                    en_tokenizer, 
                                                    en_vocab, 
                                                    max_sequence_len)
        if preprocessed_sent_en:
            preprocessed_sent_it = filter_and_preprocess(sentence_it.lower()[:-1], 
                                                it_tokenizer, 
                                                it_vocab, 
                                                max_sequence_len)
            if preprocessed_sent_it: 
                english_sentences.append(preprocessed_sent_en)
                italian_sentences.append(preprocessed_sent_it)
                count += 1
                print(count, end='\r')
    else: break

Reading english sentences...
Reading italian sentences...
10000

In [11]:
english_sentences[:2]

['resumption of the session',
 'i declare resumed the session of the european parliament adjourned on friday 17 december 1999, and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.']

In [12]:
italian_sentences[:2]

['ripresa della sessione',
 'dichiaro ripresa la sessione del parlamento europeo, interrotta venerdì 17 dicembre e rinnovo a tutti i miei migliori auguri nella speranza che abbiate trascorso delle buone vacanze.']

Next, we build a custom $\texttt{Dataset}$ class to store our input pairs and to obtain a nice interface for doing batching.

In [13]:
#we build a dataset class for our specific MT task
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):

    def __init__(self, english_sentences, italian_sentences):
        self.english_sentences = english_sentences
        self.italian_sentences = italian_sentences

    def __len__(self):
        return len(self.english_sentences)

    def __getitem__(self, idx):
        return self.english_sentences[idx], self.italian_sentences[idx]

In [14]:
#here is our dataset instance, which will work flawlessly with other pytorch modules
dataset = TextDataset(english_sentences, italian_sentences)

#saving the dataset
torch.save(dataset, '/Users/flint/Data/europarl/it-en/dataset.pt')

In [15]:
dataset[34]

('so parliament should send a message, since that is the wish of the vast majority.',
 'il parlamento dovrebbe pertanto inviare un messaggio, come auspica la stragrande maggioranza dei deputati.')

In [16]:
#loading dataset
dataset = torch.load('/Users/flint/Data/europarl/it-en/dataset.pt')

Let's see how to batch our data: we simply create a $\texttt{DataLoader}$ instance to iterate through the dataset in batches of size 3.

In [17]:
#we can also batch our data to make training faster
batch_size = 3 
train_loader = DataLoader(dataset, batch_size)
iterator = iter(train_loader)
batch = next(iterator)

In [18]:
batch

[('resumption of the session',
  'i declare resumed the session of the european parliament adjourned on friday 17 december 1999, and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.',
  "please rise, then, for this minute' s silence."),
 ('ripresa della sessione',
  'dichiaro ripresa la sessione del parlamento europeo, interrotta venerdì 17 dicembre e rinnovo a tutti i miei migliori auguri nella speranza che abbiate trascorso delle buone vacanze.',
  'vi invito pertanto ad alzarvi in piedi per osservare appunto un minuto di silenzio.')]

#### Tokenization

For this step, we replace each word with its corresponding index in the vocabulary lookup tables.\
Then, we add \<START\>, \<END\> and \<PADDING\> tokens accordingly. 

In [19]:
#now we build our tokenizer, using the previously-made lookup tables
def tokenize(sentence, language_to_index, lang_tokenizer, start_token=True, end_token=True):
    sentence_word_ids = [language_to_index[token.text] for token in lang_tokenizer(sentence)]
    if start_token:
        sentence_word_ids.insert(0, language_to_index[START_TOKEN])
    if end_token:
        sentence_word_ids.append(language_to_index[END_TOKEN])
    for _ in range(len(sentence_word_ids), max_sequence_len):
        sentence_word_ids.append(language_to_index[PADDING_TOKEN])
    return torch.tensor(sentence_word_ids)

In [20]:
tokenize('my favourite animal is the raccoon.', english_to_index, nlp_en.tokenizer, start_token=False, end_token=True)

tensor([611090, 494930, 394675, 546092, 721448, 660268,   4897, 776471,      1,
             1,      1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1,      1])

In [22]:
tokenize('il mio animale preferito è il procione.', italian_to_index, nlp_it.tokenizer, start_token=True, end_token=True)

tensor([     0, 452149, 508323, 307925, 556355, 677483, 452149, 559484,   3016,
        681836,      1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1,      1])

In [23]:
eng_tokenized, it_tokenized = [], []
for sentence_num in range(batch_size):
    eng_sentence, it_sentence = batch[0][sentence_num], batch[1][sentence_num]
    eng_tokenized.append(tokenize(eng_sentence, english_to_index, nlp_en.tokenizer, start_token=False, end_token=True))
    #start and end tokens are required for beginning and ending in the generation phase
    it_tokenized.append(tokenize(it_sentence, italian_to_index, nlp_it.tokenizer, start_token=True, end_token=True)) 
eng_tokenized = torch.stack(eng_tokenized)
it_tokenized = torch.stack(it_tokenized)

In [24]:
it_tokenized

tensor([[     0, 582931, 386197, 606005, 681836,      1,      1,      1,      1,
              1,      1,      1,      1,      1,      1,      1,      1,      1,
              1,      1,      1,      1,      1,      1,      1,      1,      1,
              1,      1,      1,      1,      1,      1,      1,      1,      1,
              1,      1,      1,      1,      1,      1,      1,      1,      1,
              1,      1,      1,      1,      1,      1,      1,      1,      1,
              1,      1,      1,      1,      1,      1],
        [     0, 390276, 582931, 479202, 606005, 385670, 538666, 412693,   1125,
         464576, 659121,  13442, 390172, 400642, 581859, 290097, 650364, 449638,
         506537, 506751, 320905, 520299, 620428, 356586, 291150, 645799, 386283,
         342076, 655743,   3016, 681836,      1,      1,      1,      1,      1,
              1,      1,      1,      1,      1,      1,      1,      1,      1,
              1,      1,      1,      1,      1,   

#### Putting all together: fancy class for Sentence Embeddings

In [None]:
def get_device():
    return torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

class SentenceEmbedding(nn.Module):
    "For a given sentence, create an embedding"
    def __init__(self, max_sequence_length, d_model, language_to_index, lang_tokenizer, START_TOKEN, END_TOKEN, PADDING_TOKEN):
        super().__init__()
        self.vocab_size = len(language_to_index)
        self.max_sequence_length = max_sequence_length
        self.embedding = nn.Embedding(self.vocab_size, d_model)
        self.language_to_index = language_to_index
        self.language_tokenizer = lang_tokenizer
        self.dropout = nn.Dropout(p=0.1)
        self.START_TOKEN = START_TOKEN
        self.END_TOKEN = END_TOKEN
        self.PADDING_TOKEN = PADDING_TOKEN
    
    def batch_tokenize(self, batch, start_token=True, end_token=True):

        def tokenize(sentence, start_token=True, end_token=True):
            sentence_word_ids = [self.language_to_index[token.text] for token in self.language_tokenizer(sentence)]
            if start_token:
                sentence_word_ids.insert(0, self.language_to_index[self.START_TOKEN])
            if end_token:
                sentence_word_ids.append(self.language_to_index[self.END_TOKEN])
            for _ in range(len(sentence_word_ids), self.max_sequence_length):
                sentence_word_ids.append(self.language_to_index[self.PADDING_TOKEN])
            return torch.tensor(sentence_word_ids)

        tokenized = []
        for sentence_num in range(len(batch)):
            tokenized.append(tokenize(batch[sentence_num], start_token, end_token))
        tokenized = torch.stack(tokenized)
        return tokenized.to(get_device())
    
    def forward(self, x, start_token = True, end_token=True): # sentence
        x = self.batch_tokenize(x, start_token, end_token)
        x = self.embedding(x)
        return x


In [None]:
sentence_embedding = SentenceEmbedding(max_sequence_len, d_model, english_to_index, nlp_en.tokenizer, START_TOKEN, END_TOKEN, PADDING_TOKEN)
english_batch = next(iterator)[0]
print(f"Input batch: {english_batch}\nOutput embeddings shapes:{[embedding.size() for embedding in sentence_embedding(english_batch, start_token = False, end_token = True)]}\nOuput embeddings: {sentence_embedding(english_batch)}")

### Positional Encoding

In [None]:
#only for displaying purposes, we will lower 
#max_sequence_len to 10
max_sequence_len = 10

$$
PE(\text{position}, 2i) = \sin\bigg( \frac{ \text{position} }{10000^\frac{2i}{d_{model}}} \bigg) 
$$

$$
PE(\text{position}, 2i+1) = \cos\bigg( \frac{ \text{position} }{10000^\frac{2i}{d_{model}}} \bigg)
$$
where $i$ is the embedding dimension index, $position$ is the position of the word in the sentence, $d_{model}$ is the size of the embedding.
- sin and cos allow for constrained values
- sin and cos are periodic -> easier to attend to relative and distant positions


Note that, whenever position is an odd number $2i+1$, its respective cosine function still considers $2i$ in its denominator formula, which corresponds to $(2i+1)-1 = 2i$. For this reason, we use $\texttt{repeat\_interleave()}$ function to repeat even elements of the $i$ tensor.

In [None]:
i = torch.arange(0,d_model,2, dtype=torch.float).repeat_interleave(2)[:d_model]
i

The next code computes $10000^\frac{2i}{d_{model}}$.

In [None]:
denominator = torch.pow(10000, 2*i/d_model)
denominator

This code computes $\text{position}$.


In [None]:
position = torch.arange(max_sequence_len, dtype=torch.float).reshape(max_sequence_len, 1)
position

And this code computes $\frac{ \text{position} }{10000^\frac{2i}{d_{model}}}$.

In [None]:
sin_cos_argument = position/denominator
sin_cos_argument

Now we are ready to apply the $\sin$ function to even vector positions ($\texttt{[:, 0::2]}$) and the $\cos$ function to odd vector positions ($\texttt{[:, 1::2]}$).\
This computation gives us the final positional encoding ($\texttt{PE}$).

In [None]:
PE = torch.zeros(size = sin_cos_argument.shape)
#even positions
PE[:, 0::2] = torch.sin(sin_cos_argument[:, 0::2])
#odd positions
PE[:, 1::2] = torch.cos(sin_cos_argument[:, 1::2])
PE

#### Fancy class for Positional Encoding

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_sequence_length):
        super().__init__()
        self.max_sequence_length = max_sequence_length
        self.d_model = d_model

    def forward(self):
        i = torch.arange(0,self.d_model,2, dtype=torch.float).repeat_interleave(2)[:self.d_model]
        denominator = torch.pow(10000, 2*i/self.d_model)
        position = torch.arange(self.max_sequence_length).reshape(self.max_sequence_length, 1)
        sin_cos_argument = position/denominator
        PE = torch.zeros(size = sin_cos_argument.shape)
        PE[:, 0::2] = torch.sin(sin_cos_argument[:, 0::2])
        PE[:, 1::2] = torch.cos(sin_cos_argument[:, 1::2])
        return PE

### Input Embeddings + Positional Encoding

Now we will update the $\texttt{SentenceEmbedding}$ class to output embeddings with encoded position.

In [None]:
#resetting max_sequence_len
max_sequence_len = 60

In [None]:
class SentenceEmbedding(nn.Module):
    "For a given sentence, create an embedding"
    def __init__(self, max_sequence_length, d_model, language_to_index, lang_tokenizer, START_TOKEN, END_TOKEN, PADDING_TOKEN):
        super().__init__()
        self.vocab_size = len(language_to_index)
        self.max_sequence_length = max_sequence_length
        self.embedding = nn.Embedding(self.vocab_size, d_model)
        self.language_to_index = language_to_index
        self.language_tokenizer = lang_tokenizer
        self.position_encoder = PositionalEncoding(d_model, max_sequence_length)
        self.dropout = nn.Dropout(p=0.1)
        self.START_TOKEN = START_TOKEN
        self.END_TOKEN = END_TOKEN
        self.PADDING_TOKEN = PADDING_TOKEN
    
    def batch_tokenize(self, batch, start_token=True, end_token=True):

        def tokenize(sentence, start_token=True, end_token=True):
            sentence_word_ids = [self.language_to_index[token.text] for token in self.language_tokenizer(sentence)]
            if start_token:
                sentence_word_ids.insert(0, self.language_to_index[self.START_TOKEN])
            if end_token:
                sentence_word_ids.append(self.language_to_index[self.END_TOKEN])
            for _ in range(len(sentence_word_ids), self.max_sequence_length):
                sentence_word_ids.append(self.language_to_index[self.PADDING_TOKEN])
            return torch.tensor(sentence_word_ids)

        tokenized = []
        for sentence_num in range(len(batch)):
            tokenized.append(tokenize(batch[sentence_num], start_token, end_token))
        tokenized = torch.stack(tokenized)
        return tokenized.to(get_device())
    
    def forward(self, x, start_token = True, end_token=True): # sentence
        x = self.batch_tokenize(x, start_token, end_token)
        x = self.embedding(x)
        pos = self.position_encoder().to(get_device())
        x = self.dropout(x + pos)
        return x


In [None]:
sentence_embedding = SentenceEmbedding(max_sequence_len, d_model, english_to_index, nlp_en.tokenizer, START_TOKEN, END_TOKEN, PADDING_TOKEN)
english_batch = next(iterator)[0]
print(f"Input batch: {english_batch}\nOutput embeddings shapes:{[embedding.size() for embedding in sentence_embedding(english_batch, start_token = False, end_token = True)]}\nOuput embeddings: {sentence_embedding(english_batch)}")

In [None]:
en_sentence_embedding = SentenceEmbedding(max_sequence_len, d_model, english_to_index, nlp_en.tokenizer, START_TOKEN, END_TOKEN, PADDING_TOKEN)
it_sentence_embedding = SentenceEmbedding(max_sequence_len, d_model, italian_to_index, nlp_it.tokenizer, START_TOKEN, END_TOKEN, PADDING_TOKEN)

In [None]:
english_batch, italian_batch = next(iterator)

### Attention

#### Simple Attention

<center><img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2019/11/image2.png" width = "50%"/></center>

<center><img src="img/attention.svg"/></center>

Properties:
1. No access to "future" information $\rightarrow$ autoregressive generation
2. Each computation is independent $\rightarrow$ parallelization

Steps:
1. Comparison
$$
\text{score}(\mathbf{x_i}, \mathbf{x_j}) = \mathbf{x_i} \cdot \mathbf{x_j}, \quad \forall j \leq i
$$
2. Normalization
$$
\alpha_{ij} = \text{softmax}(\text{score}(\mathbf{x}_i, \mathbf{x}_j)) = \frac{exp(\text{score}(\mathbf{x}_i, \mathbf{x}_j))}{\sum_{k=1}^{i}exp(\text{score}(\mathbf{x}_i, \mathbf{x}_k))}, \quad \forall j \leq i
$$
3. Weighted sum
$$
\mathbf{y}_i = \sum_{j\leq i}\alpha_{ij}\mathbf{x}_j
$$

#### Attention in transformers

Three different roles in attention process:
1. current focus of attention $\rightarrow$ **query**
2. input being compared $\rightarrow$ **key**
3. output for current focus $\rightarrow$ **value**
<center><img src="img/query-key-value.svg"/></center>

where 
$$
\mathbf{Q} = \mathbf{XW}^Q, \quad \mathbf{K} = \mathbf{XW}^K, \quad \mathbf{V} = \mathbf{XW}^V; \quad \mathbf{Q}, \mathbf{K} \in \mathbb{R}^{N \times d_k}, \mathbf{V} \in \mathbb{R}^{N \times d_v}
$$
with $N$ being the number of input tokens and $d_k = d_v = d_{model}$ the dimensionality of input embeddings.

The three steps are the same, but with matrices (parallel computation):
1. Comparison
$$
\text{score}(\mathbf{Q}, \mathbf{K}) = \mathbf{Q}\mathbf{K}^\top
$$
2. Normalization (& scaling for stabilization)
$$
\text{Attention}(\mathbf{Q},\mathbf{K}) =\text{softmax}\left(\frac{\mathbf{QK^\top}}{\sqrt{d_k}}\right)
$$
3. Weighted sum
$$
\text{Output}= \text{softmax}\left(\frac{\mathbf{QK^\top}}{\sqrt{d_k}}\right)\mathbf{V}
$$

In [None]:
d_k, d_v =  d_model, d_model
w_q = torch.randn(d_k, d_k)
w_k = torch.randn(d_k, d_k)
w_v = torch.randn(d_v, d_v)
input = en_sentence_embedding(english_batch, start_token = False, end_token = True)[0] #embeddings
tokenized_sentence = en_sentence_embedding.batch_tokenize(english_batch, start_token = False, end_token = True)[0] #textual representation of sentence, tokenized
tokenized_sentence_pad = [index_to_english[t.item()] for t in tokenized_sentence]
tokenized_sentence_nopad = [index_to_english[t.item()] for t in tokenized_sentence if t != english_to_index[PADDING_TOKEN]]
print(f"Query weight matrix shape:\t{w_q.size()}")
print(f"Key weight matrix shape:\t{w_k.size()}")
print(f"Value weight matrix shape:\t{w_v.size()}")
print(f"Input matrix shape:\t\t{input.size()}")

#computing Q, K and V
q = torch.matmul(input, w_q)
k = torch.matmul(input, w_k)
v = torch.matmul(input, w_v)
print(f"Query matrix shape:\t\t{q.size()}")
print(f"Key matrix shape:\t\t{k.size()}")
print(f"Value matrix shape:\t\t{v.size()}")

In [None]:
input_img, q_img, k_img, v_img = input[:len(tokenized_sentence_nopad),:] ,q[:len(tokenized_sentence_nopad),:], k[:len(tokenized_sentence_nopad),:], v[:len(tokenized_sentence_nopad),:]

In [None]:
def softmax(x):
  return (torch.exp(x).T / torch.sum(torch.exp(x), axis=-1)).T

def scaled_dot_product_attention(q, k, v):
  d_k = q.shape[-1]
  scaled = torch.matmul(q, k.T) / math.sqrt(d_k)
  attention = softmax(scaled)
  output = torch.matmul(attention, v)
  return output, attention

In [None]:
o, att = scaled_dot_product_attention(q, k, v)
print(f"Output shape:\t\t{o.size()}")
print(f"Attention shape:\t{att.size()}")

In [None]:
o, att = scaled_dot_product_attention(input_img, input_img, v_img)
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(att.detach().numpy(), xticklabels=tokenized_sentence_nopad, yticklabels=tokenized_sentence_nopad, ax=ax)
plt.tight_layout()
plt.show()

In [None]:
#quick visualization
from bertviz.transformers_neuron_view import BertModel, BertTokenizer
from bertviz.neuron_view import show

sentence_a = "The cat sat on the mat"
sentence_b = "The cat lay on the rug"
model_type = 'bert'
model_version = 'bert-base-uncased'
model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=True)
show(model, model_type, tokenizer, sentence_a, sentence_b, layer=4, head=3)

Attention in Transformers is used in three different ways:
- as the encoder's multi-head *self-attention*
- as the decoders' **masked** multi-head *self-attention*
- as the decoder's multi-head *cross-attention*

Differences reside in the *masked* inputs and in source for $\mathbf{Q}, \mathbf{K}$ and $\mathbf{V} $. Let's see why and how.

##### Masking

By 'masking' we mean setting a value to $-\infty$. This leads to a 0-value output after the attention's softmax computation, which is exactly what we want.

In [None]:
#to mask -> infinitely small number: needed when it will be passed 
#through softmax and to avoid 0/0 errors
NEG_INFTY = -1e9

In [None]:
english_batch[0]

In [None]:
italian_batch[0]

The encoder's multi-head self attention and decoder's multi-head cross attention does not need any mask. Indeed we want it to extract both right context and left context. The only mask we will apply will block attention to and from \<PADDING\> tokens.

In [None]:
def create_encoder_padding_mask(eng_batch):
    masks = []
    tokenized_batch = en_sentence_embedding.batch_tokenize(eng_batch, start_token = False, end_token = True)
    for sentence in tokenized_batch:
        row = torch.where(sentence == en_sentence_embedding.language_to_index[en_sentence_embedding.PADDING_TOKEN],
                        True, False)
        mask = torch.tile(row.unsqueeze(0), (len(sentence), 1))
        mask = torch.logical_or(mask, mask.T)
        mask = torch.where(mask, NEG_INFTY, 0)
        masks.append(mask)        
    masks = torch.stack(masks)
    return masks.to(get_device())

In [None]:
def create_decoder_padding_mask(eng_batch, it_batch):
    masks = []
    tokenized_it_batch = it_sentence_embedding.batch_tokenize(it_batch, start_token = True, end_token = True)
    tokenized_en_batch = en_sentence_embedding.batch_tokenize(eng_batch, start_token = False, end_token = True)
    masks = []
    for en_sentence, it_sentence in zip(tokenized_en_batch, tokenized_it_batch):
        en_row = torch.where(en_sentence == en_sentence_embedding.language_to_index[en_sentence_embedding.PADDING_TOKEN],
                        True, False)
        en_mask = torch.tile(en_row.unsqueeze(0), (len(en_sentence), 1))
        it_row = torch.where(it_sentence == it_sentence_embedding.language_to_index[it_sentence_embedding.PADDING_TOKEN],
                        True, False)
        it_mask = torch.tile(it_row.unsqueeze(0), (len(it_sentence), 1))
        mask = torch.logical_or(it_mask, en_mask.T)
        mask = torch.where(mask, NEG_INFTY, 0)
        masks.append(mask)        
    masks = torch.stack(masks)  
    return masks.to(get_device())

In [None]:
m = create_encoder_padding_mask(english_batch)[0]
fig, ax = plt.subplots(figsize=(13,10))  
sns.heatmap(m, xticklabels=tokenized_sentence_pad, yticklabels=tokenized_sentence_pad)

Decoder's cross attention:

In [None]:
tokenized_sentence_it = it_sentence_embedding.batch_tokenize(italian_batch, start_token = True, end_token = True)[0]
tokenized_sentence_it_pad = [index_to_italian[t.item()] for t in tokenized_sentence_it]
m = create_decoder_padding_mask(english_batch, italian_batch)[0]
fig, ax = plt.subplots(figsize=(13,10))  
sns.heatmap(m, xticklabels=tokenized_sentence_it_pad, yticklabels=tokenized_sentence_pad)

The decoder's masked multi head self attention needs look-ahead masks because it needs to learn to predict the next token: not masking future inputs would be cheating. More formally, remember we need to retain the auto-regressive property of attention to be able to generate text.\
For the decoder's self attention we also need \<PADDING\> token mask.

In [None]:
def create_decoder_masked_attention_mask(it_batch):
    tokenized_batch = it_sentence_embedding.batch_tokenize(it_batch, start_token = True, end_token = True)
    
    masks = []
    for sentence in tokenized_batch:
        look_ahead_mask = torch.full([len(sentence), len(sentence)] , True)
        look_ahead_mask = torch.triu(look_ahead_mask, diagonal=1)
        row = torch.where(sentence == it_sentence_embedding.language_to_index[it_sentence_embedding.PADDING_TOKEN],
                        True, False)
        padding_mask = torch.tile(row.unsqueeze(0), (len(sentence), 1))
        padding_mask = torch.logical_or(padding_mask, padding_mask.T)
        mask = torch.logical_or(padding_mask, look_ahead_mask)
        mask = torch.where(mask, NEG_INFTY, 0)
        masks.append(mask)        
    masks = torch.stack(masks)
    
    return masks.to(get_device())
        

In [None]:
m = create_decoder_masked_attention_mask(italian_batch)[0]
fig, ax = plt.subplots(figsize=(13,10))  
sns.heatmap(m, xticklabels=tokenized_sentence_it_pad, yticklabels=tokenized_sentence_it_pad)

##### Final Attention Function

Let's redefine the attention function so that it accepts masks.

In [None]:
def scaled_dot_product_attention(q, k, v, mask=None):
  d_k = q.shape[-1]
  scaled = np.matmul(q, k.T) / math.sqrt(d_k)
  if mask is not None:
    scaled = scaled + mask
  attention = softmax(scaled)
  out = np.matmul(attention, v)
  return out, attention

### Multi-head Attention

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.\
They give enriched contextual representation: how do different words relate to each other simultaneously?\
These are sets of self-attention layers, called **heads**, that reside in parallel layers at the same depth in a model, each with its own set of parameters. Given these distinct set of parameters, each head can learn different aspects of the relationships that exist among inputs at the same level of abstraction.

Each head takes a split of the original $\mathbf{Q}, \mathbf{K}, \mathbf{V}$ matrices. In the original paper, the authors employ $h = 8$ parallel attention layers, called **heads**, where each head has a reduced input embedding dimensionality $d_{model}/h = 64$, being $d_{model}=512$.\
For the $i$-th attention head, 
$$
\mathbf{Q} = \mathbf{XW}_i^Q, \mathbf{K} = \mathbf{XW}_i^K, \mathbf{V} = \mathbf{XW}_i^V; \quad \mathbf{W}_i^Q \in \mathbb{R}^{d_{model}\times d_k}, \mathbf{W}_i^K \in \mathbb{R}^{d_{model}\times d_k}, \mathbf{W}_i^V \in \mathbb{R}^{d_{model}\times d_v} 
$$
$$
\mathbf{head}_i = \text{SelfAttention}(\mathbf{Q}, \mathbf{K}, \mathbf{V})
$$
Then, we simply concatenate all $h$ outputs:
$$
\text{MultiHeadAttn}(\mathbf{X}) = (\mathbf{head}_1 \oplus \mathbf{head}_2, \dots, \oplus  \mathbf{head}_h) \mathbf{W}^O, \quad \mathbf{W}^O \in \mathbb{R}^{hd_{v}\times d_{model}} 
$$

Let's say $h=4$: then
<center><img src="img/multihead-attn.svg" /></center>

In [None]:
input = en_sentence_embedding(english_batch, start_token = False, end_token = True)
sequence_length = max_sequence_len
print(f"X shape: {input.size()}")

We will stack our Q, K and V matrices in one single linear layer

In [None]:
qkv_layer = nn.Linear(d_model , 3 * d_model)
qkv = qkv_layer(input)
print(f"QKV stacked matrix shape: {qkv.size()}")

In [None]:
num_heads = 2
head_dim = d_model // num_heads
qkv = qkv.reshape(batch_size, sequence_length, num_heads, 3 * head_dim) #split the last dimension in two, [num_heads and 3 * head_dim]
print(f"QKV stacked matrix shape after split data for {num_heads} heads: {qkv.size()}")

Notice that this shape is equal to $\texttt{[batch\_size, sequence\_length, num\_heads, 3*head\_dim]}$.\
We want it to be $\texttt{[batch\_size, num\_heads, sequence\_length, 3*head\_dim]}$.

In [None]:
qkv = qkv.permute(0, 2, 1, 3) # [batch_size, num_heads, sequence_length, 3*head_dim]
print(f"QKV stacked matrix shape after split data for {num_heads} heads, dimensions permuted: {qkv.size()}")

Now we only have to (actually) split this matrix in 3 sub-matrices to compute attention.

In [None]:
q, k, v = qkv.chunk(3, dim=-1) #breaking down last dimension
print(f"Q matrix from QKV stacked matrix: {q.size()}")
print(f"K matrix from QKV stacked matrix: {k.size()}")
print(f"V matrix from QKV stacked matrix: {v.size()}")

We are ready to compute attention. First, we must determine $d_k$.

In [None]:
d_k = d_model // num_heads #equal to last dimension of q
d_k = q.size()[-1]

Notice that the matrix multiplication $\mathbf{QK}^\top$ must be coded such that only the last two dimensions of a vector with shape  $\texttt{[batch\_size, num\_heads, sequence\_length, 3*head\_dim]}$ must be transposed.

In [None]:
scaled = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
print(f"Scaled QK matrix: {scaled.size()}")

Let's compute and apply the mask for this encoder self-attention.

In [None]:
mask = create_encoder_padding_mask(english_batch)
mask = torch.tile(mask.unsqueeze(1), (1, num_heads, 1, 1)) #add a dummy dimension to replicate mask for each split
scaled = scaled + mask

Now that we have the masked scaled input, let's normalize the values through softmax.

In [None]:
attention_weights = F.softmax(scaled, dim=-1) #computed only on last dimension containing scaled input
print(f"Attention weights matrix: {attention_weights.size()}")

Let's compute the output of each head in multihead attention.

In [None]:
heads_output = torch.matmul(attention_weights, v)
print(f"Output matrix for all heads: {heads_output.size()}")

The final output will simply be the reshaping of all heads outputs, averaged through a linear layer ($\mathbf{W}^O$).

In [None]:
heads_output = heads_output.reshape(input.size())
print(f"Output matrix for all heads after reshaping, ready for averaging: {heads_output.size()}")

In [None]:
linear_layer = nn.Linear(d_model, d_model)
output = linear_layer(heads_output)
print(f"Final output matrix after averaging: {output.size()}")

#### Fancy class for Multi-head Attention

In [None]:
def scaled_dot_product(q, k, v, mask=None):
    d_k = q.size()[-1]
    scaled = torch.matmul(q, k.transpose(-1, -2)) / math.sqrt(d_k)
    if mask is not None:
        scaled += mask
    attention = F.softmax(scaled, dim=-1)
    values = torch.matmul(attention, v)
    return values, attention

class MultiheadAttention(nn.Module):

    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        self.qkv_layer = nn.Linear(d_model , 3 * d_model)
        self.linear_layer = nn.Linear(d_model, d_model)
    
    def forward(self, x, mask=None):
        batch_size, sequence_length, input_dim = x.size()
        qkv = self.qkv_layer(x)
        qkv = qkv.reshape(batch_size, sequence_length, self.num_heads, 3 * self.head_dim)
        qkv = qkv.permute(0, 2, 1, 3)
        q, k, v = qkv.chunk(3, dim=-1)
        values, attention = scaled_dot_product(q, k, v, mask)
        values = values.reshape(batch_size, sequence_length, self.num_heads * self.head_dim)
        out = self.linear_layer(values)
        return out

In [None]:
model = MultiheadAttention(d_model, num_heads)
out = model(input)
out

### Layer Normalization

Matrix multiplications can easily make matrix values exploding or vanishing.\
 Layer normalization helps with this possible problem, keeping the values in a range that facilitates gradient-based training.

Given a hidden layer with dimensionality $d_h$, these values are calculated as follows
$$
\mu = \frac{1}{d_h}\sum_{i = 1}^{d_h}x_i
$$
$$
\sigma = \sqrt{\frac{1}{d_h}\sum_{i = 1}^{d_h}(x_i - \mu)^2}
$$
and the normalized values will be computed as 
$$
\hat{\mathbf{x}} = \frac{\mathbf{x}-\mu}{\sigma}
$$
Finally, in the standard implementation of layer normalization, two learnable parameters $\gamma$ and $\beta$, representing gain and offset values, are introduced
$$
\text{LayerNorm} = \gamma \hat{\mathbf{x}}+\beta
$$

We can compute the mean as

In [None]:
#we average on the last two dimensions
mean = out.mean(dim = (-1, -2), keepdim=True) #keepdim=True retrieves a vector maintaining the number of dimensions of input
print(mean.size())
print(mean)

In [None]:
#same for standard deviation
std = out.std(dim = (-1, -2), keepdim=True) #keepdim=True retrieves a vector maintaining the number of dimensions of input
print(std.size())
print(std)

And the normalized output is

In [None]:
parameter_shape = out.size()[-2:]
print(parameter_shape)
gamma = nn.Parameter(torch.ones(parameter_shape))
beta =  nn.Parameter(torch.zeros(parameter_shape))

out_norm = (out - mean)/std
out_norm = gamma * out_norm + beta
print(out_norm.size())

In [None]:
print(f"Old mean and variance:\n{out.mean(dim = (-1, -2), keepdim=True)}\n{out.var(dim = (-1, -2), keepdim=True)}")
print()
print(f"New mean and variance:\n{out_norm.mean(dim = (-1, -2), keepdim=True)}\n{out_norm.var(dim = (-1, -2), keepdim=True)}")

#### Fancy class for Layer Normalization

In [None]:
class LayerNormalization(nn.Module):
    def __init__(self, parameters_shape, eps=1e-5):
        super().__init__()
        self.parameters_shape=parameters_shape
        self.eps=eps
        self.gamma = nn.Parameter(torch.ones(parameters_shape))
        self.beta =  nn.Parameter(torch.zeros(parameters_shape))

    def forward(self, input):
        dims = (-1, -2)
        mean = input.mean(dim=dims, keepdim=True)
        var = ((input - mean) ** 2).mean(dim=dims, keepdim=True)
        std = (var + self.eps).sqrt() #adding espilon to avoid 0 values
        y = (input - mean) / std
        out = self.gamma * y + self.beta
        return out

In [None]:
out.size()

In [None]:
layer_norm = LayerNormalization(out.size()[-2:])
output_norm = layer_norm(out)
output_norm

### Residual Connections

Residual connections are connections that pass information from a lower layer to a higher layer without going through the intermediate layer.\
They are essential to strengthen the signals in deep neural networks as gradients could vanish when performing a high number of matrix multiplications.

## Encoder

In [None]:
def scaled_dot_product(q, k, v, mask=None):
    d_k = q.size()[-1]
    scaled = torch.matmul(q, k.transpose(-1, -2)) / math.sqrt(d_k)
    if mask is not None:
        scaled = scaled.permute(1, 0, 2, 3) + mask #swap batch and head dimensions to match dimensions with mask
        scaled = scaled.permute(1, 0, 2, 3) #resetting
    attention = F.softmax(scaled, dim=-1)
    values = torch.matmul(attention, v)
    return values, attention


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        self.qkv_layer = nn.Linear(d_model , 3 * d_model)
        self.linear_layer = nn.Linear(d_model, d_model)
    
    def forward(self, x, mask=None):
        batch_size, sequence_length, input_dim = x.size()
        qkv = self.qkv_layer(x)
        qkv = qkv.reshape(batch_size, sequence_length, self.num_heads, 3 * self.head_dim)
        qkv = qkv.permute(0, 2, 1, 3)
        q, k, v = qkv.chunk(3, dim=-1)
        values, attention = scaled_dot_product(q, k, v, mask)
        values = values.reshape(batch_size, sequence_length, self.num_heads * self.head_dim)
        out = self.linear_layer(values)
        return out


class LayerNormalization(nn.Module):
    def __init__(self, parameters_shape, eps=1e-5):
        super().__init__()
        self.parameters_shape=parameters_shape
        self.eps=eps
        self.gamma = nn.Parameter(torch.ones(parameters_shape))
        self.beta =  nn.Parameter(torch.zeros(parameters_shape))

    def forward(self, input):
        dims = (-1, -2)
        mean = input.mean(dim=dims, keepdim=True)
        var = ((input - mean) ** 2).mean(dim=dims, keepdim=True)
        std = (var + self.eps).sqrt() #adding espilon to avoid 0 values
        y = (input - mean) / std
        out = self.gamma * y + self.beta
        return out

   
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, hidden, drop_prob=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, hidden)
        self.linear2 = nn.Linear(hidden, d_model)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=drop_prob)

    def forward(self, x):
        x = self.linear1(x)
        # make network understand better complex pattern
        x = self.relu(x)
        #better generalization
        x = self.dropout(x)
        #compress to 512 dimension
        x = self.linear2(x)
        return x

  
class EncoderLayer(nn.Module):
    def __init__(self, d_model, ffn_hidden, num_heads, drop_prob):
        super(EncoderLayer, self).__init__()
        self.attention = MultiHeadAttention(d_model=d_model, num_heads=num_heads)
        self.norm1 = LayerNormalization(parameters_shape=[d_model])
        self.dropout1 = nn.Dropout(p=drop_prob)
        self.ffn = PositionwiseFeedForward(d_model=d_model, hidden=ffn_hidden, drop_prob=drop_prob)
        self.norm2 = LayerNormalization(parameters_shape=[d_model])
        self.dropout2 = nn.Dropout(p=drop_prob)

    def forward(self, x, self_attention_mask):
        residual_x = x
        x = self.attention(x, mask= self_attention_mask)
        x = self.dropout1(x)
        x = self.norm1(x + residual_x)
        residual_x = x
        x = self.ffn(x)
        x = self.dropout2(x)
        x = self.norm2(x + residual_x)
        return x

class SequentialEncoder(nn.Sequential):
    def forward(self, *inputs):
        x, self_attention_mask  = inputs
        for module in self._modules.values():
            x = module(x, self_attention_mask)
        return x

class Encoder(nn.Module):
    def __init__(self, d_model, ffn_hidden, num_heads, drop_prob, num_layers):
        super().__init__()
        #more layers for better vector representation for words and context
        self.layers = SequentialEncoder(*[EncoderLayer(d_model, ffn_hidden, num_heads, drop_prob)
                                      for _ in range(num_layers)])

    def forward(self, x, self_attention_mask):
        x = self.layers(x, self_attention_mask)
        return x

In [None]:
d_model = 512
num_heads = 8
drop_prob = 0.1
batch_size = 30
max_sequence_length = 200
ffn_hidden = 2048 #as in paper, helps with propagation of information
num_layers = 5 #multi to capture complexity

encoder = Encoder(d_model, ffn_hidden, num_heads, drop_prob, num_layers)

In [None]:
en_sentence_embedding = SentenceEmbedding(max_sequence_length, d_model, english_to_index, nlp_en.tokenizer, START_TOKEN, END_TOKEN, PADDING_TOKEN)
english_batch = next(iterator)[0]
input = en_sentence_embedding(english_batch, start_token = False, end_token = True)
out_encoder = encoder(input, create_encoder_padding_mask(english_batch))
out_encoder

## Decoder

In [None]:
class MultiHeadCrossAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        self.kv_layer = nn.Linear(d_model , 2 * d_model)
        self.q_layer = nn.Linear(d_model , d_model)
        self.linear_layer = nn.Linear(d_model, d_model)
    
    def forward(self, x, y, mask):
        batch_size, sequence_length, d_model = x.size()
        kv = self.kv_layer(x)
        q = self.q_layer(y)
        kv = kv.reshape(batch_size, sequence_length, self.num_heads, 2 * self.head_dim)
        q = q.reshape(batch_size, sequence_length, self.num_heads, self.head_dim)
        kv = kv.permute(0, 2, 1, 3)
        q = q.permute(0, 2, 1, 3)
        k, v = kv.chunk(2, dim=-1)
        values, attention = scaled_dot_product(q, k, v, mask)
        values = values.reshape(batch_size, sequence_length, d_model)
        out = self.linear_layer(values)
        return out


class DecoderLayer(nn.Module):
    def __init__(self, d_model, ffn_hidden, num_heads, drop_prob):
        super(DecoderLayer, self).__init__()
        self.self_attention = MultiHeadAttention(d_model=d_model, num_heads=num_heads)
        self.layer_norm1 = LayerNormalization(parameters_shape=[d_model])
        self.dropout1 = nn.Dropout(p=drop_prob)

        self.encoder_decoder_attention = MultiHeadCrossAttention(d_model=d_model, num_heads=num_heads)
        self.layer_norm2 = LayerNormalization(parameters_shape=[d_model])
        self.dropout2 = nn.Dropout(p=drop_prob)

        self.ffn = PositionwiseFeedForward(d_model=d_model, hidden=ffn_hidden, drop_prob=drop_prob)
        self.layer_norm3 = LayerNormalization(parameters_shape=[d_model])
        self.dropout3 = nn.Dropout(p=drop_prob)

    def forward(self, x, y, self_attention_mask, cross_attention_mask):
        _y = y.clone()
        y = self.self_attention(y, mask=self_attention_mask)
        y = self.dropout1(y)
        y = self.layer_norm1(y + _y)

        _y = y.clone()
        y = self.encoder_decoder_attention(x, y, mask=cross_attention_mask)
        y = self.dropout2(y)
        y = self.layer_norm2(y + _y)

        _y = y.clone()
        y = self.ffn(y)
        y = self.layer_norm3(y + _y)
        return y


class SequentialDecoder(nn.Sequential):
    def forward(self, *inputs):
        x, y, self_attention_mask, cross_attention_mask = inputs
        for module in self._modules.values():
            y = module(x, y, self_attention_mask, cross_attention_mask)
        return y

class Decoder(nn.Module):
    def __init__(self, 
                 d_model, 
                 ffn_hidden, 
                 num_heads, 
                 drop_prob, 
                 num_layers):
        super().__init__()
        self.layers = SequentialDecoder(*[DecoderLayer(d_model, ffn_hidden, num_heads, drop_prob) for _ in range(num_layers)])

    def forward(self, x, y, self_attention_mask, cross_attention_mask):
        y = self.layers(x, y, self_attention_mask, cross_attention_mask)
        return y

In [None]:
d_model = 512
num_heads = 8
drop_prob = 0.1
batch_size = 30
max_sequence_length = 200
ffn_hidden = 2048
num_layers = 5

it_sentence_embedding = SentenceEmbedding(max_sequence_length, d_model, italian_to_index, nlp_it.tokenizer, START_TOKEN, END_TOKEN, PADDING_TOKEN)
italian_batch = next(iterator)[1]
input_it = it_sentence_embedding(italian_batch, start_token = True, end_token = True)
self_attention_decoder_mask = create_decoder_masked_attention_mask(italian_batch)
cross_attention_mask = create_decoder_padding_mask(english_batch, italian_batch)
decoder = Decoder(d_model, ffn_hidden, num_heads, drop_prob, num_layers)
out_decoder = decoder(out_encoder, input_it, self_attention_decoder_mask, cross_attention_mask)
out_decoder

## Putting all together

In [None]:
class SentenceEmbedding(nn.Module):
    def __init__(self, max_sequence_length, d_model, language_to_index, lang_tokenizer, START_TOKEN, END_TOKEN, PADDING_TOKEN):
        super().__init__()
        self.vocab_size = len(language_to_index)
        self.max_sequence_length = max_sequence_length
        self.embedding = nn.Embedding(self.vocab_size, d_model)
        self.language_to_index = language_to_index
        self.language_tokenizer = lang_tokenizer
        self.position_encoder = PositionalEncoding(d_model, max_sequence_length)
        self.dropout = nn.Dropout(p=0.1)
        self.START_TOKEN = START_TOKEN
        self.END_TOKEN = END_TOKEN
        self.PADDING_TOKEN = PADDING_TOKEN
    
    def batch_tokenize(self, batch, start_token, end_token):

        def tokenize(sentence, start_token, end_token):
            try:
                sentence_word_ids = [self.language_to_index[token.text] for token in self.language_tokenizer(sentence)]
            except KeyError:
                print(f'Invalid input token: token unknown')
                raise KeyError
            if start_token:
                sentence_word_ids.insert(0, self.language_to_index[self.START_TOKEN])
            if end_token:
                sentence_word_ids.append(self.language_to_index[self.END_TOKEN])
            for _ in range(len(sentence_word_ids), self.max_sequence_length):
                sentence_word_ids.append(self.language_to_index[self.PADDING_TOKEN])
            
            return torch.tensor(sentence_word_ids)

        tokenized = []
        for sentence_num in range(len(batch)):
           tokenized.append(tokenize(batch[sentence_num], start_token, end_token))
        tokenized = torch.stack(tokenized)
        return tokenized.to(get_device())
    
    def forward(self, x, start_token, end_token):
        x = self.batch_tokenize(x, start_token, end_token)
        x = self.embedding(x)
        pos = self.position_encoder().to(get_device())
        x = self.dropout(x + pos)
        return x


In [None]:
class Encoder(nn.Module):
    def __init__(self, 
                 d_model, 
                 ffn_hidden, 
                 num_heads, 
                 drop_prob, 
                 num_layers,
                 max_sequence_length,
                 language_to_index,
                 language_tokenizer,
                 START_TOKEN,
                 END_TOKEN, 
                 PADDING_TOKEN):
        super().__init__()
        self.sentence_embedding = SentenceEmbedding(max_sequence_length, d_model, language_to_index, language_tokenizer, START_TOKEN, END_TOKEN, PADDING_TOKEN)
        self.layers = SequentialEncoder(*[EncoderLayer(d_model, ffn_hidden, num_heads, drop_prob)
                                      for _ in range(num_layers)])

    def forward(self, x, self_attention_mask):
        self_attention_mask = create_encoder_padding_mask(x)
        x = self.sentence_embedding(x, start_token = False, end_token = True)
        x = self.layers(x, self_attention_mask)
        return x

class Decoder(nn.Module):
    def __init__(self, 
                 d_model, 
                 ffn_hidden, 
                 num_heads, 
                 drop_prob, 
                 num_layers,
                 max_sequence_length,
                 language_to_index, 
                 language_tokenizer,
                 START_TOKEN,
                 END_TOKEN, 
                 PADDING_TOKEN):
        super().__init__()
        self.sentence_embedding = SentenceEmbedding(max_sequence_length, d_model, language_to_index, language_tokenizer, START_TOKEN, END_TOKEN, PADDING_TOKEN)
        self.layers = SequentialDecoder(*[DecoderLayer(d_model, ffn_hidden, num_heads, drop_prob) for _ in range(num_layers)])

    def forward(self, x, y, decoder_self_attention_mask, decoder_cross_attention_mask):
        y = self.sentence_embedding(y, start_token = True, end_token = True)
        y = self.layers(x, y, decoder_self_attention_mask, decoder_cross_attention_mask)
        return y

In [None]:
class Transformer(nn.Module):
    def __init__(self, 
                d_model, 
                ffn_hidden, 
                num_heads, 
                drop_prob, 
                num_layers,
                max_sequence_length, 
                english_to_index,
                english_tokenizer,
                italian_to_index,
                italian_tokenizer,
                index_to_italian,
                START_TOKEN, 
                END_TOKEN, 
                PADDING_TOKEN,
                logging = False
                ):
        super().__init__()
        self.encoder = Encoder(d_model, ffn_hidden, num_heads, drop_prob, num_layers, max_sequence_length, english_to_index, english_tokenizer, START_TOKEN, END_TOKEN, PADDING_TOKEN)
        self.decoder = Decoder(d_model, ffn_hidden, num_heads, drop_prob, num_layers, max_sequence_length, italian_to_index, italian_tokenizer, START_TOKEN, END_TOKEN, PADDING_TOKEN)
        self.linear = nn.Linear(d_model, len(italian_to_index))
        self.softmax = nn.Softmax(dim = -1)
        self.device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
        self.index_to_italian = index_to_italian
        if logging:
            print('TRANSFORMER - architecture')
            print(f'Internal embeddings dimension (d_model): {d_model}')
            print(f'Hidden feedforward layer dimension (ffn_hidden): {ffn_hidden}')
            print(f'Number of attention heads in Multi-head attention (num_heads): {num_heads}')
            print(f'Dropout probability in Dropout layer (drop_prob): {drop_prob}')
            print(f'Number of Encoder/Decoder layers (num_layers): {num_layers}')
            print(f'Maximum numbers of tokens in an input sequence (max_sequence_length): {max_sequence_length}')

    def forward(self, 
                x, 
                y):
        encoder_self_attention_mask = create_encoder_padding_mask(x)
        decoder_self_attention_mask = create_decoder_masked_attention_mask(y)
        decoder_cross_attention_mask = create_decoder_padding_mask(x, y)
        x = self.encoder(x, encoder_self_attention_mask)
        out = self.decoder(x, y, decoder_self_attention_mask, decoder_cross_attention_mask)
        out = self.linear(out)
        out = self.softmax(out)
        out = torch.argmax(out, dim = -1, keepdim=True).squeeze(-1)
        out_sentences = []
        for sentence in out:
                out_sentences.append(' '.join([self.index_to_italian[idx.item()] for idx in sentence]))
        return out, out_sentences

In [None]:
d_model = 512
ffn_hidden = 2048
num_heads = 8
drop_prob = 0.1
num_layers = 6
max_sequence_length = 200

In [None]:
transformer_model = Transformer(d_model,
                                ffn_hidden,
                                num_heads, 
                                drop_prob,
                                num_layers,
                                max_sequence_length,
                                english_to_index,
                                nlp_en.tokenizer,
                                italian_to_index,
                                nlp_it.tokenizer,
                                index_to_italian,
                                START_TOKEN,
                                END_TOKEN,
                                PADDING_TOKEN,
                                logging=True)

In [None]:
dataset_1 = TextDataset(['my favourite animal is the raccoon.',
                        'yesterday i saw a raccoon.'], 
                        ['',
                        ''])
batch_size_1 = 2
train_loader_1 = DataLoader(dataset_1, batch_size_1)
iterator_1 = iter(train_loader_1)
input_batch_1 = next(iterator_1)

In [None]:
output = transformer_model(*input_batch_1)

In [None]:
output[0].size()

In [None]:
output[1]

## References

Theory & Intuition
- **[Vaswani et al., 2017]** Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention Is All You Need.” In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc., 2017. https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
- *"Speech and Language Processing"* (2022), Dan Jurafsky and James H.Martin, Chapters 9, 10, 11
- *"Illustrated Guide to Transformers Neural Network: A step by step explanation"*: https://youtu.be/4Bdc55j80l8
- *"Transformers, explained: Understand the model behind GPT, BERT, and T5"*: https://youtu.be/SZorAJ4I-sA
- *"Transformer Neural Networks - EXPLAINED! (Attention is all you need)"*: https://youtu.be/TQQlZhbC5ps
- *"The complete guide to Transformer neural Networks!"*: https://youtu.be/Nw_PJdmydZY

Code
- *"Transformers from scratch"* playlist: https://youtube.com/playlist?list=PLTl9hO2Oobd97qfWC40gOSU8C0iu0m2l4