# Attention Architecture

In this notebook we are going to dive into the Attention mechanism in Sequence2Sequence models to understand how they work and can be implemented in PyTorch.

This is needed so we can later, in another Notebook, understand the Transformer Model and apply it correctly.
For reading up on the Attention mechanism, I propose [this blog article](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/) which explains the concept in a visual yet detailed way.
Additionally, the PyTorch Tutorials also give a good explanation on the subject: [Link](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#training)

In [1]:
import pandas as pd
import torch
import torch.nn.functional as F
import torch.optim as optim
from torch import nn
# import sys
# sys.path.insert(0, '.') # searches this directory too -> ensures that imports work fine.

# from modules.tatoeba import TatoebaDataset
from modules.tokenizer import Tokenizer

In [2]:
from modules.tatoeba import TatoebaDataset

In [3]:
data = pd.read_csv('data/deu-eng/deu.txt', sep="\t", names=['en', 'de', 'license'])
ttds = TatoebaDataset(data, max_length=10)

In [4]:
en, de = ttds[0]

>> s='go .'


RuntimeError: Could not infer dtype of NoneType

## The Encoder Architecture

The encoder will take our "word-ids" (remember the Tokenizer) and encode them into a a vector: The hidden representation. Lateron our decoder network will use this hidden representation to generate the needed output.

Let's first have a look at an important component for many NLP-tasks: The `Embedding`-Layer.

In [5]:
vocabulary_size = ttds.vocab_size()[0] # number of tokens in English
hidden_dimension = 16
query_index = 250
emb = nn.Embedding(vocabulary_size, hidden_dimension)

sample_sentence_eng, sample_sentence_deu = ttds[query_index]
print(f"Sample Sentence: {ttds.df.iloc[query_index,0]} -> {ttds.df.iloc[query_index, 3]}\n{sample_sentence_eng}\nShape: {sample_sentence_eng.shape}")

output = emb(sample_sentence_eng)
print(f"\nEmdedding Output: \n{output} \nShape: {output.shape}")

>> s=array(['keep it .', 'behalt es !'], dtype=object)
Sample Sentence: Keep it. -> keep it .
tensor([[3153],
        [3889],
        [3952],
        [   1]])
Shape: torch.Size([4, 1])

Emdedding Output: 
tensor([[[ 0.8552,  1.4405, -0.2495, -0.1888,  0.5496, -0.0926, -0.9891,
          -0.4037, -0.1124, -1.5778, -0.3150, -0.2862,  1.1753,  0.2026,
           0.1486, -0.3349]],

        [[ 0.6217, -0.7885, -1.2634, -1.0896,  0.3202,  0.3956,  0.7785,
           2.0619,  0.7042,  0.8439, -0.0092,  0.5039, -0.0093,  0.6477,
           1.4604, -1.4763]],

        [[-0.5785,  0.1268,  2.0047, -1.0455,  0.4273,  1.5664, -0.1494,
          -1.2897, -1.0611, -0.8506,  0.7665,  1.4411, -0.2000, -1.2499,
          -0.3104,  0.4136]],

        [[ 1.2055, -0.6775, -0.9319, -1.1549,  1.1883,  0.3843,  1.3659,
           1.3357, -1.6840,  0.3782,  1.0045, -0.8648,  1.2299,  1.7879,
           0.6085,  0.3778]]], grad_fn=<EmbeddingBackward>) 
Shape: torch.Size([4, 1, 16])


What we can see here is, that the variable length input, in our case with 3 tokens, gets transformed into 3 vectors of size 16 (= hidden_dimension).

This 16-dimensional representation is not arbitrary or fixed, though. During the training of our model it will learn to perfectly encode our different tokens.
(If we have time, we can look into the resulting embeddings. Spoiler: Similar tokens get similar vectors. How 'similar' is defined, remains open here.)

### PyTorch Implementation of the Encoder Architecture

In [6]:
from torch import nn

In [7]:
class Encoder(nn.Module):

    def __init__(self, input_size, hidden_dimension):
        """
        param: input_size: The size of our vocabulary.
        param: hidden_dimension: The size of our latent space (-> complexity our model can memorize)
        """

        # call init method of super-class `nn.Module` to initialize import things
        super(Encoder, self).__init__()
        self.hidden_dimension = hidden_dimension # store for later

        # create Embedding layer
        self.embedding_layer = nn.Embedding(input_size, hidden_dimension)

        # create a GRU cell / layer
        self.recurrent_layer = nn.GRU(hidden_dimension, hidden_dimension)

    def forward(self, x, h):
        """
        This method computes the propagation through our encoder.
        Note: We have two inputs, because in Recurrent Neural Networks
        we look at the previous timestep as well.
        """

        embedding = self.embedding_layer(x).view(1,1, -1)
        x_new, h_new = self.recurrent_layer(embedding, h)

        return x_new, h_new

## The Decoder Architecture

After we have explored the Encoder, the Decoder isn't as surprising now.
It basically performs the same step, but backwards:

Given a vector of a certain dimensionality, it tries to produce value that, when used with our embedding, result in our tokens again. Of course one important question is: How does it reprocude tokens?

Easy answer: The last layer of our Decoder is a simpe fully-connected Linear layer with <vocab_size> output vector.
That means that for the right token, for example "hello" -> 2143, it will light up the resulting vector at the right place (ex. 2143th position).

In [8]:
class Decoder(nn.Module):
    def __init__(self, hidden_dimension, output_size):
        """
        param: output_size: The size of our target language's vocabulary.
        """
        super(Decoder, self).__init__()
        self.hidden_dimension = hidden_dimension

        # create embedding layer
        self.embedding_layer = nn.Embedding(output_size, hidden_dimension)

        # another recurrent layer (so we look at the last input again)
        self.gru_layer = nn.GRU(hidden_dimension, hidden_dimension)

        # our output layer for the token
        self.out_layer = nn.Linear(hidden_dimension, output_size)

        # the activation function
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, x, h):
        """
        Performs the forward computation in our Decoder.
        """
        x_new = self.embedding_layer(x).view(1, 1, -1)
        x_new = F.relu(x_new)
        x_new, hidden = self.gru_layer(x_new, h)
        x_new = self.softmax(self.out_layer(x_new[0]))
        return x_new, hidden

## Test: One forward pass through our networks.

In [9]:
eng_size, de_size = ttds.vocab_size()
hidden_dimension = 16
query_index = 250

encoder = Encoder(eng_size, hidden_dimension)
decoder = Decoder(hidden_dimension, de_size)

sample_sentence_eng, sample_sentence_deu = ttds[query_index]
sentence_length_eng = sample_sentence_eng.shape[0] # number of tokens
print(f"Sample: {ttds.df.iloc[query_index, 0]}")

>> s=array(['keep it .', 'behalt es !'], dtype=object)
Sample: Keep it.


In [10]:
h_enc = torch.zeros(1,1, hidden_dimension) #  initial hidden vector is just zeros

for token_idx in range(sentence_length_eng):
    token = sample_sentence_eng[token_idx]
    x_enc, h_enc = encoder(token, h_enc)

x_enc, x_enc.shape

(tensor([[[ 0.0870,  0.2710, -0.4172,  0.2653, -0.3467,  0.2383, -0.2118,
           -0.5551,  0.3691, -0.3275, -0.4928,  0.1960, -0.1272,  0.0613,
            0.0291,  0.1308]]], grad_fn=<StackBackward>),
 torch.Size([1, 1, 16]))

In [11]:
h_dec = h_enc
x_dec = torch.Tensor([ttds.t1['SOS']]).type(torch.long)

sentence_length_deu = sample_sentence_deu.shape[0]

for token_idx in range(sentence_length_deu):
    output, h_dec = decoder(x_dec, h_dec)
    x_dec = sample_sentence_deu[token_idx]
    print(torch.argmax(x_dec))
x_dec.shape

tensor(0)
tensor(0)
tensor(0)
tensor(0)


torch.Size([1])

## Putting it all together - for now

Although you may already see points where we can improve our model, I'd like to take a short break and test things as they are right now.

This means putting together our training pipeline to really "learn" from the samples we provide the model.

In [12]:
MAX_LENGTH = 15
HIDDEN_DIMENSION = 32

In [13]:
def training_iteration(input_x, target, encoder, decoder, encoder_optim, decoder_optim, loss_function):

    h_enc = torch.zeros

    # reset optimizers and loss value
    encoder_optim.zero_grad()
    decoder_optim.zero_grad()
    loss = 0

    # ENCODER: iterate over input tokens
    outputs = torch.zeros(MAX_LENGTH, encoder.hidden_dimension)
    for idx in range(input_x.shape[0]):
        token = input_x[idx]
        x_enc, h_enc = encoder(token, h_enc)
        outputs[idx] = x_enc

    # DECODER
    x_dec = torch.Tensor([ttds.t1['SOS']])
    # init hidden representation of decoder with resulting hidden representation of encoder
    h_dec = h_enc 
    for idx in (target.shape[0]):
        x_dec, h_dec = decoder(x_dec, h_dec)
        loss += loss_function(x_dec, target[idx]) # accumulate loss for decoder
        x_dec = target[idx] # teacher-forcing (explained in cell)

    # OPTIMIZATION
    loss.backward() # run accumulated loss through decoder backwards to encoder
    encoder_optim.step()
    decoder_optim.step()

    # average loss per token
    return loss.item() / target.shape[0]

In [14]:
def train(encoder, decoder, dataset, iterations=1000):

    # create optimizers as well as loss function
    encoder_optim = optim.SGD(encoder.parameters(), lr=0.01)
    decoder_optim = optim.SGD(decoder.parameters(), lr=0.01)
    loss_function = nn.NLLLoss()

    num_tokens = len(dataset)

    for iteration in range(iterations):
        query_idx = torch.randint(low=0, high=num_tokens, size=(1,)) # choose random training pair
        print(f"Querying sentence at {query_idx.item()}")
        s0, s1 = dataset.df.loc[query_idx, [dataset.langs[0] + '_clean', dataset.langs[1] + '_clean']]
        print(f"{s0=}{s1=}")
        source_tensor, target_tensor = dataset[query_idx]

        loss = training_iteration(source_tensor, target_tensor, 
                encoder, decoder, encoder_optim, decoder_optim, loss_function
        )
        print(f"Iteration {iteration:03} | Loss: {loss}")

In [15]:
del TatoebaDataset
from modules.tatoeba import TatoebaDataset

In [16]:
dataset = TatoebaDataset(data, max_length=MAX_LENGTH, debug=True)
eng_size, deu_size = dataset.vocab_size()
encoder = Encoder(eng_size, HIDDEN_DIMENSION)
decoder = Decoder(HIDDEN_DIMENSION, deu_size)

In [17]:
train(encoder, decoder, dataset)

Querying sentence at 11020
s0='en_clean's1='de_clean'
                    en                       de  \
11020  We're famished.  Wir sind am Verhungern.   

                                                 license          en_clean  \
11020  CC-BY 2.0 (France) Attribution: tatoeba.org #2...  we re famished .   

                       de_clean  
11020  wir sind am verhungern .  
>> s=array([['we re famished .', 'wir sind am verhungern .']], dtype=object)


IndexError: index 1 is out of bounds for axis 0 with size 1

In [110]:
torch.Tensor([1])

tensor([1.])

In [22]:
df = dataset.df

In [35]:
v0 = [dataset.t0[s] for s in s0.split(" ") if s != ""] + [dataset.t0['EOS']]
v0

[46036, 45500, 29177, 46952, 1]

In [39]:
[dataset.t0[s] for s in s0.split(" ") if s != ""]

[46036, 45500, 29177, 46952]

In [23]:
s0, s1 = dataset.df.loc[2146, [dataset.langs[0] + '_clean', dataset.langs[1] + '_clean']].values
print(f"{s0=} and {s1=}")

s0='come get it .' and s1='komm und hol es dir .'


In [28]:
dataset.df.iloc[2143, 3:5].values

array(['come closer .', 'komm naher .'], dtype=object)

In [6]:
s = [['we re famished .', 'wir sind am verhungern .']]