> Attention Is All You Need

<a href="https://nlp.seas.harvard.edu/2018/04/03/attention.html#encoder-and-decoder-stacks" target="_blank">Harvard Notebook</a>

<a href="https://towardsdatascience.com/how-to-code-the-transformer-in-pytorch-24db27c8f9ec">Towardsdatascience Notebook</a>

# Transorfmer

Transformer, a bit better: <a href="https://glassboxmedicine.com/2019/08/15/the-transformer-attention-is-all-you-need/"> article</a>.

Embedding weight matrix $W$ has shape: number of embeddings x embedding dimension.

number of embedding = `vocab_size`, each word has an embedding.

embedding dimension = `d_model`, it's your choice, in the paper it is 512.

In `pytorch` you use `nn.Embedding` function. You can use pre-trained embeddings (e.g. GloVe), or you can randomly initialize them and train them. `Transformer` learns its own word embeddings:


In [1]:
from typing import Type, List
import torch
import torch.nn as nn

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.embed = nn.Embedding(vocab, d_model)
        self.d_model = d_model
    def forward(self, x):
        return self.embed(x)

## Positional Encoding
In transformer all the words of a sentence are processed simultaneously, so there is no inherent word ordering that you have on a RNN. The authors of the Transformer propose adding a "positional encoding" to address this problem. There were 2 options for Positional encoding vectors:
1. learning the positional encoding vectors (add trainable parameters).
2. calculating the positional encoding vectors using an equation (requires no trainable parameters).

Opted for option 2. Formulas:

$PE_{(pos, 2i)} = sin(pos / 10000^{2i/d_{model}})$

$PE_{(pos, 2i+i)} = cos(pos / 10000^{2i/d_{model}})$

where `pos` is the position of the word in the sentence, and `i` is the index into the embedding dimension. If model = `d_model=512`, then than `i` range from 0 to 512 / 2.

This means that an english sentence: "I like trees" will be converted to 3 vectors (one vector for each word): <img src='../img/transformer/wordembedding_positional.png' width="500" height="500">
We do exactely the same thing also for the correct output target sequence "Me gustas los arboles".


The shape of the input Tensors (and output Tensor) with embedding and positional encoder considered is: `n_batches`, `L` and `512`. `nbatches` is the batch size, `L` is the lenght of the sequence, in this case would be 3 for the input and output, and `512` is the emedding dimensions (number of columns representing the vectors column of the word).


## Encoder
The encoder has 6 identical layers. What goes in is an English sentence, e.g. "I like trees", represented in the "word embeddings + positional encodigns" format we just talked about. What comes out is a different representation of the sentence.

Each layer contains 2 sub-layers:
1. Multi-head attention
2. feed-forward network

<img src="../img/transformer/encoder.png" width="400" height="400">

In [2]:
class Encoder(nn.Module):
    """Encoder stacked with 6 layers."""
    
    def __init__(self, layer: Type[nn.Module], N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = layerNorm(layer.size)
    def forward(self, x, mask):
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)
    
class EncoderLayer(nn.Module):
    """Self-attention and feed forward."""
    def __init__(self, size: int = 512, self_attn: Type[nn.Module] = None,
                 feed_forward: Type[nn.Module] = None, dropout = 0.1):
        super(EncoderLayer, self).__init__()
        

`EncoderLayer` has `size = d_model`, columns of the embeddings. `size = d_model = 512`.

### Multi-head attention
`x`, as embedding + position_encoder, is taken up by the first `EncoderLayer`. After, the others 5 layers will pick up a different representation of `x`, which came out from the previous `EncoderLayer`.

$$Attention(Q,K,V) = softmax( \frac{QK^T}{ \sqrt{dk}})  V$$

In the attention mechanism we split the `d_model` into `heads`. Originally `heads=8`. This means that the dimensionality of the `Tensor` is different compare the `size` they come in.

$Q = Tensor(n_{batches}, 8, L_q, 64)$

$K = Tensor(n_{batches}, 8, L_k, 64)$

$V = Tensor(n_{batches}, 8, L_v, 64)$

$Q$, $K$, $V$ do have different weights. They initially have been randomly defined (or better using Xavier Initialization). 

_Note_

_You need to think to the Q, K, V dimensions as if you have a `batch` of sentences, these sentences will be needed to further split into 8 `heads`, these heads have the lenght of the sentence but with `d_model / 8 = 64` columns._

`mask` is a `Tensor` but I don't know the size. `unsqueeze` add a third dimension, for example `unsqueeze(1)` add a 3rd dimension to the second position e.g. [1,4], [1,1,4].

What enter the `MultiHeadedAttention` need also to come out in the same `shape`. `shape = n_batches, L, 512`.

In the 8 `heads` we did a different matrix multiplication for each of the eight heads, this is what is meant by _multi-headed_ attention. The extra _heads_ dimension allows us to have multiple _representation subspaces_. It gives us 8 different ways of considering the same sentence.


<a href="https://www.lesswrong.com/posts/qscAeYE67GoSffDDA/walkthrough-the-transformer-architecture-part-1-2">Article about transformer</a>: Heart of the Transformer is the attention mechanism. The way majority of the people show attention is through a matrix with words. If the connected words, therefore the sentence, is the same in rows and columns, then we have the self-attention mechanism:

<img src='../img/transformer/self_attention.png' width="500" height="500">

With it we see which parts of the sentence are relevant to the other parts.
Also in case of CNN, for instance, we frequently describe neural networks as having some sort of structured model for the images we are classifying. This is why authors sometimes talk about the model recognizing smaller pieces of the picture, like wheels, and then combining these smaller features into a coherent whole, such as a car. 

The way I think of attention is similar. The hope is that the neural network will be able to capture those parts of the text that are related to each other.

## Decoder
The `Decoder` has 3 sublayers:
1. Masked multi head attention: mask future positions.
2. multi-head attention: encoder-decoder multi-head attention.
3. Feed-forward network: linear layer and softmax

__Masked multi-head attention__

We prevent positions from attending to subsequente positions.

__Encoder-Decoder multi-head attention__

`x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask`

`x` comes from the previous `DecoderLayer`, `m` comes from the `Encoder` output (`encoderLayer6`).
In the `Encoder-Decoder attention`  layers, ther queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence (because indeed the encoder has attend all the positions of the input sequence with mask only over the padded words).

### _Important notes_
We run the `Encoder` once to get the output of the encoder stack, which is the representation of the sentence "I like trees". Now we need to run the decoder multiple times so it can predict the words in Spanish "Me gustas los arboles".

The last layer expands the output of the decoder stack into a huge vector whose length is the `vocab-size`. The _softmax_ means that we'll select the one element of this huge vector with the highest probaility (_greedy decoding_).

During training the decoder might be not very good so it can produce incorrect predictions of the next word. If the decoder produces junk, we don't want to feed that junk back into the decoder for the next step. So, during training, we use a process called _teacher forcing_. With _teacher forcing_ we feed in the right translation till the previous point. The loss is then calculated using the probability distribution over possible next words that the decoder actually produced versus the probability distribution it should have produced, this means the loss is calculated based on the actual prediction of the decoder.

In [None]:
class DecoderLayer(nn.Module):
    

In [15]:
a = torch.arange(0,10).reshape(2,5)

In [16]:
a.size()

torch.Size([2, 5])

In [17]:
a.unsqueeze(1).size()

torch.Size([2, 1, 5])

In [18]:
a.size(0)

2

In [8]:
def get_clones(module, N):
    return nn.ModuleList([copy.deepcopy(module) for i in range(N)])

In [10]:
d_model = 512
N = 6
linear = get_clones(nn.Linear(d_model, d_model), N); linear

ModuleList(
  (0): Linear(in_features=512, out_features=512, bias=True)
  (1): Linear(in_features=512, out_features=512, bias=True)
  (2): Linear(in_features=512, out_features=512, bias=True)
  (3): Linear(in_features=512, out_features=512, bias=True)
  (4): Linear(in_features=512, out_features=512, bias=True)
  (5): Linear(in_features=512, out_features=512, bias=True)
)

In [3]:
path = '../img/transformer/'

!ls {path}

encoder.png                  wordembedding_positional.png


In [4]:
torch.arange(0, 555).unsqueeze(-1).size()

torch.Size([555, 1])

In [5]:
torch.arange(0, 555).unsqueeze(1).size()

torch.Size([555, 1])

In [6]:
# !pip install torch

In [7]:
import math, copy, time
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from typing import Type
import matplotlib.pyplot as plt
import seaborn
seaborn.set_context(context='talk')

In [5]:
class Norm(nn.Module):
    """Residual connection around each of the two sub-layers, followed by layer normalization"""
    def __init__(self, features, eps=1e-6):
        super(Norm, self).__init__()
        self.a = nn.Parameter(torch.ones(features))
        self.bias = nn.Parameter(torch.zeros(features))
        self.eps = eps
        
    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a * (x - mean) / (std + self.eps) + self.bias

In [6]:
x = np.array([2,6])

In [7]:
x

array([2, 6])

In [8]:
mean = x.mean(axis=-1, keepdims=True)

In [9]:
x - mean

array([-2.,  2.])

# Code
## _Attention_
Amazing explanation at: https://www.tensorflow.org/tutorials/text/transformer


Attention mechanism is the result of a few operations:

1. the dot product between Query $Q$ and Key $K$. 

$$Q \bullet K^T$$

2. It is then divided/scaled by the number of element $d$ in $K$ (or $Q$).

$$\frac{Q \bullet K^T}{\sqrt{d_k}}$$

3. After this we apply a softmax and multiply by $V$.

$$Attention(Q, K, V) = softmax(\frac{Q \bullet K^T}{\sqrt{d_k}}) \bullet V$$

In [10]:
def attention(query: Type[torch.Tensor], key: Type[torch.tensor],
              value: Type[torch.Tensor], mask=None, dropout=None):
    """Compute the scaled dot product attention"""
    
    d_k = query.size(-1)  # column size
    print(f'd_k value {d_k}')
    print(f'{query.size()} , {key.size()}, {value.size()}')
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    print(f'Score shape -- {scores.size()}')
    if mask is not None:
        print(f'Mask ---> {mask.size()}')
        mask = mask.unsqueeze(1)
        print(f'Mask new --> {mask.size()}')
        scores = scores.masked_fill(mask == 0, -1e9)
    
    p_attn = F.softmax(scores, dim = -1)

    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value)

In [11]:
class MultiHeadedAttention(nn.Module):
    def __init__(self, heads, d_model = 512, dropout=0.1):
        """
        :params
        -------
        heads : int
            Number of heads. Heads = 8, in the original paper.
        d_model : int
            Number of sentences in the set. The `d_model` = 512 in the original paper 
            "Attention is all you need". 
        dropout : float
            Dropout value.
        
        :return
        -------
        output : torch.Tensor
            Tensor with attention matrix.
        """
        super(MultiHeadedAttention, self).__init__()
        
        assert d_model % heads == 0, 'There must be no remainder!'
        # Split the dimensionality into `h` heads. 512 / 8 = 64
        self.d_k = d_model // heads
        self.h = heads
        self.d_model = d_model
        
        # 4 layers: `Q`, `K`, `V`, `O`
        self.q_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
        self.output = nn.Linear(d_model, d_model)
        
    def forward(self, q, k, v, mask):
        
        batch_size = q.size(0)  # number of rows in the `q` matrix
        
        # linear transformation and split into h heads
        q = self.q_linear(q).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        k = self.k_linear(k).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        v = self.v_linear(v).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
        
        # calculate attention
        # Attention tells you which words are important and to which we have to give more 
        # value: https://www.tensorflow.org/tutorials/text/transformer
        attention_weights = attention(q, k, v, mask=mask, dropout=self.dropout)
        
        # concatenate heads
        concat = attention_weights.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        # The output represents the multiplication of the attention weights and the V (value)
        # vector. This ensures that the words you want to focus on are kept as-is and the
        # irrelevant words are flushed out.
        return self.output(concat)

## _Positional Encoding_

At the very beginning of the encoder.

In [12]:
class PositionalEncoder(nn.Module):
    def __init__(self, d_model, max_seq_len = 200, dropout = 0.1):
        super().__init__()
        self.d_model = d_model
        self.dropout = nn.Dropout(dropout)
        # create constant 'pe' matrix with values dependant on 
        # pos and i
        pe = torch.zeros(max_seq_len, d_model)
        for pos in range(max_seq_len):
            for i in range(0, d_model, 2):
                pe[pos, i] = \
                math.sin(pos / (10000 ** ((2 * i)/d_model)))
                pe[pos, i + 1] = \
                math.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        # make embeddings relatively larger
        x = x * math.sqrt(self.d_model)
        #add constant to embedding
        seq_len = x.size(1)
        pe = Variable(self.pe[:,:seq_len], requires_grad=False)
        if x.is_cuda:
            pe.cuda()
        x = x + pe
        return self.dropout(x)

## _Masks_

We use masks in 2 cases:
1. in the encoder-decoder: to zero attention ouputs wherever there is just padding in the input sentences.
2. In the decoder: to prevent the decoder to peak ahead at the rest of the translated sentence when predicting next word.

### _Torchtext_

In [13]:
path = '../data/nlp/transformer/fr-en'
!ls {path}

europarl_en = open(f'{path}/europarl-v7.fr-en.en', encoding='utf-8').read().split('\n')
europarl_fr = open(f'{path}/europarl-v7.fr-en.fr', encoding='utf-8').read().split('\n')

#!python3 -m spacy download en
#!python3 -m spacy download fr
#!pip install torchtext==0.6.0
import spacy
import torchtext
from torchtext.data import Field, BucketIterator, TabularDataset
from sklearn.model_selection import train_test_split
import pandas as pd

en = spacy.load('en')
fr = spacy.load('fr')

def tokenize_en(sentence):
    return [tok.text for tok in en.tokenizer(sentence)]

def tokenize_fr(sentence):
    return [tok.text for tok in fr.tokenizer(sentence)]

# Before you create the field, and at then end you build the vocab from these two Fields
EN_TEXT = Field(tokenize=tokenize_en)
FR_TEXT = Field(tokenize=tokenize_fr, init_token='<sos>', eos_token='<eos>')

# csv format
raw_data = {'English' : [line for line in europarl_en[:5000]], 'French': [line for line in europarl_fr[:5000]]}

df = pd.DataFrame(raw_data, columns=["English", "French"])

# remove very long sentences and sentences where translations are 
# not of roughly equal length
df['eng_len'] = df['English'].str.count(' ')
df['fr_len'] = df['French'].str.count(' ')
df = df.query('fr_len < 80 & eng_len < 80')
df = df.query('fr_len < eng_len * 1.5 & fr_len * 1.5 > eng_len')

# create train and validation set
train, val = train_test_split(df, test_size=0.1)

train.to_csv('train.csv', index=False)
val.to_csv('val.csv', index=False)

data_fields = [('English', EN_TEXT), ('French', FR_TEXT)]
train, val = TabularDataset.splits(path='./', train='train.csv', validation='val.csv', 
                                   format='csv', fields=data_fields)

FR_TEXT.build_vocab(train, val)
EN_TEXT.build_vocab(train, val)

train_iter = BucketIterator(train, batch_size=512, 
                            sort_key=lambda x: len(x.French), shuffle=True)

europarl-v7.fr-en.en europarl-v7.fr-en.fr


You can see we split in batches the sentences and from there `pytorch` creates the vocab size of your embeddings.

In [14]:
for i in train_iter:
    print(i)


[torchtext.data.batch.Batch of size 512]
	[.English]:[torch.LongTensor of size 83x512]
	[.French]:[torch.LongTensor of size 87x512]

[torchtext.data.batch.Batch of size 512]
	[.English]:[torch.LongTensor of size 87x512]
	[.French]:[torch.LongTensor of size 92x512]

[torchtext.data.batch.Batch of size 512]
	[.English]:[torch.LongTensor of size 88x512]
	[.French]:[torch.LongTensor of size 92x512]

[torchtext.data.batch.Batch of size 512]
	[.English]:[torch.LongTensor of size 92x512]
	[.French]:[torch.LongTensor of size 96x512]

[torchtext.data.batch.Batch of size 3]
	[.English]:[torch.LongTensor of size 37x3]
	[.French]:[torch.LongTensor of size 41x3]

[torchtext.data.batch.Batch of size 512]
	[.English]:[torch.LongTensor of size 88x512]
	[.French]:[torch.LongTensor of size 92x512]

[torchtext.data.batch.Batch of size 512]
	[.English]:[torch.LongTensor of size 89x512]
	[.French]:[torch.LongTensor of size 87x512]

[torchtext.data.batch.Batch of size 512]
	[.English]:[torch.LongTensor of 

In [15]:
for i in iter(train_iter):
    print(i)


[torchtext.data.batch.Batch of size 512]
	[.English]:[torch.LongTensor of size 83x512]
	[.French]:[torch.LongTensor of size 88x512]

[torchtext.data.batch.Batch of size 512]
	[.English]:[torch.LongTensor of size 92x512]
	[.French]:[torch.LongTensor of size 96x512]

[torchtext.data.batch.Batch of size 512]
	[.English]:[torch.LongTensor of size 89x512]
	[.French]:[torch.LongTensor of size 93x512]

[torchtext.data.batch.Batch of size 512]
	[.English]:[torch.LongTensor of size 88x512]
	[.French]:[torch.LongTensor of size 95x512]

[torchtext.data.batch.Batch of size 512]
	[.English]:[torch.LongTensor of size 87x512]
	[.French]:[torch.LongTensor of size 91x512]

[torchtext.data.batch.Batch of size 3]
	[.English]:[torch.LongTensor of size 25x3]
	[.French]:[torch.LongTensor of size 30x3]

[torchtext.data.batch.Batch of size 512]
	[.English]:[torch.LongTensor of size 91x512]
	[.French]:[torch.LongTensor of size 88x512]

[torchtext.data.batch.Batch of size 512]
	[.English]:[torch.LongTensor of 

In [None]:
for batch in iter(train_iter):
    print(batch.English.transpose(0, 1).size())

In [17]:
len(EN_TEXT.vocab.itos)

8181

In [18]:
len(FR_TEXT.vocab.itos)

10242

### _Encoder_

In [19]:
# ENCODER MASKS
batch = next(iter(train_iter))
input_seq = batch.English.transpose(0,1)
input_pad = EN_TEXT.vocab.stoi['<pad>']
input_msk = (input_seq != input_pad).unsqueeze(1)

In [20]:
input_msk.shape

torch.Size([512, 1, 91])

### _Decoder_

In [23]:
# DECODER MASKS
target_seq = batch.French.transpose(0,1)
target_pad = FR_TEXT.vocab.stoi['<pad>']
target_msk = (target_seq != target_pad).unsqueeze(1)
size = target_seq.size(1) # get seq_len for matrix
nopeak_mask = np.triu(np.ones((1, size, size)), k=1).astype('uint8')
nopeak_mask = Variable(torch.from_numpy(nopeak_mask) == 0)
target_msk = target_msk & nopeak_mask

In [24]:
target_msk.shape

torch.Size([512, 91, 91])

## _Embedder_
The `Embedder` has as parameters `vocab_size` and `d_model`. The `vocab_size` is defined by the number of unique words the encoded text has. In this case you can use `len(EN_TEXT.vocab.itos)` to understand what this parameter will be. The same will happen also for the French language in the decoder. The `d_model` is set to 512 in the paper and here too.

In [25]:
class Embedder(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.d_model = d_model
        self.embed = nn.Embedding(vocab_size, d_model)
    def forward(self, x):
        return self.embed(x)

## _Feed-Forward_

In [26]:
class FeedForward(nn.Module):
    """Linear transformation and a ReLU"""
    def __init__(self, d_model: int = 512, d_ff: int = 2048, dropout: float = 0.1):
        super(FeedForward, self).__init__()
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)
        
    def forward(self, x):
        x = self.dropout(F.relu(self.linear_1(x)))
        return self.linear_2(x)

## Layers _Encoder - Decoder_

In [27]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, heads, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.norm_1 = Norm(d_model)
        self.norm_2 = Norm(d_model)
        self.attn = MultiHeadedAttention(heads, d_model)
        self.ff = FeedForward(d_model)
        self.dropout_1 = nn.Dropout(dropout)
        self.dropout_2 = nn.Dropout(dropout)
        
    def forward(self, x, mask):
        print(f'this is x in EncoderLayer-----> {x}')
        x2 = self.norm_1(x)
        x = x + self.dropout_1(self.attn(x2, x2, x2, mask))
        x2 = self.norm_2(x)
        x = x + self.dropout_2(self.ff(x2))
        return x
        

class DecoderLayer(nn.Module):
    def __init__(self, d_model, heads, dropout=0.1):
        super().__init__()
        self.norm_1 = Norm(d_model)
        self.norm_2 = Norm(d_model)
        self.norm_3 = Norm(d_model)
        
        self.dropout_1 = nn.Dropout(dropout)
        self.dropout_2 = nn.Dropout(dropout)
        self.dropout_3 = nn.Dropout(dropout)
        
        self.attn_1 = MultiHeadedAttention(heads, d_model)
        self.attn_2 = MultiHeadedAttention(heads, d_model)
        self.ff = FeedForward(d_model)
        
    def forward(self, x, e_outputs, src_mask, trg_mask):
        x2 = self.norm_1(x)
        x = x + self.dropout_1(self.attn_1(x2, x2, x2, trg_mask))
        x2 = self.norm_2(x)
        x = x + self.dropout_2(self.attn_2(x2, e_outputs, e_outputs, src_mask))
        x2 = self.norm_3(x)
        x = x + self.dropout_3(self.ff(x2))
        return x

In [28]:
def get_clones(module, N):
    return nn.ModuleList([copy.deepcopy(module) for i in range(N)])

In [29]:
def get_clones(module, N):
    return nn.ModuleList([copy.deepcopy(module) for i in range(N)])

class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, N, heads, dropout):
        super().__init__()
        self.N = N
        self.embed = Embedder(vocab_size, d_model)
        self.pe = PositionalEncoder(d_model, dropout=dropout)
        self.layers = get_clones(EncoderLayer(d_model, heads, dropout), N)
        self.norm = Norm(d_model)
    def forward(self, src, mask):
        x = self.embed(src)
        
        x = self.pe(x)
        print(f'--Encoder x after PE : {x}')
        for i in range(self.N):
            print(f'loop within encoder. Iteration -- {i}')
            x = self.layers[i](x, mask)
        return self.norm(x)
    
class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, N, heads, dropout):
        super().__init__()
        self.N = N
        self.embed = Embedder(vocab_size, d_model)
        self.pe = PositionalEncoder(d_model, dropout=dropout)
        self.layers = get_clones(DecoderLayer(d_model, heads, dropout), N)
        self.norm = Norm(d_model)
    def forward(self, trg, e_outputs, src_mask, trg_mask):
        x = self.embed(trg)
        x = self.pe(x)
        for i in range(self.N):
            x = self.layers[i](x, e_outputs, src_mask, trg_mask)
        return self.norm(x)

## _Transformer_

In [30]:
class Transformer(nn.Module):
    """Transformer architecture with encoder, decoder and Linear layer of the output."""
    def __init__(self, src_vocab, trg_vocab, d_model, N, heads, dropout = 0.1):
        super().__init__()
        self.encoder = Encoder(src_vocab, d_model, N, heads, dropout)
        self.decoder = Decoder(trg_vocab, d_model, N, heads, dropout)
        self.out = nn.Linear(d_model, trg_vocab)
    
    def forward(self, src, trg, src_mask, trg_mask):
        e_outputs = self.encoder(src, src_mask)
        d_output = self.decoder(trg, e_outputs, src_mask, trg_mask)
        output = self.out(d_output)
        return output

## _Model_

In [31]:
def nopeak_mask(size):
    np_mask = np.triu(np.ones((1, size, size)), k=1).astype('uint8')
    np_mask =  Variable(torch.from_numpy(np_mask) == 0)
    return np_mask

def create_masks(src, trg):
    
    source_pad = EN_TEXT.vocab.stoi['<pad>']
    src_mask = (src != source_pad).unsqueeze(-2)

    if trg is not None:
        target_pad = FR_TEXT.vocab.stoi['<pad>']
        trg_mask = (trg != target_pad).unsqueeze(-2)
        size = trg.size(1) # get seq_len for matrix
        np_mask = nopeak_mask(size)
        trg_mask = trg_mask & np_mask
    else:
        trg_mask = None
    return src_mask, trg_mask

In [41]:
d_model = 512
heads = 8
N = 6
src_vocab = len(EN_TEXT.vocab)
trg_vocab = len(FR_TEXT.vocab)

model = Transformer(src_vocab, trg_vocab, d_model, N, heads, dropout = 0.1)

for p in model.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)
        
optim = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

def train_model(epochs, print_every=100):
    
    model.train()
    
    start = time.time()
    temp = start
    
    total_loss = 0
    
    for epoch in range(epochs):
        print(f'Epoch - {epoch}')
        for i, batch in enumerate(train_iter):
            src = batch.English.transpose(0,1)
            trg = batch.French.transpose(0,1)
            # the French sentence we input has all words except
            # the last, as it is using each word to predict the next
            trg_input = trg[:, :-1]
            
            # the words we are trying to predict
            targets = trg[:, 1:].contiguous().view(-1)
            
            # create function to make masks using mask code above
            src_mask, trg_mask = create_masks(src, trg_input)
            
            preds = model(src, trg_input, src_mask, trg_mask)
            
            optim.zero_grad()
            print(f'predictions size : {preds.view(-1, preds.size(-1)).size()}')
            print(f'targets size : {targets.size()}')
            loss = F.cross_entropy(preds.view(-1, preds.size(-1)), targets, 
                                   ignore_index=target_pad)
            loss.backward()
            optim.step()
            print(f'Loss.data ----> = {loss.item()}')
            total_loss += loss.item()
            if (i + 1) % print_every == 0:
                loss_avg = total_loss / print_every
                print("time = %dm, epoch %d, iter = %d, loss = %.3f, %ds per %d iters" % ((time.time() - start) // 60, epoch + 1, i + 1, loss_avg, time.time() - temp, print_every))
                total_loss = 0
                temp = time.time()

In [42]:
train_model(epochs=3)

Epoch - 0
--Encoder x after PE : tensor([[[-0.0851,  0.9312, -0.2281,  ...,  1.0702, -0.1880,  1.0272],
         [ 0.4032,  0.4439,  1.3056,  ...,  0.5143, -0.6426,  0.7896],
         [ 1.6677, -0.8952,  1.6084,  ...,  0.0000, -0.2544,  0.0000],
         ...,
         [-0.2319,  0.5529, -0.4102,  ...,  1.0595,  0.0000,  0.5406],
         [-0.8427, -0.4381, -0.2582,  ...,  1.0595,  0.0000,  0.0000],
         [-0.4383, -1.2356,  0.0000,  ...,  1.0595,  0.1775,  0.0000]],

        [[ 0.3753,  0.5082, -0.4723,  ...,  1.5692, -0.3355,  0.8619],
         [ 0.6853,  0.2364,  0.0000,  ...,  0.0000, -0.6110,  0.5476],
         [ 1.4878, -0.0000,  1.2968,  ...,  1.5232,  0.5940,  1.3578],
         ...,
         [-0.2319,  0.0000, -0.4102,  ...,  1.0595,  0.1775,  0.5406],
         [-0.8427, -0.4381, -0.2582,  ...,  1.0595,  0.0000,  0.5406],
         [-0.4383, -1.2356,  0.6229,  ...,  1.0595,  0.1775,  0.5406]],

        [[-0.5318,  0.9221,  0.4146,  ...,  0.5143, -0.6426,  0.7896],
         [ 1

d_k value 64
torch.Size([512, 8, 82, 64]) , torch.Size([512, 8, 82, 64]), torch.Size([512, 8, 82, 64])
Score shape -- torch.Size([512, 8, 82, 82])
Mask ---> torch.Size([512, 1, 82])
Mask new --> torch.Size([512, 1, 1, 82])
loop within encoder. Iteration -- 2
this is x in EncoderLayer-----> tensor([[[ 1.2373,  2.3735,  1.1324,  ..., -2.2191,  0.6009, -0.5536],
         [ 1.7796,  0.2496,  0.9270,  ..., -1.8907, -0.7796, -0.0823],
         [ 3.6477, -1.6047,  2.9605,  ..., -2.2155,  1.5546, -1.9528],
         ...,
         [ 1.7254,  1.0542, -0.1385,  ..., -0.8589,  1.2703, -1.6588],
         [ 0.2856, -0.4140, -0.2438,  ..., -0.9472,  1.7512, -1.1925],
         [ 0.2425,  0.7415, -0.1423,  ..., -2.3492,  1.3482, -1.8761]],

        [[ 1.0044,  0.9614,  0.1505,  ..., -0.2430,  2.1106, -1.6201],
         [ 0.8209,  0.7809,  0.1609,  ..., -2.1277,  1.1611, -2.1375],
         [ 2.6876, -0.1629,  3.0896,  ..., -0.2157,  3.3580,  0.1907],
         ...,
         [ 0.5951,  1.6866,  1.0686,  ..

d_k value 64
torch.Size([512, 8, 82, 64]) , torch.Size([512, 8, 82, 64]), torch.Size([512, 8, 82, 64])
Score shape -- torch.Size([512, 8, 82, 82])
Mask ---> torch.Size([512, 1, 82])
Mask new --> torch.Size([512, 1, 1, 82])
loop within encoder. Iteration -- 5
this is x in EncoderLayer-----> tensor([[[ 2.3736, -1.0162,  0.3866,  ..., -1.9666, -1.3029, -0.2442],
         [ 1.9583, -3.2455, -1.9080,  ..., -0.1589, -3.6666,  0.8351],
         [ 3.8433, -2.2394,  3.1165,  ..., -0.8794, -0.7147, -0.0499],
         ...,
         [ 1.3359, -0.6574, -0.4339,  ...,  0.7388, -0.1017, -2.0900],
         [ 1.1404, -2.7856, -0.8308,  ..., -0.8665, -1.4210, -0.2405],
         [ 0.6618, -2.0113, -1.9549,  ..., -0.8180, -1.5113, -0.4708]],

        [[ 0.9547, -1.3453, -1.3675,  ..., -0.8117, -0.0519, -0.9876],
         [ 2.4575, -3.6740,  0.1999,  ..., -2.2720, -2.2707, -1.7620],
         [ 3.5713, -3.5491,  4.1238,  ..., -0.1193, -0.2347, -0.6394],
         ...,
         [ 2.5983, -2.4574,  1.8107,  ..

d_k value 64
torch.Size([512, 8, 92, 64]) , torch.Size([512, 8, 92, 64]), torch.Size([512, 8, 92, 64])
Score shape -- torch.Size([512, 8, 92, 92])
Mask ---> torch.Size([512, 1, 92])
Mask new --> torch.Size([512, 1, 1, 92])
loop within encoder. Iteration -- 1
this is x in EncoderLayer-----> tensor([[[-2.5324,  3.7967,  0.0216,  ...,  0.1651,  1.2800,  0.3328],
         [-0.2523,  2.2308,  1.2020,  ..., -0.2065,  1.0067,  0.0549],
         [ 1.6817, -0.0502,  0.7137,  ...,  0.4918,  1.3011,  0.5103],
         ...,
         [-0.0726,  0.3920,  2.4521,  ..., -0.1285,  1.1017, -0.8300],
         [-0.0763,  2.1565,  2.3440,  ..., -0.8168,  0.0890, -0.9893],
         [-0.9927,  2.3537,  0.8408,  ..., -0.2144,  1.4347, -0.3247]],

        [[-0.8982,  2.7631,  1.0055,  ..., -0.0910,  0.8583,  0.5200],
         [ 0.1012,  1.8462,  1.1131,  ..., -1.1167,  0.2531,  0.8317],
         [-0.3562,  1.0680,  1.5284,  ..., -0.5806,  1.5817,  1.5641],
         ...,
         [-0.0596,  1.1810,  2.2705,  ..

d_k value 64
torch.Size([512, 8, 92, 64]) , torch.Size([512, 8, 92, 64]), torch.Size([512, 8, 92, 64])
Score shape -- torch.Size([512, 8, 92, 92])
Mask ---> torch.Size([512, 1, 92])
Mask new --> torch.Size([512, 1, 1, 92])
loop within encoder. Iteration -- 4
this is x in EncoderLayer-----> tensor([[[-1.1884,  5.1994,  0.1242,  ..., -3.6180, -1.2333, -2.5598],
         [ 1.5916,  2.8187,  1.3063,  ..., -4.1959, -1.9222, -3.3988],
         [ 3.3879,  3.3241, -0.2021,  ..., -3.9558, -0.1180, -3.2054],
         ...,
         [ 1.3802,  1.4840,  0.7675,  ..., -3.8833, -1.3094, -2.0298],
         [ 0.4350,  2.9146,  2.1709,  ..., -4.7154, -3.6078, -2.8101],
         [-0.1253,  4.4905, -1.2041,  ..., -4.4684, -0.4421, -0.9889]],

        [[-0.2393,  4.2441, -0.2232,  ..., -4.4183, -0.4282, -2.1799],
         [ 1.0243,  3.6881, -0.1542,  ..., -5.6110, -1.0574, -2.0070],
         [ 1.0418,  1.5724,  0.4145,  ..., -3.3089,  0.1020, -1.5662],
         ...,
         [ 0.6121,  2.1321,  0.8328,  ..

d_k value 64
torch.Size([512, 8, 95, 64]) , torch.Size([512, 8, 95, 64]), torch.Size([512, 8, 95, 64])
Score shape -- torch.Size([512, 8, 95, 95])
Mask ---> torch.Size([512, 95, 95])
Mask new --> torch.Size([512, 1, 95, 95])
d_k value 64
torch.Size([512, 8, 95, 64]) , torch.Size([512, 8, 92, 64]), torch.Size([512, 8, 92, 64])
Score shape -- torch.Size([512, 8, 95, 92])
Mask ---> torch.Size([512, 1, 92])
Mask new --> torch.Size([512, 1, 1, 92])
d_k value 64
torch.Size([512, 8, 95, 64]) , torch.Size([512, 8, 95, 64]), torch.Size([512, 8, 95, 64])
Score shape -- torch.Size([512, 8, 95, 95])
Mask ---> torch.Size([512, 95, 95])
Mask new --> torch.Size([512, 1, 95, 95])
d_k value 64
torch.Size([512, 8, 95, 64]) , torch.Size([512, 8, 92, 64]), torch.Size([512, 8, 92, 64])
Score shape -- torch.Size([512, 8, 95, 92])
Mask ---> torch.Size([512, 1, 92])
Mask new --> torch.Size([512, 1, 1, 92])
d_k value 64
torch.Size([512, 8, 95, 64]) , torch.Size([512, 8, 95, 64]), torch.Size([512, 8, 95, 64])
S

loop within encoder. Iteration -- 1
this is x in EncoderLayer-----> tensor([[[-0.9212,  2.9854,  0.7876,  ..., -0.2346,  0.7811, -0.0275],
         [ 0.7595,  2.1049,  0.6342,  ...,  0.4106,  2.0050,  0.7623],
         [-0.0246,  1.6343,  0.9550,  ..., -1.6762,  0.8438, -0.9817],
         ...,
         [-1.2049,  2.1306, -0.1078,  ..., -0.3383,  1.3275, -0.5474],
         [-1.0445,  2.5198, -0.4640,  ..., -0.3621,  1.4420, -0.4822],
         [-1.6986,  1.2432,  0.2937,  ..., -0.3456,  1.2754, -0.8856]],

        [[-1.4204,  3.0450, -0.5899,  ..., -0.8554,  1.6811, -0.1548],
         [-0.7509,  1.5633,  1.6354,  ...,  0.6904,  2.3747,  0.7515],
         [-0.1989,  1.5467,  0.8351,  ...,  0.9388,  1.9755, -0.2252],
         ...,
         [-0.6955,  2.6723,  0.4140,  ..., -0.3296,  1.5581,  0.3699],
         [-0.6544,  1.8320, -0.8550,  ..., -0.3681,  1.4773,  0.4093],
         [-1.3406,  0.9330, -0.0635,  ..., -0.4743,  0.1170, -1.0938]],

        [[ 0.7598,  3.3448,  0.0450,  ..., -1.03

d_k value 64
torch.Size([512, 8, 88, 64]) , torch.Size([512, 8, 88, 64]), torch.Size([512, 8, 88, 64])
Score shape -- torch.Size([512, 8, 88, 88])
Mask ---> torch.Size([512, 1, 88])
Mask new --> torch.Size([512, 1, 1, 88])
loop within encoder. Iteration -- 4
this is x in EncoderLayer-----> tensor([[[ 0.4230,  5.3170, -1.2567,  ..., -3.0115, -0.2290, -2.6088],
         [ 1.3656,  5.0687, -0.8841,  ..., -4.8667, -0.4553, -2.2197],
         [ 1.6236,  4.1512, -0.8767,  ..., -7.7559, -1.3243, -4.4726],
         ...,
         [ 0.1264,  5.8278, -1.1133,  ..., -2.7081, -0.9205, -3.0394],
         [ 0.5023,  5.7114, -2.9146,  ..., -4.0175, -0.1017, -2.4068],
         [-0.7948,  4.1885, -1.9223,  ..., -6.0852,  0.7595, -2.7241]],

        [[ 0.1012,  6.6271, -3.3067,  ..., -3.7281, -0.7016, -3.8786],
         [ 0.7379,  5.0841,  0.1455,  ..., -2.4506, -0.3192, -2.6866],
         [ 0.8000,  4.8496, -0.9287,  ..., -2.4407,  2.4267, -4.5445],
         ...,
         [-0.5350,  3.3730, -0.6320,  ..

d_k value 64
torch.Size([512, 8, 91, 64]) , torch.Size([512, 8, 88, 64]), torch.Size([512, 8, 88, 64])
Score shape -- torch.Size([512, 8, 91, 88])
Mask ---> torch.Size([512, 1, 88])
Mask new --> torch.Size([512, 1, 1, 88])
d_k value 64
torch.Size([512, 8, 91, 64]) , torch.Size([512, 8, 91, 64]), torch.Size([512, 8, 91, 64])
Score shape -- torch.Size([512, 8, 91, 91])
Mask ---> torch.Size([512, 91, 91])
Mask new --> torch.Size([512, 1, 91, 91])
d_k value 64
torch.Size([512, 8, 91, 64]) , torch.Size([512, 8, 88, 64]), torch.Size([512, 8, 88, 64])
Score shape -- torch.Size([512, 8, 91, 88])
Mask ---> torch.Size([512, 1, 88])
Mask new --> torch.Size([512, 1, 1, 88])
predictions size : tensor([[-4.0364e-01, -3.3938e-01,  2.6769e-01,  ..., -4.7148e-01,
         -7.7419e-03,  4.5100e-04],
        [-3.9131e-01, -3.4477e-01,  3.4359e-01,  ..., -2.4829e-01,
         -1.6565e-01, -1.1348e-01],
        [-3.8753e-01, -2.3814e-01,  2.5845e-01,  ..., -3.4388e-01,
         -3.0029e-01, -5.1941e-02],
 

d_k value 64
torch.Size([512, 8, 82, 64]) , torch.Size([512, 8, 82, 64]), torch.Size([512, 8, 82, 64])
Score shape -- torch.Size([512, 8, 82, 82])
Mask ---> torch.Size([512, 1, 82])
Mask new --> torch.Size([512, 1, 1, 82])
loop within encoder. Iteration -- 2
this is x in EncoderLayer-----> tensor([[[-2.0127e-01,  2.9940e+00,  5.8654e-01,  ..., -2.5922e+00,
           2.4272e+00, -2.2104e-01],
         [ 1.8526e-01,  4.0921e+00,  7.0172e-01,  ..., -3.4250e+00,
           2.3046e+00, -1.1830e-01],
         [-2.7542e-01,  1.3119e+00,  1.3217e+00,  ...,  8.3479e-01,
           1.5584e-01, -1.2507e+00],
         ...,
         [-6.1290e-01,  5.4906e+00, -7.4561e-01,  ..., -4.9907e-01,
           1.9229e+00, -1.1916e+00],
         [-1.8130e+00,  3.6446e+00, -5.8858e-02,  ..., -1.5086e+00,
           4.4020e-01, -1.0754e+00],
         [ 5.2264e-01,  2.1503e+00, -2.9628e-02,  ..., -1.8970e+00,
           2.6707e+00, -1.1634e+00]],

        [[-1.3693e+00,  3.2432e+00, -4.7163e-02,  ..., -2.3966e

d_k value 64
torch.Size([512, 8, 82, 64]) , torch.Size([512, 8, 82, 64]), torch.Size([512, 8, 82, 64])
Score shape -- torch.Size([512, 8, 82, 82])
Mask ---> torch.Size([512, 1, 82])
Mask new --> torch.Size([512, 1, 1, 82])
loop within encoder. Iteration -- 5
this is x in EncoderLayer-----> tensor([[[ 1.6358e+00,  2.6326e+00, -2.5437e+00,  ..., -6.6480e+00,
           2.9328e-01, -6.0283e-02],
         [ 3.5679e+00,  5.2126e+00, -3.0062e+00,  ..., -8.6629e+00,
           2.0411e+00, -2.1027e+00],
         [ 3.4621e+00,  1.9185e+00, -6.1669e-02,  ..., -3.1337e+00,
          -1.1824e+00, -1.9760e+00],
         ...,
         [ 2.1793e+00,  5.9917e+00, -3.4664e+00,  ..., -5.3463e+00,
          -1.3047e+00, -2.4314e+00],
         [ 1.2031e+00,  2.4098e+00, -1.4002e+00,  ..., -4.4303e+00,
          -2.3371e+00, -1.7180e+00],
         [ 2.7791e+00,  1.8332e+00, -2.8679e+00,  ..., -4.7335e+00,
          -2.6253e-01, -1.5966e+00]],

        [[ 1.5248e+00,  3.0616e+00, -2.9142e+00,  ..., -8.3204e

d_k value 64
torch.Size([512, 8, 89, 64]) , torch.Size([512, 8, 89, 64]), torch.Size([512, 8, 89, 64])
Score shape -- torch.Size([512, 8, 89, 89])
Mask ---> torch.Size([512, 1, 89])
Mask new --> torch.Size([512, 1, 1, 89])
loop within encoder. Iteration -- 1
this is x in EncoderLayer-----> tensor([[[-2.0135,  2.5520, -0.1131,  ..., -1.2710,  1.8633, -0.2739],
         [-2.0812,  3.0580,  0.4592,  ...,  0.1324,  1.7807,  0.3157],
         [-2.1077,  1.4106,  0.1977,  ...,  0.3285,  1.9876, -0.8802],
         ...,
         [-0.6062,  1.9885, -0.3880,  ..., -0.6373,  2.5931,  0.5406],
         [-2.4410,  1.7600, -0.5587,  ...,  0.9660,  1.6841, -0.2827],
         [-1.3418, -1.3217,  0.5597,  ...,  1.1674,  2.3946, -1.2341]],

        [[-1.6370,  3.7617,  0.1969,  ..., -1.0491,  0.7131, -0.8424],
         [-0.7455,  2.2737,  0.5637,  ..., -0.7190,  2.3634,  0.4798],
         [-0.2356,  2.6226,  0.4773,  ..., -1.7187,  2.1879, -0.6213],
         ...,
         [-1.9803,  1.3019, -0.7044,  ..

d_k value 64
torch.Size([512, 8, 89, 64]) , torch.Size([512, 8, 89, 64]), torch.Size([512, 8, 89, 64])
Score shape -- torch.Size([512, 8, 89, 89])
Mask ---> torch.Size([512, 1, 89])
Mask new --> torch.Size([512, 1, 1, 89])
loop within encoder. Iteration -- 4
this is x in EncoderLayer-----> tensor([[[ 0.1366,  6.5112, -4.0824,  ..., -4.4230,  0.9173, -4.8347],
         [ 0.0519,  7.0721, -2.9689,  ..., -4.1291, -0.0349, -4.3464],
         [ 1.5174,  3.8615, -3.6953,  ..., -4.1525, -0.3673, -4.5138],
         ...,
         [ 0.8380,  5.4576, -4.0401,  ..., -4.9609,  0.8088, -1.2684],
         [-1.8025,  5.8112, -3.5701,  ..., -2.9787, -1.1814, -3.3055],
         [ 0.3472,  1.8960, -3.5660,  ..., -1.7422, -0.2908, -4.8780]],

        [[-0.2917,  6.5753, -2.5234,  ..., -4.1192, -1.5435, -4.7013],
         [ 0.4137,  6.0222, -2.2722,  ..., -4.2955,  1.0639, -3.6393],
         [ 1.2127,  6.0859, -1.6165,  ..., -5.8198,  0.8402, -4.4724],
         ...,
         [ 0.2031,  5.2023, -2.5272,  ..

d_k value 64
torch.Size([512, 8, 91, 64]) , torch.Size([512, 8, 89, 64]), torch.Size([512, 8, 89, 64])
Score shape -- torch.Size([512, 8, 91, 89])
Mask ---> torch.Size([512, 1, 89])
Mask new --> torch.Size([512, 1, 1, 89])
d_k value 64
torch.Size([512, 8, 91, 64]) , torch.Size([512, 8, 91, 64]), torch.Size([512, 8, 91, 64])
Score shape -- torch.Size([512, 8, 91, 91])
Mask ---> torch.Size([512, 91, 91])
Mask new --> torch.Size([512, 1, 91, 91])
d_k value 64
torch.Size([512, 8, 91, 64]) , torch.Size([512, 8, 89, 64]), torch.Size([512, 8, 89, 64])
Score shape -- torch.Size([512, 8, 91, 89])
Mask ---> torch.Size([512, 1, 89])
Mask new --> torch.Size([512, 1, 1, 89])
predictions size : tensor([[-0.4157, -0.3107,  0.2332,  ..., -0.3386, -0.2376,  0.0113],
        [-0.3351, -0.3189,  0.2486,  ..., -0.1919, -0.0660,  0.0520],
        [-0.3039, -0.3591,  0.1641,  ..., -0.1754, -0.0855,  0.0020],
        ...,
        [-0.3722, -0.3032,  0.3244,  ..., -0.2556,  0.0653,  0.1369],
        [-0.2886,

loop within encoder. Iteration -- 5
this is x in EncoderLayer-----> tensor([[[  0.5796,   5.3307,  -2.3741,  ...,  -5.2855,   1.1693,  -4.4926],
         [  2.0855,   4.2141,  -0.7107,  ...,  -8.4357,  -0.6645,  -2.5086],
         [  1.4739,   4.8020,  -1.6681,  ...,  -5.0307,   2.4531,  -1.9225],
         ...,
         [ -0.7318,   4.3292,  -1.6974,  ...,  -8.9682,   0.7243,  -2.5219],
         [  2.3301,   0.5871,  -1.2639,  ...,  -7.5478,  -0.5319,  -0.1926],
         [  0.9651,   3.5720,  -0.7260,  ...,  -5.8632,   0.4134,  -3.6682]],

        [[  1.4904,   4.7306,  -1.5127,  ...,  -9.3468,   3.7137,  -1.5286],
         [  1.6074,   3.5344,  -1.4772,  ...,  -7.3805,   0.4317,  -1.2360],
         [  1.3271,   3.1477,   0.7420,  ...,  -8.7358,   1.0437,   0.1583],
         ...,
         [ -0.2570,   4.0135,  -1.6634,  ...,  -9.0268,   3.6578,  -3.3951],
         [  2.0639,   2.4502,  -3.2367,  ...,  -7.8102,   1.6374,  -2.3330],
         [  2.2615,   1.5472,  -0.1701,  ..., -10.1188,

d_k value 64
torch.Size([512, 8, 84, 64]) , torch.Size([512, 8, 84, 64]), torch.Size([512, 8, 84, 64])
Score shape -- torch.Size([512, 8, 84, 84])
Mask ---> torch.Size([512, 1, 84])
Mask new --> torch.Size([512, 1, 1, 84])
loop within encoder. Iteration -- 1
this is x in EncoderLayer-----> tensor([[[-1.8483e+00,  8.5939e-01,  1.1991e+00,  ..., -4.6804e-01,
           1.0693e+00,  4.4452e-01],
         [ 1.1190e+00,  2.8090e+00,  9.7838e-01,  ..., -1.2101e+00,
           2.9374e+00,  4.0845e-01],
         [-1.0753e+00,  1.3349e+00,  5.3119e-01,  ..., -3.1184e-01,
           2.7698e+00, -1.1215e-01],
         ...,
         [-1.7659e+00, -5.7324e-01,  7.3938e-01,  ..., -2.4080e+00,
           2.5366e+00, -1.0820e+00],
         [-1.0609e+00,  8.5476e-01,  9.2553e-01,  ..., -7.4749e-01,
           2.4329e+00, -1.4874e+00],
         [ 2.2365e+00,  2.2036e+00,  4.6491e-01,  ..., -1.9855e+00,
           2.7575e+00, -1.0860e+00]],

        [[-2.5009e+00,  2.5003e+00, -3.5937e-01,  ..., -6.9279e

d_k value 64
torch.Size([512, 8, 84, 64]) , torch.Size([512, 8, 84, 64]), torch.Size([512, 8, 84, 64])
Score shape -- torch.Size([512, 8, 84, 84])
Mask ---> torch.Size([512, 1, 84])
Mask new --> torch.Size([512, 1, 1, 84])
loop within encoder. Iteration -- 4
this is x in EncoderLayer-----> tensor([[[-1.1779,  3.3045, -1.6909,  ..., -5.4941, -0.9482, -5.4169],
         [ 2.6787,  7.4531, -1.9464,  ..., -7.2802,  0.3412, -3.3576],
         [ 1.6108,  3.6567, -2.0009,  ..., -3.0147,  0.1593, -4.4318],
         ...,
         [-0.3409,  3.1617, -2.5720,  ..., -7.5859,  0.1931, -6.0781],
         [-1.0342,  4.4838, -1.2932,  ..., -6.3567,  0.4925, -4.5262],
         [ 1.4465,  5.7535, -3.0328,  ..., -6.5524,  1.0951, -4.0748]],

        [[-1.5054,  6.1048, -3.5137,  ..., -5.1140,  1.8527, -5.0693],
         [ 0.1561,  5.8857, -1.9652,  ..., -3.2950,  1.1398, -3.7584],
         [ 0.6877,  4.1058,  0.3140,  ..., -4.7345,  0.8290, -4.2519],
         ...,
         [-0.9499,  4.9641, -1.6780,  ..

d_k value 64
torch.Size([512, 8, 87, 64]) , torch.Size([512, 8, 84, 64]), torch.Size([512, 8, 84, 64])
Score shape -- torch.Size([512, 8, 87, 84])
Mask ---> torch.Size([512, 1, 84])
Mask new --> torch.Size([512, 1, 1, 84])
d_k value 64
torch.Size([512, 8, 87, 64]) , torch.Size([512, 8, 87, 64]), torch.Size([512, 8, 87, 64])
Score shape -- torch.Size([512, 8, 87, 87])
Mask ---> torch.Size([512, 87, 87])
Mask new --> torch.Size([512, 1, 87, 87])
d_k value 64
torch.Size([512, 8, 87, 64]) , torch.Size([512, 8, 84, 64]), torch.Size([512, 8, 84, 64])
Score shape -- torch.Size([512, 8, 87, 84])
Mask ---> torch.Size([512, 1, 84])
Mask new --> torch.Size([512, 1, 1, 84])
predictions size : tensor([[-0.2292, -0.4846,  0.1083,  ..., -0.4339, -0.1529, -0.0040],
        [-0.2858, -0.2494,  0.0538,  ..., -0.2331, -0.3444,  0.1076],
        [-0.3601, -0.4306,  0.1255,  ..., -0.1607, -0.1027,  0.1816],
        ...,
        [-0.2934, -0.4505,  0.1123,  ..., -0.2398, -0.1601,  0.1946],
        [-0.4554,

d_k value 64
torch.Size([512, 8, 87, 64]) , torch.Size([512, 8, 87, 64]), torch.Size([512, 8, 87, 64])
Score shape -- torch.Size([512, 8, 87, 87])
Mask ---> torch.Size([512, 1, 87])
Mask new --> torch.Size([512, 1, 1, 87])
loop within encoder. Iteration -- 2
this is x in EncoderLayer-----> tensor([[[-1.5303,  5.7767, -0.1095,  ..., -0.9917,  1.2820, -0.2591],
         [-1.0521,  5.0119,  0.8585,  ..., -3.6522,  3.5069, -1.2366],
         [ 0.8268,  4.4365,  0.7392,  ..., -3.8971,  3.3313, -2.1980],
         ...,
         [ 0.2010,  4.9547,  0.3678,  ..., -2.8921,  2.9247, -0.5680],
         [-1.4846,  4.9882, -0.6139,  ..., -3.2665,  1.7647, -1.8071],
         [-1.9431,  4.0397, -1.8353,  ..., -3.3405,  2.7648, -2.0182]],

        [[-0.7663,  5.0205, -0.9938,  ..., -3.6702,  1.7254, -2.3296],
         [-1.6223,  4.2500,  1.3567,  ..., -3.2403,  2.0877, -0.2585],
         [-1.8263,  2.8464, -0.3914,  ..., -2.4129,  3.0069, -1.7037],
         ...,
         [ 0.6428,  3.1420, -1.2523,  ..

d_k value 64
torch.Size([512, 8, 87, 64]) , torch.Size([512, 8, 87, 64]), torch.Size([512, 8, 87, 64])
Score shape -- torch.Size([512, 8, 87, 87])
Mask ---> torch.Size([512, 1, 87])
Mask new --> torch.Size([512, 1, 1, 87])
loop within encoder. Iteration -- 5
this is x in EncoderLayer-----> tensor([[[-0.4961,  7.4471, -2.0857,  ..., -5.3875, -0.2889, -2.6684],
         [ 0.1346,  7.3953, -1.1653,  ..., -9.2316,  2.2773, -2.3897],
         [ 4.1295,  6.5391, -0.8660,  ..., -8.6035, -0.1690, -4.5922],
         ...,
         [ 2.0030,  6.9904,  0.0881,  ..., -8.6200,  1.8019, -3.0543],
         [-0.7775,  6.4845, -0.1669,  ..., -8.9162, -0.5249, -3.9168],
         [-0.3307,  5.4973, -3.5991,  ..., -9.0150,  2.3285, -3.8469]],

        [[ 0.8564,  7.9400, -2.6708,  ..., -7.2824,  0.2890, -5.3064],
         [-0.1970,  6.2343,  0.2761,  ..., -5.5239, -0.2665, -2.4603],
         [ 0.0178,  3.9900, -0.7009,  ..., -7.5142,  0.4871, -4.5655],
         ...,
         [ 3.3425,  5.9099, -2.2782,  ..

d_k value 64
torch.Size([512, 8, 91, 64]) , torch.Size([512, 8, 91, 64]), torch.Size([512, 8, 91, 64])
Score shape -- torch.Size([512, 8, 91, 91])
Mask ---> torch.Size([512, 1, 91])
Mask new --> torch.Size([512, 1, 1, 91])
loop within encoder. Iteration -- 1
this is x in EncoderLayer-----> tensor([[[-2.3786e+00,  3.4904e+00, -6.1885e-01,  ..., -1.0992e+00,
           1.6910e+00,  8.9896e-02],
         [-1.8986e+00,  3.1674e+00,  6.3392e-01,  ..., -8.4930e-01,
           1.9712e+00, -1.1411e+00],
         [-1.7856e+00,  2.0645e+00, -9.0796e-01,  ..., -3.0593e-01,
           3.5770e+00,  4.6832e-01],
         ...,
         [-2.1120e+00,  2.1417e+00,  5.3600e-01,  ..., -1.1635e+00,
           2.6544e+00, -1.2760e+00],
         [ 1.2314e+00,  8.2228e-01,  1.0837e+00,  ..., -9.0137e-01,
           2.2056e+00, -1.5374e+00],
         [-6.6109e-01,  3.0258e+00,  1.3899e+00,  ..., -2.0994e+00,
           1.0549e+00, -1.0730e+00]],

        [[-4.2422e-01,  4.2658e+00, -3.1623e-01,  ..., -1.4103e

d_k value 64
torch.Size([512, 8, 91, 64]) , torch.Size([512, 8, 91, 64]), torch.Size([512, 8, 91, 64])
Score shape -- torch.Size([512, 8, 91, 91])
Mask ---> torch.Size([512, 1, 91])
Mask new --> torch.Size([512, 1, 1, 91])
loop within encoder. Iteration -- 4
this is x in EncoderLayer-----> tensor([[[-1.9927,  8.2514, -3.3745,  ..., -4.2828, -0.8031, -5.5509],
         [-1.6663,  6.7537, -1.8185,  ..., -5.3142, -0.1682, -6.4529],
         [-0.4363,  7.1586, -4.1643,  ..., -5.8870,  1.8929, -6.1169],
         ...,
         [-1.6139,  7.0020, -2.6688,  ..., -5.6776, -0.2857, -5.7776],
         [ 2.0611,  4.3866, -2.0962,  ..., -4.6720, -0.3388, -5.2463],
         [ 0.1626,  6.8051, -1.9176,  ..., -6.3181, -0.4288, -5.0390]],

        [[-0.7189,  9.2789, -2.8796,  ..., -7.4125,  0.5197, -6.1723],
         [-0.6316,  7.7914, -3.2696,  ..., -6.7290,  0.1756, -4.8673],
         [-0.6257,  7.0094, -1.5667,  ..., -5.8023,  0.1173, -4.8981],
         ...,
         [-0.5727,  4.7273, -0.8158,  ..

d_k value 64
torch.Size([512, 8, 92, 64]) , torch.Size([512, 8, 92, 64]), torch.Size([512, 8, 92, 64])
Score shape -- torch.Size([512, 8, 92, 92])
Mask ---> torch.Size([512, 92, 92])
Mask new --> torch.Size([512, 1, 92, 92])
d_k value 64
torch.Size([512, 8, 92, 64]) , torch.Size([512, 8, 91, 64]), torch.Size([512, 8, 91, 64])
Score shape -- torch.Size([512, 8, 92, 91])
Mask ---> torch.Size([512, 1, 91])
Mask new --> torch.Size([512, 1, 1, 91])
d_k value 64
torch.Size([512, 8, 92, 64]) , torch.Size([512, 8, 92, 64]), torch.Size([512, 8, 92, 64])
Score shape -- torch.Size([512, 8, 92, 92])
Mask ---> torch.Size([512, 92, 92])
Mask new --> torch.Size([512, 1, 92, 92])
d_k value 64
torch.Size([512, 8, 92, 64]) , torch.Size([512, 8, 91, 64]), torch.Size([512, 8, 91, 64])
Score shape -- torch.Size([512, 8, 92, 91])
Mask ---> torch.Size([512, 1, 91])
Mask new --> torch.Size([512, 1, 1, 91])
predictions size : tensor([[-0.4811, -0.4970, -0.0055,  ..., -0.3388, -0.4049,  0.0539],
        [-0.298

d_k value 64
torch.Size([512, 8, 85, 64]) , torch.Size([512, 8, 85, 64]), torch.Size([512, 8, 85, 64])
Score shape -- torch.Size([512, 8, 85, 85])
Mask ---> torch.Size([512, 1, 85])
Mask new --> torch.Size([512, 1, 1, 85])
loop within encoder. Iteration -- 2
this is x in EncoderLayer-----> tensor([[[-1.5456,  3.4415,  0.5971,  ..., -4.2432,  2.6462, -1.9300],
         [ 0.5021,  3.6522,  1.2839,  ..., -3.9985,  2.6825, -2.0972],
         [-1.0661,  4.4074,  0.2980,  ..., -1.4099,  3.0225, -1.4004],
         ...,
         [-0.0927,  3.6681,  0.6742,  ..., -4.2212,  2.3097, -1.7871],
         [ 0.8433,  2.4944,  0.6114,  ..., -3.6708,  2.0371, -1.4314],
         [ 1.2709,  4.1242, -0.4078,  ..., -3.7407,  3.0307, -2.5579]],

        [[ 1.8706,  6.0957, -0.3039,  ..., -2.4526,  2.7149, -2.3709],
         [ 0.9577,  3.9885,  0.0420,  ..., -3.8407,  2.5430, -2.6174],
         [-0.6922,  0.8737,  0.6911,  ..., -3.2363,  1.6924, -0.6518],
         ...,
         [-1.0996,  4.3412,  0.1011,  ..

d_k value 64
torch.Size([512, 8, 85, 64]) , torch.Size([512, 8, 85, 64]), torch.Size([512, 8, 85, 64])
Score shape -- torch.Size([512, 8, 85, 85])
Mask ---> torch.Size([512, 1, 85])
Mask new --> torch.Size([512, 1, 1, 85])
loop within encoder. Iteration -- 5
this is x in EncoderLayer-----> tensor([[[-1.0913e+00,  6.1053e+00,  1.4995e-01,  ..., -8.7613e+00,
           2.7894e+00, -4.2606e+00],
         [ 3.5400e-01,  5.2658e+00, -4.9791e-01,  ..., -9.3482e+00,
           3.6528e+00, -5.4061e+00],
         [-1.8299e-01,  6.7693e+00, -7.2480e-01,  ..., -7.6650e+00,
           1.6800e+00, -2.1314e+00],
         ...,
         [ 6.2900e-01,  7.5455e+00,  8.7472e-01,  ..., -8.9913e+00,
          -1.7647e-01, -4.4380e+00],
         [ 1.4120e+00,  3.9050e+00, -1.1472e+00,  ..., -7.3497e+00,
           3.0271e-01, -2.6759e+00],
         [ 1.7626e+00,  6.3417e+00, -3.5968e+00,  ..., -8.7077e+00,
           9.6415e-01, -4.7166e+00]],

        [[ 3.1205e+00,  8.5167e+00, -1.5639e+00,  ..., -7.5433e

d_k value 64
torch.Size([512, 8, 87, 64]) , torch.Size([512, 8, 87, 64]), torch.Size([512, 8, 87, 64])
Score shape -- torch.Size([512, 8, 87, 87])
Mask ---> torch.Size([512, 1, 87])
Mask new --> torch.Size([512, 1, 1, 87])
loop within encoder. Iteration -- 1
this is x in EncoderLayer-----> tensor([[[-1.8714e+00,  2.4629e+00, -2.9420e-01,  ..., -1.7496e+00,
           2.4240e+00, -1.8258e-03],
         [-1.7859e+00,  2.1182e+00, -4.0490e-01,  ..., -6.4488e-01,
           2.9770e+00,  3.1390e-01],
         [-1.2353e+00,  2.3624e+00,  5.8408e-01,  ..., -5.2998e-01,
           3.7392e+00,  7.3354e-01],
         ...,
         [ 1.8916e+00,  3.2726e+00,  8.7419e-01,  ..., -1.8915e+00,
           3.0639e+00, -5.2878e-01],
         [-1.4541e+00,  3.9330e+00, -7.1348e-01,  ..., -1.6847e+00,
           2.6949e+00, -1.0865e+00],
         [-2.1342e+00,  3.2897e+00, -6.6573e-01,  ...,  7.3700e-01,
           3.1483e+00, -5.6115e-01]],

        [[-1.8698e+00,  2.4449e+00, -5.7292e-01,  ..., -1.0023e

d_k value 64
torch.Size([512, 8, 87, 64]) , torch.Size([512, 8, 87, 64]), torch.Size([512, 8, 87, 64])
Score shape -- torch.Size([512, 8, 87, 87])
Mask ---> torch.Size([512, 1, 87])
Mask new --> torch.Size([512, 1, 1, 87])
loop within encoder. Iteration -- 3
this is x in EncoderLayer-----> tensor([[[-1.3193e+00,  4.9188e+00,  3.1480e-01,  ..., -5.9032e+00,
           1.8303e+00, -3.7092e+00],
         [-1.0265e+00,  5.0405e+00, -7.7577e-01,  ..., -5.8742e+00,
           2.7929e+00, -3.0687e+00],
         [ 3.8471e-01,  5.5965e+00, -1.3990e-01,  ..., -4.6790e+00,
           2.8035e+00, -3.5361e+00],
         ...,
         [ 3.5557e+00,  6.0842e+00, -1.8293e-03,  ..., -7.4530e+00,
           1.8839e+00, -3.5025e+00],
         [-7.0567e-01,  6.0751e+00, -1.4318e+00,  ..., -6.8296e+00,
           1.5271e+00, -2.8095e+00],
         [-1.5709e+00,  4.9421e+00, -8.0466e-01,  ..., -4.5618e+00,
           1.0730e+00, -2.6790e+00]],

        [[-1.3374e+00,  6.3974e+00, -1.3536e+00,  ..., -6.9152e

d_k value 64
torch.Size([512, 8, 87, 64]) , torch.Size([512, 8, 87, 64]), torch.Size([512, 8, 87, 64])
Score shape -- torch.Size([512, 8, 87, 87])
Mask ---> torch.Size([512, 1, 87])
Mask new --> torch.Size([512, 1, 1, 87])
loop within encoder. Iteration -- 5
this is x in EncoderLayer-----> tensor([[[ -1.3903,   5.6126,  -0.3704,  ...,  -7.8654,   1.0943,  -3.3908],
         [ -1.6647,   5.7798,  -2.0917,  ...,  -8.0098,   2.8262,  -1.3325],
         [ -0.0587,   6.3204,  -0.9151,  ...,  -7.6041,   4.4730,  -5.1742],
         ...,
         [  3.4992,   7.2594,  -0.5016,  ...,  -9.6756,   1.1729,  -5.7119],
         [ -1.1531,   7.8293,  -2.5328,  ...,  -8.8767,   1.1402,  -4.5389],
         [ -1.2629,   5.5799,  -0.3538,  ...,  -6.8994,   0.3492,  -4.5422]],

        [[ -1.6970,   7.7797,  -1.8598,  ...,  -8.1274,   0.8591,  -4.7080],
         [ -0.5731,   8.3185,  -0.6785,  ...,  -7.2480,   1.3575,  -4.3442],
         [  0.2278,   3.9958,  -0.1327,  ...,  -8.3495,  -0.8866,  -4.1920],


d_k value 64
torch.Size([512, 8, 88, 64]) , torch.Size([512, 8, 88, 64]), torch.Size([512, 8, 88, 64])
Score shape -- torch.Size([512, 8, 88, 88])
Mask ---> torch.Size([512, 1, 88])
Mask new --> torch.Size([512, 1, 1, 88])
loop within encoder. Iteration -- 1
this is x in EncoderLayer-----> tensor([[[-1.9452e+00,  5.2834e+00, -6.5812e-01,  ..., -1.3562e+00,
           2.4810e+00, -3.3064e-01],
         [-8.3755e-01,  4.2378e+00,  9.3318e-01,  ...,  1.6966e-01,
           2.7990e+00, -1.2550e+00],
         [-1.0501e+00,  2.8490e+00,  1.4614e+00,  ..., -1.9033e-01,
           2.0257e+00, -5.1548e-02],
         ...,
         [-1.8453e+00,  3.3252e+00, -1.1286e+00,  ...,  1.7830e-01,
           2.8037e+00, -1.3425e+00],
         [-3.1085e+00,  2.3967e+00, -8.9296e-01,  ..., -1.1040e+00,
           1.8486e+00, -5.1663e-01],
         [-2.8494e+00,  1.2356e+00, -1.1534e+00,  ...,  1.0595e+00,
           2.1760e+00, -6.9222e-01]],

        [[-2.1932e+00,  2.8168e+00,  1.1945e-01,  ..., -1.6359e

d_k value 64
torch.Size([512, 8, 88, 64]) , torch.Size([512, 8, 88, 64]), torch.Size([512, 8, 88, 64])
Score shape -- torch.Size([512, 8, 88, 88])
Mask ---> torch.Size([512, 1, 88])
Mask new --> torch.Size([512, 1, 1, 88])
loop within encoder. Iteration -- 4
this is x in EncoderLayer-----> tensor([[[-1.8776,  8.6270, -3.4407,  ..., -7.8015, -0.3668, -5.6261],
         [ 0.2214,  8.3490, -1.7243,  ..., -4.8817,  1.2672, -7.3220],
         [-0.6025,  8.7304, -2.1545,  ..., -5.8917, -1.3418, -5.9684],
         ...,
         [-1.4944,  9.7496, -3.8171,  ..., -6.2474,  0.0135, -6.8935],
         [-2.5098,  9.0932, -4.6295,  ..., -6.3548,  0.7252, -5.7898],
         [-1.5093,  7.2598, -4.9967,  ..., -4.8404, -0.8477, -5.8635]],

        [[-1.5729,  7.4058, -2.2549,  ..., -6.4863,  2.0264, -6.3333],
         [-0.6868,  6.0668, -1.4837,  ..., -4.8126,  0.7645, -3.6697],
         [-1.0229,  6.8784, -0.9867,  ..., -4.6031, -0.7931, -3.6531],
         ...,
         [-1.0224,  9.1051, -3.3147,  ..

d_k value 64
torch.Size([512, 8, 94, 64]) , torch.Size([512, 8, 94, 64]), torch.Size([512, 8, 94, 64])
Score shape -- torch.Size([512, 8, 94, 94])
Mask ---> torch.Size([512, 94, 94])
Mask new --> torch.Size([512, 1, 94, 94])
d_k value 64
torch.Size([512, 8, 94, 64]) , torch.Size([512, 8, 88, 64]), torch.Size([512, 8, 88, 64])
Score shape -- torch.Size([512, 8, 94, 88])
Mask ---> torch.Size([512, 1, 88])
Mask new --> torch.Size([512, 1, 1, 88])
d_k value 64
torch.Size([512, 8, 94, 64]) , torch.Size([512, 8, 94, 64]), torch.Size([512, 8, 94, 64])
Score shape -- torch.Size([512, 8, 94, 94])
Mask ---> torch.Size([512, 94, 94])
Mask new --> torch.Size([512, 1, 94, 94])
d_k value 64
torch.Size([512, 8, 94, 64]) , torch.Size([512, 8, 88, 64]), torch.Size([512, 8, 88, 64])
Score shape -- torch.Size([512, 8, 94, 88])
Mask ---> torch.Size([512, 1, 88])
Mask new --> torch.Size([512, 1, 1, 88])
predictions size : tensor([[-0.5431, -0.6141, -0.0941,  ..., -0.5479, -0.4186, -0.0912],
        [-0.614

d_k value 64
torch.Size([512, 8, 91, 64]) , torch.Size([512, 8, 91, 64]), torch.Size([512, 8, 91, 64])
Score shape -- torch.Size([512, 8, 91, 91])
Mask ---> torch.Size([512, 1, 91])
Mask new --> torch.Size([512, 1, 1, 91])
loop within encoder. Iteration -- 2
this is x in EncoderLayer-----> tensor([[[-2.6449,  5.1748, -0.9969,  ..., -3.8290,  1.8258, -2.8344],
         [-2.1691,  3.4980, -0.9987,  ..., -2.2050,  2.4439, -0.9558],
         [-1.4020,  4.7874, -0.4158,  ..., -2.7465,  2.1503, -3.1683],
         ...,
         [-1.7118,  0.7396, -0.4798,  ..., -4.5157,  2.9608, -3.2554],
         [-1.5777,  4.0100, -1.1794,  ..., -3.7457,  2.2052, -2.8601],
         [-0.2008,  5.5793, -0.1632,  ..., -2.7820,  2.2708, -3.0655]],

        [[-1.2175,  5.2090,  2.0965,  ..., -5.1404,  3.9378, -3.9468],
         [ 1.0390,  4.1585,  1.0347,  ..., -4.1727,  2.0023, -3.3988],
         [ 0.5790,  3.0890,  1.1473,  ..., -3.7335,  2.1928, -3.0502],
         ...,
         [ 1.3996,  2.8312, -0.9723,  ..

d_k value 64
torch.Size([512, 8, 91, 64]) , torch.Size([512, 8, 91, 64]), torch.Size([512, 8, 91, 64])
Score shape -- torch.Size([512, 8, 91, 91])
Mask ---> torch.Size([512, 1, 91])
Mask new --> torch.Size([512, 1, 1, 91])
loop within encoder. Iteration -- 5
this is x in EncoderLayer-----> tensor([[[ -2.1863,   8.0491,  -1.7766,  ...,  -8.7726,  -1.3795,  -5.2643],
         [ -1.8059,   7.2329,   0.6603,  ...,  -5.5070,   2.3880,  -3.0699],
         [ -0.5995,   8.0245,  -1.2959,  ...,  -7.6227,   2.1897,  -6.1441],
         ...,
         [ -1.4187,   3.2671,  -0.3478,  ...,  -9.8617,   0.1211,  -6.3570],
         [ -1.6488,   6.8139,  -2.3745,  ...,  -8.8127,  -1.9079,  -4.9700],
         [ -0.5214,   8.3847,  -1.7060,  ...,  -7.9625,   1.7992,  -5.2928]],

        [[  0.0895,   7.4381,  -0.5348,  ..., -10.9240,   1.9352,  -5.9091],
         [  2.9703,   7.1427,   0.2976,  ...,  -7.6295,   0.0724,  -5.2040],
         [  2.6488,   6.0379,  -0.5372,  ..., -10.7564,   0.0385,  -5.4028],


d_k value 64
torch.Size([512, 8, 92, 64]) , torch.Size([512, 8, 92, 64]), torch.Size([512, 8, 92, 64])
Score shape -- torch.Size([512, 8, 92, 92])
Mask ---> torch.Size([512, 1, 92])
Mask new --> torch.Size([512, 1, 1, 92])
loop within encoder. Iteration -- 1
this is x in EncoderLayer-----> tensor([[[-1.4086,  4.1158, -0.2836,  ..., -1.5342,  2.4786, -0.0483],
         [ 0.0000,  3.1911, -0.7216,  ..., -1.6402,  2.5061, -0.4230],
         [-1.1287,  2.7834,  0.6090,  ..., -1.2901,  2.3230,  0.6143],
         ...,
         [-0.4633,  2.4606,  0.7785,  ..., -1.4066,  2.0725, -1.3408],
         [-1.0705,  3.6249,  1.2449,  ..., -1.7326,  1.6873, -1.7975],
         [-2.0130,  1.9748,  0.8033,  ..., -2.4801,  1.4893, -2.1408]],

        [[ 0.9579,  3.2353, -0.8412,  ..., -1.4823,  3.4379,  0.0705],
         [-1.2698,  2.4843,  0.5506,  ...,  1.2735,  2.9973, -1.6000],
         [-0.8395,  2.9840,  0.6271,  ..., -0.9484,  2.5812, -1.1096],
         ...,
         [-0.2770,  1.9582,  2.0508,  ..

d_k value 64
torch.Size([512, 8, 92, 64]) , torch.Size([512, 8, 92, 64]), torch.Size([512, 8, 92, 64])
Score shape -- torch.Size([512, 8, 92, 92])
Mask ---> torch.Size([512, 1, 92])
Mask new --> torch.Size([512, 1, 1, 92])
loop within encoder. Iteration -- 4
this is x in EncoderLayer-----> tensor([[[-0.6121,  9.9768, -3.6174,  ..., -7.7469, -2.5564, -5.5744],
         [ 1.0015,  7.2330, -4.1955,  ..., -6.3740,  0.6102, -6.3771],
         [-0.8968,  8.1064, -2.2652,  ..., -6.9423,  0.7422, -4.0610],
         ...,
         [-1.8110,  6.8103, -1.7582,  ..., -8.2376, -0.8630, -6.2015],
         [-0.7005, 10.5129, -2.0811,  ..., -7.5168, -1.6788, -7.1722],
         [-1.6995,  5.6743, -2.2728,  ..., -9.0631, -2.0386, -5.5632]],

        [[ 1.5937,  9.2377, -4.6533,  ..., -6.9635,  0.4655, -6.9765],
         [-0.2970,  7.6094, -1.9076,  ..., -4.1504,  0.3305, -7.4949],
         [ 0.6528,  6.6810, -1.9632,  ..., -7.2652, -0.1232, -5.3838],
         ...,
         [ 0.8282,  7.1942, -0.9925,  ..

d_k value 64
torch.Size([512, 8, 95, 64]) , torch.Size([512, 8, 95, 64]), torch.Size([512, 8, 95, 64])
Score shape -- torch.Size([512, 8, 95, 95])
Mask ---> torch.Size([512, 95, 95])
Mask new --> torch.Size([512, 1, 95, 95])
d_k value 64
torch.Size([512, 8, 95, 64]) , torch.Size([512, 8, 92, 64]), torch.Size([512, 8, 92, 64])
Score shape -- torch.Size([512, 8, 95, 92])
Mask ---> torch.Size([512, 1, 92])
Mask new --> torch.Size([512, 1, 1, 92])
d_k value 64
torch.Size([512, 8, 95, 64]) , torch.Size([512, 8, 95, 64]), torch.Size([512, 8, 95, 64])
Score shape -- torch.Size([512, 8, 95, 95])
Mask ---> torch.Size([512, 95, 95])
Mask new --> torch.Size([512, 1, 95, 95])
d_k value 64
torch.Size([512, 8, 95, 64]) , torch.Size([512, 8, 92, 64]), torch.Size([512, 8, 92, 64])
Score shape -- torch.Size([512, 8, 95, 92])
Mask ---> torch.Size([512, 1, 92])
Mask new --> torch.Size([512, 1, 1, 92])
d_k value 64
torch.Size([512, 8, 95, 64]) , torch.Size([512, 8, 95, 64]), torch.Size([512, 8, 95, 64])
S

d_k value 64
torch.Size([512, 8, 86, 64]) , torch.Size([512, 8, 86, 64]), torch.Size([512, 8, 86, 64])
Score shape -- torch.Size([512, 8, 86, 86])
Mask ---> torch.Size([512, 1, 86])
Mask new --> torch.Size([512, 1, 1, 86])
loop within encoder. Iteration -- 2
this is x in EncoderLayer-----> tensor([[[-2.0316e+00,  3.5005e+00, -2.2128e-01,  ...,  1.1462e-01,
           1.1863e+00, -3.0005e+00],
         [-4.9837e-01,  5.6812e+00, -1.0985e+00,  ..., -4.2968e+00,
           1.7931e+00, -2.5892e+00],
         [-1.4153e+00,  3.8224e+00,  1.7933e-01,  ..., -3.8981e+00,
           2.4583e+00, -2.2093e-01],
         ...,
         [-5.7911e-01,  4.7085e+00,  2.0488e-01,  ..., -3.9183e+00,
           1.8028e+00, -1.5705e+00],
         [-6.9561e-01,  4.9688e+00,  1.0569e-01,  ..., -2.4402e+00,
           2.3169e-01, -3.0089e+00],
         [-1.8460e+00,  5.7032e+00, -1.8515e+00,  ..., -3.0403e+00,
           1.1858e+00, -2.8055e+00]],

        [[-1.1250e+00,  3.7216e+00, -1.0692e-01,  ..., -9.7649e

d_k value 64
torch.Size([512, 8, 86, 64]) , torch.Size([512, 8, 86, 64]), torch.Size([512, 8, 86, 64])
Score shape -- torch.Size([512, 8, 86, 86])
Mask ---> torch.Size([512, 1, 86])
Mask new --> torch.Size([512, 1, 1, 86])
loop within encoder. Iteration -- 4
this is x in EncoderLayer-----> tensor([[[-2.8716e+00,  5.5103e+00, -3.0064e+00,  ..., -3.5344e+00,
          -1.9971e+00, -7.2000e+00],
         [-5.5943e-01,  9.0508e+00, -2.9907e+00,  ..., -7.8927e+00,
           1.1162e-01, -6.2085e+00],
         [-1.9651e+00,  7.3072e+00, -2.0956e+00,  ..., -7.8737e+00,
           1.7796e+00, -5.2073e+00],
         ...,
         [-2.7060e-01,  6.9442e+00, -2.0952e+00,  ..., -4.4289e+00,
           3.3677e-01, -4.6016e+00],
         [-6.2382e-01,  8.9369e+00, -2.8345e+00,  ..., -6.0939e+00,
          -2.9148e+00, -6.5951e+00],
         [-1.7327e+00,  9.3810e+00, -4.5499e+00,  ..., -5.9721e+00,
          -1.3991e+00, -7.0934e+00]],

        [[-1.2907e+00,  8.1314e+00, -3.1882e+00,  ..., -3.4846e

d_k value 64
torch.Size([512, 8, 86, 64]) , torch.Size([512, 8, 86, 64]), torch.Size([512, 8, 86, 64])
Score shape -- torch.Size([512, 8, 86, 86])
Mask ---> torch.Size([512, 1, 86])
Mask new --> torch.Size([512, 1, 1, 86])
d_k value 64
torch.Size([512, 8, 92, 64]) , torch.Size([512, 8, 92, 64]), torch.Size([512, 8, 92, 64])
Score shape -- torch.Size([512, 8, 92, 92])
Mask ---> torch.Size([512, 92, 92])
Mask new --> torch.Size([512, 1, 92, 92])
d_k value 64
torch.Size([512, 8, 92, 64]) , torch.Size([512, 8, 86, 64]), torch.Size([512, 8, 86, 64])
Score shape -- torch.Size([512, 8, 92, 86])
Mask ---> torch.Size([512, 1, 86])
Mask new --> torch.Size([512, 1, 1, 86])
d_k value 64
torch.Size([512, 8, 92, 64]) , torch.Size([512, 8, 92, 64]), torch.Size([512, 8, 92, 64])
Score shape -- torch.Size([512, 8, 92, 92])
Mask ---> torch.Size([512, 92, 92])
Mask new --> torch.Size([512, 1, 92, 92])
d_k value 64
torch.Size([512, 8, 92, 64]) , torch.Size([512, 8, 86, 64]), torch.Size([512, 8, 86, 64])
S

d_k value 64
torch.Size([512, 8, 81, 64]) , torch.Size([512, 8, 81, 64]), torch.Size([512, 8, 81, 64])
Score shape -- torch.Size([512, 8, 81, 81])
Mask ---> torch.Size([512, 1, 81])
Mask new --> torch.Size([512, 1, 1, 81])
loop within encoder. Iteration -- 1
this is x in EncoderLayer-----> tensor([[[-2.0419,  3.9815, -0.2360,  ..., -1.1338,  1.4594, -0.2515],
         [-1.1442,  2.8425,  0.6401,  ..., -2.8283,  3.5570, -1.1334],
         [-1.7587,  2.3385, -0.6952,  ..., -2.4235,  4.3995, -0.3795],
         ...,
         [ 1.4079,  3.2096, -0.2205,  ..., -1.6534,  2.6366, -0.5810],
         [-2.2446,  3.3128, -1.0514,  ..., -3.3716,  2.9466, -1.3682],
         [-2.4763,  3.4620, -1.2882,  ..., -1.8136,  2.8623, -0.4536]],

        [[-2.1825,  5.3771,  0.3891,  ..., -2.8643,  2.2036, -0.9574],
         [-1.2655,  3.6055,  0.7777,  ..., -1.0037,  3.2276, -1.5651],
         [-1.0119,  2.8212,  1.3036,  ..., -1.5994,  2.9707, -0.3736],
         ...,
         [-0.1609,  2.3399,  1.1191,  ..

d_k value 64
torch.Size([512, 8, 81, 64]) , torch.Size([512, 8, 81, 64]), torch.Size([512, 8, 81, 64])
Score shape -- torch.Size([512, 8, 81, 81])
Mask ---> torch.Size([512, 1, 81])
Mask new --> torch.Size([512, 1, 1, 81])
loop within encoder. Iteration -- 4
this is x in EncoderLayer-----> tensor([[[-1.8451e+00,  1.0369e+01, -3.9081e+00,  ..., -5.9169e+00,
          -2.5256e-03, -6.5457e+00],
         [-1.2132e+00,  8.1060e+00, -8.9740e-01,  ..., -6.5003e+00,
          -3.8514e-01, -7.1204e+00],
         [-1.3921e+00,  8.9127e+00, -4.1315e+00,  ..., -8.0476e+00,
           3.0215e+00, -6.5487e+00],
         ...,
         [ 7.5937e-01,  8.2163e+00, -3.5631e+00,  ..., -7.2516e+00,
           4.6499e-01, -6.6895e+00],
         [-2.7515e+00,  7.1603e+00, -4.5648e+00,  ..., -8.5904e+00,
          -2.1502e-01, -7.5339e+00],
         [-3.0169e+00,  7.5358e+00, -2.3504e+00,  ..., -6.7524e+00,
          -1.1137e+00, -5.9869e+00]],

        [[-1.0595e+00,  1.1124e+01, -2.6021e+00,  ..., -9.0521e

d_k value 64
torch.Size([512, 8, 88, 64]) , torch.Size([512, 8, 81, 64]), torch.Size([512, 8, 81, 64])
Score shape -- torch.Size([512, 8, 88, 81])
Mask ---> torch.Size([512, 1, 81])
Mask new --> torch.Size([512, 1, 1, 81])
d_k value 64
torch.Size([512, 8, 88, 64]) , torch.Size([512, 8, 88, 64]), torch.Size([512, 8, 88, 64])
Score shape -- torch.Size([512, 8, 88, 88])
Mask ---> torch.Size([512, 88, 88])
Mask new --> torch.Size([512, 1, 88, 88])
d_k value 64
torch.Size([512, 8, 88, 64]) , torch.Size([512, 8, 81, 64]), torch.Size([512, 8, 81, 64])
Score shape -- torch.Size([512, 8, 88, 81])
Mask ---> torch.Size([512, 1, 81])
Mask new --> torch.Size([512, 1, 1, 81])
d_k value 64
torch.Size([512, 8, 88, 64]) , torch.Size([512, 8, 88, 64]), torch.Size([512, 8, 88, 64])
Score shape -- torch.Size([512, 8, 88, 88])
Mask ---> torch.Size([512, 88, 88])
Mask new --> torch.Size([512, 1, 88, 88])
d_k value 64
torch.Size([512, 8, 88, 64]) , torch.Size([512, 8, 81, 64]), torch.Size([512, 8, 81, 64])
S

loop within encoder. Iteration -- 1
this is x in EncoderLayer-----> tensor([[[ 0.0084,  2.8232, -1.0755,  ..., -2.8687,  1.7354,  0.1416],
         [-0.6254,  3.6351, -0.5473,  ..., -2.2876,  2.4121, -0.4240],
         [-1.5210,  0.3798, -0.4451,  ..., -3.2654,  3.8527, -0.5071],
         ...,
         [-0.4385,  3.6921,  0.4902,  ..., -1.7264,  2.5072, -1.2673],
         [ 0.0660,  3.2967, -0.5366,  ..., -1.5814,  2.7425, -0.7737],
         [-2.8068,  1.8081, -0.8008,  ..., -0.8910,  2.3511, -0.3533]],

        [[-1.0291,  3.9112, -1.1216,  ..., -2.1113,  3.0937, -1.1448],
         [-0.5831,  4.2026,  1.3910,  ..., -1.3356,  3.0821, -1.0080],
         [-1.5293,  3.1737,  0.3857,  ..., -0.8122,  4.4894, -0.1521],
         ...,
         [-0.4043,  3.6340,  0.6489,  ..., -1.7954,  4.0369, -0.0728],
         [-1.6148,  3.8397, -0.8891,  ..., -1.6617,  3.2306, -1.2735],
         [-2.0064,  2.1160, -1.5851,  ..., -2.0451,  0.9496, -0.5718]],

        [[-2.7727,  4.9227, -0.4013,  ..., -2.24

d_k value 64
torch.Size([512, 8, 87, 64]) , torch.Size([512, 8, 87, 64]), torch.Size([512, 8, 87, 64])
Score shape -- torch.Size([512, 8, 87, 87])
Mask ---> torch.Size([512, 1, 87])
Mask new --> torch.Size([512, 1, 1, 87])
loop within encoder. Iteration -- 4
this is x in EncoderLayer-----> tensor([[[-7.4129e-01,  9.2650e+00, -2.6889e+00,  ..., -8.5543e+00,
          -1.3876e+00, -6.0387e+00],
         [ 2.7539e-01,  1.0774e+01, -3.7933e+00,  ..., -7.2091e+00,
           8.0265e-01, -7.3942e+00],
         [-1.6493e+00,  6.4160e+00, -3.3589e+00,  ..., -1.0230e+01,
           6.3865e-02, -7.4399e+00],
         ...,
         [ 9.6786e-03,  8.9863e+00, -2.3740e+00,  ..., -8.1008e+00,
          -1.3254e+00, -7.1479e+00],
         [ 4.3467e-01,  9.8898e+00, -4.0513e+00,  ..., -7.5778e+00,
          -1.1391e+00, -4.7385e+00],
         [-4.9800e-01,  5.4470e+00, -2.0796e+00,  ..., -6.9329e+00,
           2.0740e+00, -5.1602e+00]],

        [[ 1.0762e-01,  9.4679e+00, -3.9569e+00,  ..., -6.7164e

d_k value 64
torch.Size([512, 8, 87, 64]) , torch.Size([512, 8, 87, 64]), torch.Size([512, 8, 87, 64])
Score shape -- torch.Size([512, 8, 87, 87])
Mask ---> torch.Size([512, 1, 87])
Mask new --> torch.Size([512, 1, 1, 87])
d_k value 64
torch.Size([512, 8, 91, 64]) , torch.Size([512, 8, 91, 64]), torch.Size([512, 8, 91, 64])
Score shape -- torch.Size([512, 8, 91, 91])
Mask ---> torch.Size([512, 91, 91])
Mask new --> torch.Size([512, 1, 91, 91])
d_k value 64
torch.Size([512, 8, 91, 64]) , torch.Size([512, 8, 87, 64]), torch.Size([512, 8, 87, 64])
Score shape -- torch.Size([512, 8, 91, 87])
Mask ---> torch.Size([512, 1, 87])
Mask new --> torch.Size([512, 1, 1, 87])
d_k value 64
torch.Size([512, 8, 91, 64]) , torch.Size([512, 8, 91, 64]), torch.Size([512, 8, 91, 64])
Score shape -- torch.Size([512, 8, 91, 91])
Mask ---> torch.Size([512, 91, 91])
Mask new --> torch.Size([512, 1, 91, 91])
d_k value 64
torch.Size([512, 8, 91, 64]) , torch.Size([512, 8, 87, 64]), torch.Size([512, 8, 87, 64])
S

Score shape -- torch.Size([3, 8, 47, 47])
Mask ---> torch.Size([3, 47, 47])
Mask new --> torch.Size([3, 1, 47, 47])
d_k value 64
torch.Size([3, 8, 47, 64]) , torch.Size([3, 8, 38, 64]), torch.Size([3, 8, 38, 64])
Score shape -- torch.Size([3, 8, 47, 38])
Mask ---> torch.Size([3, 1, 38])
Mask new --> torch.Size([3, 1, 1, 38])
d_k value 64
torch.Size([3, 8, 47, 64]) , torch.Size([3, 8, 47, 64]), torch.Size([3, 8, 47, 64])
Score shape -- torch.Size([3, 8, 47, 47])
Mask ---> torch.Size([3, 47, 47])
Mask new --> torch.Size([3, 1, 47, 47])
d_k value 64
torch.Size([3, 8, 47, 64]) , torch.Size([3, 8, 38, 64]), torch.Size([3, 8, 38, 64])
Score shape -- torch.Size([3, 8, 47, 38])
Mask ---> torch.Size([3, 1, 38])
Mask new --> torch.Size([3, 1, 1, 38])
d_k value 64
torch.Size([3, 8, 47, 64]) , torch.Size([3, 8, 47, 64]), torch.Size([3, 8, 47, 64])
Score shape -- torch.Size([3, 8, 47, 47])
Mask ---> torch.Size([3, 47, 47])
Mask new --> torch.Size([3, 1, 47, 47])
d_k value 64
torch.Size([3, 8, 47, 6

d_k value 64
torch.Size([512, 8, 83, 64]) , torch.Size([512, 8, 83, 64]), torch.Size([512, 8, 83, 64])
Score shape -- torch.Size([512, 8, 83, 83])
Mask ---> torch.Size([512, 1, 83])
Mask new --> torch.Size([512, 1, 1, 83])
loop within encoder. Iteration -- 1
this is x in EncoderLayer-----> tensor([[[-3.2083,  2.7163, -0.4514,  ..., -1.0945,  2.0085, -0.7440],
         [-1.2089,  3.4290, -0.2336,  ...,  0.2839,  3.2408, -0.1589],
         [-1.2459, -0.0802, -0.2212,  ..., -0.1571,  3.2763, -1.8929],
         ...,
         [-2.6418,  3.2105, -1.0993,  ..., -1.9193,  1.5238, -0.5554],
         [-2.4292,  2.1361,  1.5128,  ..., -0.0404,  2.7486, -1.6413],
         [-0.9099,  1.7086,  0.5694,  ..., -2.0088,  2.3590, -1.0589]],

        [[-1.7956,  4.3440, -0.3535,  ..., -0.6036,  2.4987,  0.8666],
         [-1.6941,  3.1758,  0.5202,  ..., -1.4637,  2.5755, -0.5400],
         [-1.1819,  2.0917,  0.9725,  ..., -1.1468,  3.0812,  0.3517],
         ...,
         [-1.9340,  3.0544, -0.8645,  ..

d_k value 64
torch.Size([512, 8, 83, 64]) , torch.Size([512, 8, 83, 64]), torch.Size([512, 8, 83, 64])
Score shape -- torch.Size([512, 8, 83, 83])
Mask ---> torch.Size([512, 1, 83])
Mask new --> torch.Size([512, 1, 1, 83])
loop within encoder. Iteration -- 4
this is x in EncoderLayer-----> tensor([[[-4.1193,  6.6429, -3.8450,  ..., -7.2528, -1.2976, -6.2127],
         [-2.1995,  9.5656, -3.7744,  ..., -3.4927, -1.5488, -5.9292],
         [ 0.1318,  5.3354, -2.0962,  ..., -6.0749,  1.8140, -7.6801],
         ...,
         [-2.5728, 10.1480, -2.2318,  ..., -6.7148, -3.0298, -6.0873],
         [-1.9343,  8.2127, -1.1278,  ..., -5.7940, -0.7903, -6.3595],
         [-1.2767,  7.9044, -2.0837,  ..., -8.1152, -2.0378, -6.9550]],

        [[ 0.4073,  9.8701, -2.8329,  ..., -5.2840, -2.0701, -5.7113],
         [-0.8818,  7.9978, -1.6759,  ..., -7.1254, -0.9227, -7.2638],
         [ 0.5060,  8.1224, -1.2687,  ..., -6.8281, -0.8565, -5.6385],
         ...,
         [-0.0898,  9.0096, -4.4428,  ..

d_k value 64
torch.Size([512, 8, 87, 64]) , torch.Size([512, 8, 87, 64]), torch.Size([512, 8, 87, 64])
Score shape -- torch.Size([512, 8, 87, 87])
Mask ---> torch.Size([512, 87, 87])
Mask new --> torch.Size([512, 1, 87, 87])
d_k value 64
torch.Size([512, 8, 87, 64]) , torch.Size([512, 8, 83, 64]), torch.Size([512, 8, 83, 64])
Score shape -- torch.Size([512, 8, 87, 83])
Mask ---> torch.Size([512, 1, 83])
Mask new --> torch.Size([512, 1, 1, 83])
d_k value 64
torch.Size([512, 8, 87, 64]) , torch.Size([512, 8, 87, 64]), torch.Size([512, 8, 87, 64])
Score shape -- torch.Size([512, 8, 87, 87])
Mask ---> torch.Size([512, 87, 87])
Mask new --> torch.Size([512, 1, 87, 87])
d_k value 64
torch.Size([512, 8, 87, 64]) , torch.Size([512, 8, 83, 64]), torch.Size([512, 8, 83, 64])
Score shape -- torch.Size([512, 8, 87, 83])
Mask ---> torch.Size([512, 1, 83])
Mask new --> torch.Size([512, 1, 1, 83])
predictions size : tensor([[-0.9196, -1.0088, -0.5155,  ..., -0.6532, -0.7871, -0.2975],
        [-0.978

KeyboardInterrupt: 

In [None]:
matrix = torch.randn(1,16).reshape(4,4)

In [None]:
matrix.dim()

In [None]:
matrix.size()

In [None]:
matrix

In [None]:
matrix_2 = matrix.unsqueeze(-2)

In [None]:
matrix_2.size()  # 4 batches, 1 row, 4 columns

In [None]:
matrix_2

In [None]:
matrix_3 = matrix.unsqueeze(-1); matrix_3.size()

In [None]:
matrix_3  # 4 batches, 4 rows, 1 column

In [None]:
matrix.unsqueeze(-3).size()