# Explanation

This notebook is used to test the classes used and implemented in the transformer_classes.py file

Remember:
We need to implement the following (found in "Attention is all you need" paper available at https://arxiv.org/pdf/1706.03762.pdf):

- Transformer Architecture
- Scalled Dot Product
- Multi-Head Attention

And all other things required like positionnal encoding


## I. Build Vocab from Glove & Embedding

First we'll load and setup glove and our vocab to get the first "brick" for embedding

For this test I'll load all glove pretrained weight and build the vocab

#### 1st Version

Using Glove and the same data cleaning method as the one in the main branch which let lots of unknown words

In [2]:
# Main/global imports
import torch
import numpy as np
import pandas as pd

from textfn import *
from classes import *
from tranformer_classes import *

%load_ext autoreload
%autoreload 2

In [52]:
df = loadDts('dataset/train_processed.csv')

In [56]:
d_model = 50 # Renammed from embedding_dim
glove_path = 'glove_pretrained/glove.6B.{}d.txt'.format(embedding_dim)
vocab_size = 10000
max_seq_length = 20

embeddings = np.zeros((vocab_size+2, embedding_dim))
word_to_index = {}
index=0
with open(glove_path, "r", encoding="utf-8") as f:
    for line in f:
        if index >= vocab_size-2:
            break
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype="float16")
        embeddings[index] = vector
        word_to_index[word] = index
        index +=1 
    f.close()
embeddings[index+1] = np.zeros(embedding_dim)
embeddings[index+2] = np.zeros(embedding_dim)
word_to_index['<unk>'] = index+1
word_to_index['<pad>'] = index+2
vocab_size+=2

In [57]:
data = TranformerGloveDataset(df, max_seq_length, word_to_index, train=True)

### I. i. Embedding Handling

As we built the vocab, we'll need to handle the embedded values of the words.

#### Method 1:

Using Glove's pretrained weights and freezing the layer in the class Embedder

#### Method 2:

Idem as method 1 but putting this direclty in the Encoder class using the embedding layer

For now, using Method 1 

## II. Positional Encoding

Positional Encoding is a matrix which define the position of the word in the sentence.

It's defined with:

$$
PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}})
$$
And
$$
PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}})
$$

As stated on the original paper: 
"The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two can be summed."


So the dimensions of the PE matrix are the **sentence size** and **embedding size** or $d_{model}$

##### Method 1

Create a class that compute the positional encoding

In [8]:
# Example to check if positional encoder work
data = torch.randn(5, 10, 6)
pos_enc = PositionalEncoder(10, 6)
encoded_input = pos_enc(data)

print(encoded_input.size())

torch.Size([5, 10, 6])


## III. Multi-Head Attention & Scaled Dot-Product Attention 