# Explanation

This notebook is used to test the classes used and implemented in the transformer_classes.py file

Remember:
We need to implement the following (found in "Attention is all you need" paper available at https://arxiv.org/pdf/1706.03762.pdf):

- Transformer Architecture
- Scalled Dot Product
- Multi-Head Attention

And all other things required like positionnal encoding


# 1. Function Classes
## I. Build Vocab from Glove & Embedding

First we'll load and setup glove and our vocab to get the first "brick" for embedding

For this test I'll load all glove pretrained weight and build the vocab

#### 1st Version

Using Glove and the same data cleaning method as the one in the main branch which let lots of unknown words

In [1]:
# Main/global imports
import torch
import numpy as np
import pandas as pd

from textfn import *
from classes import *
from tranformer_classes import *

from torch.utils.data import DataLoader
from torchmetrics import F1Score

%load_ext autoreload
%autoreload 2

In [2]:
df = loadDts('dataset/train_processed.csv')

In [3]:
d_model = 50 # Renammed from embedding_dim
glove_path = 'glove_pretrained/glove.6B.{}d.txt'.format(d_model)
vocab_size = 10000
max_seq_length = 20

embedding_weights = np.zeros((vocab_size+2, d_model))
word_to_index = {}
index=0
with open(glove_path, "r", encoding="utf-8") as f:
    for line in f:
        if index >= vocab_size-2:
            break
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype="float16")
        embedding_weights[index] = vector
        word_to_index[word] = index
        index +=1 
    f.close()
embedding_weights[index+1] = np.zeros(d_model)
embedding_weights[index+2] = np.zeros(d_model)
word_to_index['<unk>'] = index+1
word_to_index['<pad>'] = index+2
vocab_size+=2
embedding_weights = torch.tensor(embedding_weights)


In [4]:
data = TranformerGloveDataset(df, max_seq_length, word_to_index, train=True)

In [5]:
data.__getitem__(2)

(tensor([ 3699,  1712,  5125,   241,  9999,   941,  6904,  5125,   241,   460,
          1543, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000]),
 tensor(1.))

### I. i. Embedding Handling

As we built the vocab, we'll need to handle the embedded values of the words.

#### Method 1:

Using Glove's pretrained weights and freezing the layer in the class Embedder

#### Method 2:

Idem as method 1 but putting this direclty in the Encoder class using the embedding layer

For now, using Method 1 

## II. Positional Encoding

Positional Encoding is a matrix which define the position of the word in the sentence.

It's defined with:

$$
PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}})
$$
And
$$
PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}})
$$

As stated on the original paper: 
"The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two can be summed."


So the dimensions of the PE matrix are the **sentence size** and **embedding size** or $d_{model}$

##### Method 1

Create a class that compute the positional encoding

In [6]:
# # Example to check if positional encoder work
# data = torch.randn(5, 10, 6)
# pos_enc = PositionalEncoder(10, 6)
# encoded_input = pos_enc(data)

# print(encoded_input.size())

## III. Multi-Head Attention & Scaled Dot-Product Attention 

Multi-Head Attention is the one of the "main" component of the transformer network.

It's using a set of matrices which will be trained to handle a specific role in the network:
- **Queries (Q)**: Relationship & Dependencies with tokens in sequence.
- **Keys (K)**: Key information used to compare against when computing scores.
- **Values (V)**: Weighted sum of the mechanism

Those matrices are made/initialized from inputs' embeddings vector with the positional encoding.


In the Multi-Head Attention, we split the embedding into multiple layers (or **heads**) where $N$ is the number of head. $d_k$ will be refering to the last dimension where $d_k = d_{model}/N$

**Dropout**: As the original paper state: "_We apply dropout to the sums of the embeddings and the positional encoding in both the encoder and decoder stacks. For the base model we use a rate of_ $P_{drop}=0.1$"

### III. i Scaled Dot-Product Attention

As the original paper stated, this Attention is computed as: $$Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

Without forgetting that we can add both mask and dropout to this.

#### Method 1 

Make the Multi-Head Attention class and the Attention (Scaled Dot-Product Attention) as a function in it.

#### Method 2 

Idem as method 1, but putting the Attention in a splitted class.




## IV. Feed-Forward Network

The Feed-Forward "layer" serve the purpose of deepens the whole networks by using Linear layers.

As stated in the original paper: "_This consists of two linear transformations with a ReLU activation in between_"

The number set per default of $d_{ff}$ is stated in original paper: "_[...] and the inner-layer has dimensionality_ $d_{ff} = 2048$"



## V. Normalization 

Normalization is important for our network, to prevent our values to not change too much, so model can train faster and better.

Original paper state that they're using Layer Normalization.
To implement LN, we need to implement the following:
$$LN(z;\alpha,\beta) = \frac{z-\mu}{\sigma}\odot \alpha + \beta$$

This can be found into the "Layer Normalization" paper, Page 13, 15 & 16th formula [HERE](https://arxiv.org/pdf/1607.06450.pdf)



# 2. Main blocks and Architecture

## I. Encoder/Decoder & Problem

While both Encoder & Decoder Blocks differ a bit:
- Encoder has "only" 1 Multi-Head Attention and 1 Feed Forward
- Decoder has 2 Multi-Head Attention and 1 Feed Forward and receive Encoder output

The common thing between the 2 is the skipped connections and the layers used.
So no particular difficulties in implementing the blocks.

With all previous classes implemented, we can make the parts classes by adding our embeddings/positional encoding and using either copy.deepcopy() or nn.ModuleList() to can create multiple independent blocks/module for our model to work with.

Althought the current transformer is "finished", the reference used implemented it for sequence to sequence, but Disaster Tweet is a Sentiment Analysis task, so we need to change a few things to make a sentiment analysis task from it.

## II. Changes for Sentiment Analysis

### i. Transformer class

As sentiment analysis is a Many-to-One setup, we don't need the Decoder part of the transformer.
So we we'll create a new class for sentiment analysis without decoder, adapt the output, Linear layer and forward computation for a binary output.

#### a. Different setups

There's different setups related to NLP:
- **Many-to-One**: Take a sequence and map it to two or more classes (Ex: Sentiment Analysis)
- **Many-to-Many**: Both inputs and output are sequences (Ex: Machine Translation)
- **One-to-Many**: Input is a single value and output a sequence (Ex: Image Captioning) 

In this case it's Many-to_one and we don't need Decoder blocks.

#### b. Parameters

Change the parameters to adapt to a vocab in entry + number of classes expected as output.

---

# 3. Putting it all together

In [35]:
# Parameters

# Global
epochs = 100
batch_size = 32 
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# device = torch.device("cpu")
split_seed = 42
train_dev_split = 0.65

# Model
N = 5
num_classes = 1
heads = 6

# Optimizer
optim_lr = 0.01


In [22]:
train_data, dev_data = torch.utils.data.random_split(data, [train_dev_split, 1-train_dev_split])

In [23]:
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
dev_loader   = DataLoader(dev_data,   batch_size=batch_size, shuffle=True)

In [32]:
model = SentimentAnalysisTransformer(vocab_size, max_seq_length, num_classes, d_model, N, heads, embedding_weights).to(device)

In [33]:
opt     = torch.optim.Adam(model.parameters(), lr=optim_lr)
loss_fn = torch.nn.BCELoss()
metric  = F1Score(task='binary').to(device)

In [36]:
t_size = len(train_loader)
d_size = len(dev_loader)
p = True
for e in range(epochs):
    train_acc = 0.0
    train_loss = 0.0
    dev_acc = 0.0
    dev_loss = 0.0
    for X, Y in train_loader:
        model.train()
        opt.zero_grad()
        X = X.to(device)
        Y = Y.to(device)
        with torch.set_grad_enabled(True):
            # print(X)
            y_hat = model(X, None)
            # print(y_hat)
            # print(Y)
            loss = loss_fn(y_hat, Y)
            loss.backward()
            opt.step()
            train_acc += metric(y_hat, Y)
            train_loss += loss.item()  
            # if p:
            #     print(y_hat)
            #     print(Y)
            #     print(metric(y_hat, Y))
            #     print(loss.item())
            #     # p= False
                
    for X, Y in dev_loader:
        model.eval()
        opt.zero_grad()
        X = X.to(device)
        Y = Y.to(device)
        with torch.set_grad_enabled(False):
            y_hat = model(X, None)
            loss = loss_fn(y_hat, Y)
            dev_acc += metric(y_hat, Y)
            dev_loss += loss.item()      
    print('[Epoch {} - TRAIN] - Loss: {} - Acc: {} \n[Epoch {} - DEV]   - Loss: {} - Acc: {}'.format(
            e,
            train_loss/t_size,
            train_acc/t_size, e,
            dev_loss/d_size, 
            dev_acc/d_size, 
            ))

[Epoch 0 - TRAIN] - Loss: 57.21390169205204 - Acc: 0.5939000248908997 
[Epoch 0 - DEV]   - Loss: 56.65922619047619 - Acc: 0.5988211035728455
[Epoch 1 - TRAIN] - Loss: 57.16109830794796 - Acc: 0.5942593216896057 
[Epoch 1 - DEV]   - Loss: 56.65922619047619 - Acc: 0.5996611714363098
[Epoch 2 - TRAIN] - Loss: 57.17165898969097 - Acc: 0.595155656337738 
[Epoch 2 - DEV]   - Loss: 56.882440476190474 - Acc: 0.5965134501457214
[Epoch 3 - TRAIN] - Loss: 57.18221967143397 - Acc: 0.5931157469749451 
[Epoch 3 - DEV]   - Loss: 56.436011904761905 - Acc: 0.6003729104995728
[Epoch 4 - TRAIN] - Loss: 57.19278035317698 - Acc: 0.5945325493812561 
[Epoch 4 - DEV]   - Loss: 56.54761904761905 - Acc: 0.60007643699646
[Epoch 5 - TRAIN] - Loss: 57.18221967143397 - Acc: 0.5942016243934631 
[Epoch 5 - DEV]   - Loss: 56.99404761904762 - Acc: 0.5953826308250427
[Epoch 6 - TRAIN] - Loss: 57.24558371267011 - Acc: 0.5949082374572754 
[Epoch 6 - DEV]   - Loss: 56.770833333333336 - Acc: 0.5977334976196289
[Epoch 7 - TR

# 4. Errors encountered

## I. Exploding gradient

### a. Explanation

When running the training loop with differents parameters, getting issues with CUDA:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1 

When putting the optimizer's learning rate to 0.1, This finally end up putting final linear output at values around -1e+23.
This resulting in the model outputing "Nan" values which cause issue for CUDA or any computations.

### b. Resolution 

When putting LR at 0.01, the problem "vanish" and stop happening.

### c. Analysis & Other fixes

**Hypothesis**: Putting a really high LR at first might "jump" and end up in a saddle point where the lowest value being really low, which ultimately end up causing computation issues.

This could be verified by putting an LR scheduler and try out.


## II. 

## Annex

For this notebook and creation, I've used multiple sources:
- [How to code The Transformer in Pytorch - Toward Data Science - Samuel Lynn-Evans](https://towardsdatascience.com/how-to-code-the-transformer-in-pytorch-24db27c8f9ec) (As reference for sequence to sequence implementation)
- [Attention is All You Need](https://arxiv.org/abs/1706.03762) (As a base)
- [Layer Normalization](https://arxiv.org/pdf/1607.06450.pdf) (For layer norm)
- [ChatGPT](https://openai.com/chatgpt) (For comprehension/question and quick alternatives)
- Many Kaggle's Notebooks and Medium/Toward Data Science articles (To add to ChatGPT's response) 
