# Explanation

This notebook is used to test the classes used and implemented in the transformer_classes.py file

Remember:
We need to implement the following (found in "Attention is all you need" paper available at https://arxiv.org/pdf/1706.03762.pdf):

- Transformer Architecture
- Scalled Dot Product
- Multi-Head Attention

And all other things required like positionnal encoding


# 1. Function Classes
## I. Build Vocab from Glove & Embedding

First we'll load and setup glove and our vocab to get the first "brick" for embedding

For this test I'll load all glove pretrained weight and build the vocab

#### 1st Version

Using Glove and the same data cleaning method as the one in the main branch which let lots of unknown words

In [51]:
# Main/global imports
import torch
import numpy as np
import pandas as pd

from textfn import *
from classes import *
from tranformer_classes import *

from torch.utils.data import DataLoader
from torchmetrics import F1Score

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [52]:
df = loadDts('dataset/train_processed.csv')

In [53]:
d_model = 50 # Renammed from embedding_dim
glove_path = 'glove_pretrained/glove.6B.{}d.txt'.format(d_model)
vocab_size = 10000
max_seq_length = 20

embedding_weights = np.zeros((vocab_size+2, d_model))
word_to_index = {}
index=0
with open(glove_path, "r", encoding="utf-8") as f:
    for line in f:
        if index >= vocab_size-2:
            break
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype="float16")
        embedding_weights[index] = vector
        word_to_index[word] = index
        index +=1 
    f.close()
embedding_weights[index+1] = np.ones(d_model)
embedding_weights[index+2] = np.zeros(d_model)
word_to_index['<unk>'] = index+1
word_to_index['<pad>'] = index+2
vocab_size+=2
embedding_weights = torch.tensor(embedding_weights)


In [54]:
data = TranformerGloveDataset(df, max_seq_length, word_to_index, train=True)

In [55]:
def get_padding_mask(input, padding_idx):
    padding_mask = (input == padding_idx)
    return padding_mask

### I. i. Embedding Handling

As we built the vocab, we'll need to handle the embedded values of the words.

#### Method 1:

Using Glove's pretrained weights and freezing the layer in the class Embedder

#### Method 2:

Idem as method 1 but putting this direclty in the Encoder class using the embedding layer

For now, using Method 1 

## II. Positional Encoding

Positional Encoding is a matrix which define the position of the word in the sentence.

It's defined with:

$$
PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}})
$$
And
$$
PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}})
$$

As stated on the original paper: 
"The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two can be summed."


So the dimensions of the PE matrix are the **sentence size** and **embedding size** or $d_{model}$

##### Method 1

Create a class that compute the positional encoding

In [6]:
# # Example to check if positional encoder work
# data = torch.randn(5, 10, 6)
# pos_enc = PositionalEncoder(10, 6)
# encoded_input = pos_enc(data)

# print(encoded_input.size())

## III. Multi-Head Attention & Scaled Dot-Product Attention 

Multi-Head Attention is the one of the "main" component of the transformer network.

It's using a set of matrices which will be trained to handle a specific role in the network:
- **Queries (Q)**: Relationship & Dependencies with tokens in sequence.
- **Keys (K)**: Key information used to compare against when computing scores.
- **Values (V)**: Weighted sum of the mechanism

Those matrices are made/initialized from inputs' embeddings vector with the positional encoding.


In the Multi-Head Attention, we split the embedding into multiple layers (or **heads**) where $N$ is the number of head. $d_k$ will be refering to the last dimension where $d_k = d_{model}/N$

**Dropout**: As the original paper state: "_We apply dropout to the sums of the embeddings and the positional encoding in both the encoder and decoder stacks. For the base model we use a rate of_ $P_{drop}=0.1$"

### III. i Scaled Dot-Product Attention

As the original paper stated, this Attention is computed as: $$Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

Without forgetting that we can add both mask and dropout to this.

#### Method 1 

Make the Multi-Head Attention class and the Attention (Scaled Dot-Product Attention) as a function in it.

#### Method 2 

Idem as method 1, but putting the Attention in a splitted class.




## IV. Feed-Forward Network

The Feed-Forward "layer" serve the purpose of deepens the whole networks by using Linear layers.

As stated in the original paper: "_This consists of two linear transformations with a ReLU activation in between_"

The number set per default of $d_{ff}$ is stated in original paper: "_[...] and the inner-layer has dimensionality_ $d_{ff} = 2048$"



## V. Normalization 

Normalization is important for our network, to prevent our values to not change too much, so model can train faster and better.

Original paper state that they're using Layer Normalization.
To implement LN, we need to implement the following:
$$LN(z;\alpha,\beta) = \frac{z-\mu}{\sigma}\odot \alpha + \beta$$

This can be found into the "Layer Normalization" paper, Page 13, 15 & 16th formula [HERE](https://arxiv.org/pdf/1607.06450.pdf)



# 2. Main blocks and Architecture

## I. Encoder/Decoder & Problem

While both Encoder & Decoder Blocks differ a bit:
- Encoder has "only" 1 Multi-Head Attention and 1 Feed Forward
- Decoder has 2 Multi-Head Attention and 1 Feed Forward and receive Encoder output

The common thing between the 2 is the skipped connections and the layers used.
So no particular difficulties in implementing the blocks.

With all previous classes implemented, we can make the parts classes by adding our embeddings/positional encoding and using either copy.deepcopy() or nn.ModuleList() to can create multiple independent blocks/module for our model to work with.

Althought the current transformer is "finished", the reference used implemented it for sequence to sequence, but Disaster Tweet is a Sentiment Analysis task, so we need to change a few things to make a sentiment analysis task from it.

## II. Changes for Sentiment Analysis

### i. Transformer class

As sentiment analysis is a Many-to-One setup, we don't need the Decoder part of the transformer.
So we we'll create a new class for sentiment analysis without decoder, adapt the output, Linear layer and forward computation for a binary output.

#### a. Different setups

There's different setups related to NLP:
- **Many-to-One**: Take a sequence and map it to two or more classes (Ex: Sentiment Analysis)
- **Many-to-Many**: Both inputs and output are sequences (Ex: Machine Translation)
- **One-to-Many**: Input is a single value and output a sequence (Ex: Image Captioning) 

In this case it's Many-to_one and we don't need Decoder blocks.

#### b. Parameters

Change the parameters to adapt to a vocab in entry + number of classes expected as output.

---

# 3. Putting it all together

In [130]:
# Parameters

# Global
epochs = 100
batch_size = 2048
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# device = torch.device("cpu")
split_seed = 42
train_dev_split = 0.65

# Model
N = 1
num_classes = 1
heads = 1

# Optimizer
optim_lr = 0.01
step_size = 60
gamma = 0.1


In [131]:
train_data, dev_data = torch.utils.data.random_split(data, [train_dev_split, 1-train_dev_split])

In [132]:
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
dev_loader   = DataLoader(dev_data,   batch_size=batch_size, shuffle=True)

In [133]:
model = SentimentAnalysisTransformer(vocab_size, max_seq_length, num_classes, d_model, N, heads, embedding_weights).to(device)

In [134]:
opt     = torch.optim.Adam(model.parameters(), lr=optim_lr)
loss_fn = torch.nn.BCELoss()
metric  = F1Score(task='binary').to(device)
scheduler = torch.optim.lr_scheduler.StepLR(opt, step_size, gamma=gamma)

In [135]:
t_size = len(train_loader)
d_size = len(dev_loader)
p = True
for e in range(epochs):
    train_acc = 0.0
    train_loss = 0.0
    dev_acc = 0.0
    dev_loss = 0.0
    for X, Y in train_loader:
        model.train()
        opt.zero_grad()
        X = X.to(device)
        Y = Y.to(device)
        with torch.set_grad_enabled(True):
            padding_mask = get_padding_mask(X, word_to_index['<pad>'])
            unknown_mask = get_padding_mask(X, word_to_index['<unk>'])
            mask = padding_mask+unknown_mask
            y_hat = model(X, mask)
            loss = loss_fn(y_hat, Y)
            loss.backward()
            opt.step()
            scheduler.step()
            train_acc += metric(y_hat, Y)
            train_loss += loss.item()  

    # for X, Y in dev_loader:
    #     model.eval()
    #     opt.zero_grad()
    #     X = X.to(device)
    #     Y = Y.to(device)
    #     with torch.set_grad_enabled(False):
    #         mask = get_padding_mask(X, word_to_index['<pad>'])
    #         y_hat = model(X, mask)
    #         loss = loss_fn(y_hat, Y)
    #         dev_acc += metric(y_hat, Y)
    #         dev_loss += loss.item()      
    # print('[Epoch {} - TRAIN] - Loss: {} - Acc: {} \n[Epoch {} - DEV]   - Loss: {} - Acc: {}'.format(
    #         e,
    #         train_loss/t_size,
    #         train_acc/t_size, e,
    #         dev_loss/d_size, 
    #         dev_acc/d_size, 
    #         ))
    
    print('[Epoch {} - TRAIN] - Loss: {} - Acc: {}'.format(
            e,
            train_loss/t_size,
            train_acc/t_size
            ))

[Epoch 0 - TRAIN] - Loss: 1.0444627006848652 - Acc: 0.34815382957458496
[Epoch 1 - TRAIN] - Loss: 1.2803421815236409 - Acc: 0.5050440430641174
[Epoch 2 - TRAIN] - Loss: 0.7323805689811707 - Acc: 0.21492740511894226
[Epoch 3 - TRAIN] - Loss: 0.689518948396047 - Acc: 0.24925921857357025
[Epoch 4 - TRAIN] - Loss: 0.6845297813415527 - Acc: 0.15036167204380035
[Epoch 5 - TRAIN] - Loss: 0.6866154074668884 - Acc: 0.0
[Epoch 6 - TRAIN] - Loss: 0.6843780279159546 - Acc: 0.0017590150237083435
[Epoch 7 - TRAIN] - Loss: 0.6813175876935323 - Acc: 0.02915460616350174
[Epoch 8 - TRAIN] - Loss: 0.6810479561487833 - Acc: 0.06472470611333847
[Epoch 9 - TRAIN] - Loss: 0.6820141474405924 - Acc: 0.37557944655418396
[Epoch 10 - TRAIN] - Loss: 0.6842714746793112 - Acc: 0.2120002806186676
[Epoch 11 - TRAIN] - Loss: 0.6899519761403402 - Acc: 0.13237914443016052
[Epoch 12 - TRAIN] - Loss: 0.6950146555900574 - Acc: 0.3364526927471161
[Epoch 13 - TRAIN] - Loss: 0.6817589004834493 - Acc: 0.00452845823019743
[Epoch

# 4. Errors encountered

## I. Exploding gradient

### a. Explanation
When running the training loop with differents parameters, getting issues with CUDA:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1 

When putting the optimizer's learning rate to 0.1, This finally end up putting final linear output at values around -1e+23.
This resulting in the model outputing "Nan" values which cause issue for CUDA or any computations.

### b. Resolution 

When putting LR at 0.01, the problem "vanish" and stop happening.

### c. Analysis & Other fixes

**Hypothesis**: Putting a really high LR at first might "jump" and end up in a saddle point where the lowest value being really low, which ultimately end up causing computation issues.

This could be verified by putting an LR scheduler and try out.


## II. Random Accuracy
### a. Explanation

After running the model for roughly 100 epochs with parameters:
- epochs = 100 
- batch_size = 32 
- N = 5
- heads = 6
- optim_lr = 0.01

The maximum of training AND dev accuracy (F1 Score) peak at 60%.
This basically translate to the model "guessing" instead of predicting.

Not really sure of the source of the problem right now.

**Hypothesis**: Here's differents paths to look on for this:
- Transformer model, issue in implementation
- Data prep not really good
- Not using mask 
- Parameters not right

### b. Resolution 

Until further Acc or Loss, "deleting" the evaluation part of training loop 

**Data**: For now, I'm putting this issue away to focus on the model only.

**Model**: The implementation of the model, after some review, look good.

**Mask**: We can start to use masks to for our model to handle and understand the padding tokens
By adding it I'm receiving the same F1.
Previous call to scaled dot-product was putting mask=None, so the previous F1 score dropping wwas due to other sources.

Issue found:
In calculating the mask, I've found that the shape of _scores_ tensor look like torch.Size([64, 1, 20, 20]) so with current setup, this could only mean the shape is (batch_size, num_heads, seq_len, seq_len).

After reviewing, forgot the matmul and output of it, so nothing to do with seq_len.
To apply the mask on this shapes, need to apply two "unsqueeze" to add 2 dimensions to match required dimensions and apply the mask.

Final mask is made from a padding mask and unknown mask.

**Parameters**: When giving higher values for N and heads, the accuracy drop significantly.
When putting N=1 and heads=1, the training accuracy now peak at 72% (around 70% constant in dev)
Since the task is not really complex, this seems understandable.
This is the new base I'm using for now: 
- epochs = 100
- batch_size = 64 
- N = 1
- heads = 1
- optim_lr = 0.01

#### Further parameters check:

Since we have better results, I mayn't change N and heads values, what I'll check and change on:
- Batch size
- Learning rate
- Optimizer
- Loss fn (Since BCE is the most common for binary, I might stick with it)

Batch size: After some testing, I'm reaching the top F1 score (0.72) pretty quickly from 2048 as batch size




##### Learning rate

As previous problem encountered, it look like our problem start near a saddle point.
This might be one of the best way to easily add more Accuracy on our model.

**Ways**:
- Adding scheduler
- Adding momentum to optimizer

**Scheduler**:

With tested schedulers, I'm getting same results but slower.


I'll try to set a scheduler and train the model for the night to see if that's just a training time (Althought i really doubt it is).

The next move to quickly check, Is to use already made transformer's network and try to train them on the same dataset to observe if any difference is seen.

This way I can spare time focusing either on my transformer model or Data processing.


## Annex

For this notebook and creation, I've used multiple sources:
- [How to code The Transformer in Pytorch - Toward Data Science - Samuel Lynn-Evans](https://towardsdatascience.com/how-to-code-the-transformer-in-pytorch-24db27c8f9ec) (As reference for sequence to sequence implementation)
- [Attention is All You Need](https://arxiv.org/abs/1706.03762) (As a base)
- [Layer Normalization](https://arxiv.org/pdf/1607.06450.pdf) (For layer norm)
- [ChatGPT](https://openai.com/chatgpt) (For comprehension/question and quick alternatives)
- Many Kaggle's Notebooks and Medium/Toward Data Science articles (To add to ChatGPT's response) 
