**Table of contents**<a id='toc0_'></a>    
- [Decoder (GPT)](#toc1_)    
  - [Constants](#toc1_1_)    
  - [Reproducibility](#toc1_2_)    
  - [Utilities](#toc1_3_)    
  - [Config](#toc1_4_)    
  - [Dataset](#toc1_5_)    
    - [Refactor](#toc1_5_1_)    
  - [DataLoader](#toc1_6_)    
    - [Example](#toc1_6_1_)    
  - [Model](#toc1_7_)    
    - [Masks](#toc1_7_1_)    
      - [Padding Mask](#toc1_7_1_1_)    
      - [Look-Ahead Mask (Future Mask)](#toc1_7_1_2_)    
      - [Using Both Masks in the Decoder](#toc1_7_1_3_)    
  - [2-Digits Addition](#toc1_8_)    
  - [Adder Decoder Walkthrough](#toc1_9_)    
    - [Target Padding Mask (`target_padding_mask`)](#toc1_9_1_)    
    - [Future Mask (`future_mask`)](#toc1_9_2_)    
    - [Example of Source Padding and Future Masks](#toc1_9_3_)    
      - [First Sample First Token](#toc1_9_3_1_)    
      - [First Sample Fourth Token](#toc1_9_3_2_)    
    - [Further Add a Singleton Dimension in Masks](#toc1_9_4_)    
    - [MultiHeadAttention](#toc1_9_5_)    
      - [A Primer](#toc1_9_5_1_)    
      - [An Example](#toc1_9_5_2_)    
    - [AddNorm (Residual Connection + Layer Normalization)](#toc1_9_6_)    
      - [Residual Block](#toc1_9_6_1_)    
      - [Layer Normalization](#toc1_9_6_2_)    
      - [Combining Both](#toc1_9_6_3_)    
    - [The Head and the Softmax Layer](#toc1_9_7_)    
  - [Potential to use Module Dict?](#toc1_10_)    
  - [Training with GPT-like Model](#toc1_11_)    
    - [Loss Computation](#toc1_11_1_)    
    - [Example](#toc1_11_2_)    
    - [Confusion: Training versus Inference](#toc1_11_3_)    
  - [Questions](#toc1_12_)    
    - [Why Masked == 0 in some?](#toc1_12_1_)    
    - [what is the reason of setting the attention scores's mask indexes to negative infinity](#toc1_12_2_)    
    - [Why do we need both ignore index in Loss and also negative infinity mask](#toc1_12_3_)    
    - [Target and Preds/Logits Shape](#toc1_12_4_)    
    - [Why do we flatten prediction and target (logits)?](#toc1_12_5_)    
      - [Background](#toc1_12_5_1_)    
      - [Traditional Loss Computation](#toc1_12_5_2_)    
      - [Why Flatten?](#toc1_12_5_3_)    
      - [Step-by-step Flattening](#toc1_12_5_4_)    
    - [Why sometimes unsqueeze masks?](#toc1_12_6_)    
    - [Why does sequence length differ for source and target, usually I thought it is just all L, same.](#toc1_12_7_)    
    - [Am i right to assume that the core idea of autoregressive model like decoder only (GPT like) is that for a given sample, there will eventually be L rows where L is the seq length, and therefore I can intuitively view it as 1 sample having L samples, since for each row, we will compute the loss. Am I right in my understanding? Do not hesistate to correct me.](#toc1_12_8_)    
    - [QKV Again](#toc1_12_9_)    
      - [Background and Assumptions](#toc1_12_9_1_)    
      - [Context Vector](#toc1_12_9_2_)    
      - [Query (Q), Key (K), and Value (V)](#toc1_12_9_3_)    
      - [Mathematical Description](#toc1_12_9_4_)    
  - [TODO](#toc1_13_)    
  - [References and Further Readings](#toc1_14_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Decoder (GPT)](#toc0_)

In [1]:
from __future__ import annotations

import os
import random
import time
from dataclasses import dataclass, field
from typing import Any, Callable, Dict, List, Optional, Tuple
from torch.optim import Optimizer
from torch.optim.lr_scheduler import _LRScheduler
import numpy as np
import rich
import torch
import torch.nn as nn
from rich.pretty import pprint
from torch.utils.data import DataLoader
from tqdm import tqdm
from enum import Enum

## <a id='toc1_1_'></a>[Constants](#toc0_)

In [2]:
class Constants(Enum):
    """Generic constants class."""
    # fmt: off
    SEED : int  = 42
    DEBUG: bool = True
    # fmt: on

In [3]:
SEED   = 42
DEBUG  = True
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## <a id='toc1_2_'></a>[Reproducibility](#toc0_)

In [5]:
def seed_all(seed: Optional[int] = 1992, seed_torch: bool = True) -> int:
    """
    Seed all random number generators.

    Parameters
    ----------
    seed : int, optional
        Seed number to be used, by default 1992.
    seed_torch : bool, optional
        Whether to seed PyTorch or not, by default True.

    Returns
    -------
    seed: int
        The seed number.
    """
    # fmt: off
    os.environ["PYTHONHASHSEED"] = str(seed)       # set PYTHONHASHSEED env var at fixed value
    np.random.seed(seed)                           # numpy pseudo-random generator
    random.seed(seed)                              # python's built-in pseudo-random generator

    if seed_torch:
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)           # pytorch (both CPU and CUDA)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
        torch.backends.cudnn.enabled = False
    # fmt: on
    return seed

In [6]:
seed_all(SEED, seed_torch=True)

42

## <a id='toc1_3_'></a>[Utilities](#toc0_)

In [7]:
def forward_hook(
    module: nn.Module, input: Tuple[torch.Tensor], output: torch.Tensor
) -> None:
    """Custom hook function to print layer information."""
    if not hasattr(module, "has_printed"):
        module.has_printed = False

    if not module.has_printed:
        print(f"Layer: {module.__class__.__name__}")
        print(f"Input shape: {str(input[0].shape)}")
        print(f"Output shape: {str(output.shape)}")
        module.has_printed = True


def are_both_models_same(state_dict_1, state_dict_2):
    # Check if both models have the same keys
    if set(state_dict_1.keys()) != set(state_dict_2.keys()):
        return False

    # Check if all tensors have the same shape and values
    for key in state_dict_1.keys():
        if state_dict_1[key].shape != state_dict_2[key].shape:
            return False
        if not torch.allclose(state_dict_1[key], state_dict_2[key]):
            return False

    return True
    # return model_1.state_dict().__str__() == model_2.state_dict().__str__()

## <a id='toc1_4_'></a>[Config](#toc0_)

In [8]:
@dataclass
class MultiHeadedAttentionConfig:
    attention: Attention
    d_model: int
    H: int
    dropout: float = 0.1


@dataclass
class PositionwiseFeedForwardConfig:
    d_model: int
    d_ff: int
    activation: Any = field(default_factory=lambda: nn.ReLU())
    dropout: float = 0.1
    bias: bool = True


@dataclass
class AddNormConfig:
    feature_dim: int
    dropout: float


@dataclass
class DecoderBlockConfig:
    masked_self_attention_mha: MultiHeadedAttentionConfig
    feed_forward: PositionwiseFeedForwardConfig
    add_norm_1: AddNormConfig
    add_norm_2: AddNormConfig


@dataclass
class ModelConfig:
    d_model: int
    vocab_size: int
    max_seq_len: int
    num_layers: int
    dropout: float
    decoder: DecoderBlockConfig

## <a id='toc1_5_'></a>[Dataset](#toc0_)

In [9]:
# fmt: off
PLUS_SIGN  = 10
MUL_SIGN   = 11
MINUS_SIGN = 12
EQUAL_SIGN = 13
EOS        = 14
BOS        = 15
PAD        = 16
UNK        = 17
# fmt: on

# map tokens to their corresponding index
token_to_index: Dict[str, int] = {
    "0": 0,
    "1": 1,
    "2": 2,
    "3": 3,
    "4": 4,
    "5": 5,
    "6": 6,
    "7": 7,
    "8": 8,
    "9": 9,
    "+": PLUS_SIGN,
    "*": MUL_SIGN,
    "-": MINUS_SIGN,
    "=": EQUAL_SIGN,
    "<EOS>": EOS,
    "<BOS>": BOS,
    "<pad>": PAD,
    "??": UNK,
}
index_to_token: Dict[int, str] = dict((v, k) for k, v in token_to_index.items())
vocab_size: int = len(token_to_index)


def pad_number(num: int, length: int) -> str:
    """
    Pad numbers with zeros in front so that they have uniform length.

    Note, if a + b = c and num digits allowed to add is 2, then for
    a and b we always pad to length 2, but for c we always pad to length 3.

    Example
    -------
    6 + 90 = 96 -> 06 + 90 = 096

    Parameters
    ----------
    num : int
        Number to be padded.
    num_digits : int
        Length of the resulting padded number string.

    Returns
    -------
    str
        Padded number string.
    """
    return str(num).zfill(length)


def equation_to_string(a: int, b: int, c: int, num_digits: int) -> str:
    """
    Formats the addition equation as a string.

    Parameters
    ----------
    a : int
        First addend.
    b : int
        Second addend.
    c : int
        Sum of a and b.
    num_digits : int
        Number of digits each number in the equation should have.

    Returns
    -------
    str
        Formatted equation string.
    """
    padded_a = pad_number(a, num_digits)
    padded_b = pad_number(b, num_digits)
    padded_c = pad_number(c, num_digits + 1) # note the padding here!
    return f"{padded_a}+{padded_b}={padded_c}"

def decode_equation(equation: List[int]) -> str:
    """
    Convert an equation in list format to string format.

    Parameters
    ----------
    equation : List[int]
        The equation in list format.

    Returns
    -------
    str
        The equation in string format.
    """
    if isinstance(equation, torch.Tensor): equation = equation.tolist()
    res = "".join([str(index_to_token.get(x, UNK)) for x in equation])
    return res.replace("<BOS>", "").replace("<EOS>", "")

def encode_equation(equation: str, num_digits: int) -> torch.Tensor:
    """
    Convert an equation (up to the equal sign in it) in string format to a list.

    Parameters
    ----------
    equation : str
        The equation in string format.
    num_digits : int
        Number of digits each number in the equation should have.

    Returns
    -------
    torch.Tensor
        The equation in list format as a tensor.
    """
    plus_idx = equation.index("+")
    equal_idx = equation.index("=")

    a = pad_number(int(equation[:plus_idx]), num_digits)
    b = pad_number(int(equation[plus_idx + 1:equal_idx]), num_digits)

    new_equation = f"{a}+{b}="

    return torch.tensor(
        [BOS] + [token_to_index.get(n, UNK) for n in new_equation],
        dtype=torch.int
    ).to(DEVICE)

In [10]:
def create_add_dataset(
    num_digits: int, dataset_size: int, rng_seed: int = 1337
) -> Tuple[List[torch.Tensor], List[str]]:
    rng = torch.Generator()
    rng.manual_seed(rng_seed)

    max_num = 10**num_digits - 1

    dataset_str = []
    for _ in range(dataset_size):
        a = torch.randint(low=0, high=max_num + 1, size=(1,), generator=rng).item()
        b = torch.randint(low=0, high=max_num + 1, size=(1,), generator=rng).item()
        c = a + b

        equation = equation_to_string(a, b, c, num_digits)

        dataset_str.append(equation)

    dataset_tensor = [
        torch.tensor([BOS] + [token_to_index.get(n, UNK) for n in x] + [EOS])
        for x in dataset_str
    ]
    return dataset_tensor, dataset_str

In [11]:
dataset_tensor, dataset_str = create_add_dataset(num_digits=2, dataset_size=4)
pprint(dataset_tensor)
pprint(dataset_str)

decode_equation(dataset_tensor[0]), decode_equation([15, 1, 5, 10, 5, 7, 13,  0, 7,  2, 14])

('15+57=072', '15+57=072')

Some notes:

1. We included other operations besides addition for future use. So it may seem redundant for now.
2. Kapathy's version is more efficient since for an expression such as `15+87=102` would be encoded as `1587102` since for one, we restrict the `num_digits` to be fixed, this means that if `num_digits=2`, then it follows that only numbers that are less or equals to 2 digits can be added together. As a result, `1587102` will always be interpreted as first 2 digit = first num, next 2 digits = second num, last 3 digits as third or the answer (sum of first two). Let's look at two more examples to appreciate this:
   1. `0639045 <> 6 + 39 = 45`
   2. `5101052 <> 51 + 1 = 52`
3. He also encoded the answer backwards because its easier for GPT model, but for a first review, I will not do so to avoid confusion.

KAPATHY:

As one more example, the problem 6 + 39 = 45 would be encoded as:

"0639054"

where you will notice that we are padding with zeros to make sure that we always
produce strings of the exact same size: n + n + (n + 1). When n=2, this is 7.
At test time, we will feed in an addition problem by giving the first 2n digits,
and hoping that the GPT model completes the sequence with the next (n+1) digits
correctly.

TODO: to make it inherit Dataset.

### <a id='toc1_5_1_'></a>[Refactor](#toc0_)

Here, the refactor should be done such that there exists a class that
inherits `Dataset` and the input and outputs are generated on the fly.

## <a id='toc1_6_'></a>[DataLoader](#toc0_)

In [12]:
def collate_fn(
    batch: List[torch.Tensor],
    batch_first: bool,
    padding_value: int,
):
    input_padded = torch.nn.utils.rnn.pad_sequence(
        batch, batch_first=batch_first, padding_value=padding_value
    )
    return input_padded


@dataclass
class DataLoaders:
    # fmt: off
    rng_seed    : int
    num_digits  : int
    dataset_size: int
    batch_size: int
    train_set = None
    val_set = None
    test_set = None
    train_loader = None
    val_loader  = None
    test_loader = None
    train_size = None
    val_size = None
    test_size = None
    # fmt: on

    def __post_init__(self) -> None:
        pass

    # dangerous list default
    def split_data(self, split: List[float] = [0.7, 0.1, 0.2]) -> None:
        # if sum(split) != 1.0:
        #     raise ValueError(f"Split ratios should sum to 1 but got {sum(split)}.")

        # fmt: off
        self.train_size = round(self.dataset_size * split[0])
        self.val_size   = round(self.dataset_size * split[1])
        self.test_size  = self.dataset_size - self.train_size - self.val_size

        dataset, _ = create_add_dataset(self.num_digits, self.dataset_size)
        self.train_set, self.val_set, self.test_set = torch.utils.data.random_split(
            dataset,
            [self.train_size, self.val_size, self.test_size],
            generator=torch.Generator().manual_seed(self.rng_seed),
        )

        self.train_loader = DataLoader(self.train_set, batch_size=self.batch_size, shuffle=True, collate_fn=lambda batch: collate_fn(batch, batch_first=True, padding_value=PAD))
        self.val_loader   = DataLoader(self.val_set, batch_size=self.batch_size, shuffle=False, collate_fn=lambda batch: collate_fn(batch, batch_first=True, padding_value=PAD))
        self.test_loader  = DataLoader(self.test_set, batch_size=self.batch_size, shuffle=False, collate_fn=lambda batch: collate_fn(batch, batch_first=True, padding_value=PAD))
        # fmt: on


In [13]:
dataloaders = DataLoaders(rng_seed = 1337, num_digits=2, dataset_size=8, batch_size=2)
dataloaders.split_data(split=[1,0,0])

Note here the padding in collate is "redundant" since in our earlier code
we ensured that all sample has same number of characters by way of padding
zeros in front. For example, `23 + 3 =26` will become `23 + 03 = 026`. Consequently,
all samples $\in$ batch will have same length by definition.

In [14]:
seed_all(133338, seed_torch=True)

for batch in dataloaders.train_loader:
    for sample in batch:
        print(sample)
        print(decode_equation(sample))
        print(len(sample))
    print("-" * 80)

tensor([15,  9,  5, 10,  5,  3, 13,  1,  4,  8, 14])
95+53=148
11
tensor([15,  1,  5, 10,  1,  0, 13,  0,  2,  5, 14])
15+10=025
11
--------------------------------------------------------------------------------
tensor([15,  3,  4, 10,  9,  0, 13,  1,  2,  4, 14])
34+90=124
11
tensor([15,  9,  2, 10,  0,  0, 13,  0,  9,  2, 14])
92+00=092
11
--------------------------------------------------------------------------------
tensor([15,  1,  2, 10,  2,  0, 13,  0,  3,  2, 14])
12+20=032
11
tensor([15,  9,  7, 10,  8,  6, 13,  1,  8,  3, 14])
97+86=183
11
--------------------------------------------------------------------------------
tensor([15,  9,  0, 10,  3,  8, 13,  1,  2,  8, 14])
90+38=128
11
tensor([15,  1,  5, 10,  5,  7, 13,  0,  7,  2, 14])
15+57=072
11
--------------------------------------------------------------------------------


### <a id='toc1_6_1_'></a>[Example](#toc0_)

In [15]:
# import torch
# from typing import List
# from torch.utils.data import DataLoader, TensorDataset

# sequences = [
#     torch.tensor([1, 2]),
#     torch.tensor([3, 4, 5]),
#     torch.tensor([6, 7, 8, 9]),
#     torch.tensor([2, 3]),
# ]
# # Let's say PAD is represented by the integer 16
# PAD = 16
# sample_dataloader = DataLoader(
#     sequences,
#     batch_size=4,
#     collate_fn=lambda b: collate_fn(b, batch_first=True, padding_value=PAD),
# )
# batch = next(iter(sample_dataloader))
# pprint(batch)

# # Create the pad mask
# pad_mask = batch == PAD
# print(pad_mask)


DISTINGUISH BETWEEN GPT DECODER ONLY VS ENCODER DECODER SEQ TO SEQ

In [16]:
def construct_future_mask(seq_len: int) -> torch.Tensor:
    """
    Construct a binary mask that contains 1's for all valid connections and 0's for
    all outgoing future connections. This mask will be applied to the attention
    logits in decoder self-attention such that all logits with a 0 mask are set to
    -inf.

    :param seq_len: length of the input sequence
    :return: (seq_len,seq_len) mask
    """
    future_mask = torch.triu(torch.ones((seq_len, seq_len)), diagonal=1).to(torch.bool)
    future_mask = future_mask.contiguous()
    return future_mask == 0

construct_future_mask(seq_len=3)[:, None, None, :]

tensor([[[[ True, False, False]]],


        [[[ True,  True, False]]],


        [[[ True,  True,  True]]]])

In [17]:
def construct_batches(x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    # drops last column of the input, meaning remove last element of each row
    input = x[:, :-1].long()
    equal_sign_loc = [(equation==EQUAL_SIGN).nonzero(as_tuple=True)[0].item() for equation in x]


    # Mask out the tokens before the equal sign
    target = [
        torch.cat(
            (torch.tensor([PAD] * (equal_sign_loc[i]), dtype=torch.long), x[i, equal_sign_loc[i] + 1:])
        ) for i in range(len(x))
    ]
    target = torch.stack(target).long()

    batch_size, seq_len = input.size()
    future_mask = construct_future_mask(seq_len)
    # future mask has shape (L, L) but we want it to be (B, L, L) then (B, 1, L, L)
    future_mask = future_mask.view(1, seq_len, seq_len).expand(size=(batch_size, -1, -1)).unsqueeze(1)

    # padding_mask before view has shape: (batch_size, seq_len)
    # we want it to be (B, L, L) then (B, 1, L, L)
    padding_mask = input != PAD
    padding_mask = padding_mask.view(batch_size, 1, 1, seq_len).expand(size=(batch_size, 1, seq_len, seq_len))
    return input, target, padding_mask, future_mask

In PyTorch, `input == PAD` will perform element-wise comparison between each element of the `input` tensor and the constant `PAD`. The `PAD` constant is usually an integer that represents the padding token in a sequence. For example, in NLP tasks, padding tokens are often used to make all sequences in a batch the same length.

If `input` is a tensor of shape `(Batch Size, Sequence Length)`, then `input == PAD` will return a boolean tensor of the same shape, where each element at position `(i, j)` will be `True` if `input[i, j]` is equal to `PAD`, and `False` otherwise. This boolean tensor will serve as a mask that identifies where the padding tokens are located in the original `input` tensor.

---

The `src_mask` term in the context of transformers typically refers to the "source mask," which is designed to prevent the self-attention mechanism from considering certain tokens in the source sequence. This mask is applied to the attention scores before the softmax operation during the calculation of self-attention. The primary purposes of using such a mask are:

1. Padding Masking: When sequences are batched together, shorter sequences are often padded with special tokens (usually denoted by zeros or a specific padding token) to match the length of the longest sequence in the batch. The `src_mask` helps the model ignore these padding tokens by setting the corresponding attention scores to a large negative value (usually `-inf`), so that they become zero after the softmax operation.

2. Future Information Masking: In some tasks like sequence-to-sequence prediction, it's important that a token does not attend to future tokens in the sequence. This is another use-case for the mask, although this is more commonly seen in the target mask (`tgt_mask`) rather than the source mask (`src_mask`).

To determine whether `src_mask` in a specific implementation is the padding mask from the dataloader, you'll need to check the code where this variable is defined or used. Typically, if the mask is intended to filter out padding tokens, then yes, it could very well be the same as the padding mask generated during data loading.

The actual implementation may vary, but the concept generally remains the same. You'll often see the mask being used in the self-attention calculation, specifically right before the softmax operation to zero out particular positions.

---

What are `pad_mask`?

The primary reason for using a padding mask (`pad_mask`) is to ensure that the model does not consider padding tokens during training or inference. Padding tokens are usually added to sequences to make them have a uniform length, but they don't carry any meaningful information. Ignoring them is crucial for several reasons:

1. **Attention Mechanisms**: If your model uses attention, the mask ensures that attention scores for padding tokens are set to a very low value (often negative infinity), so that these scores don't affect the weighted sum of the input sequence.

2. **Loss Computation**: When computing loss, padding tokens should not contribute. Including them could mislead the model during training, as they don't represent genuine mistakes in prediction.

3. **Output Interpretation**: When the model makes predictions, padding tokens should not be taken into account for tasks like sequence-to-sequence translation, summarization, etc.

4. **Computational Efficiency**: In some models or algorithms, knowing which tokens are padding can speed up computation by allowing the model to skip unnecessary operations.

In summary, the padding mask is a utility to ensure that the model focuses only on the meaningful parts of the input sequence, thereby improving both accuracy and computational efficiency.

In [18]:
batch = next(iter(dataloaders.train_loader))
batch

tensor([[15,  9,  0, 10,  3,  8, 13,  1,  2,  8, 14],
        [15,  1,  5, 10,  5,  7, 13,  0,  7,  2, 14]])

WHY concat target instead of stack them normally?

In [19]:
construct_batches(batch)[0].shape, construct_batches(batch)[1].shape, construct_batches(batch)[2].shape

(torch.Size([2, 10]), torch.Size([2, 10]), torch.Size([2, 1, 10, 10]))

## <a id='toc1_7_'></a>[Model](#toc0_)

In [20]:
import copy
import math
import unittest
from abc import ABC, abstractmethod
from typing import Optional, Tuple

import numpy as np
import rich
import torch
# from d2l import torch as d2l
from rich.pretty import pprint
from torch import nn
from dataclasses import dataclass

In [21]:
class Attention(ABC, nn.Module):
    """
    Base class for attention mechanisms.

    This abstract class provides a scaffold for attention mechanisms, with a
    dropout layer for regularization included. Subclasses are expected to
    implement the `forward` method.

    Attributes
    ----------
    dropout : The dropout layer applied to the attention scores.
        type: nn.Dropout

    Note
    ----
    1.  ABC method might be redundant since inheritance from nn.Module ensures
        forward method to be implemented.
    """

    def __init__(self, dropout: float = 0.0) -> None:
        super().__init__()
        self.dropout = nn.Dropout(p=dropout, inplace=False)

    @abstractmethod
    def forward(
        self,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        mask: Optional[torch.BoolTensor] = None,
    ) -> torch.Tensor:
        raise NotImplementedError(
            "The forward method must be implemented by the subclass."
        )


class ScaledDotProductAttention(Attention):
    """
    Implements scaled dot-product attention mechanism.

    This class is a derived instance of the Attention class that computes the
    scaled dot-product attention, defined by the following operation:

    .. math::
        \\text{Attention}(Q, K, V) = \\text{softmax} \\left( \\frac{QK^T}{\\sqrt{d_k}} \\right) V

    where:
    - Q is the query matrix
    - K is the key matrix
    - V is the value matrix
    - d_k is the dimension of the keys

    The attention mechanism can be applied in two different contexts: self-attention
    and cross-attention. Self-attention allows the model to integrate information
    from the entire sequence, while cross-attention allows the model to focus on
    information from a different sequence (e.g., encoder outputs).

    Methods
    -------
    forward(query, key, value, mask)
        Computes the forward pass for the scaled dot-product attention.
    """

    def forward(
        self,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        mask: Optional[torch.BoolTensor] = None,
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """Perform the forward pass for scaled dot-product attention.

        This function applies the attention mechanism on the input tensors `query`,
        `key`, and `value`. It's worth noting that for cross-attention, the sequence
        lengths of `query` and `key`/`value` may differ. This is because `query` is
        usually projected from the decoder's states, while `key` and `value` are from
        the encoder's states.

        Notations
        ---------
        - `B`  : Batch size
        - `D`  : Embedding dimension
        - `H`  : Number of heads
        - `d_k`: Dimension of the keys    = D // H
        - `d_q`: Dimension of the queries = D // H
        - `d_v`: Dimension of the values  = D // H
        - `N`  : Batch size
        - `T`  : Sequence length for `query`
        - `S`  : Sequence length for `key` and `value`
        - `L`  : Sequence length for `query`, `key` and `value` generic.

        NOTE: We use `L` in our notes instead of `T` and `S` since we assume all query,
        key and value are of same length.

        TODO: which shape is for cross-self?

        Parameters
        ----------
        query:  A tensor of query vectors representing the set of elements each sequence
                is seeking to attend to. It contains a batch of sequences, each with a set of
                vectors across multiple attention heads.
                    type :  torch.Tensor
                    shape: `(N, H, S or T, d_q)` where `d_q = D // H`

        key  :  A tensor of key vectors that are paired with values to form a mapping. The
                dot product of a query with these keys determines the attention weight for the
                corresponding values.
                    type :  torch.Tensor
                    shape: `(N, H, S or T, d_k)` where `d_k = D // H`
        value: A tensor of value vectors that are aggregated based on the attention
               weights to form the output of the attention mechanism.
                    type :  torch.Tensor
                    shape: `(N, H, S or T, d_v)` where `d_v = D // H`
        mask : An optional boolean mask tensor that can be used to mask out certain positions
               from the attention mechanism. For self-attention, the mask shape is typically
               `(B, T, T)`. For cross-attention, the mask typically has a shape of `(B, T, S)`
               allowing different target positions to attend to different source positions.
               Here, `T` is the sequence length of the queries (note `T` is the same for
               self-attention), `T_k` and `T_v` are the sequence lengths of the keys and values
               which could be equal to `T` in self-attention or vary in cross-attention, and
               `S` is the sequence length of the source (encoder) when using cross-attention.

               However, to cater to the head dimension `H`, right after the `B` dimension, we
               will need to add (unsqueeze) an dimension to have `(B, 1, T, T)` for
               self-attention and `(B, 1, T, S)` for cross-attention.

        Returns
        -------
        Tuple[torch.Tensor, torch.Tensor]
            The context vectors and the attention weights. The context vectors are the weighted sum
            of the `value` vectors, representing the information to be attended to.
            The attention weights represent the attention probabilities.

            - Context Vectors shape:   `(N, T, d_k)`
            - Attention Weights shape: `(N, T, S)`
        """
        # fmt: off
        d_q               = query.size(dim=-1)

        attention_scores  = torch.matmul(query, key.transpose(dim0=-2, dim1=-1)) / torch.sqrt(torch.tensor(d_q).float())
        attention_scores  = attention_scores.masked_fill(mask == 0, float("-inf")) if mask is not None else attention_scores

        attention_weights = attention_scores.softmax(dim=-1)
        attention_weights = self.dropout(attention_weights)

        context_vector    = torch.matmul(attention_weights, value)
        # fmt: on
        return context_vector, attention_weights

In [22]:
class MultiHeadedAttention(nn.Module):
    __slots__ = [
        "d_model",
        "d_k",
        "d_q",
        "d_v",
        "H",
        "W_Q",
        "W_K",
        "W_V",
        "W_O",
        "attention",
        "dropout",
    ]

    def __init__(
        self,
        attention: Attention,
        H: int,
        d_model: int,
        dropout: float = 0.1,
        bias: bool = False,
    ) -> None:
        super().__init__()
        assert d_model % H == 0

        # fmt: off
        self.d_model   = d_model       # D
        self.d_k       = d_model // H  # stay true to notations
        self.d_q       = d_model // H
        self.d_v       = d_model // H

        self.H         = H             # number of heads

        # shadow my notations, actually they are of shape D x D.
        self.W_Q       = nn.Linear(self.d_model, self.d_q * self.H, bias=bias)  # D x D
        self.W_K       = nn.Linear(self.d_model, self.d_k * self.H, bias=bias)
        self.W_V       = nn.Linear(self.d_model, self.d_v * self.H, bias=bias)
        self.W_O       = nn.Linear(self.d_model, self.d_model, bias=bias)

        self.attention = attention
        self.dropout   = nn.Dropout(p=dropout, inplace=False)
        # fmt: on

    def forward(
        self,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        mask: Optional[torch.BoolTensor] = None,
    ) -> torch.Tensor:
        """
        Notations
        ---------
        B:      Batch size
        S or L: Source sequence length
        T or L: Target sequence length
        D:      Embedding dimension

        Parameters
        ----------
        query:  Although named as query, it is the embeddings Z from the token_embedding + positional_embedding layer.
                type:  torch.Tensor
                shape: (B, S or T, D)
        key:    Although named as key, it is the embeddings Z from the token_embedding + positional_embedding layer.
                type:  torch.Tensor
                shape: (B, S or T, D)
        value:  Although named as value, it is the embeddings Z from the token_embedding + positional_embedding layer.
                type:  torch.Tensor
                shape: (B, S or T, D)
        mask:   Mask to be applied to the attention scores.
                type:  torch.BoolTensor
                shape: (B, 1, S or T, S or T)

        Variables
        ---------
        W_Q.weight (D, D)

        """
        # fmt: off
        if mask is not None:
            assert mask.ndim     == 4, f"Mask should have 4 dimensions but got {mask.ndim}."
            assert mask.shape[0] == query.shape[0], ("Batch size of mask and query must match.")
            assert mask.shape[1] == 1, ("Mask should have shape (batch_size, 1, seq_len, seq_len).")
            assert mask.shape[2] == mask.shape[3] == query.shape[1], ("Mask should have shape (batch_size, 1, seq_len, seq_len).")


        Q = self.W_Q(query).contiguous() # Z @ W_Q -> LxD @ DxD = LxD
        K = self.W_K(key).contiguous()   # Z @ W_K
        V = self.W_V(value).contiguous() # Z @ W_V


        Q = self.transpose_qkv(Q)        # [B, H, L, D]
        K = self.transpose_qkv(K)
        V = self.transpose_qkv(V)

        # Attention
        # same as the other code: x = torch.matmul(p_atten, value)
        context_vector, attention_weights = self.attention(Q, K, V, mask)
        context_vector_concat             = self.reverse_transpose_qkv(context_vector)
        # fmt: on
        return self.W_O(context_vector_concat)

    def _reset_parameters(self) -> None:
        """See PyTorch's code for inspiration!"""
        ...

    def transpose_qkv(self, q_or_k_or_v: torch.Tensor) -> torch.Tensor:
        """Transposition for parallel computation of multiple attention heads.
        TODO: Why does transpose allow parallel computation?
        """
        # fmt: off
        # 1. q_or_k_or_v is shape (B, L, D)
        # 2. aim to make it of shape (B, L, H, D / H = d_qkv)
        batch_size, seq_len, _ = q_or_k_or_v.shape
        q_or_k_or_v            = q_or_k_or_v.view(batch_size, seq_len, self.H, self.d_model // self.H)

        # 3. switch H from 3rd to 2nd dimension, or in python swap 2nd to 1st
        q_or_k_or_v            = q_or_k_or_v.permute(0, 2, 1, 3)
        # fmt: on
        return q_or_k_or_v

    def reverse_transpose_qkv(self, q_or_k_or_v: torch.Tensor) -> torch.Tensor:
        """Reverse the transposition operation for concatenating multiple attention heads."""
        # fmt: off
        # 1. q_or_k_or_v is shape (B, H, L, D / H = d_qkv)
        # 2. aim to make it of shape (B, L, H, D / H = d_qkv)
        q_or_k_or_v = q_or_k_or_v.permute(0, 2, 1, 3)

        # 3. Merge H and d_qkv into D
        batch_size, seq_len, _, _ = q_or_k_or_v.shape
        q_or_k_or_v = q_or_k_or_v.contiguous().view(batch_size, seq_len, self.d_model)
        # fmt: on
        return q_or_k_or_v

class ResidualBlock(nn.Module):
    def __init__(self) -> None:
        super().__init__()

    def forward(
        self,
        x: torch.Tensor,
        sublayer: Callable[[torch.Tensor], torch.Tensor],
    ) -> torch.Tensor:
        return x + sublayer(x)


class LayerNorm(nn.Module):
    def __init__(self, feature_dim: int, eps: float = 1e-6) -> None:
        super().__init__()
        # fmt: off
        self.gamma = nn.Parameter(torch.ones(feature_dim))
        self.beta  = nn.Parameter(torch.zeros(feature_dim))
        self.eps   = eps

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        mean = x.mean(dim=-1, keepdim=True)
        std = x.std(dim=-1, keepdim=True)
        # fmt: on
        return self.gamma * (x - mean) / (std + self.eps) + self.beta

class AddNorm(nn.Module):
    """AddNorm is apt since the diagram uses Add + Norm.
    Some call it SubLayer connection (Harvard) some call it
    residual connection: x + dropout(sublayer(layernorm(x)))

    If stay true, then we apply residual then layer norm. So
    we adopt the d2l method."""
    def __init__(self, feature_dim: int, dropout: float) -> None:
        super().__init__()
        # fmt: off
        self.dropout    = nn.Dropout(p=dropout, inplace=False)
        self.layer_norm = nn.LayerNorm(normalized_shape=feature_dim)
        # fmt: on

    def forward(
        self, x: torch.Tensor, sublayer: Callable[[torch.Tensor], torch.Tensor]
    ) -> torch.Tensor:
        """G(F(x) + x) where G = layer norm and F = sublayer"""
        return self.layer_norm(x + sublayer(self.dropout(x)))

class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."

    def __init__(
        self,
        d_model: int,
        d_ff: int,
        activation: nn.Module = nn.ReLU(),
        dropout: float = 0.1,
        bias: bool = True,
    ) -> None:
        super().__init__()
        # fmt: off
        self.dense_1    = nn.Linear(d_model, d_ff, bias=bias)
        self.dense_2    = nn.Linear(d_ff, d_model, bias=bias)
        self.activation = activation
        self.dropout    = nn.Dropout(p=dropout, inplace=False)
        self.ffn        = nn.Sequential(self.dense_1, self.activation, self.dropout, self.dense_2)
        # fmt: on

    def forward(self, z: torch.Tensor) -> torch.Tensor:
        return self.ffn(z)

In [23]:
class GPTDecoderBlock(nn.Module):
    """GPTDecoderBlock focuses on masked self-attention and feed-forward layers.

    The architecture follows the GPT-style decoder, which only has masked
    self-attention and position-wise feed-forward layers, omitting the
    encoder-decoder cross-attention.
    """

    def __init__(self, config: ModelConfig) -> None:
        super().__init__()
        # fmt: off
        self.masked_self_attention_mha = MultiHeadedAttention(**config.decoder.masked_self_attention_mha.__dict__)
        # self.encoder_decoder_cross_attention_mha = MultiHeadedAttention(**config.decoder.encoder_decoder_cross_attention_mha)

        self.feed_forward              = PositionwiseFeedForward(**config.decoder.feed_forward.__dict__)

        self.add_norm_1                = AddNorm(**config.decoder.add_norm_1.__dict__)
        self.add_norm_2                = AddNorm(**config.decoder.add_norm_2.__dict__)

        # self.feed_forward.register_forward_hook(forward_hook)
        # fmt: on

    def forward(self, z: torch.Tensor, target_mask: torch.BoolTensor) -> torch.Tensor:
        """
        Parameters
        ----------
        z:           Input sequence.
                     type:  torch.Tensor
                     shape: (B, S or T, D)
        target_mask: Target mask.
                     type:  torch.BoolTensor
                     shape: (B, 1, S or T, S or T)

        Returns
        -------
        z:           Output tensor after masked self-attention and feed-forward layers.
                     type:  torch.Tensor
                     shape: (B, S or T, D)
        """
        z = self.add_norm_1(
            z,
            lambda z: self.masked_self_attention_mha(
                query=z, key=z, value=z, mask=target_mask
            ),
        )
        z = self.add_norm_2(z, self.feed_forward)
        return z


class GPTDecoder(nn.Module):
    def __init__(self, config: ModelConfig):
        super().__init__()
        # fmt: off
        self.d_model       : int                   = config.d_model
        self.tok_embed     : nn.Embedding          = nn.Embedding(config.vocab_size, config.d_model)
        self.pos_embed     : nn.Parameter          = nn.Parameter(torch.zeros(1, config.max_seq_len, config.d_model))
        self.decoder_blocks: List[GPTDecoderBlock] = nn.ModuleList([GPTDecoderBlock(config) for _ in range(config.num_layers)])

        self.dropout       : nn.Dropout            = nn.Dropout(config.dropout)
        self.layer_norm    : nn.LayerNorm          = nn.LayerNorm(config.d_model)
        self.head          : nn.Linear             = nn.Linear(config.d_model, config.vocab_size)  # last layer
        # fmt: on

        self._reset_parameters()

    def _reset_parameters(self) -> None:
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

    def forward(
        self,
        input_tokens: torch.LongTensor,
        target_padding_mask: torch.BoolTensor,
        future_mask: torch.BoolTensor,
    ) -> torch.FloatTensor:
        """
        Notations
        ---------
        B:      Batch size
        S or L: Source sequence length
        T or L: Target sequence length
        D:      Embedding dimension
        V:      Vocabulary size (Class size)

        Parameters
        ----------
        input_tokens:        Input sequence.
                             type:  torch.Tensor
                             shape: (B, S or T)
        target_padding_mask: Target padding mask.
                             type:  torch.BoolTensor
                             shape: (B, 1, S or T, S or T)
        future_mask:         Future mask.
                             type:  torch.BoolTensor
                             shape: (B, 1, S or T, S or T)

        Variables
        ---------
        z:                   Input sequence after token and position embedding.
                             type:  torch.Tensor
                             shape: (B, S or T, D)
        target_mask:         Target mask.
                             type:  torch.BoolTensor
                             shape: (B, 1, S or T, S or T)
        logits:              Output logits.
                             type:  torch.FloatTensor
                             shape: (B, S or T, V)
        """
        seq_len = input_tokens.size(1)
        target_mask: torch.BoolTensor = torch.logical_and(
            target_padding_mask, future_mask
        )

        # fmt: off
        z = self.tok_embed(input_tokens) # * math.sqrt(self.d_model) for better optimization landscape
        z = z + self.pos_embed[:, :seq_len, :]
        z = self.dropout(z)

        for decoder_block in self.decoder_blocks:
            z  = decoder_block(z, target_mask)

        z      = self.layer_norm(z)
        logits = self.head(z)
        # fmt: on
        return logits

### <a id='toc1_7_1_'></a>[Masks](#toc0_)

In Transformer models, especially in the decoder, two types of masks are commonly used: padding masks and look-ahead masks (or future masks). Here's why each is important and why you might need both:

#### <a id='toc1_7_1_1_'></a>[Padding Mask](#toc0_)

1. **Why it's needed**: When you're dealing with sequences of different lengths, you pad the shorter sequences with zeros to make them the same length as the longest one in the batch. These zero-paddings should not contribute to the output of the attention mechanism.
  
2. **Where it's used**: Both the encoder and the decoder use padding masks.

3. **How it works**: The padding mask marks the padded positions so that they can be excluded from contributing to the attention mechanism. In practice, you'll typically set the corresponding attention scores to negative infinity before applying the softmax operation.

#### <a id='toc1_7_1_2_'></a>[Look-Ahead Mask (Future Mask)](#toc0_)

1. **Why it's needed**: In the decoder, each position can only attend to positions that come before it in the sequence to maintain the auto-regressive property. This is different from the encoder, where all positions can attend to all other positions.

2. **Where it's used**: This mask is specifically for the decoder.

3. **How it works**: The look-ahead mask is used to mask out future positions (i.e., positions that come after the current position) so that they don't contribute to the current attention scores. Before the softmax operation, you'll mark these positions so that their contributions are effectively zero.

#### <a id='toc1_7_1_3_'></a>[Using Both Masks in the Decoder](#toc0_)

It's possible to use both types of masks in the decoder to address different requirements:

- Padding mask is for ignoring padded positions.
- Look-ahead mask is for ensuring that each position only attends to positions before it in the sequence.



## <a id='toc1_8_'></a>[2-Digits Addition](#toc0_)

- `max_len` example: `<BOS>90+38=128<EOS>`
  - another reminder our max len in this example is same but this is on purpose.

In [24]:
rng_seed = 1992
num_digits = 2
dataset_size = 10000
batch_size = 256

# max_len is determined by 1+ num_digits + 1 + num_digits + 1 + num_digits + 1 + 1
# where the 1s represent BOS, Plus sign, Equal sign, the extra digit in the sum, EOS, respectively.
max_len = 1 + 1 + 1 + 1 + 2 * num_digits + (num_digits + 1)

dataloaders = DataLoaders(rng_seed, num_digits, dataset_size, batch_size=batch_size)
dataloaders.split_data(split=[0.7, 0.2, 0.1])

train_size, val_size, test_size = dataloaders.train_size, dataloaders.val_size, dataloaders.test_size

In [25]:
# Create individual component configurations
masked_self_attention_mha_config = MultiHeadedAttentionConfig(
     attention=ScaledDotProductAttention(),
    d_model=128, H=4, dropout=0.1
)

feed_forward_config = PositionwiseFeedForwardConfig(
    d_model=128, d_ff=256, activation=nn.ReLU(), dropout=0.1, bias=True
)

add_norm_config_1 = AddNormConfig(feature_dim=128, dropout=0.1)
add_norm_config_2 = AddNormConfig(feature_dim=128, dropout=0.1)

# Create DecoderBlockConfig
decoder_block_config = DecoderBlockConfig(
    masked_self_attention_mha=masked_self_attention_mha_config,
    feed_forward=feed_forward_config,
    add_norm_1=add_norm_config_1,
    add_norm_2=add_norm_config_2,
)

# Create the overall ModelConfig
model_config = ModelConfig(
    d_model=128,
    vocab_size=vocab_size,  # You'll need to specify this based on your dataset
    max_seq_len=max_len,  # Assuming max_len is defined elsewhere in your code
    num_layers=2,
    dropout=0.1,
    decoder=decoder_block_config,
)

model = GPTDecoder(model_config).to(DEVICE)
model_size = sum([p.numel() for p in model.parameters()])
print(f'model_size: {model_size}, train_set_size: {train_size}')

model_size: 270226, train_set_size: 7000


In [26]:
warmup_steps = 3*len(dataloaders.train_loader)
# lr first increases in the warmup steps, and then descreases
lr_fn        = lambda step: model_config.d_model**(-0.5) * min([(step+1)**(-0.5), (step+1)*warmup_steps**(-1.5)])
optimizer    = torch.optim.Adam(model.parameters(), lr=0.2, betas=(0.9, 0.98), eps=1e-9)
scheduler    = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lr_fn)
criterion    = nn.CrossEntropyLoss(ignore_index=PAD, reduction="mean")

def train_epoch(
    model: torch.nn.Module,
    dataloader: DataLoader,
    criterion: torch.nn.Module,
    optimizer: Optimizer,
    scheduler: Optional[_LRScheduler] = None,
    grad_norm_clip: float = 1.0,
    device: str = "cuda",
) -> float:
    """
    input: This is the input sequence (the EOS token is removed).
        shape: [B, S or T]
    target: This is the input shifted by one time step to the right (the BOS token is removed).
        shape: [B, S or T]
    target_padding_mask:
        shape: [B, 1, S or T, S or T]
    future_mask:
        shape: [B, 1, S or T, S or T]
    logits:
        shape: [B, S or T, V]
        shape: [B, V, S or T] when passed in to loss as pytorch enforces the 2nd dim to be class/vocab.
    """
    model.to(
        device=device,
        dtype=next(model.parameters()).dtype,
        non_blocking=True,
    )
    model.train()

    # fmt: off
    losses:       List[float] = []
    num_batches:  int         = len(dataloader)
    progress_bar: tqdm        =  tqdm(enumerate(dataloader, start=1), total=num_batches)
    # fmt: on

    for _batch_index, x in progress_bar:
        input, target, target_padding_mask, future_mask = construct_batches(x)
        # flatten the target tensor, why?
        #target = target.view(-1).to(device)
        input, target_padding_mask, future_mask = (
            input.to(device),
            target_padding_mask.to(device),
            future_mask.to(device),
        )

        optimizer.zero_grad()

        logits = model(
            input,
            target_padding_mask=target_padding_mask,
            future_mask=future_mask,
        )
        #logits = logits.view(size=(-1, logits.size(-1)))

        loss = criterion(logits.permute(0, 2, 1).contiguous(), target.contiguous())
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_norm_clip)
        optimizer.step()
        if scheduler:
            scheduler.step()

        losses.append(loss.item())

        if _batch_index > 0 and _batch_index % 50 == 0:
            progress_bar.set_description(
                f"ep: {scheduler.last_epoch // num_batches}, train loss={loss.item():.3f}, lr={scheduler.get_last_lr()[0]:.5f}"
            )

    return np.mean(losses)

In [27]:


def train(model, dataloaders, epochs):
    global early_stop_count
    train_size = len(dataloaders.train_loader) * batch_size
    for ep in range(epochs):
        train_loss = train_epoch(model, dataloaders.train_loader, criterion, optimizer, scheduler, grad_norm_clip=1.0, device=DEVICE)
        val_loss = validate(model, dataloaders.val_loader)
        print(f"ep {ep}: train_loss: {train_loss:.5f}, val_loss: {val_loss:.5f}")

    return train_loss, val_loss


def validate(model, dataloder):
    "function for computing the loss on the validation set"
    model.eval()
    losses = []
    with torch.no_grad():
        for i, x in enumerate(dataloder):
            input, target, target_padding_mask, future_mask = construct_batches(x)
            target = target.view(-1)
            pred = model(input, target_padding_mask=target_padding_mask, future_mask=future_mask).to(DEVICE)

            pred = pred.view(-1, pred.size(-1))
            losses.append(criterion(pred, target).item())
    return np.mean(losses)


@torch.no_grad()
def compute_sum(model, x):
    "Function for computing the sum of two numbers."
    for i in range(num_digits + 2):
        pad_mask = (x != PAD).view(1, 1, 1, x.size(-1)).to(DEVICE)
        future_mask = construct_future_mask(seq_len=x.size(1))
        # input, target, target_padding_mask, future_mask = construct_batches(x)
        #logits = model(x, target_padding_mask=target_padding_mask, future_mask=future_mask).to(DEVICE)

        logits = model(x, pad_mask, future_mask)
        last_output = logits.argmax(-1)[:, -1].view(1, 1)
        x = torch.cat((x, last_output), 1).to(DEVICE)
        if last_output.item() == EOS:
            break
    return x[0]


def evaluate(model, dataloader, num_batch=None):
    """Function for evaluation the model.
    This function take equations, and truncate them up to the equal-sign, and feed them to the
    model to get the predictions, compare them with the correct answers, and output the accuracy.
    """
    model.eval()
    acc, count = 0, 0
    num_wrong_to_display = 5
    for idx, x in enumerate(dataloader):
        for equation in x:
            loc_equal_sign = equation.tolist().index(EQUAL_SIGN)
            loc_EOS = equation.tolist().index(EOS)
            input = equation[0 : loc_equal_sign + 1].view(1, -1).to(DEVICE)
            ans = equation[: loc_EOS + 1].tolist()
            ans_pred = compute_sum(model, input)
            count += 1

            if ans == ans_pred.tolist():
                acc += 1
            else:
                if num_wrong_to_display > 0:
                    print(
                        f'correct equation: {decode_equation(equation).replace("<pad>","")}'
                    )
                    print(f"predicted:        {decode_equation(ans_pred)}")
                    num_wrong_to_display -= 1
        if num_batch and idx > num_batch:
            break
    return acc / count


def what_is(question: str) -> str:
    "function for computing the sum of two numbers with input in literal string format"
    pred = compute_sum(model, encode_equation(question, num_digits).view(1, -1))
    pred = decode_equation(pred)
    pred = pred[pred.index("=") + 1 :]
    return question + pred

@dataclass
class OptimizerConfig:
    pass

def create_optimizer(model: torch.nn.Module, opt_config: OptimizerConfig):
    """
    This long function is unfortunately doing something very simple and is being very defensive:
    We are separating out all parameters of the model into two buckets: those that will experience
    weight decay for regularization and those that won't (biases, and layernorm/embedding weights).
    We are then returning the PyTorch optimizer object.
    """

    # separate out all parameters to those that will and won't experience regularizing weight decay
    decay = set()
    no_decay = set()
    whitelist_weight_modules = (torch.nn.Linear,)
    blacklist_weight_modules = (torch.nn.LayerNorm, torch.nn.Embedding)
    for mn, m in model.named_modules():
        for pn, p in m.named_parameters():
            fpn = "%s.%s" % (mn, pn) if mn else pn  # full param name
            # random note: because named_modules and named_parameters are recursive
            # we will see the same tensors p many many times. but doing it this way
            # allows us to know which parent module any tensor p belongs to...
            if pn.endswith("bias"):
                # all biases will not be decayed
                no_decay.add(fpn)
            elif pn.endswith("weight") and isinstance(m, whitelist_weight_modules):
                # weights of whitelist modules will be weight decayed
                decay.add(fpn)
            elif pn.endswith("in_proj_weight"):
                # MHA projection layer
                decay.add(fpn)
            elif pn.endswith("weight") and isinstance(m, blacklist_weight_modules):
                # weights of blacklist modules will NOT be weight decayed
                no_decay.add(fpn)
            elif pn.endswith("pos_emb"):
                # positional embedding shouldn't be decayed
                no_decay.add(fpn)

    # validate that we considered every parameter
    param_dict = {pn: p for pn, p in model.named_parameters()}
    inter_params = decay & no_decay
    union_params = decay | no_decay
    assert (
        len(inter_params) == 0
    ), "parameters %s made it into both decay/no_decay sets!" % (str(inter_params),)
    assert (
        len(param_dict.keys() - union_params) == 0
    ), "parameters %s were not separated into either decay/no_decay set!" % (
        str(param_dict.keys() - union_params),
    )

    # create the pytorch optimizer object
    optim_groups = [
        {
            "params": [param_dict[pn] for pn in sorted(list(decay))],
            "weight_decay": opt_config.weight_decay,
        },
        {
            "params": [param_dict[pn] for pn in sorted(list(no_decay))],
            "weight_decay": 0.0,
        },
    ]
    optimizer = torch.optim.AdamW(
        optim_groups, lr=opt_config.learning_rate, betas=(0.9, 0.95)
    )
    return optimizer


1. `input` is indeed `[bs, 10]` because max len is 11, so removed last token.
2. `target` should be `[bs, 10]` but left shifted of the real original input but somehow i got 11.
3. Think of vocab size to be num classes in my classification problem. But the

In [28]:
if DEBUG:
    train_loss, val_loss = train(model, dataloaders, epochs=2)
    # torch.save(model.state_dict(), 'model_debug.pt')
    # model_debug = torch.load('./model_debug.pt')
    # if are_both_models_same(model.state_dict(), model_debug):
    #     print("Pass")
    # else:
    #     print("Fail")
else:
    train_loss, val_loss = train(model, dataloaders, epochs=30)

    #torch.save(model.state_dict(), 'model_non_debug.pt')

100%|██████████| 28/28 [00:01<00:00, 17.50it/s]


ep 0: train_loss: 2.10042, val_loss: 1.27964


100%|██████████| 28/28 [00:01<00:00, 19.53it/s]


ep 1: train_loss: 1.24153, val_loss: 1.11355


```
100%|██████████| 28/28 [00:03<00:00,  8.07it/s]
ep 0: train_loss: 2.10012, val_loss: 1.28574
100%|██████████| 28/28 [00:03<00:00,  7.26it/s]
ep 1: train_loss: 1.23693, val_loss: 1.10578
```

```
ep 0: train_loss: 2.10042, val_loss: 1.27964
100%|██████████| 28/28 [00:01<00:00, 19.49it/s]
ep 1: train_loss: 1.24153, val_loss: 1.11355
```

In [29]:
break

SyntaxError: 'break' outside loop (668683560.py, line 1)

In [None]:
test_loss = validate(model, dataloaders.test_loader)
print('training set examples the model gives an incorrect result:')
train_acc = evaluate(model, dataloaders.train_loader, 20)
print('validataion set examples the model gives an incorrect result:')
val_acc = evaluate(model, dataloaders.test_loader)
print('test set examples the model gives an incorrect result:')
test_acc = evaluate(model, dataloaders.test_loader)
result = f'''train_size: {train_size}, train_loss: {train_loss},
                val_loss: {val_loss}, test_loss: {test_loss},
                test_acc: {test_acc}, val_acc: {val_acc}, train_acc: {train_acc}
                '''
print(result)

non debug

```
correct equation: 24+86=110
predicted:        24+86=100
correct equation: 84+26=110
predicted:        84+26=100
validataion set examples the model gives an incorrect result:
test set examples the model gives an incorrect result:
train_size: 7000, train_loss: 0.013642309483007662,
                val_loss: 0.0008140208410623018, test_loss: 0.00040599027124699205,
                test_acc: 1.0, val_acc: 1.0, train_acc: 0.9996448863636364
```

## <a id='toc1_9_'></a>[Adder Decoder Walkthrough](#toc0_)

In a decoder-only model like GPT, the input sequence is essentially the target.
The model aims to generate tokens that come after the given input, treating it
as the "history" or "context" for the task of text generation. Unlike
encoder-decoder models like the original Transformer, where the encoder
processes a source sequence and the decoder generates a target sequence, a
decoder-only model works solely with what would traditionally be considered the
target sequence. Therefore, the padding mask applied to this input sequence is
more aptly named "target_padding_mask" to maintain terminological consistency
with the original Transformer architecture.

Consequently, the input (source to beginners like me) padding masks is called
the target padding masks for the following reasons:

In a decoder-only architecture like GPT, the input sequence serves as the target
sequence for which you want to generate subsequent tokens. Despite its role as
an input to the model, it's termed as "target" because in the original
Transformer architecture, the decoder's job is to generate the target sequence.
Therefore, the mask that works on this input sequence in a decoder-only model
should more aptly be named "target_padding_mask." This naming maintains
consistency with the Transformer architecture and clarifies that you're working
on what is essentially the target of the model's generation task.

### <a id='toc1_9_1_'></a>[Target Padding Mask (`target_padding_mask`)](#toc0_)

- Definition: An attention mask to ignore pad-tokens in the source input. But in decoder only model, the source is the target.
- Shape     : `(B, S)` or `(B, L)`.

In [None]:
pad_token_id = 16
target_batch = torch.tensor(
    [
        [5, 7, 9, 16, 16],
        [8, 6, 16, 16, 16],
        [3, 12, 4, 11, 16],
        [2, 1, 4, 16, 16],
    ]
)

batch_size, seq_len = target_batch.size()

target_padding_mask = target_batch != pad_token_id

pprint(target_padding_mask)
pprint(target_padding_mask.shape)

### <a id='toc1_9_2_'></a>[Future Mask (`future_mask`)](#toc0_)

```
:param future_mask:
:shape            : (L, L)
:note             : Independent of batch size?
```

In [None]:
seq_len = 5
future_mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
future_mask = future_mask == 0

pprint(future_mask)
pprint(future_mask.shape)

One thing we need to know is that we need to do a matmul of `attention_weights` (emphasize the weights word here because it is indeed derived
from weights although not explicit) and the value (the input seq). This attention weights has a preceding
`attention_scores` prior to softmax, and we need to fill the tensors in this `attention_scores` with `-inf` because
the softmax operation on `-inf` is zero, effectively zero out masked logits.

Let's consider what zeroing out these masked logits actually does. The attention
mechanism can be thought of as a weighted average of all the tokens in the input
sequence. Each token is assigned a weight, with higher weights indicating more
relevance to the token under consideration. If a certain token should not be
considered at all (e.g., it's a future token that should not be visible to the
current decoder step, or it's a padding token), its weight should be zero.

In the case of a masked self-attention mechanism, as is often used in the
decoder of a transformer, there are two main scenarios where masking comes into
play:

1. **Padding Tokens**: You don't want the attention mechanism to consider
   padding tokens as they carry no useful information. If it did, it could skew
   the resulting weighted average.

2. **Future Tokens in Decoding**: In autoregressive decoding, the model
   shouldn't have access to future tokens in the sequence when making
   predictions. Otherwise, the model would cheat by peeking ahead.

By setting the corresponding positions in the attention scores tensor to `-inf`
and then applying a softmax, you effectively get a zero at those positions in
the attention weights tensor. This results in completely ignoring those tokens
when taking the weighted sum of the value vectors, thus implementing the desired
masking behavior.

To summarize, zeroing out masked logits ensures that the tokens corresponding to
those logits do not contribute to the computed context, whether because they are
padding or because they are future tokens that should not be visible to the
model at a given time step.


The purpose of applying `logical_and` between `target_padding_mask` and `future_mask` is to combine the constraints from both masks when calculating self-attention scores in the transformer's decoder. The `target_padding_mask` is designed to mask out the padding tokens in the input sequence, while the `future_mask` ensures that a given position cannot attend to future positions in the sequence. By combining these masks, you can perform the necessary masking for both padding and future tokens in a single step.

Here's how it works:

1. `target_padding_mask`: Masks out the padding tokens so that they don't contribute to the attention calculations. True values mean "attend to this token," and False values mean "ignore this token."
  
2. `future_mask`: The future mask is created as a lower triangular matrix, where the lower triangle, including the diagonal, is filled with ones, and the upper triangle is filled with zeros. Masks out future tokens in a sequence so that a token at a given position can only attend to positions that come before it (and itself). True values mean "attend to this token," and False values mean "ignore this token."

3. `logical_and(target_padding_mask, future_mask)`: Combines the two masks. A True in the resulting mask means that the condition for both padding and future attention is satisfied.

By combining these two masks, the decoder obeys the autoregressive property, ensuring it doesn't see future tokens, while also ignoring padding tokens in the input sequence. We may term it the `target_mask`?

Same mask applied to all h heads.
Same mask applied to all h heads.
Same mask applied to all h heads.



### <a id='toc1_9_3_'></a>[Example of Source Padding and Future Masks](#toc0_)

#### <a id='toc1_9_3_1_'></a>[First Sample First Token](#toc0_)

- `target_padding_mask` has size of `[4, 5]`.
  - We zoom in to the first row (sample) which is of length 5.
  - This length 5 is the sequence length, which is `T, T, T, F, F` indicating the last 2 tokens being padded.
- `future_mask` has size of `[5, 5]`.
  - We note that this is indepedent of batch size. Each sample should have the same future mask shape of `[L, L]`.
  - This `L=5` should necessary be same for the sequence length in `target_padding_mask`.
- First, let's consider one batch of 4 samples. What we do first is to broadcast `future_mask` to `[4, 5, 5]` because we want each sample/row in the batch to have the same future mask. As shown below:

In [None]:
pprint(future_mask)
future_mask = future_mask.view(1, seq_len, seq_len).expand(size=(batch_size, -1, -1))
pprint(future_mask)
pprint(future_mask.shape)

- Now, we can zoom in to one particular sample since both `target_padding_mask` and `future_mask` have the same first dimension of batch size.
- What is incomplete is that we need to broadcast `target_padding_mask`'s last dimension to have the same dimensions as `future_mask`. This means we broadcast `[4, 5]` to `[4, 5, 5]`. But why?
- For simplicity, we slice the first same of both below.
- The first row of the `future_mask` of the first sample is `T, F, F, F, F`. This corresponds to what? This is the future mask of the first token in the sequence. Well, that is confusing, because it apparently have 5 elements, and has "information" of the other 4 tokens in the sequence. Let's explain in details below:
  - Regarding the first row of the `future_mask` in the first sample, which is `[T, F, F, F, F]`, it might initially seem confusing why there are 5 elements. Each of these elements, in fact, corresponds to whether the first token can attend to other tokens at each respective position in the sequence. Here's how to interpret it:
    - The first element (`True`) indicates that the first token can attend to itself.
    - The next four elements (`False`) specify that the first token should not attend to any of the future tokens in the sequence.
- Consequently, what is the first token in the sequence of the `target_padding_mask`? Recall earlier we mentioned that the first sample's `target_padding_mask` is `T, T, T, F, F` and therefore the first token in the sequence is `T`.
- What do we want to achieve here? We want to make sure that the model does not **attend** to tokens in the sequence that are masked with `False`.
- In other words, the first token in the sequence of the first sample has `target_padding_mask` of `T` and `future_masks` of `T, F, F, F, F`.
- We need to broadcast this `T` to `T, T, T, T, T` to align with `T, F, F, F, F` because? Because we need ensure that this first token in the sequence is also able to considered in relation to every other token in the sequence.
- So the first token is not a padded token, which is `T`, similarly, the first token needs to attend to itself at the first position, hence `T` and `T` give `T`. But for the second `T` in the now broadcasted `target_padding_mask`, it is still representing the first token or?
- Broadcasting the first token's `target_padding_mask` value of `T` to `[T, T, T, T, T]` ensures that when this first token is being considered for attention computations, it is free to attend to any position, barring any restrictions set by `future_mask`.
- Tricky: after broadcasting, each `T` in `[T, T, T, T, T]` is still representing the first token. They indicate that when the first token is compared with *any* token in the sequence (including itself), it is not a padding token. The element-wise `AND` with the `future_mask` then further refines this by restricting it from attending to future tokens.

In [None]:
pprint(target_padding_mask)
pprint(target_padding_mask[0])

In [None]:
pprint(target_padding_mask)
target_padding_mask = target_padding_mask.view(batch_size, 1, seq_len).expand(size=(batch_size, seq_len, seq_len))
pprint(target_padding_mask)
pprint(target_padding_mask.shape)

In [None]:
pprint(target_padding_mask[0])
pprint(future_mask[0])
pprint(target_padding_mask[0] & future_mask[0])

#### <a id='toc1_9_3_2_'></a>[First Sample Fourth Token](#toc0_)

Now let's look at another example—the 4th token in the sequence, where `target_padding_mask = [T, T, T, F, F]` and `future_mask` is a lower triangular matrix with `True`s.

1. **4th Token's target_padding_mask**: The 4th token has a value of `F` in `target_padding_mask`, indicating it's a padding token.
   
2. **4th Row of future_mask**: The 4th row in `future_mask` is `[True, True, True, True, False]`. This means that if this token were not a padding token, it would be allowed to attend to all the previous tokens in the sequence and itself, but not to any future token.

3. **Broadcast target_padding_mask**: To align `target_padding_mask` with `future_mask`, we'd broadcast `F` from the `target_padding_mask` to `[F, F, F, F, F]`. This way, when we consider the 4th token in relation to any other token in the sequence, it's still marked as a padding token.

4. **Element-wise AND with future_mask**: After broadcasting, you'd perform an element-wise AND between `[F, F, F, F, F]` and `[True, True, True, True, False]`, resulting in `[F, F, F, F, F]`.

5. **Interpretation**: This effectively means that the 4th token won't attend to any other token in the sequence, and no token will attend to it either, as it is a padding token.

So, the masks are doing their jobs correctly: the `target_padding_mask` indicates whether each token is a padding token or not, and `future_mask` dictates the "rules" of attention regarding what each token can attend to. Combining them ensures that both conditions are met.

### <a id='toc1_9_4_'></a>[Further Add a Singleton Dimension in Masks](#toc0_)

Now both masks are of shape: `(B, L, L)` but we need to add a singleton dimension to the last dimension to make it `(B, 1, L, L)`.

In deep learning frameworks like PyTorch, the dimensions of the tensors involved
in operations like matrix multiplication or attention mechanisms often have
specific semantic meanings. In the context of attention mechanisms, especially
in the transformer architecture, the attention mask usually has a shape that is
compatible with the attention logits for element-wise multiplication.

In the transformer model, the attention logits are often computed as a dot
product between query and key vectors, resulting in a tensor of shape
`(Batch size, Num heads, Sequence length, Sequence length)` or `(B, H, L, L)`.
Here, `B` is the batch size, `H` is the number of attention heads, and `L` is
the sequence length.

To make the mask tensor compatible for element-wise operations with this 4D
tensor, it needs to have a shape that can be broadcasted to `(B, H, L, L)`. A
mask of shape `(B, 1, L, L)` fulfills this requirement.

The singleton dimension is added so that the mask can be easily broadcast to the
shape of the attention logits tensor during the computation. When a tensor with
shape `(B, 1, L, L)` is element-wise multiplied with a tensor of shape
`(B, H, L, L)`, the singleton dimension (the `1`) allows the mask to be used for
each attention head without explicitly replicating the mask `H` times. This is
more memory-efficient and often faster.

Thus, adding a singleton dimension in masks is a preparatory step that allows
for efficient element-wise operations later in the model's forward pass.


In [None]:
target_padding_mask = target_padding_mask.unsqueeze(1)
pprint(target_padding_mask.shape)

future_mask = future_mask.unsqueeze(1)
pprint(future_mask.shape)

target_mask = target_padding_mask & future_mask
pprint(target_mask.shape)

### <a id='toc1_9_5_'></a>[MultiHeadAttention](#toc0_)

We start off by understanding the rationale of the following block:

```python
Q = self.W_Q(query).contiguous() # Z @ W_Q -> BxLxD @ DxD = BxLxD
K = self.W_K(key).contiguous()   # Z @ W_K
V = self.W_V(value).contiguous() # Z @ W_V
```

#### <a id='toc1_9_5_1_'></a>[A Primer](#toc0_)

In the context of the Transformer architecture and self-attention mechanism, the
matrices $\mathbf{W}^{Q}, \mathbf{W}^{K},$ and $\mathbf{W}^{V}$ are learnable
parameters designed to project the input embeddings $\mathbf{Z}$ into distinct
subspaces tailored for attention calculations. Let's explore their purpose and
their resulting transformations:

1. **The Role of Weights**:

   - $\mathbf{W}^{Q}$: Projects input embeddings into a query subspace,
     determining the type of information each token seeks from others.
   - $\mathbf{W}^{K}$: Positions the embeddings in a key subspace, highlighting
     the token features that others would search for.
   - $\mathbf{W}^{V}$: Transforms embeddings into a value subspace, showcasing
     the actual token content to be aggregated by the attention scores.

2. **Intuitive & Mathematical Interpretations**:

   - **Query Transformation** ($\mathbf{Z} \mathbf{W}^{Q}$): Intuitively, it
     tailors the raw embeddings to optimally question the rest of the sequence.
     Mathematically, it's a linear transformation of the embedding space into
     the query space, akin to a high-dimensional rotation and scaling,
     emphasizing aspects relevant to querying.

   - **Key Transformation** ($\mathbf{Z} \mathbf{W}^{K}$): Intuitively, it
     accentuates token features that other tokens might seek. Mathematically,
     it's another linear transformation emphasizing aspects that make tokens
     searchable.

   - **Value Transformation** ($\mathbf{Z} \mathbf{W}^{V}$): Intuitively, it
     prepares tokens to share their intrinsic content when beckoned by the
     attention mechanism. Mathematically, it's a linear transformation
     accentuating token content aspects.

3. **Creating Q, K, V**:

   - $\mathbf{Q} = \mathbf{Z} \mathbf{W}^{Q}$
   - $\mathbf{K} = \mathbf{Z} \mathbf{W}^{K}$
   - $\mathbf{V} = \mathbf{Z} \mathbf{W}^{V}$

   These operations recast the embedded tokens into roles for the attention
   mechanism:

   - $\mathbf{Q}$: Information seekers. The queries are seeking information, and
     the computation $Q @ K^T$ finds how much each part of the input (holder)
     should be attended to.
   - $\mathbf{K}$: Information gatekeepers. The keys hold the information being
     sought, and their arrangement in space defines the subspace that the
     queries are projected onto to find these relevance scores.
   - $\mathbf{V}$: Information providers. The values contain the content that
     needs to be retrieved, and once we have the attention weights, we know how
     much of each value to retrieve and combine to form the output.

   Mathematically, the resulting matrices ($\mathbf{Q}, \mathbf{K}, \mathbf{V}$)
   have rows that represent different aspects (querying, key, value) of the
   original tokens.

4. **Relevance to Self-Attention**:

   The transformations set the stage for attention score calculations. In this
   step, each query vector in $\mathbf{Q}$ computes its similarity (via dot
   product) against all key vectors in $\mathbf{K}$. This score matrix reveals
   the attention weightage for each token regarding every other token in the
   sequence.

   Specifically, $\mathbf{Q} @ \mathbf{K}^T$ calculates how each token (query)
   aligns with every other token (key). It's akin to measuring the relevance of
   each word to every other word in the sequence.

   After normalizing these scores (typically with softmax), we get the attention
   weights. These weights guide how the value vectors in $\mathbf{V}$ are
   aggregated. The outcome is a new matrix where each row aggregates
   contextually relevant information from the entire sequence. This enriched
   output feeds into subsequent transformer layers for further processing.

Overall, by using the $\mathbf{W}^{Q}, \mathbf{W}^{K},$ and $\mathbf{W}^{V}$
matrices, the transformer fine-tunes its focus on inter-token relationships,
enabling the model to capture intricate contextual nuances within a given
sequence.

#### <a id='toc1_9_5_2_'></a>[An Example](#toc0_)

Let's use the sentence "The cat walks by the bank" to walk through the
self-attention mechanism with analogies and to clarify how it works step by
step.

**Setting the Scene (Embedding the Sentence):** Imagine each word in the
sentence is a person at a party (our tokens). They start by telling a basic fact
about themselves (their initial embedding).

**The Roles:**

- **Q (Seekers)**: Each person (word) is curious about the stories (contexts) of
  others at the party. They have their own perspective or question (Q vector).
- **K (Holders)**: At the same time, each person has a name tag with keywords
  that describe their story (K vector).
- **V (Retrievers)**: They also hold a bag of their experiences (V vector),
  ready to share.

**Transformations (Applying W Matrices):** We give each person a set of glasses
(the matrices $W_Q, W_K, W_V$) that changes how they see the world (the space
they project to).

- With $W_Q$ glasses, they focus on what they want to know from others.
- With $W_K$ glasses, they highlight their name tag details, making some
  features stand out more.
- With $W_V$ glasses, they prepare to share the contents of their bag
  effectively.

**Attention (Calculating Q @ K.T):** Now, each person looks around the room
(sequence) with their $W_Q$ glasses and sees the highlighted name tags (after
$W_K$ transformation) of everyone else. They measure how similar their question
is to the others' name tags—this is the dot product $Q @ K^T$.

For "cat," let’s say it’s curious about the notion of "walking" and "bank." It
will measure the similarity (attention scores) between its curiosity and the
name tags of "walks," "by," "the," "bank."

**Normalization (Softmax):** After measuring, "cat" decides how much to focus on
each story—this is softmax. Some stories are very relevant ("walks"), some
moderately ("by," "the"), and some might be highly relevant depending on context
("bank" — is it a river bank or a financial institution?).

**Retrieval (Applying Attention to V):** Now "cat" decides to listen to the
stories in proportion to its focus. It takes pieces (weighted by attention
scores) from each person's experience bag (V vectors) and combines them into a
richer, contextual understanding of itself in the sentence. This combination
gives us the new representation of "cat," informed by the entire context of the
sentence.

In essence:

- **Q (Query):** What does "cat" want to know?
- **K (Key):** Who has relevant information to "cat"’s curiosity?
- **V (Value):** What stories does "cat" gather from others, and how much does
  it take from each to understand its role in the sentence?

The output of self-attention for "cat" now encapsulates not just "cat" but its
relationship and relevance to "walks," "by," "the," "bank" in a way that no
single word could convey alone. This output then becomes the input to the next
layer, where the process can repeat, enabling the model to develop an even more
nuanced understanding.

<img src="transformer.png" width="600">


### <a id='toc1_9_6_'></a>[AddNorm (Residual Connection + Layer Normalization)](#toc0_)

- https://www.d2l.ai/chapter_attention-mechanisms-and-transformers/transformer.html#residual-connection-and-layer-normalization
- https://nlp.seas.harvard.edu/annotated-transformer

#### <a id='toc1_9_6_1_'></a>[Residual Block](#toc0_)

A residual block takes an input $X$ and a sub-layer (or function) $f$, and computes $X + f(X)$.

```python
class ResidualBlock(nn.Module):
    def __init__(self) -> None:
        super().__init__()

    def forward(
        self,
        x: torch.Tensor,
        sublayer: Callable[[torch.Tensor], torch.Tensor],
    ) -> torch.Tensor:
        return x + sublayer(x)
```

The intuition behind a residual block is to facilitate the training of deeper networks by providing a "shortcut" or "skip connection" that allows the gradient to be directly backpropagated to earlier layers. Essentially, in a standard deep learning model, each layer transforms its input. As the network depth increases, these transformations can degrade the network's performance, mainly due to the vanishing or exploding gradient problems. This makes it challenging to train very deep networks.

The residual block aims to address this problem. It adds the original input back to the output of the network layer, forming $F(x) + x$ instead of just $F(x)$. Mathematically, if $x$ is the input and $F(x)$ is the transformed version, then the residual block computes $F(x) + x$.

This architecture has a few advantages:

1. **Easier Learning**: During training, if the best transformation is an identity map (i.e., the output should be the same as the input), the residual block can easily learn this. The layers in $F(x)$ only need to learn to approximate zero in this case, which is generally easier than learning an identity map in a traditional stack of layers.

2. **Mitigating Vanishing/Exploding Gradients**: The skip connections provide an unobstructed path for the gradients to flow, which can help mitigate the vanishing or exploding gradient problems in very deep networks.

3. **Enabling Deeper Networks**: Because of the above advantages, residual blocks make it possible to train very deep networks effectively. Deep networks can represent very complex functions, which can be advantageous for many tasks.

4. **Parameter Efficiency**: Residual blocks often require fewer parameters to achieve similar performance compared to traditional deep networks, making them more parameter-efficient.

In summary, the residual block is a simple yet effective idea that has enabled the training of much deeper networks, thereby pushing the boundaries of what is achievable in various machine learning tasks.

#### <a id='toc1_9_6_2_'></a>[Layer Normalization](#toc0_)

Layer normalization normalizes the features across the feature dimension. Given the feature $X$ with shape $[B, L, D]$ (where $B$ is the batch size, $L$ is the sequence length, and $D$ is the feature dimension), layer normalization computes:

$$
\text{Norm}(X) = \frac{X - \text{mean}(X)}{\sqrt{\text{var}(X) + \epsilon}} \times \gamma + \beta
$$

Where $\gamma$ and $\beta$ are learnable parameters and $\epsilon$ is a small constant for numerical stability.

```python
class LayerNorm(nn.Module):
    def __init__(self, feature_dim: int, eps: float = 1e-6) -> None:
        super().__init__()
        # fmt: off
        self.gamma = nn.Parameter(torch.ones(feature_dim))
        self.beta  = nn.Parameter(torch.zeros(feature_dim))
        self.eps   = eps

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        mean = x.mean(dim=-1, keepdim=True)
        std  = x.std(dim=-1, keepdim=True)
        # fmt: on
        return self.gamma * (x - mean) / (std + self.eps) + self.beta
```

#### <a id='toc1_9_6_3_'></a>[Combining Both](#toc0_)

Finally, you can combine these into a single block, much like the `ResidualConnection` or `AddNorm` classes you mentioned earlier.

```python
class AddNorm(nn.Module):
    def __init__(self, feature_dim, dropout_rate):
        super(AddNorm, self).__init__()
        self.dropout = nn.Dropout(dropout_rate)
        self.layer_norm = LayerNorm(feature_dim)

    def forward(self, x, sublayer_output):
        return self.layer_norm(x + self.dropout(sublayer_output))
```

This `AddNorm` class applies dropout to the output of the sub-layer, adds it to the original input, and then applies layer normalization. Note that this version doesn't include an embedded layer normalization operation in the residual block; instead, it utilizes a separate layer normalization class, which is then used in the `AddNorm` class.

### How Loss is Computed?

In [59]:
# Assuming we have B = batch size, L = sequence length, V = vocab size
B, L, V = 2, 3, 4  # Example dimensions

# Instantiate the CrossEntropyLoss
# By default, it reduces by averaging the losses over each observation in the input
criterion = nn.CrossEntropyLoss(reduction="mean")


In [63]:
# Randomly generated logits and target for illustration
# Logits are typically obtained from the last linear layer of your model
logits = torch.randn(B, L, V, generator=torch.Generator().manual_seed(SEED))
targets = torch.randint(low=0, high=V, size=(B, L), generator=torch.Generator().manual_seed(SEED))  # Randomly generated target indices

pprint(logits)
pprint(targets)
pprint(logits[0]) # logits for the first sequence [L=10, V=18]
pprint(targets[0]) # target for the first sequence [L=10]

In [64]:
# Permute logits to shape [B, V, S]
logits_permuted = logits.permute(0, 2, 1)

In [66]:
loss = criterion(logits_permuted, targets)
pprint(loss)

In [58]:
first_sequence_logits = logits[0]
first_sequence_targets = targets[0]

first_sequence_first_token_logits = first_sequence_logits[0]
first_sequence_first_token_target = first_sequence_targets[0]

pprint(first_sequence_first_token_logits)
pprint(first_sequence_first_token_target)


In [72]:
total_loss = 0

for i in range(B):
    for j in range(L):
        pprint(logits[i, j].unsqueeze(0))
        pprint(targets[i, j].unsqueeze(0))
        total_loss += criterion(logits[i, j].unsqueeze(0), targets[i, j].unsqueeze(0))

In [69]:
total_loss /= (B * L)

In [70]:
total_loss

tensor(1.7008)

## <a id='toc1_10_'></a>[Potential to use Module Dict?](#toc0_)

In [None]:
class ModelModuleDict(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleDict({
            'fc1': nn.Linear(2, 5),
            'relu': nn.ReLU(),
            'fc2': nn.Linear(5, 1)
        })

    def forward(self, x):
        for layer in self.layers.values():
            x = layer(x)
        return x

# Initialize a random tensor as input
input_tensor = torch.randn(1, 2)

In [None]:
seed_all(1, seed_torch=True)
model_sequential = nn.Sequential(
    nn.Linear(2, 5),
    nn.ReLU(),
    nn.Linear(5, 1)
)
# Forward pass using nn.Sequential model
model_sequential(input_tensor)


In [None]:
seed_all(1, seed_torch=True)
model_moduledict   = ModelModuleDict()
model_moduledict(input_tensor)

## <a id='toc1_11_'></a>[Training with GPT-like Model](#toc0_)

If you're working with a GPT-like model, which is a decoder-only architecture, the training mechanics differ slightly compared to the encoder-decoder models like seq2seq. In a GPT-style model, the entire sequence (input and output) is provided to the model at once, and each token is predicted based on the tokens that came before it. The model is still autoregressive, but there's no separate encoder to produce an intermediate representation; the "encoding" is effectively built into the ongoing autoregressive decoding process.

In your case, if the equations are like `90+38=128`, during training you'd provide `90+38=` as the input and then use the remaining part `128` as the expected output, potentially along with special tokens to demarcate sequence boundaries or to flag the equation/result parts. However, unlike an encoder-decoder model where the decoder gets to "peek" at the correct output during training (also known as "teacher forcing"), here every token in the output is predicted one by one, based solely on the preceding tokens.

In such a setup, you can definitely feed the entire equation to the model and try to predict each subsequent token based on the preceding tokens. For example, given `90+38=`, the model should predict `1`, `2`, `8` in succession.

### <a id='toc1_11_1_'></a>[Loss Computation](#toc0_)

For training a GPT-like model, you'd usually use a standard loss function like cross-entropy loss for each token's prediction. You'd compare the token predicted by the model to the actual token in the target sequence to compute the loss. This is calculated for each token and then averaged over the sequence or batch, depending on your implementation.

### <a id='toc1_11_2_'></a>[Example](#toc0_)

In a GPT-like model, each token in the sequence is used to predict the next token. The model takes a sequence of tokens and produces a new sequence of the same length where each new token is predicted based on all the preceding tokens in the input sequence. The loss is then computed between the predicted sequence and the target sequence.

Let's take a closer look at an example:

- The original tensor: `[15, 9, 0, 10, 3, 8, 13, 1, 2, 8, 14]` which corresponds to `<SOS>90+38=128<EOS>`
- Input tensor:  `[15, 9,  0,  10, 3,  8,  13, 1,  2, 8]`, which corresponds to `<SOS>90+38=128` without `EOS`
- Target tensor:     `[9,  0,  10, 3,  8,  13, 1,  2,  8, 14]`
                     `[16, 16, 16, 16, 16, 16, 1,  2,  8, 14]`

During training:

1. **First Timestep**: The model takes `[15]` (or `[<BOS>]` if 15 is your BOS token) and tries to predict the next token. Ideally, it should predict `9`. But here, your target sequence starts with masked tokens (`16`, if 16 is your masking token). So the loss is computed between the predicted token and the masked token `16`. But since `CrossEntropyLoss` has an `ignore_index` (now you know what they are right!), you can set it to say `16` or (default `-1` but you would need to change padding number) and tell the model that whenever the ground truth is `16`, the loss
is zeroed out so it is not counted? This allows the model to focus on learning from the relevant parts of the sequence while ignoring the masked portions.

2. **Second Timestep**: The model takes `[15, 9]` and predicts the next token, which should be `0`. Again, the target is a masked token `16`.

3. **...**

4. **Eighth Timestep**: The model takes `[15, 9,  0,  10, 3,  8,  13]` (which is `90+38=`) and predicts the next token. Now the target is `1`, so the loss is computed between the predicted token and `1`. There is no mask anymore here, so the loss will be computed.
5. **Ninth Timestep**: The model takes `[15, 9,  0,  10, 3,  8,  13, 1]` (which is `90+38=1`) and predicts the next token. Now the target is `2`, so the loss is computed between the predicted token and `2`.
   1. Here's an important thing for beginners (me), In a typical GPT-like architecture used for sequence-to-sequence tasks like this one, the model doesn't use its own predictions as input during training. Instead, it uses the original, ground-truth input sequence. This is known as "teacher forcing." In teacher forcing, even if the model predicts a wrong token at some timestep, it doesn't affect the input sequence for subsequent timesteps. The model continues to get the original input sequence for the entire training epoch.
   2. So if model predicts a `3` during the eighth timestep, where the ground trut is `1`, the model would simply incur a higher loss for that prediction. However, the input for the ninth timestep would still be the ground truth sequence up to that point, regardless of what the model predicted at the eighth timestep.
   3. But it is noted that this behaviour is still autoregressive.
6. **Tenth**: The model takes `[15, 9,  0,  10, 3,  8,  13, 1, 2]` and predicts the next token which is `8`.
7. **Last**: The model takes `[15, 9,  0,  10, 3,  8,  13, 1, 2, 8]` and predicts the next token which is `14` the `EOS`.
   1. The reason you need to predict `EOS` is simple intuitively, consider the case where there's no need for `EOS`, then the model will not know when to stop.

This goes on until the entire sequence is processed. Note that the model never actually "sees" the target tokens during the prediction. It is solely relying on the tokens that came before the current token in the input sequence. After the model makes its prediction, then the predicted tokens are compared to the target tokens to compute the loss, which is then backpropagated to update the model weights.

### <a id='toc1_11_3_'></a>[Confusion: Training versus Inference](#toc0_)

The statement "it generates one token at a time and uses its own previously generated tokens as context for generating subsequent tokens" is generally true for GPT-like models during the inference stage, not during training. During inference (or generation), the model does indeed use its own previously generated tokens to produce the next token, since there is no ground truth sequence to rely on. In that case, if the model makes an incorrect prediction at a certain timestep, that incorrect token is used as part of the context for the following timestep.

During training, however, the model typically uses the ground truth tokens for the preceding sequence as context for predicting each next token, as described in your example. This resembles teacher forcing, in that the ground truth, rather than the model's own predictions, is used to guide training.

So there's no contradiction, but the behavior is context-dependent:

- During training, the ground truth sequence is used for context.
- During inference, the model's own previously generated tokens are used for context.

Both approaches are consistent with the autoregressive nature of the model: in both cases, the token at each position is generated based on the tokens at all previous positions. The difference lies in whether those preceding tokens come from the ground truth (during training) or from the model's own previous outputs (during inference).

### Training vs Inference

In an autoregressive model like a Transformer decoder, the concept of "learning
the representation of the sequence as it goes" does not refer to the model
processing one token at a time during actual forward passes. Instead, it refers
to the model's ability to generate or predict one token at a time during
inference, while training on a full sequence in a batched manner.

During training:

- All tokens are processed in parallel for efficiency. This is possible because
  the entire sequence is known beforehand (it's the training data).
- The "autoregressive" property is enforced by using masks in the self-attention
  mechanism. This masking ensures that the prediction for each token can only
  depend on previously generated tokens, not on future tokens which the model
  has no access to during inference. This is how the model learns the
  conditional probability distribution of each token given the previous tokens,
  despite the parallel processing of tokens.

During inference:

- The model starts with an initial token (such as a start-of-sequence token) and
  generates the next token based on this single input.
- Then, the model uses both the initial token and the newly generated token to
  predict the third token, and so on.
- This process is sequential and each new token is predicted based on the
  previously generated tokens, creating a sequence one token at a time.

So, when we say that the model learns the representation of the sequence as it
goes, we mean that the model is trained to handle sequences in such a way that
it can generate them one piece at a time, respecting the causal order inherent
to the task (e.g., language modeling). The parallel processing during training
does not contradict the autoregressive nature of the model; it is simply a
computational efficiency that is enabled by knowing the full sequence in
advance.


## <a id='toc1_12_'></a>[Questions](#toc0_)

### <a id='toc1_12_1_'></a>[Why Masked == 0 in some?](#toc0_)

The use of `mask == 0` in the `masked_fill` operation is a result of how the mask is constructed. Essentially, different implementations may represent masks differently:

1. **Boolean Masking with True/False**: In some implementations, the mask might be a Boolean tensor where `True` denotes the positions to mask (set to negative infinity) and `False` for the positions to keep. In such cases, you can directly use the mask in `masked_fill` as in your provided code:

    ```python
    attention_scores = attention_scores.masked_fill(mask, float("-inf"))
    ```

    Here, if `mask[i][j]` is `True`, `attention_scores[i][j]` would be set to `-inf`.

2. **Integer Masking with 1/0**: In other implementations, the mask might be an integer tensor where `1` denotes the positions to keep and `0` denotes the positions to mask. In such cases, you'll often find the mask is inverted (`mask == 0`) before using `masked_fill`:

    ```python
    attention_scores = attention_scores.masked_fill(mask == 0, float("-inf"))
    ```

    Here, if `mask[i][j]` is `0`, `attention_scores[i][j]` would be set to `-inf`.

The core functionality—masking certain positions in the attention scores—is the same in both cases. The difference lies in how the mask tensor is constructed and interpreted. So, if you find an implementation using `mask == 0`, it's likely using an integer mask where `0` signifies positions to mask, whereas if it's directly using `mask`, it's probably a Boolean mask where `True` signifies positions to mask.

### <a id='toc1_12_2_'></a>[what is the reason of setting the attention scores's mask indexes to negative infinity](#toc0_)


In the attention mechanism, particularly in the Scaled Dot-Product Attention, attention scores are computed for each query-key pair and then passed through a softmax function to obtain attention weights. These weights are used to take a weighted sum of the value vectors, resulting in the final output or the context vectors. The purpose of the mask is to prevent certain tokens (like padding tokens) from being attended to.

The reason for setting masked attention scores to negative infinity (`-inf`) lies in the properties of the softmax function:

1. **Softmax Behavior**: The softmax function transforms its input (the attention scores in this case) into a probability distribution. Mathematically, the softmax function for a given vector $x$ is defined as:

$$
\text{Softmax}(x)_i = \frac{e^{x_i}}{\sum_{j=1}^{N} e^{x_j}}
$$

2. **Impact of Negative Infinity**: When you pass negative infinity through the softmax function, $e^{-\infty}$ approaches zero. As a result, the masked positions get a near-zero weight in the attention mechanism.

$$
\text{Softmax}(-\infty) = \frac{e^{-\infty}}{\sum_{j=1}^{N} e^{x_j}} \approx 0
$$

3. **Avoiding Unwanted Attention**: The point of setting these specific positions to `-inf` is to ensure that when softmax is applied, these positions get zero attention weights. This is a way of making sure that the model does not attend to the positions we've masked (like padding tokens or future tokens in the sequence, depending on the mask).

In summary, setting the masked attention scores to `-inf` and then passing them through a softmax effectively nullifies the contribution of the masked positions in the resulting attention-weighted sum of the value vectors. This is a commonly used trick to impose a certain structure (like masking out future information in the decoder) or to handle variable-length sequences with padding.

### <a id='toc1_12_3_'></a>[Why do we need both ignore index in Loss and also negative infinity mask](#toc0_)

Using an "ignore index" in the `CrossEntropyLoss` function in PyTorch can ignore the effect of certain tokens (like padding tokens) during the loss computation. However, the purpose of the mask in the attention mechanism and the "ignore index" in the loss function serve different roles in the model, and they operate at different stages of the computational graph.

1. **Ignore Index in Loss Function**: The "ignore index" in the loss function ensures that the model's output at certain positions (typically corresponding to padding tokens) does not contribute to the loss. This happens at the very end of the forward pass, just before backpropagation begins.

2. **Mask in Attention Mechanism**: The mask in the attention mechanism, on the other hand, operates during the forward pass at the time when attention scores are computed. This is a more "internal" operation and ensures that certain positions do not contribute to the output at all, not just during the loss computation but actually in the intermediate representations (i.e., context vectors) that the model computes.

To put it another way, even if you're ignoring certain tokens in your loss calculation, those tokens can still influence the model's output unless they're masked out in the attention mechanism itself.

For example, consider a decoder in a sequence-to-sequence model:
- If you don't use a mask in the attention mechanism, future tokens could influence the output at the current timestep, which is not desirable.
- Even if you use an "ignore index" in your loss function, it doesn't prevent the model from "cheating" by peeking at the future tokens if they are not masked in the attention mechanism.

So in summary, using an "ignore index" in `CrossEntropyLoss` is not a replacement for using attention masks. Both have specific roles in the model, and they are often used together to ensure both that the model attends to the right tokens and that it is trained properly.

### <a id='toc1_12_4_'></a>[Target and Preds/Logits Shape](#toc0_)

The target tensor for the cross-entropy loss function should typically have a shape of `[batch_size, sequence_length]` where each entry in the tensor is an integer representing the index of the true class (i.e., the actual word/token from the vocabulary) for that position in the sequence. Here `batch_size` refers to the number of sequences in each batch, and `sequence_length` is the length of each sequence.

Let's break it down step-by-step:

1. **Last Linear Layer of Decoder**: When you say that the last linear layer of your decoder has shape `[bs, vocab_size]`, it means that for each example in the batch, you're outputting a distribution over the vocabulary. The values can be logit scores that represent the likelihood of each word in your vocabulary being the next word in the sequence.

2. **Target Shape**: In comparison, your target tensor should contain the actual words (as integers) that appear at each position in your sequence for each example in the batch. The target tensor does not need to have a `vocab_size` dimension because it is not a distribution; it contains the indices of the actual next words. Thus, it should have a shape `[bs, sequence_length]`.

3. **Cross-Entropy Loss**: When using the cross-entropy loss, the logits (i.e., the output from your linear layer) should have a shape `[bs, sequence_length, vocab_size]`, while the target should have a shape `[bs, sequence_length]`. The cross-entropy loss function will internally apply a softmax to the logits, and then compute the log-likelihood between the predicted distribution and the target class.

To sum up, if your decoder's last linear layer has shape `[bs, vocab_size]` for each time step, make sure that your target tensor has the shape `[bs, sequence_length]`, and your logits should be `[bs, sequence_length, vocab_size]` when you feed them into the cross-entropy loss function.

### <a id='toc1_12_5_'></a>[Why do we flatten prediction and target (logits)?](#toc0_)

Flattening both the predicted logits and the target labels serves a specific purpose when using the cross-entropy loss function for sequence data. Let's dig into each component to understand why this is done:

#### <a id='toc1_12_5_1_'></a>[Background](#toc0_)

1. **Logits Tensor**: In a sequence-to-sequence model, you usually generate a sequence of logits for each item in your batch. The logits for each position in the sequence form a vector of size `vocab_size`, which gives you a probability distribution across all possible tokens.
  
   Shape: `[batch_size, sequence_length, vocab_size]`

2. **Targets Tensor**: Your ground truth data, the `targets`, are integers representing the correct class labels (or tokens) at each sequence position.

   Shape: `[batch_size, sequence_length]`

#### <a id='toc1_12_5_2_'></a>[Traditional Loss Computation](#toc0_)

Typically, the cross-entropy loss between predicted probabilities and target labels for one data point is computed, and then you average over all data points. In sequence-to-sequence models, you can think of each position in the sequence as a separate data point.

#### <a id='toc1_12_5_3_'></a>[Why Flatten?](#toc0_)
1. **Batch and Sequence Unification**: The idea of flattening both logits and targets is to treat each `(batch, sequence_position)` pair as an independent data point. Instead of having a batch of sequences, you have a "flattened" batch of tokens. This simplifies the application of the loss function by converting the 3D logits tensor and 2D targets tensor into 2D and 1D tensors, respectively.

2. **Efficiency**: Loss computations often benefit from vectorization for computational efficiency. By flattening the tensors, you enable a more efficient matrix operation, which is generally faster than using nested loops over each sequence and batch.

3. **Alignment**: The key is to ensure that each row in the flattened logits corresponds to the same position in the flattened targets. This alignment is crucial for the correct computation of the loss.

#### <a id='toc1_12_5_4_'></a>[Step-by-step Flattening](#toc0_)

1. **Logits Flattening**: `logits.view(-1, logits.size(-1))` will take the 3D tensor `[batch_size, seq_length, vocab_size]` and reshape it into a 2D tensor of shape `[batch_size * seq_length, vocab_size]`.

2. **Targets Flattening**: `targets.view(-1)` will take the 2D tensor `[batch_size, seq_length]` and convert it into a 1D tensor of shape `[batch_size * seq_length]`.

3. **Loss Calculation**: Both flattened tensors are then used in the cross-entropy loss function. The loss between each row in the flattened logits and the corresponding element in the flattened targets is computed.

By flattening the tensors this way, you maintain the correspondence between each logit and its corresponding target, enabling you to correctly compute the loss for each token across all sequences and batches.

### <a id='toc1_12_6_'></a>[Why sometimes unsqueeze masks?](#toc0_)

The `unsqueeze` operation is used to add an additional dimension to the tensor. In attention mechanisms, particularly the scaled dot-product attention used in models like the Transformer, the masks usually need to have the same number of dimensions as the attention logits for proper broadcasting.

For instance, let's say your source tensor (`src`) has a shape of $B \times L$ where $B$ is the batch size and $L$ is the sequence length. The attention logit tensor resulting from the query-key dot product would then have shape $B \times N \times L \times L$, where $N$ is the number of attention heads.

The mask needs to align with the $L \times L$ dimensions of this 4D tensor. In order to accomplish that, you add singleton dimensions to make it compatible with the attention logit tensor. By unsqueezing the mask tensor from $B \times L$ to $B \times 1 \times 1 \times L$, you enable broadcasting such that the mask effectively gets expanded to $B \times N \times L \times L$ during the attention calculation, perfectly aligning with the attention logits.

That's why the line:
```python
self.src_mask = (src != pad).unsqueeze(-2)
```
adds a singleton dimension, converting the shape from $B \times L$ to $B \times 1 \times 1 \times L$ for proper broadcasting during the attention computations.

### <a id='toc1_12_7_'></a>[Why does sequence length differ for source and target, usually I thought it is just all L, same.](#toc0_)

In many sequence-to-sequence tasks, the source and target sequences can have different lengths. Here are a few scenarios where this happens:

1. **Machine Translation**: A sentence in one language may require more or fewer words when translated into another language.
  
2. **Text Summarization**: The source text is usually much longer than the summarized target text.

3. **Question Answering**: The source document could be quite lengthy, while the target answer might be a short sentence or even a single word.

4. **Code Generation**: Given a natural language query, the corresponding code snippet could be of varying length that doesn't directly correlate with the length of the query.

5. **Dialogue Systems**: The system's response may not be of the same length as the user's query.

So, in general, source and target sequence lengths (`S` and `T` in the function signature) could be different, and the attention mechanism accommodates that by allowing for `key` and `query` tensors with different sequence lengths. Although
lilian weng use `L` for seq length.

### <a id='toc1_12_8_'></a>[Am i right to assume that the core idea of autoregressive model like decoder only (GPT like) is that for a given sample, there will eventually be L rows where L is the seq length, and therefore I can intuitively view it as 1 sample having L samples, since for each row, we will compute the loss. Am I right in my understanding? Do not hesistate to correct me.](#toc0_)


Your description captures an important aspect of autoregressive models like decoder-only Transformers (e.g., GPT). Specifically, you're talking about how the model treats a sequence during training. Let's break down your understanding step by step.

1. **Sequence Length**: When you mention "L rows," where "L" is the sequence length, you're essentially pointing out that each sequence is divided into "L" time steps (or tokens). Each time step becomes an input-output pair for training the model.

2. **One Sequence as Multiple Samples**: You're correct to intuit that a single sequence of length "L" can be treated like "L" samples, at least in the context of loss calculation. This is because, during training, the model computes the loss at each time step by comparing the predicted token with the actual next token in the sequence.

3. **Loss Computation**: The loss is often computed at each position and then averaged over the sequence length or summed up, depending on the specific loss function or training regime.

However, it's crucial to clarify that although a single sequence may contribute "L" terms to the loss function, this is not equivalent to having "L" independent samples. The key difference lies in the autoregressive property: the prediction at each time step is conditioned on the preceding tokens. This introduces a temporal dependency across the "L" positions, making them not entirely independent samples.

In other words, while it's accurate to say that a single sequence contributes multiple terms to the loss function, these terms are correlated because they come from the same sequence and are generated in an autoregressive manner.

To summarize, you're mostly correct in your understanding that a single sequence is broken down into multiple steps for the purpose of loss computation, but it's important to remember that these steps are not independent samples due to the autoregressive nature of the model.

### <a id='toc1_12_9_'></a>[QKV Again](#toc0_)

#### <a id='toc1_12_9_1_'></a>[Background and Assumptions](#toc0_)

The Transformer architecture, introduced by Vaswani et al. in 2017, has become a cornerstone in NLP and many other machine learning tasks. It is built around the concept of self-attention, which allows the model to weigh different parts of the input when making predictions or transformations. The context vector, as well as Q, K, and V vectors, play a crucial role in this architecture.

#### <a id='toc1_12_9_2_'></a>[Context Vector](#toc0_)

The term "context vector" is commonly used to refer to the weighted sum of value vectors (`V`), after the attention scores have been computed. The purpose of this vector is to encode information from different parts of the input sequence in a way that is most useful for the task at hand. In the attention mechanism, each word (or token) is represented as a context vector that aggregates information from all the other words in the sentence, weighted by their relevance or "attention score".

#### <a id='toc1_12_9_3_'></a>[Query (Q), Key (K), and Value (V)](#toc0_)

1. **Query (Q):** This is a representation of the element for which we are calculating the context. The query is used to find relevant keys, which in turn helps in identifying relevant values. Mathematically, we take the dot product of the Query with each Key to get an attention score.

   $$
   \text{Attention Score} = Q \cdot K^T
   $$

2. **Key (K):** Keys serve as a set of indicators, helping the model identify which values should be attended to when forming the context vector for each query.

3. **Value (V):** Values hold the actual content that will be used to form the context vector. Once the attention scores have been computed using Q and K, these scores are used to weigh the Value vectors before summing them up to get the final context vector.

#### <a id='toc1_12_9_4_'></a>[Mathematical Description](#toc0_)

To obtain the context vector, we first calculate the attention scores for each Query-Key pair:

$$
\text{Attention Score} = \frac{Q \cdot K^T}{\sqrt{d_k}}
$$

We then take the softmax of these scores:

$$
\text{Softmax Score} = \text{Softmax}(\text{Attention Score})
$$

Finally, we use these softmax scores to compute a weighted sum of the Value vectors:

$$
\text{Context Vector} = \text{Softmax Score} \cdot V
$$

In summary, Q, K, and V vectors are instrumental in the computation of the context vector. The Query helps to identify relevant Keys, which in turn are used to weigh the Values, culminating in a context vector that holds the contextual representation useful for a given task.

## <a id='toc1_13_'></a>[TODO](#toc0_)

1. Add Positional Encoding
2. Add LR Scheduler
3. Check why need to use `torch.nn.utils.clip_grad_norm_` to clip gradients
4. Why unsqueeze mask?
5. Can you init weights inside Encoder instead of outside?
6. Add Epoch and Batch State see my old code.
7. Important use `Vocab` class like in https://github.com/jsbaan/transformer-from-scratch/blob/main/vocabulary.py.

https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html



## <a id='toc1_14_'></a>[References and Further Readings](#toc0_)

- https://slds-lmu.github.io/seminar_nlp_ss20/