# Explanation

Building on the previous progress made from the Seq2Seq & Attention papers in the previous section, the Transformer improves on the state of the art for the machine translation task.

The title of the paper _Attention Is All You Need_ reflects on the main idea governing the transformer architecture - the "attention" mechanism introduced in a previous paper along with recurrence was actually sufficient to perform better alone on the machine translation task.

The transformer architecture builds around this by removing recurrence entirely and just using attention alone.

### Resources

Given the popularity of transformers, there are lot's of great resources for understanding the intuitions along with the software layer of how transformers work.

I'd especially recommend watching these videos:
- [But what is a GPT? Visual intro to transformers](https://www.youtube.com/watch?v=wjZofJX0v4M)
- [Attention in transformers, visually explained](https://www.youtube.com/watch?v=eMlx5fFNoYc)
- [Let's build GPT: from scratch, in code, spelled out.](https://www.youtube.com/watch?v=kCc8FmEb1nY)
- [Let's build the GPT Tokenizer](https://www.youtube.com/watch?v=zduSFxRajkE)

### Intuition

The previous initial architecture that introduced attention used the encoder to create a set of _annotations_ for each word, and then computed a context vector summarizing the relevant information from these annotations important to predicting the next word by taking a weighted sum of these annotations.

In other words, each time a new word needed to be generated, the decoder would look through all the previous words to determine which words were relevant to the meaning of the next word.

The transformer takes this a step farther by enabling not just the next word, but all words in the sequence to attend to each other and enhance each others meaning.

All words (technically tokens) in the sequence have a chance to search for any other words the sequence relevant to them, and then they adopt parts of these words meanings into their own meanings.

To understand how this works specifically, we can look at the math

### Math

A single attention head is computed as follows:

$$
\textrm{Attention}(Q,K,V) = \textrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V
$$

Here, we have a matrix of queries (Q), keys (K), and values (V) that each contain information for each token in the sequence.

Specifically, the queries contain some information about what types of tokens are relevant to each token in the sequence, the keys contain information about what types of tokens each token in the sequence is relevant to, and the values contain some information about the meaning of each token.

By taking the $QK^T$, we compute the relevance of the meaning of each word to each other word. Then, when we multiply the result of this multiplication with the values via $\textrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V$, each token contributes meanings to each other token in proportion to it's relevance.

Thus, after an attention head is applied to the tokens, each tokens meaning is enhanced with the context of all the other important information around it.

For each token, we take the softmax of all the $QK^T$ values for the token corresponding to the queries to weight the relative importance of different words to this token, forcing the most important tokens to be emphasized.

We divide the $QK^T$ by $\sqrt{d_k}$ to ensure that these quantities don't saturate in the softmax (ie. they don't hit the part of the softmax function that's so flat that gradients start to vanish).

# My Notes

## 📜 [Attention Is All You Need](https://arxiv.org/pdf/1706.03762)

> We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

Building on the state of encoder-decoder architectures, the Transformer removes all the complexities and uses attention as the only source of computation.

> On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.

This is an insane result showing the huge efficiency jump due to parallelization that the transformer offers - better performance in a fraction of the time.

> Recurrent models typically factor computation along the symbol positions of the input and output sequences. […] The fundamental constraint of sequential computation, however, remains.

The fundamental computational constraints of recurrent models makes them inefficient, despite optimization attempts.

> In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network. In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.

Motivated by the inefficiency of recurrence, this paper completely gets rid of recurrence and uses only attention to form representations.

> The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

> In the Transformer [the number of operations required to relate signals from two arbitrary input or output positions] is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention

In the Transformer, attention in a single head takes an average of the relevant information at all positions to compute the context to update individual word embedding vectors with, which can have the effect of reduce resolution. To counter-act this, multi-headed attention creates another way for the transformer to focus on multiple different important types of information for each word.

> Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

Individual words in a sequence can soak up context from each other to enrich their own meanings.

### Model Architecture

> The Transformer follows this overall architecture [of an encoder-decoder model] using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.

![Screenshot 2024-05-15 at 11.49.10 PM.png](../../images/Screenshot_2024-05-15_at_11.49.10_PM.png)

**1. Encoder and Decoder Stacks**

> The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.

Each of the $N$ layers contains a multi-headed attention block for the words in the input sequence to self-attend to each other and soak up meanings from each other, as well as a feed forward network to use and interpret those meanings.

Additionally, each sub-layer uses layer normalization and residuals for optimization purposes.

> The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.

Masked multi-head attention is used in the decoder to ensure that output words can’t attend to words that follow them

**2. Attention**

> An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.

**2.1. Scaled Dot-Product Attention**

![Screenshot 2024-05-15 at 11.54.43 PM.png](../../images/Screenshot_2024-05-15_at_11.54.43_PM.png)

$$
\textrm{Attention}(Q,K,V) = \textrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V
$$

Both dot-product and additive attention were viable choices. However:

> Dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

> We suspect that for large values of $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where it has
> extremely small gradients. To counteract this effect, we scale the dot products by $\frac{1}{\sqrt{d_k}}$.

The scaling factor is just to keep the outputs of the softmax function in the regime where it’s gradient doesn’t vanish, regardless of the dimensions size (by minimizing the differences between values).

**2.2. Multi-Head Attention**

> Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

The primary affect of multi-headed attention is to enable the model to soak up context from multiple distinct potential information sources for each word - something that couldn’t happen as easily with just a single head due to the fact that individual heads take weighted averages of the

$$
\textrm{MultiHead}(Q,K,V) = \textrm{Concat}(\textrm{head}_1,...,\textrm{head}_h)W^O \\
\textrm{where head}_i = \textrm{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$$

Each multi-headed attention block concatenates the results of each of its head into the final vector that’s passed to the feed-forward layer.

This means that the feed-forward layer is actually not just receiving a single opinion about the words enriched context, but actually a completely different perspective about the words meaning for each head.

In this case, that means that the feed-forward layer is actually receiving 8 different perspectives on the enriched meaning of each token.

> Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

**2.3. Applications of Attention in our Model**

> In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

$$
\textrm{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
$$

> While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1.

Because the same weight matrix is applied to each enhanced token outputted from the multi-headed attention layer, this can be thought of as a convolution being applied to each position.

**4. Embeddings and Softmax**

> We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities.

**5. Positional Encodings**

> Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks.

This is what enables Transformers to still maintain information about word positions without explicit recurrence.

> In this work, we use sine and cosine functions of different frequencies:

$$
PE_{(pos,  2i)} = \sin(pos/1000^{2i/d_{\textrm{model}}}) \\
PE_{(pos,  2i+1)} = \cos(pos/1000^{2i/d_{\textrm{model}}})
$$

### Why Self-Attention

> Motivating our use of self-attention we consider three desiderata. One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required. The third is the path length between long-range dependencies in the network.

> Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies.

Transformers make it far easier for the network to learn long-range dependencies between words since there’s no recurrence to make path lengths across time long.

> To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size $r$ in the input sequence centered around the respective output position.

This is the motivation behind limited context windows.

> As side benefit, self-attention could yield more interpretable models. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.

Multi-headed attention appears to model highly interpretable behaviors, something uncharacteristic of most models.

### Training

> We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs.

> Sentences were encoded using byte-pair encoding, which has a shared source-target vocabulary of about 37000 tokens.

> We trained our models on one machine with 8 NVIDIA P100 GPUs. […] Each training step took about 0.4 seconds.

> We used the Adam optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.98$ and $\epsilon = 10^{−9}$.

> We apply dropout to the output of each sub-layer before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks.

> During training, we employed label smoothing of value $\epsilon_{l_s} = 0.1$.

### Results

> Training took 3.5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.

> To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing. This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input. Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes.

> Despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar.

### Conclusion

> In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.

> For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers.


# Implementation

This implementation is based on Andrej Karpathy's transformer tutorial: [let's build GPT: from scratch, in code, spelled out.](https://www.youtube.com/watch?v=kCc8FmEb1nY). All of his tutorials are great for building intuitions and he explains all the concepts in depth.

In [None]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-05-19 05:20:30--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-05-19 05:20:30 (163 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [None]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print("length of dataset in characters", len(text))
print("---\n", text[:500])

length of dataset in characters 1115394
---
 First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

In [None]:
# gpu
batch_size = 64
block_size = 256
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
device = torch.device("cuda")
eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2

In [None]:
torch.manual_seed(1337)
torch.set_default_device(device)

In [None]:
# get all the unique characters in the text
chars = sorted(list(set(text)))
vocab_size = len(chars)

# create a mapping from characters to integers (tokenization)
stoi = { ch: i for i,ch in enumerate(chars) }
itos = { i: ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

In [None]:
# get the encoded dataset
data = torch.tensor(encode(text), dtype=torch.long)

# split into train and test split
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

In [None]:
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))

    # we create the training batches by offseting sequences, making the next word the y value for the previous word
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])

    x, y = x.to(device), y.to(device)
    return x, y

In [None]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, y = get_batch(split)
            # get losses from the model itself
            logits, loss = model(X, y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [None]:
# a single attention head
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()

        # q, k, v
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)

        # used for masked attention (mask future tokens)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)
        q = self.query(x)

        # QK^T/sqrt(dim)
        wei = k @ q.transpose(-2, -1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)

        # softmax(QK^T/sqrt(dim))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)

        # weighted aggregation of values
        v = self.value(x)

        # full attention - Attention = softmax(qk^T/sqrt(dim))
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

# multi-headed attention, just concatenating the result of multiple heads
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        # store all the attention heads
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        # prokekection into the residual pathway
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # concatenate the result of all attention heads into one vector
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

# feed forward layers that come after multiheaded attention
class FeedForward(nn.Module):
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            # projection layer into the residual pathway
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        # each block has its own multi-headed attention + feed-forward
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        # add residuals to improve optimization
        # apply layernorms before going into modules
        x = x + self.sa(self.ln1(x)) # self-attention
        x = x + self.ffwd(self.ln2(x)) # feed-forward
        return x

class GPT(nn.Module):
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        # create the position embeddings for each position
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        # idx and targets are both (B,T) tensor of integers
        B, T = idx.shape

        # embedding for each token
        tok_emb = self.token_embedding_table(idx) # (B, T, C)

        # positional embedding for each token
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T, C)

        # x has token embedding + position
        x = tok_emb + pos_emb # (B, T, C)

        # apply all blocks w/ attention and feed forward layers
        x = self.blocks(x)

        # get logits by using the final linear layer to predict the next token
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)

            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens (context window!)
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [None]:
model = GPT()
m = model.to(device)
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [None]:
for iter in range(max_iters):
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

step 0: train loss 4.5183, val loss 4.5049
step 500: train loss 1.7811, val loss 1.9220
step 1000: train loss 1.4491, val loss 1.6516
step 1500: train loss 1.3143, val loss 1.5616
step 2000: train loss 1.2365, val loss 1.5349
step 2500: train loss 1.1772, val loss 1.5225
step 3000: train loss 1.1199, val loss 1.5229
step 3500: train loss 1.0634, val loss 1.5410
step 4000: train loss 1.0079, val loss 1.5487
step 4500: train loss 0.9487, val loss 1.5795

Since thy birt: till thy blood, sits and after the other
fires; then's the wind--though I fear, the staged what I
shall say, by need with transporturney of all,
the shepherd, a doth eten one thing
purchase to his temperant throw.

MENENIUS:
What was ha sometime overed man? They say, my lords.
O unlike and leepend the good wall!
Hour the news several days. This mother day!
O montal, I'll peer it soject
To see him to-night.

CORIOLANUS:
Where is the four weak poised art
To make so her worshipped l


Now with our model fully trained, we can generate some text from it - you can see that it's looking a lot better than the random characters it starts with (although of course, since the dataset size is limited and model size is limited, it isn't perfect).

In [None]:
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=1000)[0].tolist()))


DORSET:
Readen truth that makes me levish of Bushy,
Shall I strawber in all this brare heir leap.
Then my part in her cheeks my wife unto your bed.
If Burgunderies her and cursed by lent
Hath upon me, and descend her brothers there were last,
Thy heads and frame to she diet, to see her petty ear.
Feeping this, see, her lords, all help all one:
We have been sole alraady, and I was before her
combardent, and by Romeo are the common taxe,
are the ruse-for fee-and every pine of services
That speeding days: in her purpose what bought were insatial
Marchans keeps. I have reasy
Come, or tell us plate, but stimate:
Only sins be here none beyed on you;
His idlen and resol majesty over yet:
The noble mother usages my head, my lord and
If consisted worms, and he may be, the laid was.

QUEEN MARGARET:
Involt, daughter, nothing capame into made.

QUEEN MARGARET:
Blind time, and lessen title peril thou fae.

QUEEN MARGARET:
No, not In King Henry shall go.
To help them, and then take again.
I canst 