# RNNs to Transformers

 In the context of deep learning models, Recurrent Neural Newtorks(RNNs) were traditionally used to learn temporal dependencies. In RNNs, the hidden state from the previous time step is fed back into the network, allowing it to maintain a “memory” of past inputs. They were ideal for tasks with short sequences such as natural language processing and time-series prediction. With a Seq2Seq architecture, input sequences are fed into the Encoder, and Decoder predicts each word one after another.


 But these networks comes with its share of challenges.
 ![rnn_arch](https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/RNN_arch.png?raw=1)

**Challenges with RNNs**
*   **Slow to train** : RNNs are sequential models that process data one element at a time, maintaining an internal hidden state that is updated at each step. They operate in a recurrent manner, where the output at each step depends on the previous hidden state and the current input, leaving no scope for parallel computation.
*   **Cannot handle large sequences** : Exploding and vanishing gradients limit the RNN modelling of long sequences. Though some of the variants of RNNs like LSTM and GRU addressed this problem successfully, they cannot engage with very large sequences.
* **Fixed vector length** : A fixed length  context vector limits the representation and decodinng of a long input sequence. Moreever it is challenging to differentiate sentences with similar words but with different meanings

![RNNVsTransformers](https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/RNNVsTransformer.png?raw=1)






# Transformers

"Attention is all you need" was the game changing paper in 2017 that revolutionized the NLP world. Transformer relies entirely on Attention mechanisms to boost its speed by being parallelizable. As you can see, while RNNs processes each part of the sequence one by one, the transformer on the other hand processes the entire sequence at a time.

![TransArch](https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/TransformerArch.png?raw=1)

**Architecture overview :**

For a machine translation task (say English to French), input the English sentence to the encoder. Word embedding (provides meaning) along with the positional vector (adds context of the word in the sentence) are fed into Encoder attention block, which computes the attention vectors for every word, followed by a feed-forward network in parallel, and the output will be a set of encoded vectors for every word.

The Decoder receives input of the French word(s) and attention vectors of the entire English sentence, to generate the next French word. The encoded word vectors are fed into the first decoder attention block, the masked attention block. The masked attention block computes the Attention vectors for current and prior words. Attention vectors from the Encoder and Decoder are fed into the next attention block, which generates attention mapping vectors for every English and French word. These vectors are passed into the feed-forward layer linear layer and the softmax layer to predict the next French word. We repeat this process to generate the next word until the “end of sentence” token is generated.

Though the original transformer had both encoder and decoder blocks, subsequent architectures, specifically the GPT based models are "decoder only" models, where the trained model rambles on its own (which is technically called generating unconditional samples). Alternatively, we can give it a prompt to have it speak about a certain topic (a.k.a generating interactive conditional samples). In the rambling case, we can simply hand it the start token and have it start generating words (the trained model uses <|endoftext|> as its start token. For illustration purpose here, we use a decoder only model that uses masked self attention and no second multi head attention (cross attention) in the stack.

In [None]:
## IMPORTS AND HYPERPARAMETERS

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
#max_iters = 5000
#eval_interval = 100
#learning_rate = 1e-3
#device = 'cuda' if torch.cuda.is_available() else 'cpu'
#eval_iters = 200
n_embd = 16
n_head = 4
n_layer = 4
dropout = 0.0
vocab_size = 65
# ------------

torch.manual_seed(1337)



<torch._C.Generator at 0x7f5367f75b10>

# Embedding Layer

Word embedding or word vectors help us represent words or text as a numeric vector input where words with similar meanings have the similar representation. Think Embedding layer as a look-up table of size **vocab_size x embedding size**. On the forward pass, the embedding layer will simply query the rows of the underlying weight array based on the input token indices. Note below that the entire embedding layer is a trainable with requires_grad set to True. For any out of dictionary words, you can set the embedding to be all 0s using 'padding_idx'

In [None]:
token_embedding_table = nn.Embedding(vocab_size, n_embd)
position_embedding_table = nn.Embedding(block_size, n_embd)
print(token_embedding_table)
print(token_embedding_table.weight)
x = torch.tensor([1,3,15,4,7,1,4,9])
x = token_embedding_table(x) + position_embedding_table(torch.arange(block_size))

Embedding(65, 16)
Parameter containing:
tensor([[ 0.2347, -1.2034, -0.1935,  ...,  0.7385, -0.4643, -0.2165],
        [ 0.7709,  0.6257, -0.2308,  ..., -0.0188,  0.0890, -0.6228],
        [-0.2247, -0.2451, -0.1528,  ..., -1.1393,  0.0502, -1.1192],
        ...,
        [ 0.1421,  0.3871,  1.4001,  ..., -0.6893,  0.5360,  0.0078],
        [-0.5903, -0.7970, -1.8383,  ..., -0.3543,  1.5475,  0.5125],
        [ 1.1790, -0.4160, -0.8031,  ...,  1.2491, -0.8952,  0.7180]],
       requires_grad=True)


In [None]:
import torch
inputs = torch.tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets = torch.tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])

# Masked Self Attention
Attention is a communication mechanism. It gives access to all tokens at each time step and hence select and determine which words are more important than others in a specific context. It can also be viewed as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.


The decoder architecture uses masked self attention where the future tokens are masked by interfering in the self-attention calculation blocking information from tokens that are to the right of the position being calculated. The masked attention is implemented using a lower triangular masking (tril).


![MSA](https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/decoder_only_block.png?raw=1)


There are three components to the self attention, namely

*   **Query**: The query is a representation of the current token used to score against all the other words (using their keys). We only care about the query of the token we’re currently processing.

*   **Key**: Key vectors are like labels for all the words in the segment. They’re what we match against in our search for relevant words or context.

*   **Value**: Value vectors are actual word representations, once we’ve scored how relevant each word is, these are the values we add up to represent the current word.


![ScaledDP](https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/scaledDotProduct.png?raw=1)


In [None]:
#Self Attention
# version 4: self-attention!
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# here we want the wei to be data dependent, gather info from the past but in a data dependant way

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16) # each token here (totally B*T) produce a key and query in parallel and independently
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) * head_size**-0.5 # (B, T, 16) @ (B, 16, T) ---> (B, T, T)
# so now for each entry/sample, we have T*T matrix of affinities. weighted aggregation now is
# a func of in  data dependent manner between K and Q of the tokens, so wei[0] will be diff from wei[1]
# in wei[0], the last row is for the 8th token, it knows at what pos it is and what is it contents
# so the 8th token can have the info in the channel saying I am this vowel and I am looking for this
# all the other tokens have similar info in the keys and when you do the dot prod, they can find each other
# and create a high affinity, so if you consider the last row which maps to 8th token, where ever the values are high
# for ex 4th, 7th and 8th token have more affinity to 8th token. And through softmax you end up aggregating a lot of its info
# in to that position.

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf')) # this is only present in a decoder block not in an encoder
wei = F.softmax(wei, dim=-1) # exponentiate and normalize giving a nice distibution that sums to one and
# now it tells us that in a data dependent manner how much of info to aggregate from any of the past tokens

v = value(x)
out = wei @ v

In [None]:
wei[0,:,:]
out.shape


torch.Size([4, 8, 16])

Notes


*   There is no notion of space in attention. It simply acts over a set of vectors. This is why we need to positionally encode tokens.

*   Each example across batch dimension is of course processed completely independently and never "talk" to each other

*   "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)


*   "Scaled" attention additional divides wei by 1/sqrt(head_size). This helps prevent the attention weights from becoming too small or too large, which could lead to numerical instability or affect the model’s ability to converge during training.

*   In an "encoder" attention block just delete the single line that does masking with tril, allowing all tokens to communicate.




# Multi Head Attention

Until now we have seen what happens under a single scaled dot-product attention, the input sequence was transformed using three matrices representing the query, key, and value. These three matrices can be considered as a single attention head in the context of multi-head attention. Multi-head attention involves multiple such heads, each consisting of query, key, and value matrices.

![MHA](https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/MultiHeadAttention.png?raw=1)

Add 4. Multi head attention, residual connections, layer norm

## Full finished code, for reference

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))
