# RNNs to Transformers

 In the context of deep learning models, Recurrent Neural Newtorks(RNNs) were traditionally used to learn temporal dependencies. In RNNs, the hidden state from the previous time step is fed back into the network, allowing it to maintain a “memory” of past inputs. They were ideal for tasks with short sequences such as natural language processing and time-series prediction. With a Seq2Seq architecture, input sequences are fed into the Encoder, and Decoder predicts each word one after another.


 But these networks comes with its share of challenges.
 ![rnn_arch](https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/RNN_arch.png?raw=1)

 Source: [OpenSourceWikiMedia](https://commons.wikimedia.org/wiki/File:Recurrent_neural_network_unfold.svg),
    [baeldung](https://www.baeldung.com/cs/rnns-transformers-nlp)



**Challenges with RNNs**
*   **Slow to train** : RNNs are sequential models that process data one element at a time, maintaining an internal hidden state that is updated at each step. They operate in a recurrent manner, where the output at each step depends on the previous hidden state and the current input, leaving no scope for parallel computation.
*   **Cannot handle large sequences** : Exploding and vanishing gradients limit the RNN modelling of long sequences. Though some of the variants of RNNs like LSTM and GRU addressed this problem successfully, they cannot engage with very large sequences.
* **Fixed vector length** : A fixed length  context vector limits the representation and decodinng of a long input sequence. Moreever it is challenging to differentiate sentences with similar words but with different meanings

<img src="https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/RNNVsTransformer.png?raw=1" width="500">

Source: [jinglescode](https://jinglescode.github.io/2020/05/27/illustrated-guide-transformer/)









# Transformers

"Attention is all you need" was the game changing paper in 2017 that revolutionized the NLP world. Transformer relies entirely on Attention mechanisms to boost its speed by being parallelizable. As you can see, while RNNs processes each part of the sequence one by one, the transformer on the other hand processes the entire sequence at a time.

<img src="https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/TransformerArch.png?raw=1" width="400">

Source: [Attention Is all you need](https://arxiv.org/pdf/1706.03762.pdf)

**Architecture overview :**

For a machine translation task (say English to French), input the English sentence to the encoder. Word embedding (provides meaning) along with the positional vector (adds context of the word in the sentence) are fed into Encoder attention block, which computes the attention vectors for every word, followed by a feed-forward network in parallel, and the output will be a set of encoded vectors for every word.

The Decoder receives input of the French word(s) and attention vectors of the entire English sentence, to generate the next French word. The encoded word vectors are fed into the first decoder attention block, the masked attention block. The masked attention block computes the Attention vectors for current and prior words. Attention vectors from the Encoder and Decoder are fed into the next attention block, which generates attention mapping vectors for every English and French word. These vectors are passed into the feed-forward layer linear layer and the softmax layer to predict the next French word. We repeat this process to generate the next word until the “end of sentence” token is generated.

Though the original transformer had both encoder and decoder blocks, subsequent architectures, specifically the GPT based models are "decoder only" models, where the trained model rambles on its own (which is technically called generating unconditional samples). Alternatively, we can give it a prompt to have it speak about a certain topic (a.k.a generating interactive conditional samples). In the rambling case, we can simply hand it the start token and have it start generating words (the trained model uses <|endoftext|> as its start token. For illustration purpose here, we use a decoder only model that uses masked self attention and no second multi head attention (cross attention) in the stack.

In [10]:
## IMPORTS AND HYPERPARAMETERS

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
#max_iters = 5000
#eval_interval = 100
#learning_rate = 1e-3
#device = 'cuda' if torch.cuda.is_available() else 'cpu'
#eval_iters = 200
n_embd = 16
n_head = 4
n_layer = 4
dropout = 0.0
vocab_size = 65
# ------------

torch.manual_seed(1337)



<torch._C.Generator at 0x7fd0647753b0>

# Embedding Layer

Word embedding or word vectors help us represent words or text as a numeric vector input where words with similar meanings have the similar representation. Think Embedding layer as a look-up table of size **vocab_size x embedding size**. On the forward pass, the embedding layer will simply query the rows of the underlying weight array based on the input token indices. Note below that the entire embedding layer is a trainable with requires_grad set to True. For any out of dictionary words, you can set the embedding to be all 0s using 'padding_idx'

In [11]:
token_embedding_table = nn.Embedding(vocab_size, n_embd)
position_embedding_table = nn.Embedding(block_size, n_embd)
print(token_embedding_table)
print(token_embedding_table.weight)
x = torch.tensor([1,3,15,4,7,1,4,9])
x = token_embedding_table(x) + position_embedding_table(torch.arange(block_size))

Embedding(65, 16)
Parameter containing:
tensor([[ 0.1808, -0.0700, -0.3596,  ..., -0.2398, -0.9211,  1.5433],
        [ 1.3488, -0.1396,  0.2858,  ..., -0.8016,  1.5236,  2.5086],
        [-0.6631, -0.2513,  1.0101,  ...,  0.5718, -0.5974, -0.6937],
        ...,
        [ 0.9187,  0.2998,  0.6106,  ..., -0.6686, -0.4831, -0.2298],
        [ 0.9043,  0.7631, -0.1606,  ...,  0.8282, -0.4826,  1.8330],
        [ 0.3421,  0.2154, -0.1029,  ...,  0.5812, -0.5356, -1.7944]],
       requires_grad=True)


In [12]:
import torch
inputs = torch.tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets = torch.tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])

# Masked Self Attention
Self-attention is a mechanism that enhances the information content of an input embedding by including information about the input's context. In other words, the self-attention mechanism enables the model to weigh the importance of different elements in an input sequence and dynamically adjust their influence on the output. It assigns scores to how relevant each token in the sequence is, and adds up to the vector representation.

<!--- <img src="https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/AttentionGraph.png?raw=1">
![graph](https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/AttentionGraph.png?raw=1)
[MSA](https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/decoder_only_block.png?raw=1)
--->
As an example shown below, the self-attention layer in the top block is paying attention to “a robot” when it processes the word “it”. The vector it will pass to its neural network is a sum of the vectors for each of the three words multiplied by their scores.

<img src="https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/SelfAttentionEx.png?raw=1" width="600">

Source: [jalammar](http://jalammar.github.io/illustrated-gpt2/#part-2-illustrated-self-attention)



Attention can be viewed as a communication mechanism, where tokens are viewed as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
<img src="https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/AttentionGraph.png?raw=1" width="400">

The decoder architecture uses masked self attention where the future tokens are masked by interfering in the self-attention calculation blocking information from tokens that are to the right of the position being calculated. The masked attention is implemented using a lower triangular masking.


<img src="https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/decoder_only_block.png?raw=1" width="400">

Source: [Attention Is all you need](https://arxiv.org/pdf/1706.03762.pdf)


There are three components to the self attention, namely

*   **Query**: The query is a representation of the current token used to score against all the other words (using their keys). We only care about the query of the token we’re currently processing.

*   **Key**: Key vectors are like labels for all the words in the segment. They’re what we match against in our search for relevant words or context.

*   **Value**: Value vectors are actual word representations, once we’ve scored how relevant each word is, these are the values we add up to represent the current word.

Once we have created the above vectors, we need to calculate the "attention scores" for each word of the input sentence against the current word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

The score is calculated by taking the **dot product** of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #3 (say robot in the above example), the first score would be the dot product of q3 and k1(key for start of sequence). The second score would be the dot product of q3 and k2(key for word "a") and the third dot product of q3 and k3(key for word "robot"). The attention scores are then normalized by a **softmax function**. This is followed by multiplying the scores with each of the value vector and aggregating them to represent the final embedding of the current word.

![ScaledDP](https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/scaledDotProduct.png?raw=1)

Source: [Sebastian Raschka](https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html)

In the above example, we can see that the weighted blend of value vectors results in a vector that paid 50% of its “attention” to the word robot, 30% to the word a, and 18% to the word it.



In [13]:
#Self Attention
# version 4: self-attention!
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# here we want the wei to be data dependent, gather info from the past but in a data dependant way

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16) # each token here (totally B*T) produce a key and query in parallel and independently
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) * head_size**-0.5 # (B, T, 16) @ (B, 16, T) ---> (B, T, T)
# so now for each entry/sample, we have T*T matrix of affinities. weighted aggregation now is
# a func of in  data dependent manner between K and Q of the tokens, so wei[0] will be diff from wei[1]
# in wei[0], the last row is for the 8th token, it knows at what pos it is and what is it contents
# so the 8th token can have the info in the channel saying I am this vowel and I am looking for this
# all the other tokens have similar info in the keys and when you do the dot prod, they can find each other
# and create a high affinity, so if you consider the last row which maps to 8th token, where ever the values are high
# for ex 4th, 7th and 8th token have more affinity to 8th token. And through softmax you end up aggregating a lot of its info
# in to that position.

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf')) # this is only present in a decoder block not in an encoder
wei = F.softmax(wei, dim=-1) # exponentiate and normalize giving a nice distibution that sums to one and
# now it tells us that in a data dependent manner how much of info to aggregate from any of the past tokens

v = value(x)
out = wei @ v # aggregate the attention scores and value vector.

In [14]:
wei[0,:,:]
out.shape


torch.Size([4, 8, 16])

Notes


*   There is no notion of space in attention. It simply acts over a set of vectors. This is why we need to positionally encode tokens.

*   Each example across batch dimension is of course processed completely independently and never "talk" to each other

*   "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)


*   "Scaled" attention additional divides wei by 1/sqrt(head_size). This helps prevent the attention weights from becoming too small or too large, which could lead to numerical instability or affect the model’s ability to converge during training. So essentially you would ideally want the result of attention to be fairly diffused especially upon initialization before applying the softmax function. Hence, scaling is used control the variance at initialization.

*   In an "encoder" attention block just delete the single line that does masking with tril, allowing all tokens to communicate.




# Multi Head Attention

Until now we have seen what happens under a single scaled dot-product attention, the input sequence was transformed using three matrices representing the query, key, and value. These three matrices can be considered as a single attention head in the context of multi-head attention. Multi-head attention involves multiple such heads in parallel, each consisting of its own query, key, and value matrices and finally concatenating the results. It expands the model’s ability to focus on different positions and provides multiple “representation subspaces”.

<img src="https://github.com/brettin/llm_tutorial/blob/main/tutorials/01-LLMs101/Images/MultiHeadAttention.png?raw=1" width="200">

Source: [Attention Is all you need](https://arxiv.org/pdf/1706.03762.pdf)


In [15]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

# Feed Forward Network

Now with all the communication completed with multi headed self attention, we follow it with a feed forward network that is a simple multi layer perceptron(MLP). This layer is applyed on a per token basis with each token processed(thinks on the data) independently. Usually the inner layer dimensionality has a multiple factor of 4 as compared to the input of the feedforward layer.

In [17]:
class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

## Residual Connection

Skip Connection or residual connections are added both the attention as well as the feedforward layers. The skip connections combat the vanishing gradient problem, thus enabling deeper networks. For more information on skip connections [refer](https://arxiv.org/pdf/1512.03385.pdf).

## Layer norm

In the original paper, the layer norm is applied after the attention or feedforward layer. But now-a-days it is more common to have the normalization applied before the attention and feed-forward layers. The layer-norm normalizes over the entire row.

In [10]:
    def forward(self, x):
        x = x + self.sa(self.ln1(x))    # Communication / Attention
        x = x + self.ffwd(self.ln2(x))  # Computation
        return x

In [11]:
import torch
class LayerNorm1d: # (used to be BatchNorm1d)

  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    # calculate the forward pass
    xmean = x.mean(1, keepdim=True) # batch mean
    xvar = x.var(1, keepdim=True) # batch variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    return self.out

  def parameters(self):
    return [self.gamma, self.beta] # trainable params

torch.manual_seed(1337)
module = LayerNorm1d(5)
x = torch.randn(3, 5) # batch size 32 of 100-dimensional vectors
x = module(x)
x.shape

torch.Size([3, 5])

## Full finished code, for reference

In [12]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd), # Projection layer going back into the residual pathway
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))    # Communication
        x = x + self.ffwd(self.ln2(x))  # Computation
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


--2023-11-28 04:12:09--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-11-28 04:12:09 (14.1 MB/s) - ‘input.txt’ saved [1115394/1115394]

0.209729 M parameters
step 0: train loss 4.4116, val loss 4.4022
step 100: train loss 2.6568, val loss 2.6670
step 200: train loss 2.5091, val loss 2.5060
step 300: train loss 2.4199, val loss 2.4337
step 400: train loss 2.3500, val loss 2.3563
step 500: train loss 2.2961, val loss 2.3126
step 600: train loss 2.2408, val loss 2.2501
step 700: train loss 2.2053, val loss 2.2187
step 800: train loss 2.1636, val loss 2.1870
step 900: train loss 2.1226, val loss 2.1483
step 1000: 

# References


*   http://jalammar.github.io/illustrated-gpt2/
*   Vaswani et al, Attention is all you need. arXiv:1706.03762 [cs.CL]
*   https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html
*   http://jalammar.github.io/illustrated-transformer/
*   https://jinglescode.github.io/2020/05/27/illustrated-guide-transformer/
*   https://towardsdatascience.com/getting-started-with-recurrent-neural-network-rnns-ad1791206412
*  https://www.youtube.com/watch?v=kCc8FmEb1nY
* https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing





