# Imports

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

In [2]:
import os
if not os.path.exists('data'):
    os.makedirs('data')
!wget -O data/shakespeare.txt  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt 

--2024-06-04 10:23:33--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘data/shakespeare.txt’


2024-06-04 10:23:34 (1.97 MB/s) - ‘data/shakespeare.txt’ saved [1115394/1115394]



In [3]:
with open('data/shakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read()
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


# Fundamentals

## Tokenization


Vocabulary of the model that it can see or emit.

In [None]:
chars = sorted(list(set(text))) # vocabulary of our transformer (see or emit)
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)

Convert the raw text, which is a string, into a sequence of intergers according to some vocabulary of possible elements. To do this, we create a mapping function (an encoder) from characters to integers.

In [None]:
stoi = { char:idx for idx, char in enumerate(chars) } # dictionnary of character as key and index as value 
itos = { idx:char for idx, char in enumerate(chars) } # dictionnary of character as value and index as key

In [None]:
encode = lambda s: [stoi[c] for c in s] # encode a list s of character (e.g a word) to a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decode a list of integers into a string

In [None]:
print(encode('Best Transformer ever'))
print(decode(encode('Best Transformer ever')))

In fact, you can use what encoder you want (not specialy for each character). ChatGPT use [TikToken](https://github.com/openai/tiktoken) as encoder/decoder which is a sub-words units tokenizer. A tokenizer can be more complex than just a usuel cut, check [this video](https://www.youtube.com/watch?v=zduSFxRajkE) for more infos.

In [None]:
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)

## Train/Validation Set

In [None]:
# first 90% will be train, rest val
train = data[:int(0.9*len(data)) ]
val = data[int(0.9*len(data)) :]

In [None]:
batch_size = 4 # number of parallel sequences will process (i.e number of line analyse in the same time)
block_size = 8 # the maximum context length for predictions / "time" (i.e maximum number of character in this line which will be analyse)

def get_batch(set, batch_size=batch_size, block_size=block_size):
    data = train if set=='train' else val 
    idx = torch.randint(len(data)-block_size, (batch_size,)) # random offsets into the training set
    x = torch.stack([data[i:i+block_size] for i in idx])
    y = torch.stack([data[i+1:i+block_size+1] for i in idx])
    return x.to(device), y.to(device) # input, target

x, y = get_batch('train')

## Create the initial model

In [None]:
class ShakespeareLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None): # None because we use self() in the generate()

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C) batch, time, channel(dimension of your embedding)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape # (batch_size=4, block_size=8, vocab_size=65)
            logits = logits.view(B*T, C) # need to reshape because of cross_entropy torch function implementation
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets) # negative log likelihood 

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # for the moment idx are a integer if the input is torch.zeros((1, 1), dtype=torch.long); after it is a matrix of batch_size*time size
        for _ in range(max_new_tokens):
            # get the vector in the embedding space
            logits, loss = self(idx)
            # focus only on the last time step token embedding
            # (Here we feed all the character block but we just check the value of the last to generate the one after.)
            # (Not really smart but will make sense with self attention.)
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities (see it like an activation function to have an equal repartition between 0 and 1)
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution to get an index number
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1) because num_samples equal 1
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
            #now index is a tensor of integer when the input was torch.zeros((1, 1), dtype=torch.long)
        return idx

m = ShakespeareLanguageModel(vocab_size).to(device)
logits, loss = m(x, y)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long).to(device), max_new_tokens=100)[0].tolist())) # zeros((1,1)) for generate from the first charater

**Let's delve into the embedding layer**

Embeddings serve as a method to represent data, such as tokens, in a high-dimensional continuous space. In this case, the space is represented by $\mathbb{R}^{\text{vocab size}}$, cause the second parameter of `nn.Embedding` is the vocabulary size. The input has to be one-hot-encode and that is why we need to precise the vocabulary size in the first parameter. Training this layer involves shifting each vector within this space.

One of the simplest ways to visualize this concept is by attempting to determine whether certain words are positive or negative, and whether they are commonly used or formal. Imagine projecting your words (or tokens) onto a two-dimensional plane, where each hyperplane from the canonic base represents a particular state. For instance, if a vector falls within $\mathbb{R}^{+,+}$, it signifies that the word is both positive and formal.

In the task of predicting the next word, you can utilize the block of encode words as the input. By adding all the emdedding vectors together, you can then decode the nearest emdedding token to this resultant addition in this space, yielding the output.

In [None]:
"""
>>> # an Embedding module containing 10 tensors of dimension 3
>>> embedding = nn.Embedding(10, 3)
>>> # a batch of 2 samples of 4 token of each
>>> input = torch.LongTensor([[1, 2, 4, 5], [4, 3, 2, 9]])
>>> embedding(input)
tensor([[[-0.0251, -1.6902,  0.7172],
         [-0.6431,  0.0748,  0.6969],
         [ 1.4970,  1.3448, -0.9685],
         [-0.3677, -2.7265, -0.1685]],

        [[ 1.4970,  1.3448, -0.9685],
         [ 0.4362, -0.4004,  0.9400],
         [-0.6431,  0.0748,  0.6969],
         [ 0.9124, -2.3616,  1.1151]]])
"""

## Train the model

In [None]:
batch_size = 64 
epochs = 3000

In [None]:
@torch.no_grad() #context manager 
def estimate_loss(model=m, epochs=epochs):
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(epochs)
        for k in range(epochs):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [None]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [None]:
for _ in range(epochs):
    
    if _ % 500 == 0 or iter == epochs - 1:
        losses = estimate_loss()
        print(f"step {_}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        
    x, y = get_batch('train')
    
    logits, loss = m(x,y)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long).to(device), max_new_tokens=100)[0].tolist())) # zeros((1,1)) for generate from the first charater

Much better! But for the moment, each token is only created thanks to the previous one instead of all the previous ones : need to add attention in it !

## Attention is all you need !

In [None]:
B, T, C = 4, 8, 2
z = torch.randn(B,T,C)

### 1st step : get the average of the precedent tokens ("bag of words")

In [None]:
# Exemple
zbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        zprev = z[b,:t+1]
        zbow[b, t] = torch.mean(zprev, 0)
print(z[0])
print(zbow[0])

Reminder that zbow is in the embedding space, so change in my exemple (in the paragraph explanation of embedding layer above) the sum into the mean per component to understand what is the output.

Let's optimize this code with a mathematical trick:

In [None]:
# Exemple
a = torch.tril(torch.ones(3,3))
a = a / torch.sum(a, 1, keepdim=True)
print(a) # so a@b is the average of the precedent time of b

In [None]:
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
zbow2 = wei @ z # here torch will convert this (T, T)@(B, T, C) product to a (B, T, T)@(B, T, C) to match the dimension --> (B, T, C)
torch.allclose(zbow, zbow2)

But with averaging, we lose a lot of information, such as the position or importance of a token in relation to the one to be predicted. Let's try adapting this weight matrix to modify the weights of previous tokens to predict the next one. Let's add some self-attention.

### 2th step : self-attention !

**Let's delve into self-attention every single** 

Each token will emit independently two new vectors : 
- `query` : "what I am looking for"
- `key` : "what do I contain"

So, to ensure that one token's query is correctly "aligned" with another token's key, we need to check whether these two vectors are LITERALLY aligned. This is why dot product have been created. So now the weights of the matrice is representing by the dot product between the query of the token to predict and the key of all the precedent ones.

Note that the `query` and the `key` vectors are created from the emdedding vector of the token and not directly from the token.

Let's see a single Head perform self-attention !

In [None]:
head_size = 16 # the length of the vector of the query/key

In [None]:
key = nn.Linear(C, head_size, bias=False) # remove the biais because I DONT KNOW WHY YET
query = nn.Linear(C, head_size, bias=False)
k = key(z) # (B, T, 16)
q = query(z) # (B, T, 16)

In [None]:
# dot product <x,y> can be write as x @ y.T for row vectors
wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) -->  (B, T, T)
print(wei.shape)
wei[0]

Cool but it seems that the token can interact with the following ones. Let's mask the next ones and re distribute that !

In [None]:
tril = torch.tril(torch.ones(T, T))
print(tril)
wei = wei.masked_fill(tril == 0, float('-inf'))
print(wei[0])
wei = F.softmax(wei, dim=-1) # nice distribution equal to one
print(wei[0])

out = wei @ z

Cool but when input Q,K are not unit variance, wei will be an explozing variance and Softmax will not stay diffuse but it will saturate too much (creating an one-hot vector, that means that the target token will get information from one unique other vector). We need to force Q, K to be unit variance by normalize with $\sqrt{d_k}$.

In [None]:
wei = q @ k.transpose(-2, -1)* head_size**-0.5 # (B, T, 16) @ (B, 16, T) -->  (B, T, T)
print(wei.shape)
tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1) # nice distribution equal to one

out = wei @ z
out.shape

In fact, each token will emit one more vector :
- `values` : "what I will communicate to the token if it find me interesting"

And the output of all of this will be the matrix product between the weights and the values.

In [None]:
value = nn.Linear(C, head_size, bias=False)
v = value(z)

out = wei @ v
out.shape

**Understand why the value layer is necessary with an example**

Imagine the sequence “My black cat died yesterday. In his coffin, he looked TARGET”. Here, we're looking for an adjective to describe the cat. But in theory, the weights of “black” and “dead” should be very close, as they're both adjectives describing the cat. The result could therefore be either an adjective close to black, or an adjective close to dead. That's why we add the value layer: here, death brings much more value than black. What we mean is that even though we're looking for an adjective, we want an adjective to correlate with death, so “death” has to have a higher value than “black”.

**CONCLUSION OF SELF-ATTENTION**

To really understand what's going on here
1. In the time block, at each instant t, i.e. at each new token to be predicted, you take its key vector and search for the most aligned (literraly) previous vectors in the block using the dot product with their query vectors. The weight matrix is now the matrix of each dot product.
2. Now in the embedding space, wei @ z represent the shifting to the **weighted** average of the precedent tokens for the token to predict.
3. But we want more freedom to really understand what matters (and not just what aligns) in a sentence. So we add a new layer name value which represent the value of each token in the sequence. And now the output become wei @ value(z)

So to have a high vector output (so to bring the original embedding token vector to), you have to be interesting for the prediction (represente by the weight) AND add a hugh value to the sequence (representing by the value).

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, **allowing all tokens to communicate (past and future)**. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)

In [None]:
class OneHead(nn.Module):
    def __init__(self, dim_emb, head_size):
        super().__init__()
        self.key = nn.Linear(dim_emb, head_size, bias=False)
        self.query = nn.Linear(dim_emb, head_size, bias=False)
        self.value = nn.Linear(dim_emb, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size))) #is not a paremeter
        
    def forward(self, x):
        B, T, C = x.shape
        
        k = self.key(x) # (B, T, C)
        q = self.query(x) # (B, T, C)
        
        wei = q @ k.transpose(-2, -1)* C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        
        v=self.value(x)

        out = wei @ v

        return out

## Improve the model (part 1)

1. Change the size of the embedding space ! 
2. And get information from position of the tokens ! 
3. 1.+2.
4. Add parralel attention

In [None]:
dim_emb = 32
head_size = 16

In [None]:
class ShakePT(nn.Module):

    def __init__(self, vocab_size, dim_emb, head_size):
        super().__init__()
        #### 1. change the dimension of the token embedding ####
        self.token_embedding_table = nn.Embedding(vocab_size, dim_emb)
        #### 2. create embedding for the position of the tokens ####
        self.position_embedding_table = nn.Embedding(block_size, dim_emb)
        #### 4. Attention ####
        self.head = OneHead(dim_emb, head_size=dim_emb) #head_size=dim_emb for the moment cause we do not introduce mutliheading yet (to match the dimensions)
        self.linear = nn.Linear(dim_emb, vocab_size)
        
    def forward(self, idx, targets=None):
        #### 1. #####
        tok_emb = self.token_embedding_table(idx) # (B, T, C)
        #new : logits = self.linear(tok_emb) # (B, T, vocab_size); is needed to match the output for cross_entropy function
        #### 2. #####
        B, T = idx.shape
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        #### 3. ####
        x = tok_emb + pos_emb
        #### 4. ####
        x = self.head(x)
        
        logits = self.linear(x)
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C) 
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_crop = idx[:, -block_size:] #make sure that the idx that are feed into the model has no more than block size coming in (position_embedding_table)
            logits, loss = self(idx_crop)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1) 
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

m = ShakePT(vocab_size, dim_emb, head_size).to(device)
logits, loss = m(x, y)
print(logits.shape)
print(loss)

context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

# Optimization

## Improve the attention

To do it run mutliple attention in parralel and concatenating the results instead of one attention with a huge head size.  

In [None]:
class MultipleHeadAttention(nn.Module):
    def __init__(self, num_heads, dim_emb, head_size):
        super().__init__()
        self.heads = nn.ModuleList([OneHead(dim_emb, head_size) for _ in range(num_heads)])
        
    def forward(self, x):
        return torch.cat([head(x) for head in self.heads], dim=-1) #contatenation over the channel dimension

In [None]:
class ShakePT(nn.Module):

    def __init__(self, vocab_size, dim_emb, head_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, dim_emb)
        self.position_embedding_table = nn.Embedding(block_size, dim_emb)
        self.head = MultipleHeadAttention(4, dim_emb, dim_emb//4) # instead of having one big attention head, we divise it in 4 then concatenate
        self.linear = nn.Linear(dim_emb, vocab_size)
        
    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.head(x)
        logits = self.linear(x)
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C) 
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_crop = idx[:, -block_size:]
            logits, loss = self(idx_crop)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1) 
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

m = ShakePT(vocab_size, dim_emb, head_size).to(device)
logits, loss = m(x, y)
print(logits.shape)
print(loss)

context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

## Improve the model (part 2)

We add a new layer of the model because it can not really deside by is own which prediction to do, i.e it just decode the high value embedding vector output as a target. This new layer enable to decide if this vector is a good choice or no. This layer is just a MLP with a ReLu activation function (OpenAI use GeLu). 

In [None]:
class FeedForward(nn.Module):
    def __init__(self, dim_emb):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim_emb, dim_emb),
            nn.ReLU(),
        )
        
    def forward(self, x):
        return self.net(x)

In [None]:
class ShakePT(nn.Module):

    def __init__(self, vocab_size, dim_emb, head_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, dim_emb)
        self.position_embedding_table = nn.Embedding(block_size, dim_emb)
        self.head = MultipleHeadAttention(4, dim_emb, dim_emb//4)
        self.fforward = FeedForward(dim_emb) # neeeeew
        self.linear = nn.Linear(dim_emb, vocab_size)
        
    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.head(x)
        x = self.fforward(x) # neeeeew
        logits = self.linear(x)
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C) 
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_crop = idx[:, -block_size:]
            logits, loss = self(idx_crop)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1) 
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

m = ShakePT(vocab_size, dim_emb, head_size).to(device)
logits, loss = m(x, y)
print(logits.shape)
print(loss)

context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

## Let's group and scale up this model  

Let's try to go deeper by replicate this model. But the problem of vanishing gradient will apear, so we use three optimization tools : 
- skip connection / residual connection (like ResNet)
- add a layer norm (BEFORE as pre-norm formulation, not as the original paper)
- add dropout layer

In [None]:
class OneHead(nn.Module):
    def __init__(self, dim_emb, head_size):
        super().__init__()
        self.key = nn.Linear(dim_emb, head_size, bias=False)
        self.query = nn.Linear(dim_emb, head_size, bias=False)
        self.value = nn.Linear(dim_emb, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(0.1)
        
    def forward(self, x):
        B, T, C = x.shape
        
        k = self.key(x) 
        q = self.query(x) 
        
        wei = q @ k.transpose(-2, -1)* C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei) #add dropout
        
        v=self.value(x)

        out = wei @ v

        return out

In [None]:
class MultipleHeadAttention(nn.Module):
    def __init__(self, num_heads, dim_emb, head_size):
        super().__init__()
        self.heads = nn.ModuleList([OneHead(dim_emb, head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(dim_emb, dim_emb) # add a projection layer 
        self.dropout = nn.Dropout(0.1) # add a dropout layer 
        
    def forward(self, x):
        out = torch.cat([head(x) for head in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

In [None]:
class FeedForward(nn.Module):
    def __init__(self, dim_emb):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim_emb, 4*dim_emb), # the paper multiply by 4 the inner-layer
            nn.ReLU(),
            nn.Linear(4*dim_emb, dim_emb), # add a linear layer
            nn.Dropout(0.1), # add dropout
        )
        
    def forward(self, x):
        return self.net(x)

In [None]:
class Block(nn.Module):
    def __init__(self, dim_emb):
        super().__init__()
        self.heads = MultipleHeadAttention(4, dim_emb, dim_emb//4)
        self.fforward = FeedForward(dim_emb)
        self.ln1 = nn.LayerNorm(dim_emb)
        self.ln2 = nn.LayerNorm(dim_emb)
        
    def forward(self, x):
        x_ = self.ln1(x) # layer norm
        x += self.heads(x_) # with skip connection
        x_ = self.ln2(x) # layer norm
        x += self.fforward(x) # with skip connection
        return x

In [None]:
class ShakePT(nn.Module):

    def __init__(self, vocab_size, dim_emb):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, dim_emb)
        self.position_embedding_table = nn.Embedding(block_size, dim_emb)
        self.blocks = nn.Sequential(
            Block(dim_emb),
            Block(dim_emb),
            Block(dim_emb),
            nn.LayerNorm(dim_emb)
        )
        self.linear = nn.Linear(dim_emb, vocab_size)
        
    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.blocks(x) # use the blocks
        logits = self.linear(x)
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C) 
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_crop = idx[:, -block_size:]
            logits, loss = self(idx_crop)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1) 
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

m = ShakePT(vocab_size, dim_emb).to(device)
logits, loss = m(x, y)
print(logits.shape)
print(loss)

context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

That's it ! You create your state of the art transformer. You can now see a "clean" code on the *ShakePT.py*.

**SOME NOTES TO READ :**
- we create a decoder transformer instead of an complete (encoder+decoder) transformer. Because we just want to generate one language speetch. If you want to translate, you have to "encode" your initial speek language setence into the embedding space, and its keys and values are send to the multi-head attention (like cross attention). Becarefull because during the "encode", the prediction have access to all the token block.
- much more optimization can be bring (as parralelize the multiple head attention instead of continating)
- we have created an general pre-trained transformer, not an assistant. So if you ask a question, it could reply with other questions link. You have to specialize this transfomers into an assistant with **fine-tuning**.
- OpenIA use also a reward model after the fine tuning to push ChatGPT to reply with high reward answers.