<a href="https://colab.research.google.com/github/anas1IA-art/site/blob/main/Build_chat_gpt_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Building Chatgpt on based on paper "Attention is all you need"

**GPT (Generative Pre-trained Transformer)** is a type of large language model developed by OpenAI that uses transformer architecture to generate human-like text based on input prompts. Its development was based on the paper "[Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)" published by OpenAI in 2018.

The groundbreaking transformer model, which GPT builds upon, was introduced in the paper "[Attention is All You Need](https://arxiv.org/abs/1706.03762)" by Vaswani et al., published in 2017.


This notebook contains the notes of **Andrej Karpathy** under the title of [Let's Build GPT: From Scratch, in Code, Spelled Out](https://www.youtube.com/watch?v=kCc8FmEb1nY) on his YouTube channel.


## Let's  Prepare dataset

In this tutorial, we utilize the Tiny Shakespeare dataset, a 1.06 MB text file that combines all the works of [William Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt).


In [419]:
!wget  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-06-10 11:10:06--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.5’


2024-06-10 11:10:06 (127 MB/s) - ‘input.txt.5’ saved [1115394/1115394]



In [420]:
# Specifying `encoding='utf-8'` ensures that text files are read or written using the UTF-8 encoding, which supports a wide range of characters from various languages. Without this specification, the default system encoding is used, which can lead to inconsistencies and errors, especially with non-ASCII characters.
with open('./input.txt','r',encoding  = 'utf-8') as f :
  text = f.read()

In [421]:
print(len(text))

1115394


In [422]:
print(text[:11])

First Citiz


In [423]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
voc_size = len(chars)
print(''.join(chars))



 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


In [424]:
# create a mapping from characters to integers using The `enumerate` function in Python adds a counter to an iterable and returns it as an enumerate object, providing both index and value pairs in loops.
int_char = {i:ch for i,ch in enumerate(chars , start = 0)}
char_int = {ch:i for i,ch in enumerate(chars , start = 0)}
int_char[9],char_int['3']

('3', 9)

In [425]:
sentence  = 'Anas Nouri'
# encoder = {} # trasform a sentenece into presentation numeric
encoder = lambda s : [char_int[a]for a in s]
# encoder(sentence)
decoder  =  lambda s : ''.join([int_char[a] for a in s ])
decoder(encoder(sentence))

'Anas Nouri'

In [426]:
encoder(text[:10])

[18, 47, 56, 57, 58, 1, 15, 47, 58, 47]

In [427]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch

In [428]:
dataset = torch.tensor(encoder(text),dtype = torch.long)
dataset

tensor([18, 47, 56,  ..., 45,  8,  0])

In [429]:
dataset.unsqueeze(0)

tensor([[18, 47, 56,  ..., 45,  8,  0]])

In [430]:
dataset.squeeze(0)

tensor([18, 47, 56,  ..., 45,  8,  0])

In [431]:
len(dataset)
int(len(dataset)*0.8)

892315

In [432]:
# Let's now split up the data into train and validation sets
n = int(0.8*len(dataset))
train_dataset = dataset[:n]
validation_dataset = dataset[n:]

In [433]:
Block_size = 10
train_data = dataset[:Block_size]
target_data = dataset[1:Block_size+1]
train_data

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47])

In [434]:
target_data

tensor([47, 56, 57, 58,  1, 15, 47, 58, 47, 64])

In [435]:
len(target_data),len(train_data)

(10, 10)

In [436]:
for t in range(Block_size):
    x = train_data[:t+1]
    y = target_data[t]
    print(f"input is {x} and target is {y}")

input is tensor([18]) and target is 47
input is tensor([18, 47]) and target is 56
input is tensor([18, 47, 56]) and target is 57
input is tensor([18, 47, 56, 57]) and target is 58
input is tensor([18, 47, 56, 57, 58]) and target is 1
input is tensor([18, 47, 56, 57, 58,  1]) and target is 15
input is tensor([18, 47, 56, 57, 58,  1, 15]) and target is 47
input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) and target is 58
input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58]) and target is 47
input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47]) and target is 64


In [437]:
torch.manual_seed(1337)
batch_size = 4# how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

In [438]:
ix = torch.randint(len(dataset) - block_size, (batch_size,))

In [439]:
ix

tensor([1078327,  453969,   41646,  671252])

In [440]:
len(train_data)

10

In [441]:
torch.manual_seed(1337)
def get_batch (split):

  data = train_dataset if split == 'train' else validation_dataset

  ix = torch.randint(len(data)-block_size ,(batch_size,))
  x = torch.stack([data[i:i+block_size] for i in ix ])

  y = torch.stack(tuple(data[i+1:i+block_size+1] for i in ix ))

  return x,y

xt ,yt = get_batch('train')


In [442]:
xt.shape

torch.Size([4, 8])

In [443]:
yt

tensor([[63,  8,  0,  0, 19, 24, 27, 33],
        [59, 45, 46, 58,  1, 46, 43,  1],
        [43, 57,  1, 53, 50, 42,  1, 46],
        [41, 47, 43, 52, 58,  1, 56, 47]])

In [444]:
import torch
import torch.nn as nn
from torch.nn import functional as F

In [445]:
token_embedding_table = nn.Embedding(65, 10)
token_embedding_table

Embedding(65, 10)

In [446]:
logits = token_embedding_table(xt)
logits

tensor([[[-1.3924e+00,  1.3114e+00,  1.3805e+00,  3.5440e-01,  3.5861e-01,
          -8.7726e-01,  3.2667e-01, -5.6669e-01, -1.6618e-01,  1.3520e+00],
         [ 1.1708e+00, -2.6099e-01, -9.5709e-01,  6.3591e-01, -1.4204e-01,
          -1.5207e-01,  6.5104e-02, -1.7956e+00,  1.3145e+00,  1.7042e+00],
         [-1.5935e+00, -1.2706e+00,  6.9033e-01, -1.9614e-01,  6.1403e-03,
          -9.6651e-01,  3.5829e-01,  1.0729e-01, -4.1896e-01, -4.3699e-01],
         [ 6.2577e-01,  2.5510e-02,  9.5451e-01,  6.4349e-02, -5.0240e-01,
          -2.0255e-01, -1.5671e+00, -1.0980e+00,  2.3596e-01, -2.3978e-01],
         [ 6.2577e-01,  2.5510e-02,  9.5451e-01,  6.4349e-02, -5.0240e-01,
          -2.0255e-01, -1.5671e+00, -1.0980e+00,  2.3596e-01, -2.3978e-01],
         [-4.3205e-01, -1.4938e+00,  1.0785e+00, -6.1495e-01, -4.5885e-01,
           5.6748e-01,  9.5883e-02, -1.5700e+00,  3.7396e-01, -1.4207e-01],
         [ 4.2716e-01, -2.8192e-01, -1.2773e-02, -8.7792e-01,  1.3680e+00,
          -7.9199e-

In [447]:
B,T,C = logits.shape
log =logits.view(B*T,C)
log

tensor([[-1.3924e+00,  1.3114e+00,  1.3805e+00,  3.5440e-01,  3.5861e-01,
         -8.7726e-01,  3.2667e-01, -5.6669e-01, -1.6618e-01,  1.3520e+00],
        [ 1.1708e+00, -2.6099e-01, -9.5709e-01,  6.3591e-01, -1.4204e-01,
         -1.5207e-01,  6.5104e-02, -1.7956e+00,  1.3145e+00,  1.7042e+00],
        [-1.5935e+00, -1.2706e+00,  6.9033e-01, -1.9614e-01,  6.1403e-03,
         -9.6651e-01,  3.5829e-01,  1.0729e-01, -4.1896e-01, -4.3699e-01],
        [ 6.2577e-01,  2.5510e-02,  9.5451e-01,  6.4349e-02, -5.0240e-01,
         -2.0255e-01, -1.5671e+00, -1.0980e+00,  2.3596e-01, -2.3978e-01],
        [ 6.2577e-01,  2.5510e-02,  9.5451e-01,  6.4349e-02, -5.0240e-01,
         -2.0255e-01, -1.5671e+00, -1.0980e+00,  2.3596e-01, -2.3978e-01],
        [-4.3205e-01, -1.4938e+00,  1.0785e+00, -6.1495e-01, -4.5885e-01,
          5.6748e-01,  9.5883e-02, -1.5700e+00,  3.7396e-01, -1.4207e-01],
        [ 4.2716e-01, -2.8192e-01, -1.2773e-02, -8.7792e-01,  1.3680e+00,
         -7.9199e-01, -8.8244e-0

In [448]:
logits[:,-1,:]

tensor([[-0.1269, -0.3490,  0.7520, -0.0629, -0.7111,  0.9810,  1.5095, -1.5489,
         -1.0653,  1.0056],
        [ 0.0531, -0.6655, -1.1730,  2.5181,  1.6212, -1.8134, -0.1020,  0.1283,
         -0.4133, -1.2003],
        [-0.9211,  1.5433, -0.3676, -0.7483, -0.1006,  0.7307, -2.0371,  0.4931,
          1.4870,  0.5910],
        [ 0.2990,  0.1199, -1.2433,  1.7859, -0.2789, -0.4232, -0.6174,  0.2643,
         -0.3542,  0.6690]], grad_fn=<SliceBackward0>)

In [449]:
probabilities = F.softmax(log, dim=-1)
probabilities

tensor([[1.3901e-02, 2.0763e-01, 2.2248e-01, 7.9738e-02, 8.0074e-02, 2.3268e-02,
         7.7557e-02, 3.1742e-02, 4.7378e-02, 2.1624e-01],
        [1.7481e-01, 4.1756e-02, 2.0817e-02, 1.0239e-01, 4.7031e-02, 4.6561e-02,
         5.7855e-02, 9.0003e-03, 2.0181e-01, 2.9798e-01],
        [2.3811e-02, 3.2884e-02, 2.3368e-01, 9.6303e-02, 1.1789e-01, 4.4573e-02,
         1.6766e-01, 1.3044e-01, 7.7067e-02, 7.5689e-02],
        [1.7678e-01, 9.6995e-02, 2.4559e-01, 1.0084e-01, 5.7211e-02, 7.7215e-02,
         1.9729e-02, 3.1538e-02, 1.1971e-01, 7.4393e-02],
        [1.7678e-01, 9.6995e-02, 2.4559e-01, 1.0084e-01, 5.7211e-02, 7.7215e-02,
         1.9729e-02, 3.1538e-02, 1.1971e-01, 7.4393e-02],
        [6.2540e-02, 2.1630e-02, 2.8326e-01, 5.2086e-02, 6.0886e-02, 1.6992e-01,
         1.0603e-01, 2.0043e-02, 1.4002e-01, 8.3578e-02],
        [1.1280e-01, 5.5509e-02, 7.2654e-02, 3.0587e-02, 2.8902e-01, 3.3331e-02,
         3.0448e-02, 1.3938e-01, 2.2535e-01, 1.0920e-02],
        [5.6394e-02, 4.5160

In [450]:
probabilities[-1 ,:]

tensor([0.0921, 0.0770, 0.0197, 0.4075, 0.0517, 0.0447, 0.0368, 0.0890, 0.0479,
        0.1334], grad_fn=<SliceBackward0>)

In [451]:
log[:,0]

tensor([-1.3924,  1.1708, -1.5935,  0.6258,  0.6258, -0.4320,  0.4272, -0.1269,
         0.4138,  0.0067, -1.4546,  0.2123, -1.3924, -0.9211,  0.2123,  0.0531,
        -0.5744,  0.0531,  0.7535, -0.9211, -0.1016, -1.0835,  0.6071, -0.9211,
        -1.0023,  0.1747, -1.0206,  0.0531, -1.0023, -1.3924, -0.9211,  0.2990],
       grad_fn=<SelectBackward0>)

In [452]:
F.softmax(log[:,0]).sum()

  F.softmax(log[:,0]).sum()


tensor(1., grad_fn=<SumBackward0>)

In [477]:
from torch.nn import functional as F

In [490]:
class Bigrammodel(nn.Module):

  def __init__(self,voc_size):
     super().__init__()
     self.token_embeding_tabel = nn.Embedding(voc_size,voc_size)

  def forward(self,idx ,targets = None):
      logits = self.token_embeding_tabel(idx)

      if targets is None:
        loss = None

      else :
        B,T,C =logits.shape
        logits = logits.view(B*T,C)
        # prediction = F.softmax(logits , dim = -1)
        targets = targets.view(B*T)
        loss = F.cross_entropy(logits,targets)

      return logits ,loss

  def generate(self,idx,max_token):

    for _ in range(max_token):
        logits ,loss  = self(idx)
        log =  logits[:,-1,:]
        pred = F.softmax(log , dim = -1)
        dx_next = torch.multinomial(pred, num_samples=1)
        idx = torch.cat((idx ,dx_next),dim = 1)
    return idx


In [491]:
test = Bigrammodel(65)

In [492]:
log,loss = test(xt,yt)

In [493]:
loss

tensor(4.7580, grad_fn=<NllLossBackward0>)

In [504]:
decoder(test.generate(idx = torch.zeros((1,1), dtype = torch.long), max_token = 100)[0].tolist())

'\njVVhmTHVjv-bT3p-EuS.HhxL!GFapfRI-iA t:.h3hSU;wGKRM BrPtLhM!,efdOKaRdsdnL!WWy,DX\nYNEqvinL,SX.3BsV&-Ez'