## Building Chatgpt on based on paper "Attention is all you need"

**GPT (Generative Pre-trained Transformer)** is a type of large language model developed by OpenAI that uses transformer architecture to generate human-like text based on input prompts. Its development was based on the paper "[Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)" published by OpenAI in 2018.

The groundbreaking transformer model, which GPT builds upon, was introduced in the paper "[Attention is All You Need](https://arxiv.org/abs/1706.03762)" by Vaswani et al., published in 2017.


This notebook contains the notes of **Andrej Karpathy** under the title of [Let's Build GPT: From Scratch, in Code, Spelled Out](https://www.youtube.com/watch?v=kCc8FmEb1nY) on his YouTube channel.


## Let's  Prepare dataset

In this tutorial, we utilize the Tiny Shakespeare dataset, a 1.06 MB text file that combines all the works of [William Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt).


In [None]:
!wget  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-06-09 15:28:47--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-06-09 15:28:47 (18.6 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [None]:
# Specifying `encoding='utf-8'` ensures that text files are read or written using the UTF-8 encoding, which supports a wide range of characters from various languages. Without this specification, the default system encoding is used, which can lead to inconsistencies and errors, especially with non-ASCII characters.
with open('./input.txt','r',encoding  = 'utf-8') as f :
  text = f.read()

In [None]:
print(len(text))

1115394


In [None]:
print(text[:11])

First Citiz


In [None]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
voc_size = len(chars)
print(''.join(chars))



 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


In [None]:
# create a mapping from characters to integers using The `enumerate` function in Python adds a counter to an iterable and returns it as an enumerate object, providing both index and value pairs in loops.
int_char = {i:ch for i,ch in enumerate(chars , start = 0)}
char_int = {ch:i for i,ch in enumerate(chars , start = 0)}
int_char[9],char_int['3']

('3', 9)

In [None]:
sentence  = 'Anas Nouri'
# encoder = {} # trasform a sentenece into presentation numeric
encoder = lambda s : [char_int[a]for a in s]
# encoder(sentence)
decoder  =  lambda s : ''.join([int_char[a] for a in s ])
decoder(encoder(sentence))

'Anas Nouri'

In [None]:
encoder(text[:10])

[18, 47, 56, 57, 58, 1, 15, 47, 58, 47]

In [None]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch

In [None]:
dataset = torch.tensor(encoder(text),dtype = torch.long)
dataset

tensor([18, 47, 56,  ..., 45,  8,  0])

In [None]:
dataset.unsqueeze(0)

tensor([[18, 47, 56,  ..., 45,  8,  0]])

In [None]:
dataset.squeeze(0)

tensor([18, 47, 56,  ..., 45,  8,  0])

In [None]:
# Let's now split up the data into train and validation sets
n = int(0.8*len(dataset))
train_data = dataset[:n]
validation_data = dataset[n:]

In [None]:
Block_size = 10
train_data = dataset[:Block_size]
target_data = dataset[1:Block_size+1]
train_data

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47])

In [None]:
target_data

tensor([47, 56, 57, 58,  1, 15, 47, 58, 47, 64])

In [None]:
len(target_data),len(train_data)

(10, 10)

In [None]:
for t in range(Block_size):
    x = train_data[:t+1]
    y = target_data[t]
    print(f"input is {x} and target is {y}")

input is tensor([18]) and target is 47
input is tensor([18, 47]) and target is 56
input is tensor([18, 47, 56]) and target is 57
input is tensor([18, 47, 56, 57]) and target is 58
input is tensor([18, 47, 56, 57, 58]) and target is 1
input is tensor([18, 47, 56, 57, 58,  1]) and target is 15
input is tensor([18, 47, 56, 57, 58,  1, 15]) and target is 47
input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) and target is 58
input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58]) and target is 47
input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47]) and target is 64


In [None]:
torch.manual_seed(1337)
batch_size = 6 # how many independent sequences will we process in parallel?
block_size = 10 # what is the maximum context length for predictions?

In [None]:
ix = torch.randint(len(dataset) - block_size, (batch_size,))

In [None]:
ix

tensor([1080343,  458285,   42868,  672888, 1083415,  245809])