<a href="https://colab.research.google.com/github/abhimanyuyadav627/LLM-From-Scratch/blob/main/Tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [13]:
!pip install tiktoken



**What is an Embedding ?**
* Embedding - **It is a mapping from discrete objects, such as words , images, or even entire documents, to points in a continous vector space.**
* While we can be using pretrained word embeddings but it is a common practice for LLMs to produce their own embeddings that are part of input layer and are updated during training.

### PREPROCESSING STEPS FOR CREATING EMBEDDINGS

##### TOKENIZING TEXT

Important Considerations:


1. When developing a simple tokenizer,
 whether we should encode whitespaces as seperate characters or just remove them depends on our application and its requirements. **Removing whitespaces reduces the memory and computing requirements. However, keeping whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example Python code, which is sensitive to indentation and spacing).**

2. Adding special tokens - to deal with unknown words that were not a part of vocabulary(not needed with BPE), to deal with situations where we need a seperator for two unrelated text sources.









In [14]:
import importlib
import tiktoken
print("tiktoken version:", importlib.metadata.version("tiktoken"))
tokenizer = tiktoken.get_encoding("gpt2")

tiktoken version: 0.6.0


In [15]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDataset(Dataset):

  def __init__(self,txt,tokenizer,max_length,stride):
    self.tokenizer = tokenizer
    self.input_ids = []
    self.target_ids = []

    token_ids = tokenizer.encode(txt)

    for i in range(0,len(token_ids) - max_length, stride):
      input_chunk = token_ids[i:i + max_length]
      target_chunk = token_ids[i + 1:i + max_length + 1]
      self.input_ids.append(torch.tensor(input_chunk))
      self.target_ids.append(torch.tensor(target_chunk))

  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self,idx):
    return self.input_ids[idx], self.target_ids[idx]

##### CREATING A DATA LOADER

In [16]:
def create_data_loader(txt,batch_size = 4, max_length = 256, stride = 128, shuffle = True):
  tokenizer = tiktoken.get_encoding("gpt2")
  dataset = GPTDataset(txt, tokenizer, max_length, stride)
  dataloader = DataLoader(
      dataset, batch_size = batch_size, shuffle = shuffle
  )
  return dataloader

In [17]:
max_length = 4
# reading raw text from a text file.
with open("the-verdict.txt", "r") as f:
  raw_text = f.read()
dataloader = create_data_loader(
    raw_text,batch_size = 8, max_length = max_length, stride = 5, shuffle = False
    )
data_iter = iter(dataloader)
inputs,targets = next(data_iter)

##### CREATING TOKEN EMBEDDINGS

In [18]:
output_dim = 256
vocab_size = 50257

torch.manual_seed(123)
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


ENCODING WORD POSITIONS

While token embeddings provide consistent vector representations for each token, they lack a sense of the token's position in a sequence **(as self attention mechanism is position agnostic)**. To rectify this, two main types of positional embeddings exist: absolute and relative. **OpenAI's GPT models utilize absolute positional embeddings that are added to the token embedding vectors and are optimized during the model training.**
  * **Absolute Positional Embeddings** - directly associated with specific positions in a sequence.
  * **Relative Positional Embeddings** - the emphasis of relative positional embeddings is on the relative position or distance between tokens. This means the model **learns the relationships in terms of "how far apart" rather than "at which exact position."**


In [19]:
block_size = max_length
pos_embedding_layer = torch.nn.Embedding(block_size, output_dim)
# the input to the pos_embeddings is usually a placeholder vector torch.arange(block_size), which contains a sequence of numbers 1, 2, ..., up to the maximum input length.
pos_embeddings = pos_embedding_layer(torch.arange(block_size))
print(pos_embeddings.shape)

torch.Size([4, 256])


PREPARING FINAL INPUT EMBEDDINGS TO BE FED TO GPT

In [20]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])
