# Stage1: Preprocessing



## 1. Tokenization and Encoding

**Tokenization and Encoding** is the very first step in the LLM where the input text is broken down into tokens, and tokens are translated into integers. GPT-2 uses BytePairEncoding (BPE) as tokenizer.
Let's see how it works.

In [None]:
import importlib
import tiktoken # OpenAI's tokenizer library

tokenizer = tiktoken.get_encoding("gpt2")

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.12.0


In [None]:
text1 = "Hello, do you like tea. Is this-- a test?"
text2 = "In the sunlit terraces of the palace."

# Add <|endoftext|> as line breaker.
text = " <|endoftext|> ".join((text1, text2))

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(f"Encoded into Integers:\n {integers}")
print("\n")

strings = tokenizer.decode(integers)

print(f"Decoded into Text:\n {strings}")

Encoded into Integers:
 [15496, 11, 466, 345, 588, 8887, 13, 1148, 428, 438, 257, 1332, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 262, 20562, 13]


Decoded into Text:
 Hello, do you like tea. Is this-- a test? <|endoftext|> In the sunlit terraces of the palace.


### Sampling / Tensors with Sliding Window
The way the LLM works is that it always predicts the next word. It means, it must be trained in the same manner.

E.g. Lets say we have to train the LLM on one TEXT line "LLM learns to predict one word at a time".

We will have to break this one TEXT/Chunk line into multiple input sub-chunks, and output as shown below for a context window of 3.

|Input Tokens|Expected Output Token|
|---|---|
|LLM learns to|learns to predict|
|learns to predict|to predict one|
|to predict one|predict one word|
|predict one word|one word at|
|one word at|word at a|
|word at a|at a time|

Also to note is that the input could be a very long TEXT which not might fit into the model's context window. When training a model like GPT-2, the model can only process a limited number of tokens at a time (for example, 1024 is the context window of gpt-2). So we must split the long text into manageable chunks (tensors). The context window is also called **max_length**. **stride**: Controls how much overlap there is between consecutive training chunks.

```
Suppose:
max_length = 10
stride = 5
and your tokens are [t1, t2, t3, â€¦, t25].

Then your dataset chunks will be:
Input Chunk 1: t1  to t10                                   -->  Output Chunk 1:                         
Input Chunk 2: t6  to t15  (an overlap from t6 to t10)      
Input Chunk 3: t11 to t20  (an overlap from t11 to t115)    
Input Chunk 4: t16 to t25  (an overlap from t16 to t20)    
```

This is achieved through data loader and dataset, that breaks the input texts into chunks.


In [None]:
import importlib
import tiktoken
import torch
from torch.utils.data import Dataset, DataLoader

tokenizer = tiktoken.get_encoding("gpt2")

class SlidingWindowDataset(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
        assert len(token_ids) > max_length, "Number of tokenized inputs must at least be equal to max_length+1"

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    ## max_length --> how many tokens will be placed in each tensor
    ## stride --> ??
    dataset = SlidingWindowDataset(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,   # number of samples/chunks per batch
        shuffle=shuffle,         # whether to shuffle the data at every epoch
        drop_last=drop_last,     # true means that the last batch is dropped if it's smaller than batch_size
        num_workers=num_workers  # >0 is used for parallel data loading
    )

    return dataloader

In [None]:
batch_size=3
max_length=5
stride=2

# with open("../../_data/the-verdict.txt", "r", encoding="utf-8") as f:
#     raw_text = f.read()

raw_text = "The sun slipped behind the hills as the air turned colder. A stray dog wandered along the empty street searching for food."

# It tokenizes the raw texts into token integers first using BytePairEncoder
# It then breaks the token arrary to chunks or smaller token arrays or tensors
dataloader = create_dataloader_v1(
    raw_text, batch_size=batch_size, max_length=max_length, stride=stride, shuffle=False
)

data_iter = iter(dataloader)
print("First Batch:")
print("------------------------------------\n")
input1, target1 = next(data_iter)
print("Input Tensor:", input1)
print("Decoded Tensor Seq1:", tokenizer.decode(input1[0].tolist()))
print("Decoded Tensor Seq2", tokenizer.decode(input1[1].tolist()))
print("Decoded Tensor Seq3:", tokenizer.decode(input1[2].tolist()))
print("\n")
print("Expected Output Tensor:", target1)
print("Decoded Tensor Seq1 expected output:", tokenizer.decode(target1[0].tolist()))
print("Decoded Tensor Seq2 expected output:", tokenizer.decode(target1[1].tolist()))
print("Decoded Tensor Seq3 expected output:", tokenizer.decode(target1[2].tolist()))
print("\n")

print("Second Batch:")
print("------------------------------------\n")
input2, target2 = next(data_iter)
print("Input Tensor:", input2)
print("Decoded Tensor Seq1:", tokenizer.decode(input2[0].tolist()))
print("Decoded Tensor Seq2:", tokenizer.decode(input2[1].tolist()))
print("Decoded Tensor Seq3:", tokenizer.decode(input2[2].tolist()))
print("\n")
print("Expected Output Tensor:", target2)
print("Decoded Tensor Seq1 expected output:", tokenizer.decode(target2[0].tolist()))
print("Decoded Tensor Seq2 expected output:", tokenizer.decode(target2[1].tolist()))
print("Decoded Tensor Seq3 expected output:", tokenizer.decode(target2[2].tolist()))
print("\n")

First Batch:
------------------------------------

Input Tensor: tensor([[  464,  4252, 18859,  2157,   262],
        [18859,  2157,   262, 18639,   355],
        [  262, 18639,   355,   262,  1633]])
Decoded Tensor Seq1: The sun slipped behind the
Decoded Tensor Seq2  slipped behind the hills as
Decoded Tensor Seq3:  the hills as the air


Expected Output Tensor: tensor([[ 4252, 18859,  2157,   262, 18639],
        [ 2157,   262, 18639,   355,   262],
        [18639,   355,   262,  1633,  2900]])
Decoded Tensor Seq1 expected output:  sun slipped behind the hills
Decoded Tensor Seq2 expected output:  behind the hills as the
Decoded Tensor Seq3 expected output:  hills as the air turned


Second Batch:
------------------------------------

Input Tensor: tensor([[  355,   262,  1633,  2900, 38427],
        [ 1633,  2900, 38427,    13,   317],
        [38427,    13,   317, 28583,  3290]])
Decoded Tensor Seq1:  as the air turned colder
Decoded Tensor Seq2:  air turned colder. A
Decoded Tens

## 2. Embedding
Now that you have the token tensors (input tensors + expected output tensors) produced with sliding windows, the next step is to vectorize them.

In [None]:
import torch
import tiktoken #BytePairEncoder.

vocab_size = tokenizer.n_vocab  # Because we used tiktoken for encoding, we should use its vocabulary size.
                                # The BytePair encoder has a vocabulary size of 50,257
embedding_dim = 768             # Vector matrix dimension of GPT-2 is 768
# The embedding layer coverts each token into 768 dimension vector.
token_embedding_layer = torch.nn.Embedding(
    num_embeddings=vocab_size,      # size of the vocabulary. That is, total number of unique tokens
    embedding_dim=embedding_dim     # dimension of the embedding vector for each token
)

In [None]:
# Example token IDs. Let says, we have tensor of 4 tokens only
# Each token must be converted into vector.
# So, it will be 4 x 768 matrix
token_ids = torch.tensor([15496, 11, 703, 389])
token_ids.shape

# Get embeddings
print(f'Input is a 1D tensor: {token_ids.shape}')
embeddings = token_embedding_layer(token_ids)
print(f"Output 2D tensor size: {embeddings.shape}. Each token is converted to a 768 dimension vector.")
# print(token_embedding_layer.weight)
print(embeddings)

Input is a 1D tensor: torch.Size([4])
Output 2D tensor size: torch.Size([4, 768]). Each token is converted to a 768 dimension vector.
tensor([[ 0.8774, -1.4247,  0.2718,  ...,  1.3046, -0.4417,  0.8319],
        [ 0.1590,  0.2387, -2.3928,  ..., -1.3388,  0.3896, -1.4132],
        [ 0.2799, -1.0826, -0.4743,  ...,  1.6107, -0.6543, -0.4180],
        [ 0.3794, -0.0688, -0.6316,  ...,  1.9465, -1.0704, -0.2395]],
       grad_fn=<EmbeddingBackward0>)


In [None]:
# ---------------------------------------------------------
# Pass batch1 token IDs through embeddings
# ---------------------------------------------------------
print("\n===== Embeddings for First Batch =====\n")

# Shape: [batch_size, max_length, embedding_dim]
print(f'Input is a 1D tensor: {input1.shape}')
batch1_embeddings = token_embedding_layer(input1)

print(f"Output 2D tensor size: {batch1_embeddings.shape}. Each token is converted to a 768 dimension vector.")
print(batch1_embeddings)


===== Embeddings for First Batch =====

Input is a 1D tensor: torch.Size([3, 5])
Output 2D tensor size: torch.Size([3, 5, 768]). Each token is converted to a 768 dimension vector.
tensor([[[-0.5329, -1.3247, -0.1321,  ...,  0.7444,  1.9264,  1.4444],
         [-0.5785, -1.3702,  1.0191,  ...,  0.5402,  0.3909,  0.6542],
         [ 0.7275,  0.6341,  1.9347,  ..., -0.9212,  0.2855,  0.0110],
         [ 0.1893, -0.4733, -0.9623,  ...,  0.0555, -0.9752, -0.4528],
         [-0.2465,  0.3179, -0.5723,  ...,  0.0317, -0.9859,  0.3813]],

        [[ 0.7275,  0.6341,  1.9347,  ..., -0.9212,  0.2855,  0.0110],
         [ 0.1893, -0.4733, -0.9623,  ...,  0.0555, -0.9752, -0.4528],
         [-0.2465,  0.3179, -0.5723,  ...,  0.0317, -0.9859,  0.3813],
         [ 0.1323,  1.0493, -0.7841,  ...,  0.0362, -0.3318, -0.1653],
         [-0.6942, -1.4064, -1.7145,  ...,  0.0332, -1.1418,  0.0596]],

        [[-0.2465,  0.3179, -0.5723,  ...,  0.0317, -0.9859,  0.3813],
         [ 0.1323,  1.0493, -0.784

## 3. Add Positional Embedding

In an input sequence like "fox jumps over the fox", the encoder is going to generate the same token id. The same token id will translates into same embedding vectors. Now, the concern is how will you distinguish between, the fox in the begining of the sequence from the fox at the end of the sequence.

This is where positional embedding plays a role. It defines an embedding for each position in the window context.

Say, GPT-2 has a context window of 1024 token. that's the maximum, it can process. Each token is a 768 dimension vector.

It means, every sequence input to GPT-2 is a tensor of 1024 x 768 size.

The positional embedding is also of same size 1024 x 768, and it has a FIXED set of values for every GPT. It does not change with the input text or embeddings.

The positional embedding is simply added to the input embedding to produce the input to the TRANSFORMER.

In our example here, we are using context window (max_length) as 5 for simplicity. While building the final GPT-2, we will use it as 1024.

In [None]:
import torch

embedding_dim = 768             # Vector matrix dimension of GPT-2 is 768
# The embedding layer coverts each token into 768 dimension vector.
position_embedding_layer = torch.nn.Embedding(
    num_embeddings=max_length,      # this is the total number of tokens you can pass in a sequence.
                                    # It is also called context window. Its 1024 for GPT-2
    embedding_dim=embedding_dim     # dimension of the embedding vector for each token
)

In [None]:
pos_embeddings = position_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([5, 768])


In [None]:
# Input Embeddings = token embeddings + positional encodings
input_embeddings_batch1 = batch1_embeddings + pos_embeddings

print(input_embeddings_batch1)

tensor([[[ 1.4296, -1.4895, -1.0155,  ...,  0.7876,  5.0135,  0.6227],
         [-1.6781, -4.3282,  2.4288,  ...,  1.0637, -0.4990,  1.4549],
         [ 1.9781,  1.8971,  2.1632,  ..., -1.4088,  1.9509,  0.2565],
         [ 1.1339,  0.0960, -2.1792,  ..., -2.0248, -0.6697,  0.8989],
         [ 1.4439,  0.8378, -2.0558,  ...,  0.3818, -0.3582,  1.0698]],

        [[ 2.6900,  0.4693,  1.0513,  ..., -0.8781,  3.3726, -0.8107],
         [-0.9103, -3.4313,  0.4474,  ...,  0.5790, -1.8651,  0.3478],
         [ 1.0041,  1.5809, -0.3437,  ..., -0.4559,  0.6796,  0.6268],
         [ 1.0769,  1.6185, -2.0009,  ..., -2.0440, -0.0263,  1.1864],
         [ 0.9962, -0.8865, -3.1980,  ...,  0.3833, -0.5142,  0.7481]],

        [[ 1.7160,  0.1531, -1.4557,  ...,  0.0749,  2.1012, -0.4403],
         [-0.9673, -1.9087,  0.6257,  ...,  0.5597, -1.2217,  0.6353],
         [ 0.5564, -0.1434, -1.4859,  ..., -0.4544,  0.5236,  0.3051],
         [ 0.6981,  0.8871, -1.7891,  ..., -2.0486, -0.6804,  1.7331],
  

## 4. Summary
The final out of the 3 steps (tokenization, embedding, and positional embedding) is a batch of sequences.

- A batch has "batch_size" no of sequences.
- Each sequence has "max_length" no of tokens. (this is nothing but the context window of the llm. gpt-2 context window is 1024)
- Each token has "vector_dimension" of embeddings.

**Basically, it produces an array of tensors. And each tensor is of [batch_size x max_length x vector_dimension] size.** The vector dimension in gpt-2 is 768. It means, each vector within gpt is represented in 768 dimensions.



# Stage1: Attention