# Lab 1a - Chapter 2 
> Author : Badr TAJINI - Large Language model (LLMs) - ESIEE 2024-2025

# The core implementation:




This notebook contains the main takeaway, the data loading pipeline without the intermediate steps.

#### 1.  Imports libraries


In [None]:
import tiktoken
import torch
from torch.utils.data import Dataset, DataLoader

- tiktoken: OpenAI's tokenizer library for efficient text encoding
- torch: The PyTorch deep learning framework
- Dataset and DataLoader: PyTorch's data handling utilities

#### 2. The GPTDatasetV1 class implementation:

In [None]:
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

This class implements a custom dataset that:

- Accepts raw text and tokenizes it
- Implements sliding window tokenization with specified stride
- Creates input-target pairs shifted by one position (crucial for language modeling)
- Converts token sequences to PyTorch tensors

#### 3. The dataloader function:

In [None]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True, num_workers=0):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)

    return dataloader

This function encapsulates the dataset creation process and configures the dataloader with:

- Customizable batch size and sequence length
- Adjustable stride for controlling overlap between sequences
- Options for shuffling and parallel data loading

#### 4. The model initialization and data preparation

In [None]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

tokenizer = tiktoken.get_encoding("gpt2")
encoded_text = tokenizer.encode(raw_text)

vocab_size = 50257
output_dim = 256
context_length = 1024

This section establishes the model's fundamental parameters:

- Loads the training text
- Initializes the GPT-2 tokenizer
- Defines architectural parameters (vocabulary size, embedding dimension, context length)

#### 5. The embedding layers setup

In [None]:
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

max_length = 4
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=max_length, stride=max_length)

This implements the two essential embedding components:

- Token embeddings: Convert token IDs to dense vectors
- Positional embeddings: Encode position information
- Creates a dataloader with specified parameters for training

#### 6. The forward pass demonstration:

In [2]:
for batch in dataloader:
    x, y = batch

    token_embeddings = token_embedding_layer(x)
    pos_embeddings = pos_embedding_layer(torch.arange(max_length))

    input_embeddings = token_embeddings + pos_embeddings

    break

This showcases the embedding process:

- Extracts input-target pairs from the batch
- Computes token embeddings
- Adds positional information
- Combines both embeddings through addition

The final output shape verification:

In [3]:
print(input_embeddings.shape)

torch.Size([8, 4, 256])


**Confirms the expected tensor dimensions:**

- Batch size: 8
- Sequence length: 4
- Embedding dimension: 256

This implementation demonstrates a foundational approach to preparing text data for language model training, incorporating essential components like sliding window tokenization, positional encoding, and batch processing.


END.