# Chapter 2 - Lab 1a - Exercise
> Author : Badr TAJINI - Large Language model (LLMs) - ESIEE 2024-2025

# Exercise 2.1

### Exploring Byte Pair Encoding (BPE) Tokenization with Unknown Words

In this exercise, you will explore how the Byte Pair Encoding (BPE) tokenizer from the `tiktoken` library processes unknown words. BPE is a subword tokenization technique that constructs its vocabulary by iteratively merging frequent sequences of characters or subwords. This approach allows BPE tokenizers to handle previously unseen words by decomposing them into smaller, known subunits.

### Objective
Answer the following questions based on your experimentation with the tokenizer:

1. How does the BPE tokenizer decompose the input phrase **"Akwirw ier"** into token IDs?
2. What are the subwords or characters corresponding to each token ID in the tokenized output?
3. Can the tokenizer's decoding process successfully reconstruct the original input phrase **"Akwirw ier"** from the token IDs? Why or why not?

### Theoretical Background
Byte Pair Encoding begins with a minimal vocabulary of single characters, such as **"a," "b," "c,"** and so on. The tokenizer builds upon this base by iteratively merging frequently co-occurring characters into subwords, and subsequently merging frequent subwords into complete words. The merging process is guided by a frequency threshold or cutoff. 

For example:
- In the initial stage, the character **"d"** and **"e"** might frequently appear together in a corpus. The tokenizer merges these characters into the subword **"de"** if their co-occurrence exceeds the frequency cutoff.
- This subword then becomes part of the tokenizer's vocabulary and is used to tokenize words where it occurs, such as **"define," "depend," "made,"** and **"hidden."**

This hierarchical merging enables the BPE tokenizer to strike a balance between granularity and generalization, efficiently encoding both common words and rare or unknown words by breaking them into smaller units.

### Task Steps

1. **Tokenization**:
   - Use the `tiktoken` BPE tokenizer to process the unknown input string **"Akwirw ier."**
   - Print the token IDs generated for this input.

2. **Subword Decoding**:
   - For each token ID in the resulting list, use the tokenizer's `decode` function to convert the ID back into its corresponding subword or character.

3. **Reconstruction**:
   - Apply the `decode` method to the entire list of token IDs to reconstruct the original input string. Verify whether the reconstructed string matches the initial input, **"Akwirw ier."**



### Questions - Exercise 2.1
1. What sequence of token IDs does the BPE tokenizer generate for the input **"Akwirw ier"?**
2. What subwords or characters correspond to each token ID in the sequence?
3. Does the reconstructed output from the token IDs match the original input? Explain your observations and reasoning.



In [24]:
import tiktoken
import torch

In [14]:
txt = "Akwirw ier"

tokenizer = tiktoken.get_encoding("gpt2")

In [15]:
ids = tokenizer.encode(txt)

ids

[33901, 86, 343, 86, 220, 959]

In [16]:
tokens = {token_id:tokenizer.decode([token_id]) for token_id in ids}

tokens

{33901: 'Ak', 86: 'w', 343: 'ir', 220: ' ', 959: 'ier'}

In [17]:
tokenizer.decode(ids)

'Akwirw ier'

La sortie est la même que l'entrée. Les espaces sont aussi encodée.

---

# Exercise 2.2

**Exercise: Exploring Data Loader Behavior with Different Parameters**

Certainly! Here's the exercise rewritten in the same structured style as the first one, ensuring clarity and consistency:

---

**Exercise: Exploring Data Loader Behavior with Different Parameter Configurations**

In this exercise, you will investigate how the parameters of a data loader—such as `max_length`, `stride`, and `batch_size`—affect the preparation of input-output pairs for training large language models (LLMs). By experimenting with these settings, you will gain a practical understanding of their impact on the data batching process and their implications for model training.

### Objective
You will:
1. Observe how the data loader generates input-output pairs with different configurations of `max_length` and `stride`.
2. Analyze how increasing the batch size changes the structure of the data and discuss the tradeoffs involved.
3. Experiment with a batch size greater than 1 to understand how it impacts memory usage and input-output organization.

---

### Theoretical Background

A data loader processes raw text data into smaller sequences suitable for training LLMs. Key parameters that influence its behavior are:

1. **`max_length`**: Specifies the maximum sequence length for each input-output pair. Shorter sequences may simplify computation but can limit the context available to the model.
   
2. **`stride`**: Determines the step size for sliding the window over the text when creating sequences. A smaller stride increases overlap between sequences, leading to more redundancy. A larger stride reduces overlap, ensuring more unique coverage of the dataset.

3. **`batch_size`**: Controls the number of sequences in a batch:
   - **Small batches** (e.g., `batch_size=1`) are easier to process and require less memory. However, they can produce noisier gradient updates during training.
   - **Larger batches** improve gradient stability but require more memory and computational power. This parameter is an important hyperparameter to tune during training.

These parameters are central to efficient and effective preprocessing of data for training deep learning models.

---

### Task Steps

1. **Experimenting with `max_length` and `stride`**:
   - Run the data loader with two configurations:
     - `max_length=2` and `stride=2`
     - `max_length=8` and `stride=2`
   - Observe the structure of the input-output pairs for each configuration and note how they differ.

2. **Increasing Batch Size**:
   - Experiment with a batch size of 8 using the following configuration:
     ```python
     dataloader = create_dataloader_v1(
         raw_text, batch_size=8, max_length=4, stride=4,
         shuffle=False
     )
     data_iter = iter(dataloader)
     inputs, targets = next(data_iter)
     print("Inputs:\n", inputs)
     print("\nTargets:\n", targets)
     ```
   - Examine the resulting `inputs` and `targets`. Consider how the data is structured when `batch_size` is increased compared to a batch size of 1.

3. **Avoiding Overlap**:
   - Analyze the effect of a `stride=4` setting. Note that this value ensures no overlap between sequences within a batch, minimizing redundancy and reducing the risk of overfitting.

---

### Questions - Exercise 2.2

1. How do changes in `max_length` and `stride` affect the input-output mappings produced by the data loader?  
2. What differences do you observe in the data when using a batch size of 8 compared to a batch size of 1?  
3. How does using a larger stride (e.g., `stride=4`) influence the coverage of the dataset and the overlap between sequences?  

---

### Example Output

Using the configuration:
```python
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=4, stride=4,
    shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)
```

**Inputs**:
```plaintext
tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
```

**Targets**:
```plaintext
tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])
```

---

### Expected Learning Outcomes

By completing this exercise, you should:
1. Understand how varying `max_length` and `stride` impacts the input-output pairs produced by the data loader.
2. Appreciate the tradeoffs involved in choosing different batch sizes for training deep learning models.
3. Gain insight into how stride settings can minimize redundancy and optimize dataset utilization.

In [18]:
# Importing Required Modules
from torch.utils.data import Dataset, DataLoader

# Defining the Custom Dataset Class
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [19]:
# Creating the Data Loader Function
def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

In [21]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    txt = f.read()

In [25]:
dataloader = create_dataloader_v1(
    txt, batch_size=1, max_length=2, stride=2,
    shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[ 40, 367]])

Targets:
 tensor([[ 367, 2885]])


In [26]:
dataloader = create_dataloader_v1(
    txt, batch_size=1, max_length=8, stride=2,
    shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[  40,  367, 2885, 1464, 1807, 3619,  402,  271]])

Targets:
 tensor([[  367,  2885,  1464,  1807,  3619,   402,   271, 10899]])


In [27]:
dataloader = create_dataloader_v1(
    txt, batch_size=8, max_length=8, stride=2,
    shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464,  1807,  3619,   402,   271],
        [ 2885,  1464,  1807,  3619,   402,   271, 10899,  2138],
        [ 1807,  3619,   402,   271, 10899,  2138,   257,  7026],
        [  402,   271, 10899,  2138,   257,  7026, 15632,   438],
        [10899,  2138,   257,  7026, 15632,   438,  2016,   257],
        [  257,  7026, 15632,   438,  2016,   257,   922,  5891],
        [15632,   438,  2016,   257,   922,  5891,  1576,   438],
        [ 2016,   257,   922,  5891,  1576,   438,   568,   340]])

Targets:
 tensor([[  367,  2885,  1464,  1807,  3619,   402,   271, 10899],
        [ 1464,  1807,  3619,   402,   271, 10899,  2138,   257],
        [ 3619,   402,   271, 10899,  2138,   257,  7026, 15632],
        [  271, 10899,  2138,   257,  7026, 15632,   438,  2016],
        [ 2138,   257,  7026, 15632,   438,  2016,   257,   922],
        [ 7026, 15632,   438,  2016,   257,   922,  5891,  1576],
        [  438,  2016,   257,   922,  5891,  1576,   43

In [28]:
dataloader = create_dataloader_v1(
    txt, batch_size=8, max_length=8, stride=4,
    shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464,  1807,  3619,   402,   271],
        [ 1807,  3619,   402,   271, 10899,  2138,   257,  7026],
        [10899,  2138,   257,  7026, 15632,   438,  2016,   257],
        [15632,   438,  2016,   257,   922,  5891,  1576,   438],
        [  922,  5891,  1576,   438,   568,   340,   373,   645],
        [  568,   340,   373,   645,  1049,  5975,   284,   502],
        [ 1049,  5975,   284,   502,   284,  3285,   326,    11],
        [  284,  3285,   326,    11,   287,   262,  6001,   286]])

Targets:
 tensor([[  367,  2885,  1464,  1807,  3619,   402,   271, 10899],
        [ 3619,   402,   271, 10899,  2138,   257,  7026, 15632],
        [ 2138,   257,  7026, 15632,   438,  2016,   257,   922],
        [  438,  2016,   257,   922,  5891,  1576,   438,   568],
        [ 5891,  1576,   438,   568,   340,   373,   645,  1049],
        [  340,   373,   645,  1049,  5975,   284,   502,   284],
        [ 5975,   284,   502,   284,  3285,   326,    1