# Reading in a short story as text sample in Python.

## Step 1: Creating Tokens

In [2]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
Our goal is to tokenize this 20,479 character short story into individual words and special characters that we can then turn into embeddings for LLM training.
We note that it is common to process millions of articles and hundreds of thousands of books -- many gigabytes of text -- when workings with LLMs. However, for this project we will we working with smaller text samples to illustrate the main points.
</div>

In [3]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text) # splitting at white spaces

print(result) 

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


In [5]:
result = re.split(r'([,.]|\s)', text) #splitting at , and . now as well as white spaces

print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


In [6]:
result = [item for item in result if item.strip()] # remove redundant white spaces from list
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
Removing whitespaces for tokenizer or keeping them as separate characters depends on our application and its requirements. Removing whitespaces reduces the memory and computing requirements. However, keeping whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example, Python code, which is sensitive to indentation and spacing. For now, we will remove whitespaces for simplicity.
</div>

In [7]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text) # we want to tokenize ? , . : ; ...etc separately
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'Is', ' ', 'this', '--', '', ' ', 'a', ' ', 'test', '?', '']


In [8]:
result = [item for item in result if item.strip()] # remove whitespace from list
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


**Now we have basic tokenizer, we apply to Edith Wharton's entire short story**

In [10]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()] # remove whitespace
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [11]:
print(len(preprocessed))

4690


## Step 2: Creating Token IDs

In [12]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

1130


**After determining vocabulary size is 1,130,  we create the vocabulary and print the first 51 entries for illustration purposes**

In [13]:
vocab = {token:integer for integer, token in enumerate(all_words)}

In [14]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


**The dictionary contains individual tokens associated with unique integer labels**

**Later when we want to convert outputs of LLM from numbers back to text, we also need a way to turn token IDs into text.**

**To do this, we can create an inverse version of the vocabulary that maps token IDs back to corresponding text tokens.**

**Now we will implement a complete tokenizer class in Python.**

**The class will have an encode method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs.**

**We will also implement a decode method that carries out the reverse integer-to-string mapping to convert token IDs back to their tokens.**

In [32]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

**Let's instantiate a new tokenizer object from the SimpleTokenizerV1 class and tokenize a passage from the short story**

In [33]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know,"
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [34]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
Important to note that there are many words not used in vocabulary. We need large and diverse training sets to extend the vocabulary when working on LLMs.
</div>

## ADDING SPECIAL CONTEXT TOKENS

One way to deal with unknown words is by using special context tokens

We will modify the vocabulary and tokenizer in SimpleTokenizerV1 to SimpleTokenizerV2

<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
We can modify tokenizer to use an <|unk|> token if it encounters a word that is not part of the vocabulary.

Furthermore, we add an <|endoftext|> between unrelated texts.

For example, when training GPT-like LLMs on multiple independent documents or books, it is common to insert a token before each document or book that follows the previous text source
</div>

In [35]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer, token in enumerate(all_tokens)}

In [36]:
len(vocab.items())

1132

**As an additional check, let's print last 5 entries of updated vocabulary**

In [37]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


**Now a simple text tokenizer that handles unknown words**

**Replace unknown words by <|unk|> tokens**

**Replaces spaces before the specified punctuations**

In [39]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [40]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [41]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [42]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

Based on comparing the de-tokenized text above with original input text, we know the training dataset, Edith Wharton's The Verdict, did not contain the words "Hello" and "palace".

**Other special tokens considered:**

[BOS] (beginning of sequence): This token marks start of a text. Signifies to LLM where a piece of content begins.

[EOS] (end of sequence): This token is positioned at the end of a text, and is especially useful for concatenating multiple unrelated texts.

[PAD] (padding): When training LLMs with batch sizes larger than one, the batch might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or "padded" using the [PAD] token, up to the length of the longest text in the batch.

<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
Note that tokenizer used for GPT models do not need any of these tokens mentioned above but only use <|endoftext|> for simplicity

Tokenizer used for GPT models also do not use <|unk|> token for out-of-vocabulary words. Instead, GPT models use a byte pair encoding tokenizer, which breaks down words into subword units
</div>

# Byte Pair Encoding (BPE)

<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
This section covers a more sophisticated tokenization scheme based on concept called byte pair encoding (BPE).

**BPE is a subword tokenization algorithm**

BPE tokenizer covered here was used to train LLMs like GPT-2, GPT-3, and the original model used in ChatGPT.
</div>

**BPE can be relatively complicated so we will use existing Python open-source library called tiktoken (https://github.com/openai/tiktoken)**

The library implements BPE algorithm very efficiently based on source code in Rust.

In [1]:
! pip3 install tiktoken



In [3]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.9.0


<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
Once installed, we can instantiate the BPE tokenizer from tiktoken as follows
</div>

In [4]:
tokenizer = tiktoken.get_encoding("gpt2")

<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
The usage of this tokenizer is similar to SimpleTokenizerV2 we implemented previously via an encode method:
</div>

In [6]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


<div style="background-color: white; color: blue; padding: 15px; border: 2px solid blue; border-radius: 5px;">
We can then convert the token IDs back into text using the decode method similar to our SimpleTokenizerV2
</div>

In [9]:
strings = tokenizer.decode(integers)

print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


<div style="background-color: white; color: orange; padding: 15px; border: 2px solid orange; border-radius: 5px;">
The BPE tokenizer, which was used to train models such as GPT-2, GPT-3, and the original model used in ChatGPT, has a <|endoftext|> being assigned with the largest token ID.
</div>

<div style="background-color: white; color: orange; padding: 15px; border: 2px solid orange; border-radius: 5px;">
The BPE tokenizer above encodes and decodes unknown words, such as "someuknownPlace" correctly.

The BPE tokenizer can handle any unknown word. How does it achieve this without using <|unk|> tokens?

The algorithm underlying BPE breaks down words that aren't in its predefined vocabulary into smaller subwords.
This enables it to handle out-of-vocabulary words.
Thanks to the BPE algorithm, if the tokenizer encounters an unfamiliar word during tokenization, it can represent it as a sequence of subword tokens or characters
</div>

**Another example to illustrate how the BPE tokenizer deals with unknown words**

In [10]:
integers = tokenizer.encode("Akwirw ier")
print(integers)

strings = tokenizer.decode(integers)
print(strings)

[33901, 86, 343, 86, 220, 959]
Akwirw ier


## Creating input-target pairs

<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
In this section we implement a data loader that fetches the input-target pairs using a sliding window approach.

To get started, we will first tokenize the whole The Verdict short story we worked with earlier using the BPE tokenizer introduced in the previous section.
</div>

In [11]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


<div style="background-color: white; color: blue; padding: 15px; border: 2px solid blue; border-radius: 5px;">
Executing the code above will return 5145. 5145 is the vocabulary size for The Verdict text or total number of tokens in the training set, after applying the BPE tokenizer.
</div>

<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
Next, we remove the first 50 tokens from the dataset for demonstration purposes as it results a slightly more interesting text passage
</div>

In [12]:
enc_sample = enc_text[50:]

<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
One of the easiest and most intuitive ways to create input-target pairs for next word prediction task is to create x and y variables where x contains the input tokens and y contains the targets, which are the inputs shifted by 1:
</div>

<div style="background-color: white; color: blue; padding: 15px; border: 2px solid blue; border-radius: 5px;">
The context size determines how many tokens are included in the input
</div>

In [14]:
context_size = 4 #length of input
#The context_size of 4 means that the model is trained to look at a sequence of 4 words (or tokens)
#to predict the next word in the sequence.
#The input x is the first 4 tokens [1,2,3,4], and the target y is the next 4 tokens [2,3,4,5]

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
Processing the inputs along with the targets, which are the inputs shifted by one position, we can then create the next-word prediction tasks as follows:
</div>

In [15]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


<div style="background-color: white; color: blue; padding: 15px; border: 2px solid blue; border-radius: 5px;">
Everything left of the arrow (---->) refers to the input an LLM would receive, and the token ID on the right side of the arrow represents the target token ID that the LLM is supposed to predict.
</div>

<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
For illustration purposes, we repeat the previous code but convert the token IDs into text:
</div>

In [16]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


<div style="background-color: white; color: orange; padding: 15px; border: 2px solid orange; border-radius: 5px;">
We have now created the input-target pairs that we can turn into use for LLM training.

Next task before we can turn tokens into embeddings is to implement an efficient data loader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors, which can be thought of as multidimensional arrays.

We are interested in returning two tensors: an input tensor containing the text that the LLM sees and a target tensor that includes the targets for the LLM to predict.
</div>

## Implementing a Data Loader

<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
For the efficient data loader implementation, we will use PyTorch's built-in Dataset and DataLoader classes
</div>

<div style="background-color: white; color: blue; padding: 15px; border: 2px solid blue; border-radius: 5px;">
Step 1: Tokenize entire text

\
Step 2: Use a sliding window to chunk the book into overlapping sequences of max_length

Step 3: Return the total number of rows in the dataset

Step 4: Return a single row from the dataset
</div>

In [20]:
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

<div style="background-color: white; color: orange; padding: 15px; border: 2px solid orange; border-radius: 5px;">
The GPTDatasetV1 in listing 2.5 is based on the PyTorch Dataset class.

\
It defines how individual rows are fetched from the dataset.

Each row consists of a number of token IDs (based on a max_length) assigned to an input_chunk tensor.

The target_chunk tensor contains the corresponding targets.
</div>

<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
The following code will use the GPTDatasetV1 to load the inputs in batches via a Pytorch DataLoader:
</div>

<div style="background-color: white; color: blue; padding: 15px; border: 2px solid blue; border-radius: 5px;">
Step 1: Initialize the tokenizer

\
Step 2: Create dataset

Step 3: drop_last=True drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training

Step 4: The number of CPU processes to use for preprocessing
</div>

In [21]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader 

<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
We test the dataloader with a batch size of 1 for an LLM with the context size of 4.
This will develop an intuition of how the GPTDatasetV1 class and the create_dataloader_v1 function work together:
</div>

In [23]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

<div style="background-color: white; color: blue; padding: 15px; border: 2px solid blue; border-radius: 5px;">
Convert dataloader into a Python iterator to fetch the next entry via Python's built-in next() function
</div>

In [26]:
import torch
print("PyTorch version:", torch.__version__)
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

PyTorch version: 2.6.0
[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


<div style="background-color: white; color: orange; padding: 15px; border: 2px solid orange; border-radius: 5px;">
The first_batch variable contains two tensors: the first tensor stores the input token IDs and the second tensor stores the target token IDs.

\
Since max_length is set to 4, each of the two tensors contains 4 token IDs.

Note that an input size of 4 is relatively small and only chosen for illustration purposes. It is common to train LLMs with input sizes of at least 256.
</div>

<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
To illustrate the meaning of stride=1, let's fetch another batch from this dataset:
</div>

In [25]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


<div style="background-color: white; color: orange; padding: 15px; border: 2px solid orange; border-radius: 5px;">
If we compare the first with the second batch, we can see that the second batch's token IDs are shifted by one position compared to the first batch.

\
For example, the second ID in the first batch's input is 367, which is the first ID of the second batch's input.

The stride setting dictates the number of positions the inputs shift across batches, emulating a sliding window approach
</div>

<div style="background-color: white; color: orange; padding: 15px; border: 2px solid orange; border-radius: 5px;">
Batch sizes of 1, such as we have sampled from the data loader so far, are useful for illustration purposes.

\
Small batch sizes require less memory during training but lead to more noisy model updates.

**Batch size is a trade-off and hyperparameter to experiment with when training LLMs.**
</div>

<div style="background-color: #d4edda; color: #155724; padding: 15px; border: 2px solid #28a745; border-radius: 5px;">
Let's have a look at how we can use the data loader to sample with a batch size greater than 1:
</div>

In [27]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


<div style="background-color: white; color: blue; padding: 15px; border: 2px solid blue; border-radius: 5px;">
Note that we increase the stride to 4. This is to utilize the data set fully (we don't skip a single word) but also avoid any overlap between batches, since more overlap could lead to overfitting.
</div>