<a href="https://colab.research.google.com/github/achalbajpai/llm-scratch/blob/main/LLMsFromScratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Step 1: Creating Tokens

<div class="alert alert-block alert-success">

The print command prints the total number of characters followed by the first 100
characters of this file for illustration purposes. </div>

In [1]:
# Open the file "cristiano-ronaldo.txt" in read mode with UTF-8 encoding
# The `with` statement ensures the file is properly closed after reading
with open("cr7.txt", "r", encoding="utf-8") as f:
    # Read the entire content of the file into the variable `raw_text`
    raw_text = f.read()

# Print the total number of characters in the file
# `len(raw_text)` returns the length of the string `raw_text`, which is the number of characters
print("Total number of characters:", len(raw_text))

# Print the first 99 characters of the file
# `raw_text[:99]` slices the string `raw_text` to get the first 99 characters
print(raw_text[:99])

Total number of characters: 4348
Cristiano Ronaldo dos Santos Aveiro, born on 5 February 1985, is a Portuguese professional football


# Tokenizer


<div class="alert alert-block alert-success">

Our goal is to tokenize this 4348-character short story into individual words and special
characters that we can then turn into embeddings for LLM training  </div>

<div class="alert alert-block alert-warning">

Note that it's common to process millions of articles and hundreds of thousands of
books -- many gigabytes of text -- when working with LLMs. However, for educational
purposes, it's sufficient to work with smaller text samples like a single book to
illustrate the main ideas behind the text processing steps and to make it possible to
run it in reasonable time on consumer hardware. </div>

<div class="alert alert-block alert-success">

How can we best split this text to obtain a list of tokens? For this, we go on a small
excursion and use Python's regular expression library re for illustration purposes. (Note
that you don't have to learn or memorize any regular expression syntax since we will
transition to a pre-built tokenizer later in this chapter.) </div>

In [2]:
# Import the `re` module, which provides support for regular expressions in Python
import re

# Define a sample text string to be processed
text = "Hello, world. This, is a test."

# Use the `re.split()` function to split the text into tokens and separators
# The regular expression `([,.:;?_!"()\']|--|\s)` matches:
# - Punctuation marks: , . : ; ? _ ! " ( ) '
# - Double hyphens: --
# - Whitespace: \s
# The parentheses `()` around the regex ensure that the separators are included in the result
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)

# Print the initial result of the split
print(result)


['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


<div class="alert alert-block alert-info">
The result is a list of individual words, whitespaces, and punctuation characters:
</div>


<div class="alert alert-block alert-warning">

Let's modify the regular expression splits on whitespaces (\s) and commas, and periods
([,.]):</div>

In [3]:
# Clean the result list:
# 1. Use a list comprehension to iterate over each item in `result`
# 2. Apply `item.strip()` to remove leading and trailing whitespace from each item
# 3. Filter out any empty strings using `if item.strip()`
result = [item.strip() for item in result if item.strip()]

# Print the cleaned result
print(result)


['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


<div class="alert alert-block alert-success">

REMOVING WHITESPACES OR NOT


When developing a simple tokenizer, whether we should encode whitespaces as
separate characters or just remove them depends on our application and its
requirements. Removing whitespaces reduces the memory and computing
requirements. However, keeping whitespaces can be useful if we train models that
are sensitive to the exact structure of the text (for example, Python code, which is
sensitive to indentation and spacing). Here, we remove whitespaces for simplicity
and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme
that includes whitespaces.

</div>

<div class="alert alert-block alert-warning">

The tokenization scheme we devised above works well on the simple sample text. Let's
modify it a bit further so that it can also handle other types of punctuation, such as
question marks, quotation marks, and the double-dashes we have seen earlier in the first
100 characters of Edith Wharton's short story, along with additional special characters: </div>

In [5]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [4]:
# Alternative approach (redundant in this case, as the previous step already handles this):
# Filter out any empty strings from the list
# This step is unnecessary here because the previous list comprehension already removes empty strings
result = [item for item in result if item.strip()]

# Print the final result
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


In [6]:
# Use the `re.split()` function to split the `raw_text` string into tokens and separators
# The regular expression `([,.:;?_!"()\']|--|\s)` matches:
# - Punctuation marks: , . : ; ? _ ! " ( ) '
# - Double hyphens: --
# - Whitespace: \s
# The parentheses `()` around the regex ensure that the separators are included in the result
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)

# Clean the `preprocessed` list:
# 1. Use a list comprehension to iterate over each item in `preprocessed`
# 2. Apply `item.strip()` to remove leading and trailing whitespace from each item
# 3. Filter out any empty strings using `if item.strip()`
preprocessed = [item.strip() for item in preprocessed if item.strip()]

# Print the first 30 items of the cleaned `preprocessed` list
# This is useful for inspecting the initial part of the processed data
print(preprocessed[:30])

# Print the total number of items in the cleaned `preprocessed` list
# This gives an idea of the size of the processed data
print(len(preprocessed))

['Cristiano', 'Ronaldo', 'dos', 'Santos', 'Aveiro', ',', 'born', 'on', '5', 'February', '1985', ',', 'is', 'a', 'Portuguese', 'professional', 'footballer', '.', 'He', 'plays', 'as', 'a', 'forward', 'for', 'and', 'captains', 'both', 'the', 'Saudi', 'Pro']
882


# **Step 2 - Token to Token ID**


<div class="alert alert-block alert-warning">

In the previous section, we tokenized CR7 short story and assigned it to a
Python variable called preprocessed. Let's now create a list of all unique tokens and sort
them alphabetically to determine the vocabulary size:</div>

In [7]:
# Create a sorted list of unique words (tokens) from the `preprocessed` list
# 1. `set(preprocessed)` creates a set of unique items from `preprocessed`.
# 2. `sorted()` sorts the unique items alphabetically.
all_words = sorted(set(preprocessed))

# Calculate the size of the vocabulary (number of unique words)
# `len(all_words)` returns the number of items in the `all_words` list
vocab_size = len(all_words)

# Print the size of the vocabulary
# This gives the total number of unique words in the preprocessed data
print(vocab_size)

303


<div class="alert alert-block alert-success">

After determining that the vocabulary size is 303 via the above code, we create the
vocabulary and print its first 51 entries for illustration purposes:

</div>

In [8]:
# Create a vocabulary dictionary that maps each unique token to a unique integer
# 1. `enumerate(all_words)` generates pairs of (index, token) for each token in `all_words`.
#    - `index` is the position of the token in the sorted list (starting from 0).
#    - `token` is the word itself.
# 2. A dictionary comprehension is used to create the `vocab` dictionary:
#    - The key is the `token` (word).
#    - The value is the `integer` (index).
vocab = {token: integer for integer, token in enumerate(all_words)}

In [9]:
# Iterate over the items in the `vocab` dictionary using `enumerate`
# `enumerate(vocab.items())` generates pairs of (index, (token, integer)) for each item in `vocab`
# - `index` is the position of the item in the iteration (starting from 0).
# - `item` is a tuple of (token, integer), where:
#   - `token` is the word (key in the dictionary).
#   - `integer` is the corresponding ID (value in the dictionary).
for i, item in enumerate(vocab.items()):
    # Print the current item (tuple of (token, integer))
    print(item)

    # Stop the loop after printing 50 items
    # This is useful for inspecting the first few items in a large dictionary
    if i >= 50:
        break

("'", 0)
('(', 1)
(')', 2)
(',', 3)
('.', 4)
('1', 5)
('100', 6)
('135', 7)
('14', 8)
('140', 9)
('18', 10)
('183', 11)
('1985', 12)
('200', 13)
('2003', 14)
('2004', 15)
('2008', 16)
('2009', 17)
('2013', 18)
('2014', 19)
('2015', 20)
('2016', 21)
('2017', 22)
('2018', 23)
('2019', 24)
('2020', 25)
('2021', 26)
('2022', 27)
('2023', 28)
('2024', 29)
('217', 30)
('23', 31)
('30', 32)
('33', 33)
('42', 34)
('5', 35)
('8', 36)
('900', 37)
('A', 38)
('Additionally', 39)
('Al', 40)
('At', 41)
('Aveiro', 42)
('Awards', 43)
('Bale', 44)
('Ballon', 45)
('Ballons', 46)
('Benzema', 47)
('Boot', 48)
('CP', 49)
('Champions', 50)


<div class="alert alert-block alert-info">
As we can see, based on the output above, the dictionary contains individual tokens
associated with unique integer labels.
</div>

<div class="alert alert-block alert-success">

Later in this book, when we want to convert the outputs of an LLM from numbers back into
text, we also need a way to turn token IDs into text.

For this, we can create an inverse
version of the vocabulary that maps token IDs back to corresponding text tokens.

</div>

<div class="alert alert-block alert-success">

Let's implement a complete tokenizer class in Python.

The class will have an encode method that splits
text into tokens and carries out the string-to-integer mapping to produce token IDs via the
vocabulary.

In addition, we implement a decode method that carries out the reverse
integer-to-string mapping to convert the token IDs back into text.

</div>

<div class="alert alert-block alert-info">
    
Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods
    
Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens

Step 3: Process input text into token IDs

Step 4: Convert token IDs back into text

Step 5: Replace spaces before the specified punctuation

</div>



In [10]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        """
        Initialize the tokenizer with a vocabulary mapping.

        Args:
            vocab (dict): A dictionary mapping tokens (strings) to unique integers (IDs).
        """
        # Map tokens to IDs (e.g., {"Hello": 1, "world": 2})
        self.str_to_int = vocab

        # Create the inverse mapping: IDs to tokens (e.g., {1: "Hello", 2: "world"})
        self.int_to_str = {i: s for s, i in vocab.items()}

    def encode(self, text):
        """
        Convert a text string into a list of token IDs.

        Args:
            text (str): The input text to encode.

        Returns:
            list: A list of token IDs corresponding to the input text.
        """
        import re

        # Step 1: Preprocess the text
        # Split the text into tokens and separators (like punctuation and spaces)
        # using a regular expression.
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        # Step 2: Clean the preprocessed list
        # Remove leading/trailing spaces from each item and filter out empty strings.
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]

        # Step 3: Convert tokens to IDs
        # Use the str_to_int mapping to convert each token to its corresponding ID.
        ids = [self.str_to_int[s] for s in preprocessed]

        # Return the list of token IDs
        return ids

    def decode(self, ids):
        """
        Convert a list of token IDs back into a text string.

        Args:
            ids (list): A list of token IDs to decode.

        Returns:
            str: The decoded text string.
        """
        # Step 1: Convert IDs back to tokens
        # Use the int_to_str mapping to convert each ID to its corresponding token.
        tokens = [self.int_to_str[i] for i in ids]

        # Step 2: Join tokens into a single string with spaces in between
        text = " ".join(tokens)

        # Step 3: Postprocess the text
        # Fix spacing issues around punctuation marks (e.g., "Hello , world !" -> "Hello, world!")
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)

        # Return the decoded text
        return text

<div class="alert alert-block alert-success">

Let's instantiate a new tokenizer object from the SimpleTokenizerV1 class and tokenize a
passage from CR7 short story to try it out in practice:
</div>

In [12]:
tokenizer = SimpleTokenizerV1(vocab)

# Input text
text = """Ronaldo has won five Ballon d'Or awards, a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes. He was named the world's best player by FIFA five times, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship, and the UEFA Nations League."""

# Encode the text
ids = tokenizer.encode(text)
print(ids)

[100, 191, 296, 177, 45, 163, 0, 92, 144, 3, 123, 249, 273, 114, 88, 0, 253, 93, 230, 268, 121, 43, 3, 134, 184, 63, 72, 104, 4, 73, 289, 227, 268, 297, 0, 253, 149, 240, 155, 65, 177, 276, 3, 268, 225, 155, 123, 63, 240, 4, 73, 191, 296, 33, 286, 200, 197, 159, 3, 203, 259, 211, 278, 3, 177, 114, 50, 84, 3, 268, 114, 63, 51, 3, 134, 268, 114, 91, 83, 4]


In [13]:
tokenizer.decode(ids)


"Ronaldo has won five Ballon d' Or awards, a record three UEFA Men' s Player of the Year Awards, and four European Golden Shoes. He was named the world' s best player by FIFA five times, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship, and the UEFA Nations League."

<div class="alert alert-block alert-info">
    
Based on the output above, we can see that the decode method successfully converted the
token IDs back into the original text.
</div>

<div class="alert alert-block alert-success">

So far, so good. We implemented a tokenizer capable of tokenizing and de-tokenizing
text based on a snippet from the training set.

Let's now apply it to a new text sample that
is not contained in the training set:
</div>

In [None]:
# text = "Hello, do you like tea?"
# print(tokenizer.encode(text))

<div class="alert alert-block alert-info">
    
The problem is that the word "Hello" was not used in the The Verdict short story.

Hence, it
is not contained in the vocabulary.

This highlights the need to consider large and diverse
training sets to extend the vocabulary when working on LLMs.

</div>

### ADDING SPECIAL CONTEXT TOKENS

In the previous section, we implemented a simple tokenizer and applied it to a passage
from the training set.

In this section, we will modify this tokenizer to handle unknown
words.


In particular, we will modify the vocabulary and tokenizer we implemented in the
previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and
<|endoftext|>

<div class="alert alert-block alert-warning">

We can modify the tokenizer to use an <|unk|> token if it
encounters a word that is not part of the vocabulary.

Furthermore, we add a token between
unrelated texts.

For example, when training GPT-like LLMs on multiple independent
documents or books, it is common to insert a token before each document or book that
follows a previous text source

</div>



<div class="alert alert-block alert-success">

Let's now modify the vocabulary to include these two special tokens, <unk> and
<|endoftext|>, by adding these to the list of all unique words that we created in the
previous section:
</div>

In [None]:
# Create a sorted list of unique tokens from the preprocessed list
# 1. `set(preprocessed)` creates a set of unique tokens from the `preprocessed` list.
# 2. `list(set(preprocessed))` converts the set back into a list.
# 3. `sorted()` sorts the list of unique tokens alphabetically.
all_tokens = sorted(list(set(preprocessed)))

# Add special tokens to the list of all tokens
# 1. `<|endoftext|>` is a special token used to mark the end of a text sequence.
# 2. `<|unk|>` is a special token used to represent unknown or out-of-vocabulary tokens.
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

# Create a vocabulary dictionary that maps each token to a unique integer
# 1. `enumerate(all_tokens)` generates pairs of (index, token) for each token in `all_tokens`.
#    - `index` is the position of the token in the sorted list (starting from 0).
#    - `token` is the word or special token itself.
# 2. A dictionary comprehension is used to create the `vocab` dictionary:
#    - The key is the `token` (word or special token).
#    - The value is the `integer` (index).
vocab = {token: integer for integer, token in enumerate(all_tokens)}

In [None]:
len(vocab.items())


305

<div class="alert alert-block alert-info">
    
Based on the output of the print statement above, the new vocabulary size is 305 (the
vocabulary size in the previous section was 303).

</div>



<div class="alert alert-block alert-success">

As an additional quick check, let's print the last 5 entries of the updated vocabulary:
</div>

In [None]:
# Iterate over the last 5 items in the `vocab` dictionary
# 1. `vocab.items()` returns a view of the dictionary's (token, integer) pairs.
# 2. `list(vocab.items())` converts the view into a list of (token, integer) tuples.
# 3. `[-5:]` slices the list to get the last 5 items.
# 4. `enumerate()` adds an index to each item, generating pairs of (index, (token, integer)).
for i, item in enumerate(list(vocab.items())[-5:]):
    # Print the current item (tuple of (token, integer))
    print(item)

('£88', 300)
('€100', 301)
('€94', 302)
('<|endoftext|>', 303)
('<|unk|>', 304)


<div class="alert alert-block alert-success">

A simple text tokenizer that handles unknown words</div>



<div class="alert alert-block alert-info">
    
Step 1: Replace unknown words by <|unk|> tokens
    
Step 2: Replace spaces before the specified punctuations

</div>


In [None]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        """
        Initialize the tokenizer with a vocabulary mapping.

        Args:
            vocab (dict): A dictionary mapping tokens (strings) to unique integers (IDs).
        """
        # Map tokens to IDs (e.g., {"Hello": 1, "world": 2})
        self.str_to_int = vocab

        # Create the inverse mapping: IDs to tokens (e.g., {1: "Hello", 2: "world"})
        self.int_to_str = {i: s for s, i in vocab.items()}

    def encode(self, text):
        """
        Convert a text string into a list of token IDs.

        Args:
            text (str): The input text to encode.

        Returns:
            list: A list of token IDs corresponding to the input text.
        """
        import re

        # Step 1: Preprocess the text
        # Split the text into tokens and separators (like punctuation and spaces)
        # using a regular expression.
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        # Step 2: Clean the preprocessed list
        # Remove leading/trailing spaces from each item and filter out empty strings.
        preprocessed = [item.strip() for item in preprocessed if item.strip()]

        # Step 3: Handle unknown tokens
        # Replace any token not in the vocabulary with the special "<|unk|>" token.
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]

        # Step 4: Convert tokens to IDs
        # Use the str_to_int mapping to convert each token to its corresponding ID.
        ids = [self.str_to_int[s] for s in preprocessed]

        # Return the list of token IDs
        return ids

    def decode(self, ids):
        """
        Convert a list of token IDs back into a text string.

        Args:
            ids (list): A list of token IDs to decode.

        Returns:
            str: The decoded text string.
        """
        # Step 1: Convert IDs back to tokens
        # Use the int_to_str mapping to convert each ID to its corresponding token.
        tokens = [self.int_to_str[i] for i in ids]

        # Step 2: Join tokens into a single string with spaces in between
        text = " ".join(tokens)

        # Step 3: Postprocess the text
        # Fix spacing issues around punctuation marks (e.g., "Hello , world !" -> "Hello, world!")
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)

        # Return the decoded text
        return text

In [None]:
# Create a tokenizer instance using the vocabulary
tokenizer = SimpleTokenizerV2(vocab)

# Define two example text strings
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

# Combine the two text strings into a single string, separated by the special token "<|endoftext|>"
# The `join()` method inserts "<|endoftext|>" between `text1` and `text2`.
text = " <|endoftext|> ".join((text1, text2))

# Print the combined text
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [None]:
tokenizer.encode(text)


[304, 3, 304, 304, 304, 304, 304, 303, 74, 268, 304, 304, 230, 268, 304, 4]

In [None]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, <|unk|> <|unk|> <|unk|> <|unk|> <|unk|> <|endoftext|> In the <|unk|> <|unk|> of the <|unk|>.'


<div class="alert alert-block alert-info">
    
Based on comparing the de-tokenized text above with the original input text, we know that
the training dataset, Edith Wharton's short story The Verdict, did not contain the words
"Hello" and "palace."

</div>


<div class="alert alert-block alert-warning">

So far, we have discussed tokenization as an essential step in processing text as input to
LLMs. Depending on the LLM, some researchers also consider additional special tokens such
as the following:

[BOS] (beginning of sequence): This token marks the start of a text. It
signifies to the LLM where a piece of content begins.

[EOS] (end of sequence): This token is positioned at the end of a text,
and is especially useful when concatenating multiple unrelated texts,
similar to <|endoftext|>. For instance, when combining two different
Wikipedia articles or books, the [EOS] token indicates where one article
ends and the next one begins.

[PAD] (padding): When training LLMs with batch sizes larger than one,
the batch might contain texts of varying lengths. To ensure all texts have
the same length, the shorter texts are extended or "padded" using the
[PAD] token, up to the length of the longest text in the batch.

</div>


<div class="alert alert-block alert-warning">

Note that the tokenizer used for GPT models does not need any of these tokens mentioned
above but only uses an <|endoftext|> token for simplicity

</div>

<div class="alert alert-block alert-warning">

the tokenizer used for GPT models also doesn't use an <|unk|> token for outof-vocabulary words. Instead, GPT models use a byte pair encoding tokenizer, which breaks
down words into subword units
</div>