<a href="https://colab.research.google.com/github/drpetros11111/Bert-Transformers/blob/main/ch02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# Chapter 2: Working with Text Data

Packages that are being used in this notebook:

In [None]:
from importlib.metadata import version

print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

torch version: 2.5.1
tiktoken version: 0.7.0


- This chapter covers data preparation and sampling to get input data "ready" for the LLM

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/01.webp?timestamp=1" width="500px">

## 2.1 Understanding word embeddings

- No code in this section

- There are many forms of embeddings; we focus on text embeddings in this book

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/02.webp" width="500px">

- LLMs work with embeddings in high-dimensional spaces (i.e., thousands of dimensions)
- Since we can't visualize such high-dimensional spaces (we humans think in 1, 2, or 3 dimensions), the figure below illustrates a 2-dimensional embedding space

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/03.webp" width="300px">

## 2.2 Tokenizing text

- In this section, we tokenize text, which means breaking text into smaller units, such as individual words and punctuation characters

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/04.webp" width="300px">

- Load raw text we want to work with
- [The Verdict by Edith Wharton](https://en.wikisource.org/wiki/The_Verdict) is a public domain short story

In [None]:
import os
import urllib.request

if not os.path.exists("the-verdict.txt"):
    url = ("https://raw.githubusercontent.com/rasbt/"
           "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
           "the-verdict.txt")
    file_path = "the-verdict.txt"
    urllib.request.urlretrieve(url, file_path)

- (If you encounter an `ssl.SSLCertVerificationError` when executing the previous code cell, it might be due to using an outdated Python version; you can find [more information here on GitHub](https://github.com/rasbt/LLMs-from-scratch/pull/403))

In [None]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


- The goal is to tokenize and embed this text for an LLM
- Let's develop a simple tokenizer based on some simple sample text that we can then later apply to the text above
- The following regular expression will split on whitespaces

In [None]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)

print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


# Explaining the Code for Text Tokenization
This code snippet demonstrates a basic approach to tokenizing text.

Tokenization is the process of breaking down a sequence of text into smaller units called tokens.

These tokens can be words, subwords, or even individual characters.




---


---


Here's a breakdown of the code:

    import re

This line imports the re module, which provides support for regular expressions in Python.

Regular expressions are powerful tools for pattern matching within strings.

    text = "Hello, world. This, is a test."

This line creates a string variable called text and assigns it a sample sentence.

    result = re.split(r'(\s)', text)

This is the core of the tokenization in this specific example.

##re.split()
is a function from the re module that splits a string by occurrences of a pattern.

##r'(\s)'
 is the regular expression pattern.

##r'':
This part indicates a raw string literal in Python. U

sing a raw string is common when working with regular expressions because it tells Python to treat backslashes (\) literally, rather than as escape characters.

In many regular expressions, backslashes are used to define special character sequences (like \s), and using a raw string prevents potential conflicts where Python might try to interpret the backslash itself before the regular expression engine gets to it.

##(...):
The parentheses create a capturing group.

In the context of re.split(), enclosing a part of the pattern in parentheses tells the function to include the matched text for that group in the resulting list of splits, in addition to using it as a delimiter.

##\s:
This is a special character sequence (a metacharacter) within the regular expression.

It specifically matches any whitespace character. This includes:

    Space ()
    Tab (\t)
    Newline (\n)
    Carriage return (\r)
    Form feed (\f)
    Vertical tab (\v)

So, when re.split() is given the pattern r'(\s)' and a string, it will:

Find every instance of a whitespace character (\s) in the string.

Use each found whitespace character as a point to split the string.

Because the \s is inside a capturing group (), the whitespace character itself will be included as a separate item in the list of results.

This is why, in the earlier example re.split(r'(\s)', "Hello, world."), the output included the spaces as separate elements in the list.

If the parentheses were omitted (re.split(r'\s', "Hello, world.")), the spaces would still be used as delimiters, but they would not appear in the resulting list.

A regular expression, often shortened to regex or regexp, is a sequence of characters that defines a search pattern.

These patterns are primarily used for string matching and manipulation, including operations like:

##Searching:
Finding specific patterns within a larger string.

##Replacing:
Substituting parts of a string that match a pattern with something else.

##Validating:
Checking if a string conforms to a specific format (like an email address or phone number).

----------------------
##Parsing:
Extracting structured data from unstructured text.

Think of a regular expression as a mini-language for describing patterns in text. It uses a combination of literal characters and special characters (called metacharacters) to create flexible and powerful patterns.


For example:

The regular expression hello would match the exact string "hello".

The regular expression h.llo would match "hello", "hallo", "hxllo", etc., because the . metacharacter matches any single character (except a newline).

The regular expression \d+ would match one or more digits (\d matches a digit, and + matches one or more of the preceding element).

This could match "1", "123", "98765", and so on.

Regular expressions are a fundamental tool in text processing and are supported by many programming languages (like Python, Java, JavaScript, etc.) and text editors.

They can be incredibly powerful for complex pattern matching, but they can also be quite concise and sometimes difficult to read for beginners.

##Regular expressions are based on formal language theory and computability theory.

They are a method for defining patterns in text based on a set of rules and syntax, not on statistical properties or probabilities of character sequences.

Here's why they are not statistical:

##Deterministic Pattern Matching:
A regular expression precisely defines a pattern.

A given string either matches the pattern or it doesn't.

There's no probability involved in the match itself.

##Rule-Based:
The behavior of a regular expression is determined by the specific characters and metacharacters used in the pattern, following defined rules (e.g., . matches any single character, * matches zero or more of the preceding element).

##No Learning from Data (in their core):

The definition of a regular expression is static unless you explicitly modify the pattern string itself. They don't learn patterns from a corpus of text in the way that statistical models do.

In contrast, statistical methods for text analysis often involve:

##Probabilistic Models:
Analyzing the likelihood of certain words or sequences appearing based on observed frequencies in a dataset (e.g., N-gram models).

##Machine Learning:
Training models on large datasets to identify patterns, which often involves statistical inference and optimization.

##Variability and Generalization:

Statistical models can often handle variations and generalize to unseen data based on learned distributions, which is not the core function of regular expressions.

While regular expressions can be used in conjunction with statistical methods (e.g., using regex to extract features that are then used in a statistical model), the regular expression language itself is a non-statistical method for pattern matching.

---------------
##r''
indicates a raw string, which is often used for regular expressions to avoid issues with backslashes.

##\s
is a special sequence that matches any whitespace character (space, tab, newline, etc.).

The parentheses () around \s are important.

They tell re.split() to include the matched whitespace characters in the resulting list.

Without the parentheses, the whitespace characters would be used as delimiters but not included in the output.

The text variable is the string that will be split.

The result of this operation will be a list of strings, where the original text is split at each whitespace character, and the whitespace characters themselves are included as separate items in the list.

    print(result)

This line prints the result list to the console.

This will show you the list of tokens generated by splitting the original text based on whitespace.

In [None]:
print(result)

- We don't only want to split on whitespaces but also commas and periods, so let's modify the regular expression to do that as well

In [None]:
result = re.split(r'([,.]|\s)', text)

print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


- As we can see, this creates empty strings, let's remove them

In [None]:
# Strip whitespace from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


- This looks pretty good, but let's also handle other types of punctuation, such as periods, question marks, and so on

In [None]:
text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


- This is pretty good, and we are now ready to apply this tokenization to the raw text

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/05.webp" width="350px">

In [None]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


# Complex Regular Expression
This line uses the re.split() function again, but with a more complex regular expression pattern: r'([,.:;?_!"()\']|--|\s)'.

r'': Again, this indicates a raw string.


(...): This is a capturing group.

It means that whatever is matched by the pattern inside the parentheses will be included in the resulting list.

[,.:;?_!"()\']: This part of the pattern is enclosed in square brackets [].

In a regular expression, square brackets define a character set. This character set matches any single character that is inside the brackets.

So, this part will match a comma (,), a period (.), a colon (:), a semicolon (;), a question mark (?), an underscore (_), an exclamation mark (!), a double quote ("), an opening parenthesis ((), a closing parenthesis ()), or a single quote (').

|: This is the OR operator in regular expressions. It means "match the pattern on the left OR the pattern on the right".

--: This matches the literal two characters --.

|: Another OR operator.

\s: This matches any whitespace character, as we discussed before.

So, the entire pattern ([,.:;?_!"()\']|--|\s) tells re.split() to split the raw_text string whenever it encounters any of the following:

##Any single character from the set ,.:;?_!"()\'.
The literal string --.

Any whitespace character.

Because the entire pattern is within a capturing group (), the delimiters (the punctuation, --, or whitespace) will also be included as separate items in the preprocessed list.

    preprocessed = [item.strip() for item in preprocessed if item.strip()]

This line uses a list comprehension to refine the preprocessed list. It does two things:

##for item in preprocessed:
It iterates through each item in the list generated by the re.split() call.

##if item.strip():
This is a filter. The item.strip() method removes leading and trailing whitespace from the item.

If, after stripping whitespace, the item is not an empty string (meaning it contained more than just whitespace or was a non-whitespace character/string), the condition if item.strip() evaluates to True.

Items that consist only of whitespace (or are empty strings) will evaluate to False and be filtered out.

##item.strip():
For the items that pass the filter, item.strip() is performed, and the resulting string (with leading/trailing whitespace removed) is included in the new preprocessed list.

This step is important because the re.split() with a capturing group can sometimes produce empty strings in the output list, especially when multiple delimiters appear consecutively.

The if item.strip() part effectively removes these empty strings.

    print(preprocessed[:30])

Finally, this line prints the first 30 items of the preprocessed list.

This gives you a peek at the beginning of the tokenized text to see how the splitting and filtering worked.

- Let's calculate the total number of tokens

In [None]:
print(len(preprocessed))

4690


## 2.3 Converting tokens into token IDs

- Next, we convert the text tokens into token IDs that we can process via embedding layers later

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/06.webp" width="500px">

- From these tokens, we can now build a vocabulary that consists of all the unique tokens

In [None]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

1130


# Converting Tokens to Token IDs
This section of the code focuses on preparing the tokenized text for use in a large language model (LLM) by converting the text tokens into numerical identifiers, or token IDs. LLMs typically process numerical data, not raw text.

---------------
--------------
First, let's look at the code block you provided:

    all_words = sorted(set(preprocessed))
    vocab_size = len(all_words)

    print(vocab_size)

----------------
#Building a Vocabulary

The goal of this code is to create a vocabulary from the preprocessed list.

Remember that the preprocessed list contains all the individual tokens (words and punctuation) extracted from the text.

##set(preprocessed):
This part of the code takes the preprocessed list and converts it into a set.

A set is a collection of unique items. By converting the list to a set, we automatically remove any duplicate tokens, leaving only the unique words and punctuation marks present in the text.


##This collection of unique tokens forms our vocabulary.

##sorted(...):
The sorted() function is then used to sort the unique tokens alphabetically.

While sorting isn't strictly necessary for building a vocabulary, it can be helpful for consistency and organization.

##all_words = ...:
The result of the sorted set of unique tokens is assigned to the variable all_words.

This variable now holds a sorted list of every unique token found in the original text.

##vocab_size = len(all_words):
The len() function is used to get the number of items in the all_words list.

This count represents the total number of unique tokens in our vocabulary, which is referred to as the vocab_size.

##print(vocab_size):

Finally, this line prints the calculated vocab_size to the console.

After executing this code, the vocab_size variable will contain the number of distinct tokens identified in the text, which is an important value for subsequent steps like creating an embedding layer for the LLM.

In [None]:
vocab = {token:integer for integer,token in enumerate(all_words)}

# Creating a Vocabulary Dictionary
This line of code creates a Python dictionary called vocab.

----------
----------------
##Dictionaries are useful for storing key-value pairs.

    vocab = {token:integer for integer,token in enumerate(all_words)}

---
Let's break down what this code does:

##all_words:
This variable is assumed to be a list containing all the unique tokens from the text you processed earlier.

The elements in this list are strings (the tokens).

##enumerate(all_words):
The enumerate() function is a built-in Python function that takes an iterable (like our all_words list) and returns an iterator that produces pairs of index and value.

So, for each token in all_words, enumerate will provide a number (starting from 0) and the token itself.

For example, if all_words is ['.', ',', 'Hello', 'world'], enumerate will yield (0, '.'), then (1, ','), then (2, 'Hello'), and so on.

##for integer, token in enumerate(all_words):
 is the loop part of a dictionary comprehension.

 It iterates through the pairs generated by enumerate(all_words).

 In each iteration, the index from enumerate is assigned to the variable integer, and the token is assigned to the variable token.

##token:integer:
This is the key-value pair that will be added to the dictionary for each iteration.

The token (the string representing the unique word or punctuation) becomes the key, and the integer (the numerical index assigned by enumerate) becomes the value.

##{...}:
The curly braces indicate that we are creating a dictionary using a dictionary comprehension.

-----------------
#In summary,
this code iterates through the sorted list of unique tokens (all_words) and assigns a unique integer ID to each token.

The resulting vocab dictionary maps each token (string) to its corresponding integer ID.

This is a crucial step in preparing text data for machine learning models, as models typically work with numerical representations rather than raw text.

This dictionary will be used later to convert sequences of tokens into sequences of integer IDs, which can then be processed by embedding layers.

- Below are the first 50 entries in this vocabulary:

In [None]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


- Below, we illustrate the tokenization of a short sample text using a small vocabulary:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/07.webp?123" width="500px">

- Putting it now all together into a tokenizer class

In [None]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [None]:
jjclass SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

# SimpleTokenizerV1 Class
This code defines a Python class called SimpleTokenizerV1.

This class is a custom tokenizer designed to convert text into a sequence of numerical IDs (encoding) and convert those IDs back into text (decoding).

It uses a predefined vocabulary to perform these conversions.

    __init__ Method
    class SimpleTokenizerV1:
        def __init__(self, vocab):
          self.str_to_int = vocab
          self.int_to_str = {i:s for s,i in vocab.items()}

##The __init__ method
is the constructor for the SimpleTokenizerV1 class.

It's called when you create a new instance of the class.

##self:
This refers to the instance of the class being created.

##vocab:
This is an input parameter that is expected to be a dictionary.

This dictionary represents the vocabulary, where keys are the text tokens (strings) and values are their corresponding integer IDs.


##Inside the __init__ method:

##self.str_to_int = vocab:
This line stores the input vocab dictionary in an instance variable called str_to_int.

This dictionary will be used to look up the integer ID for a given text token during the encoding process.

##self.int_to_str = {i:s for s,i in vocab.items()}:
This line creates a new dictionary called int_to_str and stores it as an instance variable.

This dictionary is the inverse of the vocab dictionary.

It's created using a dictionary comprehension that iterates through the items (key-value pairs) of the vocab dictionary.

For each item (s, i), where s is the string token and i is the integer ID, it creates a new key-value pair i:s in the self.int_to_str dictionary.

This dictionary will be used to look up the text token for a given integer ID during the decoding process.

-------------------
##In essence,
the __init__ method initializes the tokenizer with a vocabulary and creates two dictionaries: one to map strings to integers and another to map integers back to strings.

##encode Method
    def encode(self, text):
          preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
                                
          preprocessed = [
              item.strip() for item in preprocessed if item.strip()
           ]
           ids = [self.str_to_int[s] for s in preprocessed]
           return ids

The encode method takes a string of text as input and converts it into a list of integer IDs based on the tokenizer's vocabulary.

##self:
Refers to the instance of the class.

##text:
The input string to be encoded.
Inside the encode method:

##preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text):
This line uses the re.split() function from the re module (which is assumed to be imported) to split the input text into a list of tokens.

The regular expression pattern r'([,.:;?_!"()\']|--|\s)' is used as the delimiter.

This pattern splits the text based on various punctuation marks (comma, period, colon, etc.), double hyphens (--), or whitespace characters (\s).

The parentheses around the pattern create a capturing group, which means the delimiters themselves are also included in the resulting preprocessed list.

##preprocessed = [item.strip() for item in preprocessed if item.strip()]:

This is a list comprehension that cleans up the preprocessed list.

It iterates through each item in the list. item.strip() removes leading and trailing whitespace from the item.

The if item.strip() part acts as a filter, including the item in the new list only if, after stripping whitespace, the item is not an empty string.

This removes any empty strings that might have resulted from the splitting process.

##ids = [self.str_to_int[s] for s in preprocessed]:
This is another list comprehension. It iterates through the cleaned preprocessed list of text tokens (s).

For each token s, it looks up its corresponding integer ID in the self.str_to_int dictionary (the vocabulary mapping strings to integers) and adds that ID to a new list called ids.

##return ids:
The method returns the final list of integer IDs, which is the encoded representation of the input text.
decode Method

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

The decode method takes a list of integer IDs as input and converts it back into a text string using the tokenizer's vocabulary.

##self:
Refers to the instance of the class.
ids: The input list of integer IDs to be decoded.


Inside the decode method:

##text = " ".join([self.int_to_str[i] for i in ids]):

This line converts the list of integer IDs back into a string. It uses a list comprehension to iterate through each integer ID (i) in the ids list.

For each ID i, it looks up the corresponding text token in the self.int_to_str dictionary (the vocabulary mapping integers back to strings).

The "".join(...) part then concatenates these text tokens into a single string, using a space character " " as the separator between tokens.

This initially reconstructs the text with spaces between every token, including spaces before punctuation.

##text = re.sub(r'\s+([,.?!"()\'])', ##r'\1', text):
This line uses the re.sub() function to clean up the spacing around punctuation.

##r'\s+([,.?!"()\'])':
This is the regular expression pattern to search for.

##\s+:
Matches one or more whitespace characters.

##([,.?!"()\']):
This is a capturing group that matches any single character from the set: comma, period, question mark, exclamation mark, double quote, opening parenthesis, or closing parenthesis.

##r'\1':
This is the replacement string. \1 refers to the content of the first capturing group in the pattern.

The re.sub() function finds all occurrences of the pattern (one or more spaces followed by one of the specified punctuation marks) and replaces them with just the punctuation mark itself (captured in group 1).

This effectively removes the spaces before the punctuation marks that were introduced by the "".join() operation.

##return text:
The method returns the reconstructed and cleaned text string.

-----
##In summary,
the SimpleTokenizerV1 class provides basic text tokenization functionality using a predefined vocabulary, allowing for the conversion of text to numerical IDs and back.

- The `encode` function turns text into token IDs
- The `decode` function turns token IDs back into text

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/08.webp?123" width="500px">

- We can use the tokenizer to encode (that is, tokenize) texts into integers
- These integers can then be embedded (later) as input of/for the LLM

In [None]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know,"
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


- We can decode the integers back into text

In [None]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

In [None]:
tokenizer.decode(tokenizer.encode(text))

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

## 2.4 Adding special context tokens

- It's useful to add some "special" tokens for unknown words and to denote the end of a text

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/09.webp?123" width="500px">

- Some tokenizers use special tokens to help the LLM with additional context
- Some of these special tokens are
  - `[BOS]` (beginning of sequence) marks the beginning of text
  - `[EOS]` (end of sequence) marks where the text ends (this is usually used to concatenate multiple unrelated texts, e.g., two different Wikipedia articles or two different books, and so on)
  - `[PAD]` (padding) if we train LLMs with a batch size greater than 1 (we may include multiple texts with different lengths; with the padding token we pad the shorter texts to the longest length so that all texts have an equal length)
- `[UNK]` to represent words that are not included in the vocabulary

- Note that GPT-2 does not need any of these tokens mentioned above but only uses an `<|endoftext|>` token to reduce complexity
- The `<|endoftext|>` is analogous to the `[EOS]` token mentioned above
- GPT also uses the `<|endoftext|>` for padding (since we typically use a mask when training on batched inputs, we would not attend padded tokens anyways, so it does not matter what these tokens are)
- GPT-2 does not use an `<UNK>` token for out-of-vocabulary words; instead, GPT-2 uses a byte-pair encoding (BPE) tokenizer, which breaks down words into subword units which we will discuss in a later section



- We use the `<|endoftext|>` tokens between two independent sources of text:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/10.webp" width="500px">

- Let's see what happens if we tokenize the following text:

In [None]:
tokenizer = SimpleTokenizerV1(vocab)

text = "Hello, do you like tea. Is this-- a test?"

tokenizer.encode(text)

KeyError: 'Hello'

- The above produces an error because the word "Hello" is not contained in the vocabulary
- To deal with such cases, we can add special tokens like `"<|unk|>"` to the vocabulary to represent unknown words
- Since we are already extending the vocabulary, let's add another token called `"<|endoftext|>"` which is used in GPT-2 training to denote the end of a text (and it's also used between concatenated text, like if our training datasets consists of multiple articles, books, etc.)

In [None]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [None]:
len(vocab.items())

1132

In [None]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


- We also need to adjust the tokenizer accordingly so that it knows when and how to use the new `<unk>` token

In [None]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

# SimpleTokenizerV2 Class
This code defines a Python class named SimpleTokenizerV2.

This class is designed to handle the process of converting text into a sequence of numerical IDs (encoding) and converting those IDs back into text (decoding).

It builds upon the functionality of SimpleTokenizerV1 by adding support for an "unknown" token (<|unk|>) to handle words that are not present in its vocabulary.

    class SimpleTokenizerV2:
       def __init__(self, vocab):
           self.str_to_int = vocab
           self.int_to_str = { i:s for s,i in vocab.items()}

---------------------------------------
##__init__ Method
The __init__ method is the constructor for the SimpleTokenizerV2 class. It is called when you create a new instance of the class.

##self:
This refers to the instance of the class being created.

##vocab:
This is an input parameter that is expected to be a dictionary.

This dictionary represents the vocabulary, where keys are the text tokens (strings) and values are their corresponding integer IDs.

--------------------
#Inside the __init__ method:

##self.str_to_int = vocab:
This line stores the input vocab dictionary in an instance variable called str_to_int.

This dictionary will be used to look up the integer ID for a given text token during the encoding process.

----------------------------
##self.int_to_str = { i:s for s,i in vocab.items()}:
This line creates a new dictionary called int_to_str and stores it as an instance variable.

This dictionary is the inverse of the vocab dictionary.

It is created using a dictionary comprehension that iterates through the items (key-value pairs) of the vocab dictionary. For each item (s, i), where s is the string token and i is the integer ID, it creates a new key-value pair i:s in the self.int_to_str dictionary.

This dictionary will be used to look up the text token for a given integer ID during the decoding process.

------
##In essence,
the __init__ method initializes the tokenizer with a vocabulary and creates two dictionaries: one to map strings to integers and another to map integers back to strings.

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

##encode Method
The encode method takes a string of text as input and converts it into a list of integer IDs based on the tokenizer's vocabulary.

------

##self:
Refers to the instance of the class.

##text:
The input string to be encoded.
Inside the encode method:

##preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text):
This line uses the re.split() function from the re module (which is assumed to be imported) to split the input text into a list of tokens.

The regular expression pattern r'([,.:;?_!"()\']|--|\s)' is used as the delimiter.

This pattern splits the text based on various punctuation marks (comma, period, colon, etc.), double hyphens (--), or whitespace characters (\s).

The parentheses around the pattern create a capturing group, which means the delimiters themselves are also included in the resulting preprocessed list.

##preprocessed = [item.strip() for item in preprocessed if item.strip()]:
This is a list comprehension that cleans up the preprocessed list.

It iterates through each item in the list. item.strip() removes leading and trailing whitespace from the item.

The if item.strip() part acts as a filter, including the item in the new list only if, after stripping whitespace, the item is not an empty string.

This removes any empty strings that might have resulted from the splitting process.

##preprocessed = [ item if item in self.str_to_int else "<|unk|>" for item in preprocessed ]:

This is another list comprehension that handles out-of-vocabulary words. It iterates through the cleaned preprocessed list.

For each item, it checks if the item exists as a key in the self.str_to_int dictionary (the vocabulary).

If the item is found in the vocabulary (item in self.str_to_int is True), the item itself is kept.

If the item is not found in the vocabulary, the string " <|unk|>" is used instead to represent the unknown token.

The result is a new list of tokens, where unknown words are replaced by the <|unk|> token.

##ids = [self.str_to_int[s] for s in preprocessed]:
This is the final list comprehension in the encoding process.

It iterates through the preprocessed list (which now contains either vocabulary tokens or the <|unk|> token).

For each token s in this list, it looks up its corresponding integer ID in the self.str_to_int dictionary and adds that ID to a new list called ids.

##return ids:
The method returns the final list of integer IDs, which is the encoded representation of the input text.

-------------------
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

##decode Method
The decode method takes a list of integer IDs as input and converts it back into a text string using the tokenizer's vocabulary.

self: Refers to the instance of the class.

ids: The input list of integer IDs to be decoded.
Inside the decode method:

##text = " ".join([self.int_to_str[i] for i in ids]):
This line converts the list of integer IDs back into a string.

It uses a list comprehension to iterate through each integer ID (i) in the ids list. For each ID i, it looks up the corresponding text token in the self.int_to_str dictionary (the vocabulary mapping integers back to strings).

The " ".join(...) part then concatenates these text tokens into a single string, using a space character " " as the separator between tokens.

This initially reconstructs the text with spaces between every token, including spaces before punctuation.

##text = re.sub(r'\s+([,.:;?!"()\'])##r'\1', text):


This line uses the re.sub() function to clean up the spacing around punctuation.

##r'\s+([,.:;?!"()\'])':
This is the regular expression pattern to search for. \s+ matches one or more whitespace characters.

##([,.:;?!"()\'])
It is a capturing group that matches any single character from the specified set of punctuation.

##r'\1':
This is the replacement string. \1 refers to the content of the first capturing group in the pattern (which is the matched punctuation character).

##The re.sub() function

finds all occurrences of the pattern (one or more spaces followed by one of the specified punctuation marks) and replaces them with just the punctuation mark itself.

This effectively removes the spaces before the punctuation marks that were introduced by the " ".join() operation.

##return text:
The method returns the reconstructed and cleaned text string.

--------
##In summary,
the SimpleTokenizerV2 class provides basic text tokenization functionality using a predefined vocabulary, including handling unknown tokens, allowing for the conversion of text to numerical IDs and back.

Let's try to tokenize text with the modified tokenizer:

In [None]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [None]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [None]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

## 2.5 BytePair encoding

- GPT-2 used BytePair encoding (BPE) as its tokenizer
- it allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words
- For instance, if GPT-2's vocabulary doesn't have the word "unfamiliarword," it might tokenize it as ["unfam", "iliar", "word"] or some other subword breakdown, depending on its trained BPE merges
- The original BPE tokenizer can be found here: [https://github.com/openai/gpt-2/blob/master/src/encoder.py](https://github.com/openai/gpt-2/blob/master/src/encoder.py)
- In this chapter, we are using the BPE tokenizer from OpenAI's open-source [tiktoken](https://github.com/openai/tiktoken) library, which implements its core algorithms in Rust to improve computational performance
- I created a notebook in the [./bytepair_encoder](../02_bonus_bytepair-encoder) that compares these two implementations side-by-side (tiktoken was about 5x faster on the sample text)

In [None]:
# pip install tiktoken

In [None]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.7.0


#tiktoken

tiktoken is an open-source library from OpenAI.

This means the code for it is publicly available for use and modification.

A BytePair Encoding (BPE) tokenizer.

 This is a specific type of algorithm used to break down text into smaller units (tokens).

 BPE is known for handling out-of-vocabulary words by splitting them into subword units.

Implemented with its core algorithms in Rust. Rrust is a programming language known for its performance and memory safety, suggesting that tiktoken is designed to be computationally efficient.

Used in GPT-2. This indicates that it's a relevant tool for working with models like GPT-2, particularly for preparing the text data these models process.
Faster than the original GPT-2 tokenizer.

The text notes a comparison in a separate notebook showing tiktoken was about 5x faster on sample text.

In essence, tiktoken is a high-performance, open-source library for tokenizing text using the BPE algorithm, developed by OpenAI for use with their models like GPT-2.

In [None]:
tokenizer = tiktoken.get_encoding("gpt2")

In [None]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [None]:
strings = tokenizer.decode(integers)

print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


- BPE tokenizers break down unknown words into subwords and individual characters:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/11.webp" width="300px">

## 2.6 Data sampling with a sliding window

- We train LLMs to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/12.webp" width="400px">

In [None]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


- For each text chunk, we want the inputs and targets
- Since we want the model to predict the next word, the targets are the inputs shifted by one position to the right

In [None]:
enc_sample = enc_text[50:]

In [None]:
context_size = 4

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


- One by one, the prediction would look like as follows:

In [None]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


In [None]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


- We will take care of the next-word prediction in a later chapter after we covered the attention mechanism
- For now, we implement a simple data loader that iterates over the input dataset and returns the inputs and targets shifted by one

- Install and import PyTorch (see Appendix A for installation tips)

In [None]:
import torch
print("PyTorch version:", torch.__version__)

PyTorch version: 2.5.1


- We use a sliding window approach, changing the position by +1:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/13.webp?123" width="500px">

- Create dataset and dataloader that extract chunks from the input text dataset

In [None]:
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
        assert len(token_ids) > max_length, "Number of tokenized inputs must at least be equal to max_length+1"

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [None]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

- Let's test the dataloader with a batch size of 1 for an LLM with a context size of 4:

In [None]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

In [None]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


In [None]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


- An example using stride equal to the context length (here: 4) as shown below:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/14.webp" width="500px">

- We can also create batched outputs
- Note that we increase the stride here so that we don't have overlaps between the batches, since more overlap could lead to increased overfitting

In [None]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


## 2.7 Creating token embeddings

- The data is already almost ready for an LLM
- But lastly let us embed the tokens in a continuous vector representation using an embedding layer
- Usually, these embedding layers are part of the LLM itself and are updated (trained) during model training

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/15.webp" width="400px">

- Suppose we have the following four input examples with input ids 2, 3, 5, and 1 (after tokenization):

In [None]:
input_ids = torch.tensor([2, 3, 5, 1])

- For the sake of simplicity, suppose we have a small vocabulary of only 6 words and we want to create embeddings of size 3:

In [None]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

- This would result in a 6x3 weight matrix:

In [None]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


- For those who are familiar with one-hot encoding, the embedding layer approach above is essentially just a more efficient way of implementing one-hot encoding followed by matrix multiplication in a fully-connected layer, which is described in the supplementary code in [./embedding_vs_matmul](../03_bonus_embedding-vs-matmul)
- Because the embedding layer is just a more efficient implementation that is equivalent to the one-hot encoding and matrix-multiplication approach it can be seen as a neural network layer that can be optimized via backpropagation

- To convert a token with id 3 into a 3-dimensional vector, we do the following:

In [None]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


- Note that the above is the 4th row in the `embedding_layer` weight matrix
- To embed all four `input_ids` values above, we do

In [None]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


- An embedding layer is essentially a look-up operation:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/16.webp?123" width="500px">

- **You may be interested in the bonus content comparing embedding layers with regular linear layers: [../03_bonus_embedding-vs-matmul](../03_bonus_embedding-vs-matmul)**

## 2.8 Encoding word positions

- Embedding layer convert IDs into identical vector representations regardless of where they are located in the input sequence:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/17.webp" width="400px">

- Positional embeddings are combined with the token embedding vector to form the input embeddings for a large language model:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/18.webp" width="500px">

- The BytePair encoder has a vocabulary size of 50,257:
- Suppose we want to encode the input tokens into a 256-dimensional vector representation:

In [None]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

- If we sample data from the dataloader, we embed the tokens in each batch into a 256-dimensional vector
- If we have a batch size of 8 with 4 tokens each, this results in a 8 x 4 x 256 tensor:

In [None]:
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [None]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


In [None]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

# uncomment & execute the following line to see how the embeddings look like
# print(token_embeddings)

torch.Size([8, 4, 256])


- GPT-2 uses absolute position embeddings, so we just create another embedding layer:

In [None]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

# uncomment & execute the following line to see how the embedding layer weights look like
# print(pos_embedding_layer.weight)

In [None]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

# uncomment & execute the following line to see how the embeddings look like
# print(pos_embeddings)

torch.Size([4, 256])


- To create the input embeddings used in an LLM, we simply add the token and the positional embeddings:

In [None]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

# uncomment & execute the following line to see how the embeddings look like
# print(input_embeddings)

torch.Size([8, 4, 256])


- In the initial phase of the input processing workflow, the input text is segmented into separate tokens
- Following this segmentation, these tokens are transformed into token IDs based on a predefined vocabulary:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/19.webp" width="400px">

# Summary and takeaways

See the [./dataloader.ipynb](./dataloader.ipynb) code notebook, which is a concise version of the data loader that we implemented in this chapter and will need for training the GPT model in upcoming chapters.

See [./exercise-solutions.ipynb](./exercise-solutions.ipynb) for the exercise solutions.

See the [Byte Pair Encoding (BPE) Tokenizer From Scratch](../02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb) notebook if you are interested in learning how the GPT-2 tokenizer can be implemented and trained from scratch.