<div style="background-color: #ffffff; color: #000000; padding: 10px;">
<div style="display: flex; justify-content: space-between; align-items: center; background-color: #ffffff; color: #000000; padding: 10px;">
    <img src="../media/logo_kisz.png" height="80" style="margin-right: auto;" alt="Logo of the AI Service Center Berlin-Brandenburg.">
    <img src="../media/logo_bmbf.jpeg" height="150" style="margin-left: auto;" alt="Logo of the German Federal Ministry of Education and Research: Gefördert vom Bundesministerium für Bildung und Forschung.">
</div>
<h1> Video Search
<h2> Finding Locations in Audio and Video with Automatic Speech Recognition and Semantic Search
</div>

<div style="background-color: #f6a800; color: #ffffff; padding: 10px;">
    <h2> Part 2 - Byte Pair Encoding
</div>

When working with OpenAI's Whisper or Large Language Models (LLMs) in general, we need to represent text in a format that neural networks can understand. Word embeddings provide this format by converting discrete language units (e.g., words, punctuation) into continuous vectors of real numbers (e.g., `[-0.342, 0.234, 0.633, 0.451]`). These vectors, so called *embeddings*, capture semantic and syntactic information, enabling models to recognise patterns, similarities, and relationships in language data.

The making of embeddings can be divided into two steps:

1. Tokenisation of training data (e.g., text), so that each unique token (e.g., word or punctuation character) can be represented with a number.
2. Computing embeddings with one of various methods (e.g., next word prediction), which is often part of the model training.

This notebook introduces the tokenisation of text and in particular the Byte Pair Encoder (BPE), which is used by many contemporary LLMs. The next notebook - [Part 3: Embeddings](03_embeddings.ipynb) - describes how to compute embeddings.

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>1. Tokenisation - Breaking down Text into small Units
</div>

In order to compute embeddings, we first need to tokenise the text that constitutes the training data. Tokenisation is the process of breaking down continuous texts into smaller units such as words and punctuation characters. Each unique token will get its own ID. In the following, tokenisation is illustrated in the context of the novel *Moby Dick*, which can be downloaded from the *Gutenberg Corpus*.

In [1]:
import nltk
nltk.download('gutenberg', quiet=True)

True

If the download does not work, use the solution from stackoverflow: https://stackoverflow.com/questions/38916452/nltk-download-ssl-certificate-verify-failed

In [9]:
from nltk.corpus import gutenberg

# get text from novel Moby Dick
moby_dick = gutenberg.words('melville-moby_dick.txt')
moby_dick = ' '.join(moby_dick)

# print a few words to understand the data's structure
print(f'"{moby_dick[20000:20100]}"')
print("Total number of characters in the novel:", len(set(moby_dick)))
print("Total number of unique words in the novel:", len(set(moby_dick.split())))

"meet a whale - ship on the ocean without being struck by her near appearance . The vessel under shor"
Total number of characters in the novel: 80
Total number of unique words in the novel: 19317


In [8]:
# Split the text into sentences using "." as separator
sentences = moby_dick.split(".")

# Remove empty sentences (e.g., trailing after last ".")
sentences = [s.strip() for s in sentences if s.strip()]

# Calculate average number of characters per sentence
avg_chars = sum(len(s) for s in sentences) / len(sentences)

# Calculate average number of words per sentence (split on space)
avg_words = sum(len(s.split()) for s in sentences) / len(sentences)

print(f"Average number of characters per sentence: {avg_chars:.2f}")
print(f"Average number of words per sentence: {avg_words:.2f}")

Average number of characters per sentence: 164.71
Average number of words per sentence: 33.79


For creating tokens, we first split the text into words and punctuation characters. Each of these *units* will be the basis for a token. These units will not be our final tokens, because some units will be further decomposed, as is detailled out below.

In [3]:
import re

# split the text into tokens; each space seperated character string and each punctuation character is a unit
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', moby_dick)

# remove empty tokens
preprocessed = [item.strip() for item in preprocessed if item.strip()]

# investigate
print(preprocessed[:30])

['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(', 'Supplied', 'by', 'a', 'Late', 'Consumptive', 'Usher', 'to', 'a', 'Grammar', 'School', ')', 'The', 'pale', 'Usher', '--', 'threadbare', 'in', 'coat', ',']


Next, we convert the text tokens into token IDs, representing our vocabulary. The vocabulary consists of each unique token, sorted alphabetically. Each unique token is then mapped onto its ID.

In [4]:
# create the vocabulary
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(f'vocabulary size: {vocab_size} unique tokens.')

# print the first items of the vocabulary
vocab = {token:integer for integer,token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 10:
        break

vocabulary size: 19243 unique tokens.
('!', 0)
('"', 1)
('$', 2)
('&', 3)
("'", 4)
('(', 5)
(')', 6)
('*', 7)
(',', 8)
('-', 9)
('--', 10)


<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>2. Special Context Tokens
</div>

Often, special tokens are added to the vocabulary for different reasons.
- `<|UNK|>` (unknown word) denotes out-of-vocabulary words
- `<|endoftext|>` or `<|EOS|>` (end of sequence) are used, among others, when multiple texts such as newspaper articles are concatenated
- `<|beginningoftext|>` or `<|BOS|>` denote beginnings analogous to `<|endoftext|>` and `<|EOS|>`
- `<|PAD|>` (padding) is used when training LLMs with batch sizes greater than 1 (i.e., inputs with different length are 'padded' to the length of the longest input; e.g., `['Never', 'mind', '<|PAD|>', '<|PAD|>', '<|PAD|>']` has the same length as `['Actions', 'speak', 'louder', 'than', 'words']`

Next, the `<|UNK|>` token is added to our vocabulary. Without this special token, any out-of-vocabulary word would cause errors or be ignored during tokenisation and model inference. Because of this, almost each tokenisation approach uses the `<|UNK|>` token.


In [5]:
# add <|unk|> to vocabulary
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|UNK|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

# check whether it worked
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('zone', 19239)
('zoned', 19240)
('zones', 19241)
('zoology', 19242)
('<|UNK|>', 19243)


We can now define `encode` and `decode` functions for a simple tokeniser. In the AI domain, encoding refers to the process of transforming an input sequence into tokens, whereas decoding refers to the reverse transformation, that is, tokens to output sequence. In the case of this notebook, input and output sequences are texts. With the `encode` and the `decode` function, we can turn text into IDs and vice versa.

In [6]:
class SimpleTokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}
    
    # split text into tokens and add <|UNK|> for unknown tokens
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int 
            else "<|UNK|>" for item in preprocessed # add <|UNK|> for unknown tokens
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    # decode ids back to text
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

The tokeniser can be used as follows:

In [7]:
# create a small vocabulary for testing
sentence = sorted(set(['this', 'sentence', 'is', 'a', 'test', 'sentence', '.']))
vocab_small = {token:integer for integer,token in enumerate(sentence)}

# create tokenizer with the small vocabulary
tokenizer = SimpleTokenizer(vocab_small)

# tokenise the example sentence and decode it back to text
ids = tokenizer.encode('this sentence is a test sentence.')
print(tokenizer.decode(ids))
print(ids)

this sentence is a test sentence.
[5, 3, 2, 1, 4, 3, 0]


After sorting alphabetically, the "." is the first token, which gets the ID 0. The second token (ID 1) is "a". The token "sentence", which occurs twice, gets the ID 3. Using the small vocabulary, the relationship between tokens and IDs is straightforward. However, in real texts that are much longer, it is not possible anymore to see this relationship at one glance.

In [8]:
# build tokenizer with the vocabulary from Moby Dick
tokenizer = SimpleTokenizer(vocab)

# encode a part of the text and decode it back to text
text = moby_dick[20015:20200]
ids = tokenizer.encode(text)
print(tokenizer.decode(ids))
print(ids)

ship on the ocean without being struck by her near appearance. The vessel under short sail, with look - outs at the mast - heads, eagerly scanning the wide expanse around them, has
[15641, 12867, 17275, 12802, 19050, 4821, 16694, 5471, 9992, 12548, 4227, 11, 3296, 18487, 17981, 15683, 15152, 8, 19037, 11657, 9, 12993, 4435, 17275, 11960, 9, 9897, 8, 7843, 15255, 17275, 18965, 8389, 4322, 17280, 8, 9839]


Out-of-vocabulary words are encoded with the ID that represents the special token `<|UNK|>`.

In [9]:
tokenizer = SimpleTokenizer(vocab)

text = "Hello World"

print(tokenizer.decode(tokenizer.encode(text)))
print(tokenizer.encode(text))

<|UNK|> World
[19243, 3640]


<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>3. Byte Pair Encoding - Motivation
</div>

When training LLMs, the way we represent text as tokens has a significant impact on both model performance and computational efficiency. Using a simple byte-level encoding, where each character is assigned its own token, results in very long input sequences, which increases memory usage and slows down training. In contrast, Byte Pair Encoding (BPE) builds a vocabulary of frequently occurring character sequences (subwords), allowing common words and patterns to be represented by fewer tokens. For instance, the words "runs" and "running" could be represented as ["run", "s"] and ["run", "ing"]. This has the following advantages:

- **Optimised vocabulary size and computing costs:**  
  A smaller vocabulary enables the model to represent more words with fewer tokens, reducing sequence length and speeding up training. Smaller vocabularies reduce the size of the embedding and output layers, leading to smaller memory requirements and faster computations. BPE is optimised for learning a manageable vocabulary of subwords, keeping both sequence lengths and model size reasonable.

- **Smart usage of repetitive patterns in words:**  
  Many words share common roots or affixes (e.g., "run", "runs", "runner", "running"). BPE exploits these repetitive patterns by merging frequent character pairs into subword units. This enables the model to efficiently represent related words using shared subword tokens, improving generalisation and reducing the number of unique tokens needed.

- **Efficiently handling infrequent words:**  
  Natural language exhibits a long-tail distribution: a small number of words are very common, while many words are rare or unique, which is especially the case of languages such as German or Dutch, where you can spontaneously build long compounds (e.g., "Kaffeevollautomatreinigungsset" or "Aansprakelijkheidsverzekerig"). BPE helps address this by breaking rare or unseen words into known subword units, allowing the model to handle out-of-vocabulary words gracefully without needing an excessively large vocabulary.

The original BPE tokeniser that OpenAI implemented for training the original GPT models can be found [here](https://github.com/openai/gpt-2/blob/master/src/encoder.py). The BPE algorithm was originally described in 1994: "[A New Algorithm for Data Compression](http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM)" by Philip Gage.

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>4. The Fundament of BPE - Bits and Bytes
</div>

The BPE tokeniser heavily relies on the notion of bits and bytes. So, what are bytes? A byte consists of eigth bits, so there are 2<sup>8</sup> = 256 possible values that a single byte can represent, ranging from 0 to 255. This is illustrated in the following code, which creates byte arrays for incrementaly longer sequences. As soon as the sequence gets longer than 256, an error is produced.

In [10]:
# Try to create bytearrays for incrementally longer sequences, print error when it occurs
for n in range(0, 300):
    try:
        ba = bytearray(range(0, n))
    except ValueError as e:
        print(f"Error at n={n}: {e}")
        break

Error at n=257: byte must be in range(0, 256)


In principle, most texts can be converted into byte arrays, because texts hardly consist of more than 256 unique characters. This would be an easy approach, because converting a byte array into a list encodes each character with a unique IDs similar to the `encode` function that we defined above.

In [11]:
# convert text into byte array
text = "Hello World"
byte_ary = bytearray(text, "utf-8")

# enocde text
ids = list(byte_ary)
print("IDs::", ids)

IDs:: [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]


The downside of this approach is that for each charater a unique ID is used. Thus, representing our 11-character sentence with byte-based tokens would result in a vector of length eleven, whereas representing this sentence with word-based tokens results in a vector of length two.

In [12]:
# investigate length of encodings with byte-based and word-based tokens
print("Number of characters:", len(text))
print("Number of token IDs (byte-based):", len(ids))
print("Number of token IDs (word-based):", len(tokenizer.encode(text)))


Number of characters: 11
Number of token IDs (byte-based): 11
Number of token IDs (word-based): 2


<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>5. BPE Algorithm
</div>

The BPE tokenisation algorithm builds a vocabulary consisting of character tokens, subword units, and words. Thus, commonly occurring subwords like `ent`, which can be found in, among others, "entangle", "entertain", "enter", "entrance", have their own token ID. The BPE algorithm strives to find the optimal balance between character and subword units. This is achieved as follows:

**1. Identify frequent pairs**
- In each iteration, scan the text to find the most commonly occurring pair of bytes (or characters)

**2. Replace and record**

- Replace that pair with a new placeholder ID (one not already in use, e.g., if we start with 0...255, the first placeholder would be 256)
- Record this mapping in a lookup table
- The size of the lookup table is a hyperparameter, also called "vocabulary size" (for GPT-2, that's
50,257)

**3. Repeat until no gains**

- Keep repeating steps 1 and 2, continually merging the most frequent pairs
- Stop when no further compression is possible (e.g., no pair occurs more than once)


### Concrete example of the encoding part (steps 1 & 2)

Suppose we have the text (training dataset) `the cat in the hat` from which we want to build the vocabulary for a BPE tokeniser.

**Iteration 1**

1. Identify frequent pairs
  - In this text, "th" appears twice (at the beginning and before the second "e")

2. Replace and record
  - replace "th" with a new token ID that is not already in use, e.g., 256
  - the new text is: `<256>e cat in <256>e hat`
  - the new vocabulary is

```
  0: ...
  ...
  256: "th"
```

**Iteration 2**

1. **Identify frequent pairs**  
   - In the text `<256>e cat in <256>e hat`, the pair `<256>e` appears twice

2. **Replace and record**  
   - replace `<256>e` with a new token ID that is not already in use, for example, `257`.  
   - The new text is:
     ```
     <257> cat in <257> hat
     ```
   - The updated vocabulary is:
     ```
     0: ...
     ...
     256: "th"
     257: "<256>e"
     ```

**Iteration 3**

1. **Identify frequent pairs**  
   - In the text `<257> cat in <257> hat`, the pair `<257> ` appears twice (once at the beginning and once before “hat”).

2. **Replace and record**  
   - replace `<257> ` with a new token ID that is not already in use, for example, `258`.  
   - the new text is:
     ```
     <258>cat in <258>hat
     ```
   - The updated vocabulary is:
     ```
     0: ...
     ...
     256: "th"
     257: "<256>e"
     258: "<257> "
     ```
     
- and so forth

### Concrete example of the decoding part

- To restore the original text, we reverse the process by substituting each token ID with its corresponding pair in the reverse order they were introduced
- Start with the final compressed text: `<258>cat in <258>hat`
-  Substitute `<258>` → `<257> `: `<257> cat in <257> hat`  
- Substitute `<257>` → `<256>e`: `<256>e cat in <256>e hat`
- Substitute `<256>` → "th": `the cat in the hat`

The result of this algorithm produces token IDs that represet characters, subwords, and words. For illustration, have a look at the following cell.

In [13]:
# use the GPT-2 tokeniser, which is based on BPE
import tiktoken
gpt2_tokenizer = tiktoken.get_encoding("gpt2")

# 298 is the token "ent"
print(f'{298}: {gpt2_tokenizer.decode([298])}')

# here are a few words, that have their own token IDs
for i in [318, 617, 1212, 2420]:
    decoded = gpt2_tokenizer.decode([i])
    print(f"{i}: {decoded}")

298: ent
318:  is
617:  some
1212: This
2420:  text


Below is an implementation of the BPE algorithm as a Python class that mimics the `tiktoken` Python user interface.

In [14]:
from collections import Counter, deque
from functools import lru_cache


class BPETokenizerSimple:
    def __init__(self):
        # Maps token_id to token_str (e.g., {11246: "some"})
        self.vocab = {}
        # Maps token_str to token_id (e.g., {"some": 11246})
        self.inverse_vocab = {}
        # Dictionary of BPE merges: {(token_id1, token_id2): merged_token_id}
        self.bpe_merges = {}

        # For the official OpenAI GPT-2 merges, use a rank dict:
        # of form {(string_A, string_B): rank}, where lower rank = higher priority
        self.bpe_ranks = {}

    def train(self, text, vocab_size, allowed_special={"<|endoftext|>"}):
        """
        Train the BPE tokenizer from scratch.

        Args:
            text (str): The training text.
            vocab_size (int): The desired vocabulary size.
            allowed_special (set): A set of special tokens to include.
        """
        
        processed_text = text

        # Initialize vocab with unique characters, including "Ġ" if present
        # Start with the first 256 ASCII characters
        unique_chars = [chr(i) for i in range(256)]
        unique_chars.extend(
            char for char in sorted(set(processed_text))
            if char not in unique_chars
        )

        self.vocab = {i: char for i, char in enumerate(unique_chars)}
        self.inverse_vocab = {char: i for i, char in self.vocab.items()}

        # Add allowed special tokens
        if allowed_special:
            for token in allowed_special:
                if token not in self.inverse_vocab:
                    new_id = len(self.vocab)
                    self.vocab[new_id] = token
                    self.inverse_vocab[token] = new_id

        # Tokenize the processed_text into token IDs
        token_ids = [self.inverse_vocab[char] for char in processed_text]

        # BPE steps 1-3: Repeatedly find and replace frequent pairs
        for new_id in range(len(self.vocab), vocab_size):
            pair_id = self.find_freq_pair(token_ids, mode="most")
            if pair_id is None:
                break
            token_ids = self.replace_pair(token_ids, pair_id, new_id)
            self.bpe_merges[pair_id] = new_id

        # Build the vocabulary with merged tokens
        for (p0, p1), new_id in self.bpe_merges.items():
            merged_token = self.vocab[p0] + self.vocab[p1]
            self.vocab[new_id] = merged_token
            self.inverse_vocab[merged_token] = new_id

    def encode(self, text, allowed_special=None):
        """
        Encode the input text into a list of token IDs, with tiktoken-style handling of special tokens.
    
        Args:
            text (str): The input text to encode.
            allowed_special (set or None): Special tokens to allow passthrough. If None, special handling is disabled.
    
        Returns:
            List of token IDs.
        """
    
        token_ids = []
    
        # If special token handling is enabled
        if allowed_special is not None and len(allowed_special) > 0:
            # Build regex to match allowed special tokens
            special_pattern = (
                "(" + "|".join(re.escape(tok) for tok in sorted(allowed_special, key=len, reverse=True)) + ")"
            )
    
            last_index = 0
            for match in re.finditer(special_pattern, text):
                prefix = text[last_index:match.start()]
                token_ids.extend(self.encode(prefix, allowed_special=None))  # Encode prefix without special handling
    
                special_token = match.group(0)
                if special_token in self.inverse_vocab:
                    token_ids.append(self.inverse_vocab[special_token])
                else:
                    raise ValueError(f"Special token {special_token} not found in vocabulary.")
                last_index = match.end()
    
            text = text[last_index:]  # Remaining part to process normally
    
            # Check if any disallowed special tokens are in the remainder
            disallowed = [
                tok for tok in self.inverse_vocab
                if tok.startswith("<|") and tok.endswith("|>") and tok in text and tok not in allowed_special
            ]
            if disallowed:
                raise ValueError(f"Disallowed special tokens encountered in text: {disallowed}")
    
        # If no special tokens, or remaining text after special token split:
        tokens = []
        lines = text.split("\n")
        for i, line in enumerate(lines):
            if i > 0:
                tokens.append("\n")
            words = line.split()
            for j, word in enumerate(words):
                if j == 0 and i <= 0:
                    tokens.append(word)
                else:
                    tokens.append(" " + word)
    
        for token in tokens:
            if token in self.inverse_vocab:
                token_ids.append(self.inverse_vocab[token])
            else:
                token_ids.extend(self.tokenize_with_bpe(token))
    
        return token_ids

    def tokenize_with_bpe(self, token):
        """
        Tokenize a single token using BPE merges.

        Args:
            token (str): The token to tokenize.

        Returns:
            List[int]: The list of token IDs after applying BPE.
        """
        # Tokenize the token into individual characters (as initial token IDs)
        token_ids = [self.inverse_vocab.get(char, None) for char in token]
        if None in token_ids:
            missing_chars = [char for char, tid in zip(token, token_ids) if tid is None]
            raise ValueError(f"Characters not found in vocab: {missing_chars}")

        can_merge = True
        while can_merge and len(token_ids) > 1:
            can_merge = False
            new_tokens = []
            i = 0
            while i < len(token_ids) - 1:
                pair = (token_ids[i], token_ids[i + 1])
                if pair in self.bpe_merges:
                    merged_token_id = self.bpe_merges[pair]
                    new_tokens.append(merged_token_id)
                    i += 2  # Skip the next token as it's merged
                    can_merge = True
                else:
                    new_tokens.append(token_ids[i])
                    i += 1
            if i < len(token_ids):
                new_tokens.append(token_ids[i])
            token_ids = new_tokens
        return token_ids

    def decode(self, token_ids):
        """
        Decode a list of token IDs back into a string.

        Args:
            token_ids (List[int]): The list of token IDs to decode.

        Returns:
            str: The decoded string.
        """
        decoded_string = ""
        for i, token_id in enumerate(token_ids):
            if token_id not in self.vocab:
                raise ValueError(f"Token ID {token_id} not found in vocab.")
            token = self.vocab[token_id]
            if token == "\n":
                if decoded_string and not decoded_string.endswith(" "):
                    decoded_string += " "  # Add space if not present before a newline
                decoded_string += token
            else:
                decoded_string += token
        return decoded_string

    @staticmethod
    def find_freq_pair(token_ids, mode="most"):
        pairs = Counter(zip(token_ids, token_ids[1:]))

        if not pairs:
            return None

        if mode == "most":
            return max(pairs.items(), key=lambda x: x[1])[0]
        elif mode == "least":
            return min(pairs.items(), key=lambda x: x[1])[0]
        else:
            raise ValueError("Invalid mode. Choose 'most' or 'least'.")

    @staticmethod
    def replace_pair(token_ids, pair_id, new_id):
        dq = deque(token_ids)
        replaced = []

        while dq:
            current = dq.popleft()
            if dq and (current, dq[0]) == pair_id:
                replaced.append(new_id)
                # Remove the 2nd token of the pair, 1st was already removed
                dq.popleft()
            else:
                replaced.append(current)

        return replaced

Next, the BPE tokeniser is initated and trained with a vocabulary size of 1,000. Note that the vocabulary size is already 256 by default due to the byte values discussed earlier, so we are only "learning" 744 vocabulary entries For comparison, the GPT-2 vocabulary is 50,257 tokens, the GPT-4 vocabulary is 100,256 tokens (`cl100k_base` in tiktoken), and GPT-4o uses 199,997 tokens (`o200k_base` in tiktoken); they have all much bigger training sets compared to our simple example text above.

In [15]:
# train the tokeniser, which takes a minute or two, depending on your processing capacities
tokenizer = BPETokenizerSimple()
tokenizer.train(moby_dick, vocab_size=1000, allowed_special={"<|endoftext|>"})

In [16]:
# investigate length of vocabulary (should be as long as our parameter for vocab_size, i.e., 1000)
print(len(tokenizer.vocab))

# investigate last five items in vocab
print(list(tokenizer.vocab.items())[-5:])

1000
[(995, '," '), (996, 'bra'), (997, 'ct'), (998, 'ive '), (999, 'see ')]


This vocabulary is created by merging 743 times (`= 1000 - len(range(0, 256)) - len(special_tokens) = 1000 - 256 - 1 = 743`). This means that the first 256 entries are single-character tokens. Next, let's use the created merges via the `encode` method to encode some text:

In [17]:
input_text = "A person who never made a mistake never tried anything new."
token_ids = tokenizer.encode(input_text)
print(token_ids)

[65, 32, 411, 677, 110, 32, 295, 111, 32, 900, 404, 32, 479, 464, 32, 97, 32, 836, 289, 578, 101, 32, 900, 404, 258, 345, 336, 32, 266, 121, 277, 272, 32, 900, 119, 46]


We can enable the `<|endoftext|>` as follows. Note that the last token ID represents this special token.

In [18]:
input_text = "A person who never made a mistake never tried anything new.<|endoftext|> "
token_ids = tokenizer.encode(input_text, allowed_special={"<|endoftext|>"})
print(token_ids)

[65, 32, 411, 677, 110, 32, 295, 111, 32, 900, 404, 32, 479, 464, 32, 97, 32, 836, 289, 578, 101, 32, 900, 404, 258, 345, 336, 32, 266, 121, 277, 272, 32, 900, 119, 46, 256]


The 73-character sentence was encoded into 37 token IDs, effectively cutting the input length roughly in half. How many byte pairs should be merged in order to yield the best performing vocabulary depends on the training data and use case.

In [19]:
print("Number of characters:", len(input_text))
print("Number of token IDs:", len(token_ids))

Number of characters: 73
Number of token IDs: 37


The `decode()` method can be used to map token IDs back onto text.

In [20]:
print(tokenizer.decode(token_ids))

A person who never made a mistake never tried anything new.<|endoftext|>


Iterating over each token ID can give us a better understanding of how the token IDs are decoded via the vocabulary:

In [21]:
for token_id in token_ids:
    print(f"{token_id} -> {tokenizer.decode([token_id])}")

65 -> A
32 ->  
411 -> per
677 -> so
110 -> n
32 ->  
295 -> wh
111 -> o
32 ->  
900 -> ne
404 -> ver
32 ->  
479 -> ma
464 -> de
32 ->  
97 -> a
32 ->  
836 -> mi
289 -> st
578 -> ak
101 -> e
32 ->  
900 -> ne
404 -> ver
258 ->  t
345 -> ri
336 -> ed
32 ->  
266 -> an
121 -> y
277 -> th
272 -> ing
32 ->  
900 -> ne
119 -> w
46 -> .
256 -> <|endoftext|>


As we can see, most token IDs represent 2-character subwords; that's because the training data text is very short with not that many repetitive words, and because we used a relatively small vocabulary size.

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>6. OpenAI's Open-Source tiktoken Library
</div>

The BPE algorithm implementation above focuses on readability for educational purposes and is not recommended for training LLMs. Instead, I highly recommend using [tiktoken](https://github.com/openai/tiktoken), which implements its core algorithms in Rust to improve computational performance. The following cells illustrate how the tiktoken library can be used.

In [22]:
# load the library
import importlib

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.9.0


In [23]:
# build the tokeniser
tokenizer = tiktoken.get_encoding("gpt2")

In [24]:
# use a sample text
text = "A Byte Pair Encoding (BPE) tokeniser is a subword tokenisation algorithm that iteratively " \
"merges the most frequent pairs of characters or character sequences in a text to build a vocabulary of " \
"common subword units, enabling efficient and flexible representation of words."

# encode the text
tiktoken_IDs = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(f'IDs: {tiktoken_IDs}')

decoded_text = tokenizer.decode(tiktoken_IDs)
print(f'Decoded IDs: {decoded_text}')

IDs: [32, 30589, 39645, 14711, 7656, 357, 33, 11401, 8, 11241, 5847, 318, 257, 850, 4775, 11241, 5612, 11862, 326, 11629, 9404, 4017, 3212, 262, 749, 10792, 14729, 286, 3435, 393, 2095, 16311, 287, 257, 2420, 284, 1382, 257, 25818, 286, 2219, 850, 4775, 4991, 11, 15882, 6942, 290, 12846, 10552, 286, 2456, 13]
Decoded IDs: A Byte Pair Encoding (BPE) tokeniser is a subword tokenisation algorithm that iteratively merges the most frequent pairs of characters or character sequences in a text to build a vocabulary of common subword units, enabling efficient and flexible representation of words.


<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>7. Retrieving Embeddings from GloVe
</div>

In this section, we retrieve **GloVe embeddings** for the tokens produced by our tokenizer.  

**GloVe** (Global Vectors for Word Representation) is a widely used pre-trained word embedding model developed by Stanford University. It represents words as dense vectors in a high-dimensional space, capturing semantic relationships and similarities between words based on their co-occurrence statistics in large text corpora.

Normally, in modern language models, embeddings are learned as part of the model training process. That is, the model learns to map each token to a vector representation that is optimal for the task. However, in this workshop, we do **not** train embeddings ourselves, as this is outside the scope of our current focus. Instead, we use pre-trained GloVe embeddings.

If you are interested in a detailed, hands-on workshop on embeddings please see:  
[https://github.com/aihpi/kisz-nlp-embeddings](https://github.com/aihpi/kisz-nlp-embeddings)

In [27]:
# download and unzip GloVe embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip
!rm glove.6B.zip

--2025-07-07 10:57:56--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2025-07-07 10:57:57--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2025-07-07 10:57:58--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [None]:
# Create token strings from an example text
text = moby_dick
encoded_text = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
token_strings = [tokenizer.decode([tid]) for tid in encoded_text]

In [32]:
import numpy as np

# Load GloVe embeddings
glove_path = 'GloVe/glove.6B.300d.txt'
glove_embeddings = {}
with open(glove_path, 'r', encoding='utf8') as f:
    for line in f:
        parts = line.strip().split()
        word = parts[0]
        vec = np.array(parts[1:], dtype=np.float32)
        glove_embeddings[word] = vec

In [42]:
# Get embedding for each token string (lowercased, as GloVe is lowercased)
embedding_dim = 300
embeddings = []
for token in token_strings:
    # Remove whitespace and lowercase for GloVe lookup
    key = token.strip().lower()
    if key in glove_embeddings:
        embeddings.append(glove_embeddings[key])
    else:
        # If not found, use a zero vector or random vector
        embeddings.append(np.zeros(embedding_dim))

embeddings = np.stack(embeddings)
print(embeddings.shape)  # (num_tokens, 300)

(290799, 300)


In [None]:
# Find the index of a token
my_token = "fish"
my_token = "whale"
my_token = "wh"
my_token = "ale"

try:
    idx = token_strings.index(my_token)
    idx_embedding = embeddings[idx]
    print(f"Embedding for '{my_token}':", idx_embedding)
except ValueError:
    print(f"'{my_token}' not found in tokens")

Embedding for 'ale': [-1.94409996e-01 -3.65069985e-01 -3.45169991e-01  4.92989987e-01
  9.82309971e-03  3.84810001e-01 -5.14100015e-01  9.30480003e-01
  3.82220000e-01  5.22710025e-01  3.29620004e-01 -1.14629996e+00
 -9.59580004e-01  1.48450002e-01  6.34609997e-01  3.42689991e-01
 -3.60170007e-01  2.94299990e-01 -4.70560014e-01 -3.04760009e-01
  2.06379995e-01 -7.02880025e-02 -5.22329986e-01  6.26590014e-01
 -7.09309995e-01 -3.03290009e-01 -1.82789996e-01 -8.41019955e-03
 -5.58499992e-01  3.18789989e-01  4.81970012e-02 -2.08139997e-02
 -2.67840009e-02 -1.60840005e-01 -3.65869999e-01  6.82219982e-01
  4.71410006e-01  7.21440017e-01  2.03740001e-01  8.18630010e-02
  1.33589998e-01  2.05090001e-01  3.46089989e-01 -1.39920004e-02
  5.48330009e-01 -5.43470025e-01 -1.40400007e-01 -6.86999977e-01
  6.22420013e-02  2.56579995e-01 -3.30989987e-01 -5.27249992e-01
  3.24649990e-01  2.70020008e-01  2.40099996e-01  3.60870004e-01
  3.16729993e-01  2.57349998e-01 -6.90100014e-01  4.76429999e-01
  2.

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>7. Summary

In this notebook, we explored the fundamentals of tokenisation for LLMs, focusing on the motivation and mechanics behind Byte Pair Encoding. We began by examining basic tokenisation approaches, such as splitting text into words or characters, and discussed the limitations of character-level encoding, particularly the inefficiency caused by long input sequences. We then introduced BPE, which builds a vocabulary of frequently occurring subword units, allowing for more efficient and flexible text representation. The notebook demonstrated how BPE leverages repetitive patterns in language and handles rare or out-of-vocabulary words by breaking them into known subwords. We also implemented a simple BPE tokeniser, compared its output to byte- and word-based approaches, and reviewed the vocabulary sizes of popular LLMs. Finally, we introduced OpenAI's `tiktoken` library as a high-performance, production-ready solution for BPE tokenisation. Overall, this notebook provided both theoretical background and practical examples to illustrate why BPE is the preferred tokenisation method for modern LLMs.