---
title: "Tokenization"
description: ""
jupyter: python3
categories: [tokenization]
image: https://substackcdn.com/image/fetch/$s_!Aj2N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eb8e0ce-1111-4896-88ec-4e630d6471ed_1182x488.png
---

# Tokenization

<iframe width="560" height="315" src="https://www.youtube.com/embed/zduSFxRajkE?si=7U1IDwsVn20kq2Xe" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

Here is a live Tokenizer called [Tiktokenizer](https://tiktokenizer.vercel.app/)

> Tokens are the fundemantal unit, the “atom” of Large Language Models (LLMs). Tokenization is the process of translating strings (i.e. text) and converting them into sequences of tokens and vice versa.

In [83]:
text = "The scaling up of AI models has two major consequences. First, AI models are becoming more powerful and capable of more tasks, enabling more applications. More people and teams leverage AI to increase productivity, create economic value, and improve quality of life. Second, training large language models (LLMs) requires data, compute resources, and specialized talent that only a few organizations can afford. This has led to the emergence of model as a service: models developed by these few organizations are made available for others to use as a service. Anyone who wishes to leverage AI to build applications can now use these models to do so without having to invest up front in building a model. In short, the demand for AI applications has increased while the barrier to entry for building AI applications has decreased. This has turned AI engineering—the process of building applications on top of readily available models—into one of the fastest-growing engineering disciplines. Building applications on top of machine learning (ML) models isn’t new. Long before LLMs became prominent, AI was already powering many applications, including product recommendations, fraud detection, and churn prediction. While many principles of productionizing AI applications remain the same, the new generation of large-scale, readily available models brings about new possibilities and new challenges, which are the focus of this book. This chapter begins with an overview of foundation models, the key catalyst behind the explosion of AI engineering. I’ll then discuss a range of successful AI use cases, each illustrating what AI is good and not yet good at. As AI’s capabilities expand daily, predicting its future possibilities becomes increasingly challenging. However, existing application patterns can help uncover opportunities today and offer clues about how AI may continue to be used in the future. To close out the chapter, I’ll provide an overview of the new AI stack, including what has changed with foundation models, what remains the same, and how the role of an AI engineer today differs from that of a traditional ML engineer."
tokens = text.encode('utf-8') # raw bytes
tokens = list(map(int, tokens)) # convert to a list of integers 0-255

print("---")
print(text)
print("length of text:", len(text))
print("---")
print(tokens)
print("length of tokens:", len(tokens))


---
The scaling up of AI models has two major consequences. First, AI models are becoming more powerful and capable of more tasks, enabling more applications. More people and teams leverage AI to increase productivity, create economic value, and improve quality of life. Second, training large language models (LLMs) requires data, compute resources, and specialized talent that only a few organizations can afford. This has led to the emergence of model as a service: models developed by these few organizations are made available for others to use as a service. Anyone who wishes to leverage AI to build applications can now use these models to do so without having to invest up front in building a model. In short, the demand for AI applications has increased while the barrier to entry for building AI applications has decreased. This has turned AI engineering—the process of building applications on top of readily available models—into one of the fastest-growing engineering disciplines. Buildi

# Byte-Pair Encoding Algorithm

- find the pair of bytes that occur most frequently
- mint a new token representation of them

In [84]:
def get_most_frequent_pair(ids):
    # count the frequency of each pair of bytes
    pair_counts = {}
    for pair in zip(ids, ids[1:]): # zip() pairs each byte with the next one
        pair_counts[pair] = pair_counts.get(pair, 0) + 1 # get() returns the value of the pair, or 0 if it doesn't exist
    return pair_counts

most_frequent_pair = get_most_frequent_pair(tokens)
# print(most_frequent_pair)
print(sorted(((value, key) for key, value in most_frequent_pair.items()), reverse=True))

[(57, (101, 32)), (52, (115, 32)), (48, (105, 110)), (40, (32, 116)), (40, (32, 97)), (35, (110, 103)), (31, (111, 110)), (29, (32, 111)), (28, (116, 105)), (26, (114, 101)), (26, (97, 116)), (25, (116, 104)), (24, (101, 114)), (24, (97, 110)), (23, (100, 32)), (23, (44, 32)), (22, (104, 101)), (21, (101, 115)), (21, (32, 109)), (20, (105, 111)), (20, (103, 32)), (19, (111, 100)), (19, (110, 32)), (19, (99, 97)), (19, (32, 99)), (18, (111, 102)), (18, (108, 105)), (18, (100, 101)), (18, (32, 65)), (17, (116, 32)), (17, (110, 101)), (17, (108, 101)), (17, (105, 108)), (17, (104, 97)), (17, (102, 32)), (16, (110, 100)), (16, (101, 110)), (16, (97, 115)), (16, (65, 73)), (15, (111, 32)), (15, (110, 115)), (15, (109, 111)), (15, (105, 99)), (15, (73, 32)), (15, (46, 32)), (15, (32, 112)), (15, (32, 98)), (14, (121, 32)), (14, (116, 111)), (14, (115, 101)), (14, (111, 114)), (14, (101, 108)), (14, (100, 105)), (14, (32, 102)), (13, (112, 108)), (13, (97, 112)), (13, (32, 105)), (13, (32, 10

In [85]:
chr(101), chr(32) # most common tokens changed back into characters

('e', ' ')

In [86]:
top_ranking_pair = max(most_frequent_pair, key=most_frequent_pair.get) # key=most_frequent_pair.get ranks the keys by the value of the pair
top_ranking_pair

(101, 32)

In [87]:
def merge_tokens(ids, pair, idx):
    # in the list of ints (ids), replace all consecutive occurrences of the pair with a new token
    newids = []
    i = 0
    while i < len(ids):
        # if we are not at the very last position AND the pair matches, replace it
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
            newids.append(idx)
            i += 2 # skip the next two bytes
        else:
            newids.append(ids[i])
            i += 1
    return newids

print("length of tokens before merge:", len(tokens))
print("---Merging tokens---")
tokens_after_merge = merge_tokens(tokens, top_ranking_pair, 255)
print("---")
print(tokens_after_merge)
print("length of tokens after merge:", len(tokens_after_merge))



length of tokens before merge: 2153
---Merging tokens---
---
[84, 104, 255, 115, 99, 97, 108, 105, 110, 103, 32, 117, 112, 32, 111, 102, 32, 65, 73, 32, 109, 111, 100, 101, 108, 115, 32, 104, 97, 115, 32, 116, 119, 111, 32, 109, 97, 106, 111, 114, 32, 99, 111, 110, 115, 101, 113, 117, 101, 110, 99, 101, 115, 46, 32, 70, 105, 114, 115, 116, 44, 32, 65, 73, 32, 109, 111, 100, 101, 108, 115, 32, 97, 114, 255, 98, 101, 99, 111, 109, 105, 110, 103, 32, 109, 111, 114, 255, 112, 111, 119, 101, 114, 102, 117, 108, 32, 97, 110, 100, 32, 99, 97, 112, 97, 98, 108, 255, 111, 102, 32, 109, 111, 114, 255, 116, 97, 115, 107, 115, 44, 32, 101, 110, 97, 98, 108, 105, 110, 103, 32, 109, 111, 114, 255, 97, 112, 112, 108, 105, 99, 97, 116, 105, 111, 110, 115, 46, 32, 77, 111, 114, 255, 112, 101, 111, 112, 108, 255, 97, 110, 100, 32, 116, 101, 97, 109, 115, 32, 108, 101, 118, 101, 114, 97, 103, 255, 65, 73, 32, 116, 111, 32, 105, 110, 99, 114, 101, 97, 115, 255, 112, 114, 111, 100, 117, 99, 116, 105, 118, 

Keep iterating over the next largest pair to mint

In [88]:
vocab_size = 276 # desired final volcabulary size 
number_of_merges = vocab_size - 256 # number of merges to perform
ids = list(tokens) # copy so we don't destory original list

# building a BINARY TREE of merges (starting not from root but from the leaves)
merges = {} # (child1, child2) -> mapping to new token
for i in range(number_of_merges):
  most_frequent_pair = get_most_frequent_pair(ids) # find most commonly occuring pair
  pair = max(most_frequent_pair, key=most_frequent_pair.get)
  idx = 256 + i
  print(f"merging {pair} into new token {idx}")
  ids = merge_tokens(ids, pair, idx)
  merges[pair] = idx


merging (101, 32) into new token 256
merging (115, 32) into new token 257
merging (105, 110) into new token 258
merging (111, 110) into new token 259
merging (32, 116) into new token 260
merging (32, 97) into new token 261
merging (97, 116) into new token 262
merging (258, 103) into new token 263
merging (101, 114) into new token 264
merging (105, 259) into new token 265
merging (111, 100) into new token 266
merging (111, 102) into new token 267
merging (114, 101) into new token 268
merging (105, 108) into new token 269
merging (104, 256) into new token 270
merging (65, 73) into new token 271
merging (101, 110) into new token 272
merging (100, 32) into new token 273
merging (32, 109) into new token 274
merging (46, 32) into new token 275


Note that some prior minted tokens are also eligible for merging

In [89]:
print(len(tokens))
print(len(ids))
print(f"compression ratio: {len(tokens) / len(ids):.2f}X")

2153
1653
compression ratio: 1.30X


[The Technical User's Introduction to LLM Tokenization - Christopher Samiullah](https://christophergs.com/blog/understanding-llm-tokenization)

The **Tokenizer** is completely seperate, independent module to the LLM.
It has its own training dataset of text, on which on **train the vocabulary** using the **Byte-Pair Encoding Algorithm**.

The Tokenizer then translates back and forth between raw text and sequences of tokens. The LLM later only ever sees the tokens and never deals with any text

![](https://substackcdn.com/image/fetch/$s_!Aj2N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eb8e0ce-1111-4896-88ec-4e630d6471ed_1182x488.png)

### Decoding

Given the integers, what is the text?

In [90]:
# dictionary mapping token ids to bytes
vocab = {idx: bytes([idx]) for idx in range(256)}
# in order of all the merges, populate the vocab list
# run in order of insertion of merges into the dictionary
for (p0, p1), idx in merges.items():
    # concatenate the bytes
    vocab[idx] = vocab[p0] + vocab[p1]
    print(vocab[idx])
def decode(ids):
    # given ids, return the text
    # get the tokens
    # concat all bytes together
    tokens = b"".join(vocab[idx] for idx in ids)
    text = tokens.decode('utf-8', errors='replace') # utf-8 needs the replace erroring to fix the issue of some tokens not being valid utf-8
    return text
print("----")
print(decode(tokens))
print("----")
print(decode([257]))

b'e '
b's '
b'in'
b'on'
b' t'
b' a'
b'at'
b'ing'
b'er'
b'ion'
b'od'
b'of'
b're'
b'il'
b'he '
b'AI'
b'en'
b'd '
b' m'
b'. '
----
The scaling up of AI models has two major consequences. First, AI models are becoming more powerful and capable of more tasks, enabling more applications. More people and teams leverage AI to increase productivity, create economic value, and improve quality of life. Second, training large language models (LLMs) requires data, compute resources, and specialized talent that only a few organizations can afford. This has led to the emergence of model as a service: models developed by these few organizations are made available for others to use as a service. Anyone who wishes to leverage AI to build applications can now use these models to do so without having to invest up front in building a model. In short, the demand for AI applications has increased while the barrier to entry for building AI applications has decreased. This has turned AI engineering—the process

In [91]:
merges

{(101, 32): 256,
 (115, 32): 257,
 (105, 110): 258,
 (111, 110): 259,
 (32, 116): 260,
 (32, 97): 261,
 (97, 116): 262,
 (258, 103): 263,
 (101, 114): 264,
 (105, 259): 265,
 (111, 100): 266,
 (111, 102): 267,
 (114, 101): 268,
 (105, 108): 269,
 (104, 256): 270,
 (65, 73): 271,
 (101, 110): 272,
 (100, 32): 273,
 (32, 109): 274,
 (46, 32): 275}

In [92]:
most_frequent_pair

{(84, 270): 1,
 (270, 115): 3,
 (115, 99): 4,
 (99, 97): 9,
 (97, 108): 11,
 (108, 263): 2,
 (263, 32): 11,
 (32, 117): 6,
 (117, 112): 2,
 (112, 32): 5,
 (32, 267): 9,
 (267, 32): 9,
 (32, 271): 12,
 (271, 274): 3,
 (274, 266): 8,
 (266, 101): 12,
 (101, 108): 14,
 (108, 257): 7,
 (257, 104): 5,
 (104, 97): 12,
 (97, 257): 8,
 (257, 116): 7,
 (116, 119): 1,
 (119, 111): 1,
 (111, 274): 1,
 (274, 97): 4,
 (97, 106): 1,
 (106, 111): 1,
 (111, 114): 14,
 (114, 32): 4,
 (32, 99): 9,
 (99, 259): 4,
 (259, 115): 1,
 (115, 101): 5,
 (101, 113): 1,
 (113, 117): 3,
 (117, 272): 1,
 (272, 99): 2,
 (99, 101): 6,
 (101, 115): 11,
 (115, 46): 3,
 (46, 32): 15,
 (32, 70): 1,
 (70, 105): 1,
 (105, 114): 1,
 (114, 115): 1,
 (115, 116): 8,
 (116, 44): 3,
 (44, 32): 15,
 (257, 97): 9,
 (97, 114): 6,
 (114, 256): 9,
 (256, 98): 1,
 (98, 101): 6,
 (101, 99): 8,
 (99, 111): 5,
 (111, 109): 7,
 (109, 263): 1,
 (263, 274): 3,
 (274, 111): 3,
 (256, 112): 5,
 (112, 111): 5,
 (111, 119): 7,
 (119, 264): 2,
 (

## Encoding 

Given a string, what are the tokens?

In [93]:
def encode(text):
    tokens = list(text.encode('utf-8'))
    while True:
        # this time we dont care for frequency of pairs
        raw_pairs_in_sequence = get_most_frequent_pair(tokens)
        # get the lowest index pair (earliest merges first
        # all pairs inside raw_pairs_in_sequence, get index, get pair with min number index
        # float inf is fallback for non merging pairs
        # pair returns the most eligible merging pair from merged list
        pair = min(raw_pairs_in_sequence, key=lambda p: merges.get(p, float('inf')))
        if pair not in merges:
            break  # nothing else can be merged
        idx = merges[pair]
        tokens = merge_tokens(tokens, pair, idx)
    return tokens
print(encode("hello, world!"))

[104, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 33]
