# Unicode Code Point

In [1]:
ord("h")

104

In [2]:
ord("拼")

25340

In [3]:
ord("あ")

12354

In [6]:
example = "您好！😊 hello"

In [7]:
print([ord(c) for c in example])

[24744, 22909, 65281, 128522, 32, 104, 101, 108, 108, 111]


Why not just using Unicode Code Point? => Unicode is not stable but keeps changing.

## UTF-8 / UTF-16 / UTF-32

In [8]:
example.encode("utf-8")

b'\xe6\x82\xa8\xe5\xa5\xbd\xef\xbc\x81\xf0\x9f\x98\x8a hello'

In [13]:
# ASCII code points are encoded as they are, while non-english words and emojis are encoded into more bytes.
# Therefore, the length of the encoded string is longer than the original one.
print(len(example))
print(len(example.encode("utf-8")))

10
19


In [9]:
print(list(example.encode("utf-8")))

[230, 130, 168, 229, 165, 189, 239, 188, 129, 240, 159, 152, 138, 32, 104, 101, 108, 108, 111]


In [10]:
"""
Not efficient as there are one starting "0" for ASCII characters.
"""
print(list(example.encode("utf-16")))

[255, 254, 168, 96, 125, 89, 1, 255, 61, 216, 10, 222, 32, 0, 104, 0, 101, 0, 108, 0, 108, 0, 111, 0]


In [11]:
"""
Not efficient as there are many starting "0"s for ASCII characters.
"""
print(list(example.encode("utf-32")))

[255, 254, 0, 0, 168, 96, 0, 0, 125, 89, 0, 0, 1, 255, 0, 0, 10, 246, 1, 0, 32, 0, 0, 0, 104, 0, 0, 0, 101, 0, 0, 0, 108, 0, 0, 0, 108, 0, 0, 0, 111, 0, 0, 0]


Why not just using UTF-8 encoding? => It only has 256 vocab_size, resulting in limited ability to attend to long sequence, given a certain context length.

# Byte-Pair Encoding (BPE)

### Example

In [28]:
import random

random.seed(42)
a = ['a', 'b', 'c', 'd']
tiny_example = []

for _ in range(20):
    tiny_example.extend(random.sample(a, 1))

tiny_example = ''.join(tiny_example)
print(tiny_example)

aacbbbaadaaabbabdbdc


In [29]:
counts = {}
for lead, follower in zip(tiny_example, tiny_example[1:]):
    counts[(lead, follower)] = counts.get((lead, follower), 0) + 1

In [24]:
counts = sorted(counts.items(), key=lambda entry: entry[1], reverse=True)
print(counts)

[(('a', 'a'), 4), (('b', 'b'), 3), (('b', 'a'), 2), (('a', 'b'), 2), (('b', 'd'), 2), (('a', 'c'), 1), (('c', 'b'), 1), (('a', 'd'), 1), (('d', 'a'), 1), (('d', 'b'), 1), (('d', 'c'), 1)]


In [32]:
max(counts.items(), default=None, key=lambda entry: entry[1])

(('a', 'a'), 4)

In [33]:
max(counts, key=counts.get)

('a', 'a')

In [35]:
chr(65)

'A'

In [42]:
new_token_index = 65
i = 0
token_ids = []
while i < len(tiny_example):
    pair = max(counts, key=counts.get)
    if i < len(tiny_example) - 1 and tiny_example[i] == pair[0] and tiny_example[i+1] == pair[1]:
        # merge
        token_ids.append(new_token_index)
        i += 2
    else:
        token_ids.append(ord(tiny_example[i]))
        i += 1

print(token_ids)
print(''.join([chr(token_id) for token_id in token_ids]))


[65, 99, 98, 98, 98, 65, 100, 65, 97, 98, 98, 97, 98, 100, 98, 100, 99]
AcbbbAdAabbabdbdc


### Training

In [74]:
# Train BPE
import typing as tp

def get_stats(token_ids: tp.List[int]) -> tp.Dict[tp.Tuple[int, int], int]:
    counts = {}
    for lead, follower in zip(token_ids, token_ids[1:]):
        counts[(lead, follower)] = counts.get((lead, follower), 0) + 1
    return counts


def merge_with_pair(token_ids: tp.List[int], pair: tp.Tuple[int, int], new_token_index: int) -> tp.List[int]:
    i = 0
    new_token_ids = []
    while i < len(token_ids):
        if i < len(token_ids) - 1 and token_ids[i] == pair[0] and token_ids[i+1] == pair[1]:
            # merge
            new_token_ids.append(new_token_index)
            i += 2
        else:
            new_token_ids.append(token_ids[i])
            i += 1
    
    return new_token_ids

def merge(token_ids: tp.List[int], num_merges: int, start_index: int) -> tp.Tuple[tp.List[int], tp.Dict]:
    new_token_mapping = {}
    for i in range(num_merges):
        stats = get_stats(token_ids)
        pair = max(stats, key=stats.get)
        new_token_index = start_index + i
        print(f"Merging {pair} to a new token {new_token_index}")
        token_ids = merge_with_pair(token_ids, pair, new_token_index)
        new_token_mapping[pair] = new_token_index

    return token_ids, new_token_mapping




In [75]:
# Training data (ahaha)
input = """Now, Unicode does also include many “precomposed” code points, each representing a letter with some combination of diacritics already applied, such as U+00C1 “Á” latin capital letter a with acute or U+1EC7 “ệ” latin small letter e with circumflex and dot below. I suspect these are mostly inherited from older encodings that were assimilated into Unicode, and kept around for compatibility. In practice, there are precomposed code points for most of the common letter-with-diacritic combinations in European-script languages, so they don’t use dynamic composition that much in typical text."""

# Training
ori_token_ids = list(map(int, input.encode("utf-8")))
print(f"Original token ids: {len(ori_token_ids)}")
# new_token_mapping is the tokenizer that we trained.
token_ids, new_token_mapping = merge(ori_token_ids, num_merges=10, start_index=256)

print(f"After BPE: {len(token_ids)}")
print(f"Compression rate: {len(ori_token_ids) / len(token_ids):.2f}x")

Original token ids: 607
Merging (101, 32) to a new token 256
Merging (105, 110) to a new token 257
Merging (99, 111) to a new token 258
Merging (32, 97) to a new token 259
Merging (105, 116) to a new token 260
Merging (101, 114) to a new token 261
Merging (97, 116) to a new token 262
Merging (32, 116) to a new token 263
Merging (226, 128) to a new token 264
Merging (258, 109) to a new token 265
After BPE: 510
Compression rate: 1.19x


### Decoding

In [76]:
new_token_mapping

{(101, 32): 256,
 (105, 110): 257,
 (99, 111): 258,
 (32, 97): 259,
 (105, 116): 260,
 (101, 114): 261,
 (97, 116): 262,
 (32, 116): 263,
 (226, 128): 264,
 (258, 109): 265}

In [81]:
"""
vocab is essential to decoding.
"""
vocab = {i: bytes([i]) for i in range(256)}
for pair, new_token_index in new_token_mapping.items():
    vocab[new_token_index] = vocab[pair[0]] + vocab[pair[1]] 
print(len(vocab))
print(vocab[257])
print(vocab[105])
print(vocab[110])

266
b'in'
b'i'
b'n'


In [82]:
import typing as tp


def decode(token_ids: tp.List[int]) -> str:
    tokens = b"".join(vocab[token_id] for token_id in token_ids)
    text = tokens.decode("utf-8")
    return text

In [86]:
decode(token_ids) == example

True

In [87]:
# 97 is ASCII character 'a'
decode([97])

'a'

In [88]:
# 128 does not comply with utf-8 format.
decode([128])

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

In [91]:
import typing as tp


def decode_with_fallback(token_ids: tp.List[int]) -> str:
    tokens = b"".join(vocab[token_id] for token_id in token_ids)
    text = tokens.decode("utf-8", errors="replace")
    return text

decode_with_fallback([128])

'�'

### Encoding

In [93]:
"""
new_token_mapping is essential to encoding.
"""
import typing as tp

def encode(input: str) -> tp.List[int]:
    token_ids = list(map(int, input.encode("utf-8")))
    while len(token_ids) > 1: # stats will be empty if len(token_ids) == 1.
        stats = get_stats(token_ids)
        # Get the mapping, start from the smallest new token id, merge the token id list repeatedly.
        pair = min(stats, key=lambda p: new_token_mapping.get(p, float('inf')))
        if pair not in new_token_mapping:
            # means no pair can be found in the mapping, nothing to be merged.
            break
        token_ids = merge_with_pair(token_ids, pair, new_token_mapping[pair])
    return token_ids

encode("hello world!")

[104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33]

In [94]:
# Test Single Character
encode("h")

[104]

In [95]:
# ASCII data (subset of training data)
decode(encode("hello world!"))

'hello world!'

In [96]:
# Training data
decode(encode(example)) == example

True

In [97]:
# Picked from https://en.wikipedia.org/wiki/Emoji.
validation_data = """Originally meaning pictograph, the word emoji comes from Japanese e (絵, 'picture') + moji (文字, 'character'); the resemblance to the English words emotion and emoticon is purely coincidental.[4] The first emoji set was created by Japanese phone carrier SoftBank in 1997,[5] with emoji becoming increasingly popular worldwide in the 2010s after Unicode began encoding emoji into the Unicode Standard.[6][7][8] They are now considered to be a large part of popular culture in the West and around the world.[9][10] In 2015, Oxford Dictionaries named the Face with Tears of Joy emoji (😂) the word of the year.[11][12]"""
decode(encode(validation_data)) == validation_data

True

# GPT2 Tokenizer

## Chunking (regex)

In [103]:
"""
We see the regular expression below split a sequence into pieces and concatenate them into a list of strings.

Why do that? Consider this example in GPT2 paper: "dog", "dog!", "dog.", "dog?".
Without the regex, "d" and "o" are likely to be merged, then merged with "g", then with punctuation.
That is not what we want, as we do not want to merge semantics with punctuation.
"""
# regex is an extension of python re package.
import regex

# https://github.com/openai/gpt-2/blob/master/src/encoder.py#L53
gpt2pat = regex.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")


In [105]:
"""
It matches a subsequence of letters followed by an optional space ( ?\p{L}+), or a series of numbers followed by an optional space ( ?\p{N}+).
In this way, it avoids BPE to merge a letter / number followed by a space.
There are other cases in this regex.
"""
# Letters
print(gpt2pat.findall("Hello world how are you"))

# Numbers
print(gpt2pat.findall("Hello world123 how are you"))

# Apostrophy
print(gpt2pat.findall("Hello've world how are you"))

# Punctuation
print(gpt2pat.findall("Hello world how are you!!!"))

# Extra spaces (always allow the last space to be with the next non-space token)
print(gpt2pat.findall("Hello world how are     you"))


['Hello', ' world', ' how', ' are', ' you']
['Hello', ' world', '123', ' how', ' are', ' you']
['Hello', "'ve", ' world', ' how', ' are', ' you']
['Hello', ' world', ' how', ' are', ' you', '!!!']
['Hello', ' world', ' how', ' are', '    ', ' you']


In [109]:
"""
Also note that the regex above also does not cover certain cases.
"""
# 'VE is not recognized and chunked together
print(gpt2pat.findall("Hello'VE world how are you"))

# `ve is not recognized as 've
print(gpt2pat.findall("Hello`ve world how are you"))

['Hello', "'", 'VE', ' world', ' how', ' are', ' you']
['Hello', '`', 've', ' world', ' how', ' are', ' you']


# Tiktoken

In [112]:
import tiktoken

# GPT2 tokenizer. Vocab size is ~50k.
encoder = tiktoken.get_encoding("gpt2") # this will download gpt2 tokenizer.
print(encoder.encode("    Hello world!!!"))

# GPT4 tokenizer. Vocab size is ~100k. Tackled cases with apostrophy. Added additional treatments.
encoder = tiktoken.get_encoding("cl100k_base") # this will download gpt4 tokenizer.
print(encoder.encode("    Hello world!!!"))

[220, 220, 220, 18435, 995, 10185]
[262, 22691, 1917, 12340]


# Why not use a very large vocab_size?

1) Large vocab_size leads to some / many tokens under-trained since we cannot guarantee a sufficient train data for tokenizer.
2) Huge vocab_size leads to a lot of merging, and giant blocks of tokens being merged into one, which reduces LLM's flexibility.
3) Computationally expensive.


# How to retrain tokenizer?

Resize the vocab_size in embedding layer and the final linear layer, freeze the base language model, and train the model.

Note only the embedding layer and linear layer is trained.

# LLM's Issues

### Count number of letters in a "word"

".DefaultCellStyle" is one single token in GPT4 tokenizer. The model fails to count the "l"s in in it.

### Reverse String

".DefaultCellStyle" is one single token in GPT4 tokenizer. The model fails to reverse it.

### Foreign Languages

Larger vocabulary in Unicode (Chinese / Korean vocab size is 10x than letter-based vocab: https://www.reedbeta.com/blog/programmers-intro-to-unicode/#scripts)

1) Fewer data of foreign languages than of English.
2) Harder to find sufficient training data for the tokenizer. (to have enough merges)

### Math

Really depends on how the numbers are merged by the tokenizer.

### GPT2 Not Good in Python

Spaces (indentation) is not well handled in GPT2 tokenizer.

### Special Tokens "<|endoftext|>"

GPT4 recognize it as a single token, and will not see it as a string.

### Trailing Whitespace

Tokenizer always merges a starting whitespace with other tokens. Rarely does it merge a series of tokens and add a whitespace at the end.

Therefore, the trailing whitespace is an isolated token at inference time, while the model has seen very little this example in training.

Thus LLM gives poor results.

### Unstable Tokens (Partial Tokens)

"`.DefaultCellSty`" appears rarely in the training set. Thus the model would output directly an end of text token when it sees it prompt.

### Reddit User Name

Tokenizer dataset != Training dataset.

Reddit User names probably appeared in the tokenizer dataset a lot times and got merged to have a dedicated token.
While they did not appear in the training dataset, and thus those specific tokens never got (well) trained.

### JSON vs YAML

YAML shorter than JSON.

# minBPE

Training code of tiktoken (OpenAI did not release the training code).

1) `tiktoken` would be a go-to solution if not retraining tokenizer.
2) `sentencepiece` with BPE can be used to train a tokenizer. Be carefully with fallback, and tons of configs, and normalization.