A large part of this demo (designed for pedagogical reasons) is borrowed from the awesome tutorial by Andrej Karpathy.

In [37]:
import regex as re
from tqdm import tqdm  
import tiktoken

#### Unicode Code points 

[Unicode](https://en.wikipedia.org/wiki/Unicode#Notes) defines 154998 characters from 168 scripts. Everything you can think of (including emojis) are likely defined, and can be represented using a unicode code point.

This is a big deal as it enables digitalization of different writing systems. 

For instance, one of the most recent inclusions was the Tigalari script used to write Tulu (spoken not very far from Bangalore, in and around Mangalore).

Let's look at some of the code points: 

In [6]:
# code point for 'h'
print (ord('h'))

# code point for 'ह'
print (ord('ह'))

104
2361


Note that the Hindi 'ह' is assigned a higher number here. 


In [7]:
hello_en = "hello"
hello_hi = "हैलो"

Unicode code points is just an abstract concept where each (digitized) character is conceptually mapped to a number. However, to realize this concept in practice, `utf-8` is used to encode these code points. `utf-8` is designed to efficiently store and transmit data. 

It supports all valid code points, using a _variable-width_ encoding, ranging from 1 to 4 bytes. It is also backward-compatible with `ASCII`. (Side note: In standard ASCII-encoded data, there are unique values for just 128 alphabetic, numeric or special additional characters and control codes).

Let's look at how some of the strings are encoded as per the `utf-8` format.

In [8]:
print (list(hello_en.encode('utf-8')))
print (list(hello_hi.encode('utf-8')))

[104, 101, 108, 108, 111]
[224, 164, 185, 224, 165, 136, 224, 164, 178, 224, 165, 139]


In [10]:
# let's look at the UTF-8 representation of individual hindi characters
for ch in hello_hi:
    print (ch, list(ch.encode('utf-8')))

ह [224, 164, 185]
ै [224, 165, 136]
ल [224, 164, 178]
ो [224, 165, 139]


In [11]:
# let's train a tokenizer on Kohli's wikipedia page

# english biography 
train_text_en = open('datasets/kohli_en.txt').read()

# hindi biography
train_text_hi = open('datasets/kohli_hi.txt').read()

In [None]:
# printing parts of the biography

print ("English: ", train_text_en[:100])   
print ("Hindi: ", train_text_hi[:100])

English:  Virat Kohli (born 5 November 1988)[b] is an Indian international cricketer who plays Test and ODI cr
Hindi:  विराट कोहली (जन्म: 5 नवम्बर 1988) भारतीय क्रिकेट टीम के एक दिवसीय क्रिकेट, टेस्ट क्रिकेट, व टी 20 आई


In [16]:
# a function to compute the counts of byte pairs
def get_counts(ids):
    counts = {}

    for i in range(1, len(ids)):
        if (ids[i-1], ids[i]) not in counts:
            counts[(ids[i-1], ids[i])] = 0
        counts[(ids[i-1], ids[i])] += 1

    return counts

In [17]:
tokens_en = list(train_text_en.encode('utf-8'))
tokens_hi = list(train_text_hi.encode('utf-8'))

stats_en = get_counts(tokens_en)
stats_hi = get_counts(tokens_hi)

In [None]:
# let's look at some of the most frequent byte pairs
sorted_stats_en = sorted(stats_en.items(), key=lambda x: x[1], reverse=True)[:25]

for (p0, p1), count in sorted_stats_en:
    print (chr(p0), chr(p1), count)

e   1741
  t 1573
i n 1360
t h 1298
h e 1257
s   1196
  a 1139
n   1025
d   1009
t   774
a n 744
e r 735
,   697
o r 684
r e 677
n g 666
e d 652
a t 639
  s 627
o n 598
  o 560
  i 535
n d 530
  h 514
e n 514


As you can see above, some of the most popular byte-pairs correspond to characters you'd see frequently co-occur. For instance, `t` and `h`, `h` and `e`, `a` and `n`, `e` and `r`.

In [28]:
# a function to merge common pairs 
def merge(ids, pair, idx): 
    new_ids = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
            # merge the pair -> idx
            new_ids.append(idx)
            i += 2
        else:
            new_ids.append(ids[i])
            i += 1

    return new_ids

Let's quickly test whether the merge function works? 

In [32]:
assert merge([1, 2, 3, 4, 5], (2, 3), 6) == [1, 6, 4, 5]
assert merge([1, 2, 3, 4, 5], (9, 10), 2) == [1, 2, 3, 4, 5]

In [38]:
# let's train a tokenizer

desired_vocab_size = 1000
num_merges = desired_vocab_size - 256 

# copy the tokens 
copy_tokens_en = tokens_en.copy()
merges_en = {}

for i in tqdm(range(num_merges)):

    # compute counts 
    stats_en = get_counts(copy_tokens_en)

    # find the most frequent pair
    most_freq_pair = max(stats_en, key = lambda x: stats_en[x])

    # merge the pair
    copy_tokens_en = merge(copy_tokens_en, most_freq_pair, 256 + i)

    # store the merges 
    merges_en[most_freq_pair] = 256 + i

100%|██████████| 744/744 [00:08<00:00, 83.74it/s] 


In [39]:
# encode the string 

def encode(text):
    ids = list(text.encode('utf-8'))

    for pair in merges_en:
        ids = merge(ids, pair, merges_en[pair])

    return ids

Note that the above encode function is not written in the most efficient way. For instance, one need not go over all the items in the merge-list and try to see which ones can be merged in the given text. (Doing the opposite would be more efficient.)

In [40]:
# let's decode the ids
 
# first 256 characters 
vocab = {idx: bytes([idx]) for idx in range(256)}

for (p0, p1), idx in merges_en.items():
    vocab[idx] = vocab[p0] + vocab[p1]

def decode(ids):
    tokens = b"".join(vocab[idx] for idx in ids)
    text = tokens.decode("utf-8", errors='replace')
    return text

In [47]:
print (encode("Kohli"))

[285]


It's interesting to note that even with such a small number of merges `Kohli` gets a dedicated token--thanks to the training data.

In [49]:
print (encode("Virat Kohli scored a hundred"))

[957, 355, 300, 341, 489, 104, 509, 114, 280]


In [50]:
print (decode(encode("Virat Kohli scored a hundred")))

Virat Kohli scored a hundred


In [None]:
chinese_sample = "科利得分一百"
print (decode(encode(chinese_sample)))
assert (decode(encode(chinese_sample)) == chinese_sample)

科利得分一百


You can verify how encoding and decoding work as inverse operations.