<a href="https://colab.research.google.com/github/elichen/karpathyGPT/blob/main/Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [27]:
from collections import Counter

In [28]:
text = """A Programmer’s Introduction to Unicode March 3, 2017 · Coding · 22 Comments  Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺\u200c🇳\u200c🇮\u200c🇨\u200c🇴\u200c🇩\u200c🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception.  A few months ago, I got interested in Unicode and decided to spend some time learning more about it in detail. In this article, I’ll give an introduction to it from a programmer’s point of view.  I’m going to focus on the character set and what’s involved in working with strings and files of Unicode text"""
tokens = text.encode("utf-8")
tokens = list(map(int,tokens))
len(tokens), len(text)

(1012, 917)

In [29]:
pairs = zip(tokens[:-1], tokens[1:])
pair = Counter(pairs).most_common()[0][0]
chr(pair[0]), chr(pair[1])

('e', ' ')

In [30]:
def merge(ids, pair, idx):
  new_tokens = []
  i = 0
  while i < len(ids):
    if i < len(ids)-1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
      new_tokens.append(idx)
      i += 2
    else:
      new_tokens.append(ids[i])
      i += 1
  return new_tokens

# merge([5,6,6,7,9,1], (6,7), 99)
tokens2 = merge(tokens, pair, 256)
print(tokens2)
len(tokens), len(tokens2)

[65, 32, 80, 114, 111, 103, 114, 97, 109, 109, 101, 114, 226, 128, 153, 115, 32, 73, 110, 116, 114, 111, 100, 117, 99, 116, 105, 111, 110, 32, 116, 111, 32, 85, 110, 105, 99, 111, 100, 256, 77, 97, 114, 99, 104, 32, 51, 44, 32, 50, 48, 49, 55, 32, 194, 183, 32, 67, 111, 100, 105, 110, 103, 32, 194, 183, 32, 50, 50, 32, 67, 111, 109, 109, 101, 110, 116, 115, 32, 32, 239, 188, 181, 239, 189, 142, 239, 189, 137, 239, 189, 131, 239, 189, 143, 239, 189, 132, 239, 189, 133, 33, 32, 240, 159, 133, 164, 240, 159, 133, 157, 240, 159, 133, 152, 240, 159, 133, 146, 240, 159, 133, 158, 240, 159, 133, 147, 240, 159, 133, 148, 226, 128, 189, 32, 240, 159, 135, 186, 226, 128, 140, 240, 159, 135, 179, 226, 128, 140, 240, 159, 135, 174, 226, 128, 140, 240, 159, 135, 168, 226, 128, 140, 240, 159, 135, 180, 226, 128, 140, 240, 159, 135, 169, 226, 128, 140, 240, 159, 135, 170, 33, 32, 240, 159, 152, 132, 32, 84, 104, 256, 118, 101, 114, 121, 32, 110, 97, 109, 256, 115, 116, 114, 105, 107, 101, 115, 32, 10

(1012, 984)

In [31]:
vocab_size = 276
num_merges = vocab_size - 256
ids = tokens.copy()
merges = {}
for i in range(num_merges):
  pairs = zip(ids[:-1], ids[1:])
  top_pair = Counter(pairs).most_common()[0][0]
  new_id = 256+i
  merges[top_pair] = new_id
  ids = merge(ids, top_pair, new_id)
merges

{(101, 32): 256,
 (105, 110): 257,
 (115, 32): 258,
 (226, 128): 259,
 (240, 159): 260,
 (97, 110): 261,
 (32, 116): 262,
 (97, 114): 263,
 (257, 103): 264,
 (116, 32): 265,
 (101, 114): 266,
 (111, 100): 267,
 (100, 32): 268,
 (105, 99): 269,
 (44, 32): 270,
 (111, 114): 271,
 (259, 153): 272,
 (85, 110): 273,
 (273, 269): 274,
 (274, 267): 275}

In [32]:
print(f"original tokens: {len(tokens)}")
print(f"compressed tokens: {len(ids)}")
print(f"compression: f{len(tokens)/len(ids):.2f}X")

original tokens: 1012
compressed tokens: 768
compression: f1.32X


In [33]:
table = {v:k for k,v in merges.items()}

In [34]:
def decode(ids):
  ret = []
  for id in ids:
    l = [id]
    run = True
    while run:
      new_l = []
      run = False
      for t in l:
        if t in table:
          new_l += [*table[t]]
          run = True
        else:
          new_l += [t]
      l = new_l
    ret += new_l
  return bytes(ret).decode("utf-8", errors="replace")

decode(ids)

'A Programmer’s Introduction to Unicode March 3, 2017 · Coding · 22 Comments  Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺\u200c🇳\u200c🇮\u200c🇨\u200c🇴\u200c🇩\u200c🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception.  A few months ago, I got interested in Unicode and decided to spend some time learning more about it in detail. In this article, I’ll give an introduction to it from a programmer’s point of view.  I’m going to focus on the character set and what’s involved in working with strings and files of Unicode text'

In [35]:
def encode(text):
  text = text.encode("utf-8")
  ret = []
  for c in text:
    ret.append(c)
    while True:
      if len(ret)>1 and (ret[-2],ret[-1]) in merges:
        ret = ret[:-2]+[merges[(ret[-2],ret[-1])]]
      else:
        break
  return ret

In [36]:
text2 = decode(encode(text))
text == text2

True

In [37]:
valtext = "Many common characters, including numerals, punctuation, and other symbols, are unified within the standard and are not treated as specific to any given writing system. Unicode encodes thousands of emoji, with the continued development thereof conducted by the Consortium as a part of the standard.[4] Moreover, the widespread adoption of Unicode was in large part responsible for the initial popularization of emoji outside of Japan. Unicode is ultimately capable of encoding more than 1.1 million characters."
valtext2 = decode(encode(valtext))
print(valtext2 == valtext)

True
