# Lecture 7: Tokenizer

## First tokenize the entire short story (The Verdict)

In [14]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
  raw_text = f.read()

print("Total characters", len(raw_text))
raw_text[:99]

Total characters 20479


'I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no '

In [15]:
# Use regular expressions with re

import re
text = "Hello, world. This is a test!"
result = re.split(r'(\s)', text) # split where white spaces

print(result)

['Hello,', ' ', 'world.', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test!']


In [16]:
# also split commas and periods
result = re.split(r'([,.!]|\s)', text)

print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test', '!', '']


In [17]:
result = [item for item in result if item.strip()] # gets rid of whitespace
result

['Hello', ',', 'world', '.', 'This', 'is', 'a', 'test', '!']

Remove whitespace to reduce memory but can also have some defects where whitespace may have more meaning (i.e. in Python code).

When building an LLM, think for applications whether it makes sense to remove whitespace.

In [None]:
# This is the final simple tokenization scheme
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item for item in result if item.strip()]
result

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']

In [19]:
preproccessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preproccessed = [item for item in preproccessed if item.strip()]
print(preproccessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [20]:
print(len(preproccessed))

4690


## Now we must convert tokens into Token IDs (step 2)

*Tokens now have to have numerical representations*

In [22]:
all_words = sorted(set(preproccessed))
vocab_size = len(all_words)
vocab_size

1130

In [24]:
# Map the sorted vocab to a number (order)
vocab = {token:integer for integer,token in enumerate(all_words)}

In [31]:
for i, item in list(enumerate(vocab.items()))[:50]:
  print(item)

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)


**Alphabetical order in the vocab** (see above)
- Individual tokens
- Individual integers

We also need a decoder... map ID back to token (decode)

Define two functions, _encode_ and __decode__

In [33]:
class SimpleTokenizer:
  def __init__(self, vocab):
    self.str_to_int = vocab
    self.int_to_str = {i:s for s,i in vocab.items()}

  def encode(self, input_text):
    preproccessed = re.split(r'([,.:;?_!"()\']|--|\s)', input_text)
    preproccessed = [item.strip() for item in preproccessed if item.strip()]
    ids = [self.str_to_int[s] for s in preproccessed]
    return ids
  
  def decode(self, ids):
    text = " ".join([self.int_to_str[i] for i in ids])
    # add spaces back but not between punctuation!
    text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
    return text

Take some text, encode it and decode it back! This is a simple sanity check.

In [36]:
tokenizer = SimpleTokenizer(vocab)
text = """"It's the last he painted, you know"
          Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [37]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know" Mrs. Gisburn said with pardonable pride.'

Now encode something not in our vocab

In [38]:
text = "Where is my five iron?!!"
ids = tokenizer.encode(text)
print(ids)

KeyError: 'Where'

This shows an error message because this word is not in our vocab. This motivates __special context tokens...__
- The tokenizer will handle unknown words
- Unknown text token <|unk|>
- Also add end of text token <|endoftext|>

Add these two!

In [47]:
# add two more tokens to our vocab
import enum


all_tokens = sorted(list(set(preproccessed)))
all_tokens.extend(["<|endoftext|>","<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}
len(vocab) # verify we added two tokens bc 1130 before

1132

In [49]:
# show these two tokens were actually added
for i, item in enumerate(list(vocab.items())[-5:]):
  print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [50]:
class SimpleTokenizer2:
  def __init__(self, vocab):
    self.str_to_int = vocab
    self.int_to_str = {i:s for s,i in vocab.items()}

  def encode(self, input_text):
    preproccessed = re.split(r'([,.:;?_!"()\']|--|\s)', input_text)
    preproccessed = [item.strip() for item in preproccessed if item.strip()]
    preproccessed = [
      (item if item in self.str_to_int
      else "<|unk|>") for item in preproccessed
    ]
    ids = [self.str_to_int[s] for s in preproccessed]
    return ids
  
  def decode(self, ids):
    text = " ".join([self.int_to_str[i] for i in ids])
    # add spaces back but not between punctuation!
    text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
    return text

In [51]:
tokenizer = SimpleTokenizer2(vocab)

text1 = "Where is my five iron?!!"
text2 = "Hey there, do you like tea?"

text = " <|endoftext|> ".join((text1, text2))
print(text)

Where is my five iron?!! <|endoftext|> Hey there, do you like tea?


In [52]:
tokenizer.encode(text)

[1131,
 584,
 697,
 445,
 1131,
 10,
 0,
 0,
 1130,
 1131,
 992,
 5,
 355,
 1126,
 628,
 975,
 10]

In [53]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|> is my five <|unk|>?!! <|endoftext|> <|unk|> there, do you like tea?'

_We can restore where the unknown tokens are without failure_

The original text does not inlcude 'Where,' 'hey,' 'iron' but no error is formed

__Some more tokens:__
- BOS (beginning of sequence)
- EOS (end of sequence)
- PAD (padding)

GPT will ONLY use <|endoftext|>

GPT also does byte-pair encoding (no <|unk|>) so it breaks words down into sub-units (so worst case just the individual characters)

# Lecture 8: Byte-pair encoding

Last time, we did a very simple tokenizer, but GPT actually does byte-pair which we will start now!