Implemented two simple tokenizers from scratch and demonstrated tiktoken library.


Dataset used: 'The verdict' by Edith Warton(1908)

Step 1: Creating tokens (word based tokenizers)

In [1]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

In [2]:
print("Total number of character:", len(raw_text))
print(raw_text[:99])
print(type(raw_text))

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 
<class 'str'>


In [3]:
import re

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]

print(preprocessed[:15])
print(len(preprocessed))

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow']
4690


Step 2: Creating Token IDs

In [4]:
all_words = sorted(set(preprocessed))
vocab = {token:integer for integer,token in enumerate(all_words)}
len(vocab)

1130

Implementing a python class for tokenization.
This class will have two methods, encode and decode.
Step 1: Store the vocabulary as class attribute for access in the encode and decode method.

Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens.

Step 3: process input text into token IDs

Step 4: Convert token IDs back into text

In [1]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        #self.toInt = vocab
        self.toStr = {t:s for s, t in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'[,.:;?_!"()\']|--|\s', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [vocab[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.toStr[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [2]:
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know,"
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

NameError: name 'vocab' is not defined

In [7]:
tokenizer.decode(ids)

'It s the last he painted you know Mrs Gisburn said with pardonable pride'

In [8]:
text = "Hello, do you like tea?"
#print(tokenizer.encode(text))

In [9]:
all_tokens = sorted(set(preprocessed))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_tokens)}
len(vocab)

1132

In [10]:
for item in list(vocab.items())[-5:]:
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [11]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.toInt = vocab
        self.toStr = {integer:token for token, integer in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.toInt else "<|unk|>" for item in preprocessed]
        ids = [self.toInt[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.toStr[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text


In [12]:
tokenizer = SimpleTokenizerV2(vocab)

In [13]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
ids = tokenizer.encode(text)
print(ids)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]


In [14]:
tokenizer.decode(ids)

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

BYTE PAIR ENCODING: Implementing BPE from scratch can be relatively complicated, thus we will use an existing Python open-source library called tiktoken.

In [15]:
import tiktoken
import importlib

In [16]:
tokenizer = tiktoken.get_encoding("gpt2")

In [17]:
text = ( "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace." )

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]


In [18]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


The BPE tokenizer can handle unknown words. How can it achieve this without using <|unk|> token?


The algorithm underlying BPE breaksdown words that aren't in its predefined vocabulary into smaller subword units or even individual characters.
This enables it to handle out of vocabulary(OOV) words.

An example to illustrate how the BPE tokenizer deals with unknown tokens

In [19]:
integers = tokenizer.encode("Akwirw ier")
print("integers: ", integers)

strings = tokenizer.decode(integers)
print("strings: ", strings)

integers:  [33901, 86, 343, 86, 220, 959]
strings:  Akwirw ier
