# Tokenization

_Tokenization_ is the process of converting a body of text into individual _tokens_, e.g., words and punctuation characters. This is the first step for most Natural Language Processing (NLP) tasks, including preparing data for training an LLM. Let's see how it's done!

## Some sample text

In [50]:
text = "This is a test! Or is this not a test? Test it to be sure. :)"
print(text)
print(f"This sample text has {len(text)} characters.")

This is a test! Or is this not a test? Test it to be sure. :)
This sample text has 61 characters.


In [51]:
print(text.split())

['This', 'is', 'a', 'test!', 'Or', 'is', 'this', 'not', 'a', 'test?', 'Test', 'it', 'to', 'be', 'sure.', ':)']


In [52]:
str.split?

[31mSignature:[39m str.split(self, /, sep=[38;5;28;01mNone[39;00m, maxsplit=-[32m1[39m)
[31mDocstring:[39m
Return a list of the substrings in the string, using sep as the separator string.

  sep
    The separator used to split the string.

    When set to None (the default value), will split on any whitespace
    character (including \\n \\r \\t \\f and spaces) and will discard
    empty strings from the result.
  maxsplit
    Maximum number of splits (starting from the left).
    -1 (the default value) means no limit.

Note, str.split() is mainly useful for data that has been intentionally
delimited.  With natural text that includes punctuation, consider using
the regular expression module.
[31mType:[39m      method_descriptor

In [53]:
import re

Set takes the unique items from a list

In [54]:
tokens = re.split(r'([.?!:()]|\s)', text)
tokens = [ items for items in tokens if items.split()]
Stokens = sorted (list( set( tokens )) )
print(tokens)
print(Stokens)

['This', 'is', 'a', 'test', '!', 'Or', 'is', 'this', 'not', 'a', 'test', '?', 'Test', 'it', 'to', 'be', 'sure', '.', ':', ')']
['!', ')', '.', ':', '?', 'Or', 'Test', 'This', 'a', 'be', 'is', 'it', 'not', 'sure', 'test', 'this', 'to']


In [55]:

vocab = { token:index for index, token in enumerate( Stokens )}
print( vocab.items())

dict_items([('!', 0), (')', 1), ('.', 2), (':', 3), ('?', 4), ('Or', 5), ('Test', 6), ('This', 7), ('a', 8), ('be', 9), ('is', 10), ('it', 11), ('not', 12), ('sure', 13), ('test', 14), ('this', 15), ('to', 16)])


In [56]:
vocab["a"]

8

In [57]:
with open("one-foggy-night.txt", "r") as f:
    raw_text = f.read()
print(raw_text[:100])

A PLUME of smoke drifted under the great glass dome of Slagborough Station as the Northern Express c


In [131]:
tokens = re.split(r'([.?!:()",]|\s)', raw_text)
tokens = [ items for items in tokens if items.split()]
tokens.extend(["<|unk|>", "<|endoftext|>"])

print( len (tokens))



tokens = sorted (list( set( tokens )) )
print( len (tokens))

5033
1197


In [132]:
print( tokens[-20:])

['worked', 'working', 'would', "wouldn't", 'writer', 'written', 'wrong', 'wrote', 'yard', 'year', 'years', 'yes', 'yet', 'you', "you're", "you've", 'your', 'yours', 'zest', '—']


In [133]:
vocab = {token:index for index, token in enumerate(tokens)}
vocab.items()

dict_items([('!', 0), ('"', 1), ("'asn't", 2), ("'ave", 3), ("'e", 4), ("'e'd", 5), ("'e's", 6), ("'im", 7), ("'imself", 8), ("'is", 9), ("'isself", 10), ("'ow", 11), (',', 12), ('.', 13), ('18975', 14), ('19', 15), (':', 16), ('<|endoftext|>', 17), ('<|unk|>', 18), ('?', 19), ('A', 20), ('According', 21), ('Ah', 22), ('Along', 23), ('And', 24), ('Anyway', 25), ('Apparently', 26), ('As', 27), ('At', 28), ("Barker's", 29), ('Besides', 30), ('But', 31), ('Certainly', 32), ('Churchyard', 33), ('City', 34), ('Come', 35), ('Dear', 36), ('England', 37), ('Essex', 38), ('Express', 39), ('Fadden', 40), ("Fadden's", 41), ('Flinn', 42), ('For', 43), ('Foxhill', 44), ('From', 45), ('Gange', 46), ('Gaylord', 47), ("Gray's", 48), ('Has', 49), ('He', 50), ('Here', 51), ('His', 52), ('How', 53), ('I', 54), ("I'll", 55), ("I'm", 56), ('If', 57), ('In', 58), ('Inn', 59), ('Inside', 60), ('Inspector', 61), ('Instead', 62), ("Isn't", 63), ('It', 64), ("It's", 65), ('Jabez', 66), ('Joe', 67), ('London', 6

In [134]:
phrase = "A moment later, white and shaken, he was racing down the platform in the direction of the guard's van."

In [135]:
phrase = re.split(r'([.?!:()",]|\s)', phrase)
phrase = [ items for items in phrase if items.split()]

In [136]:
ids = [ vocab[token] for token in phrase]
print(ids)

[20, 693, 621, 12, 1154, 170, 925, 12, 515, 1129, 856, 362, 1029, 804, 560, 1029, 346, 747, 1029, 496, 1113, 13]


In [137]:
vocab["A"]

20

In [138]:
reverse_vocab = {index:token for token,index in vocab.items()}
reverse_vocab.items()

dict_items([(0, '!'), (1, '"'), (2, "'asn't"), (3, "'ave"), (4, "'e"), (5, "'e'd"), (6, "'e's"), (7, "'im"), (8, "'imself"), (9, "'is"), (10, "'isself"), (11, "'ow"), (12, ','), (13, '.'), (14, '18975'), (15, '19'), (16, ':'), (17, '<|endoftext|>'), (18, '<|unk|>'), (19, '?'), (20, 'A'), (21, 'According'), (22, 'Ah'), (23, 'Along'), (24, 'And'), (25, 'Anyway'), (26, 'Apparently'), (27, 'As'), (28, 'At'), (29, "Barker's"), (30, 'Besides'), (31, 'But'), (32, 'Certainly'), (33, 'Churchyard'), (34, 'City'), (35, 'Come'), (36, 'Dear'), (37, 'England'), (38, 'Essex'), (39, 'Express'), (40, 'Fadden'), (41, "Fadden's"), (42, 'Flinn'), (43, 'For'), (44, 'Foxhill'), (45, 'From'), (46, 'Gange'), (47, 'Gaylord'), (48, "Gray's"), (49, 'Has'), (50, 'He'), (51, 'Here'), (52, 'His'), (53, 'How'), (54, 'I'), (55, "I'll"), (56, "I'm"), (57, 'If'), (58, 'In'), (59, 'Inn'), (60, 'Inside'), (61, 'Inspector'), (62, 'Instead'), (63, "Isn't"), (64, 'It'), (65, "It's"), (66, 'Jabez'), (67, 'Joe'), (68, 'London

In [139]:
print ( " ".join([reverse_vocab[id] for id in ids]))

A moment later , white and shaken , he was racing down the platform in the direction of the guard's van .


In [None]:
# class SimpleTokenizer:
#     def __init__(self, vocab):
#         self.str_to_int = vocab
#         self.int_to_string = {index:token for token, index in vocab.items()}
#     def encode(self, text):
#         tokens = re.split(r'([.?!:()",]|\s)', text)
#         tokens = [ items for items in tokens if items.split()]
#         ids = [ self.str_to_int[token] for token in tokens]
#         return ids
        
#     def decode(self, ids):
#         ## text = [self.int_to_string[id] for id in ids]
#         text = ( " ".join([self.int_to_string[id] for id in ids]))
#         text = re.sub( r'\s+([.?!:()",])', r'\1', text)
#         return text
        
    

In [None]:
class SimpleTokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_string = {index:token for token, index in vocab.items()}
    def encode(self, text):
        tokens = re.split(r'([.?!:()",]|\s)', text)
        tokens = [ item if item in self.str_to_int else "<|unk|>" for item in tokens if item.split()]
        ids = [ self.str_to_int[token] for token in tokens]
        return ids
        
    def decode(self, ids):
        text = ( " ".join([self.int_to_string[id] for id in ids]))
        text = re.sub( r'\s+([.?!:()",])', r'\1', text)
        return text

In [145]:
tokenizer = SimpleTokenizer(vocab)

In [152]:
phrase = "A moment later, white and shaken, he was racing down the platform in the direction of the guard's van."

ids = tokenizer.encode ( phrase )
print(ids)



[20, 693, 621, 12, 1154, 170, 925, 12, 515, 1129, 856, 362, 1029, 804, 560, 1029, 346, 747, 1029, 496, 1113, 13]


In [153]:
text = tokenizer.decode( ids )
print(text)



A moment later, white and shaken, he was racing down the platform in the direction of the guard's van.


In [154]:
phrase = "A2 moment later, white and shaken, he was racing down the platform in the direction of the guard's van2."

ids = tokenizer.encode ( phrase )
print(ids)

[18, 693, 621, 12, 1154, 170, 925, 12, 515, 1129, 856, 362, 1029, 804, 560, 1029, 346, 747, 1029, 496, 18, 13]


In [155]:
text = tokenizer.decode( ids )
print(text)

<|unk|> moment later, white and shaken, he was racing down the platform in the direction of the guard's <|unk|>.


In [157]:
import tiktoken

In [160]:
tokenizer = tiktoken.get_encoding("gpt2")

In [161]:
phrase = "A2 moment later ahfjahghaohopgai, white and shaken, he was racing down the platform in the direction of the guard's van2."

ids = tokenizer.encode ( phrase )
print(ids)

text = tokenizer.decode( ids )
print(text)

[32, 17, 2589, 1568, 29042, 69, 31558, 46090, 1219, 404, 70, 1872, 11, 2330, 290, 27821, 11, 339, 373, 11717, 866, 262, 3859, 287, 262, 4571, 286, 262, 4860, 338, 5719, 17, 13]
A2 moment later ahfjahghaohopgai, white and shaken, he was racing down the platform in the direction of the guard's van2.
