# Tokenization

_Tokenization_ is the process of converting a body of text into individual _tokens_, e.g., words and punctuation characters. This is the first step for most Natural Language Processing (NLP) tasks, including preparing data for training an LLM. Let's see how it's done!

## Some sample text

In [65]:
text = "This is a test! Or is this not a test? Test it to be sure. :)"
print(text)
print(f"This sample text has {len(text)} characters.")

This is a test! Or is this not a test? Test it to be sure. :)
This sample text has 61 characters.


In [66]:
print(text.split())

['This', 'is', 'a', 'test!', 'Or', 'is', 'this', 'not', 'a', 'test?', 'Test', 'it', 'to', 'be', 'sure.', ':)']


In [67]:
str.split?

[0;31mSignature:[0m [0mstr[0m[0;34m.[0m[0msplit[0m[0;34m([0m[0mself[0m[0;34m,[0m [0;34m/[0m[0;34m,[0m [0msep[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mmaxsplit[0m[0;34m=[0m[0;34m-[0m[0;36m1[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return a list of the substrings in the string, using sep as the separator string.

  sep
    The separator used to split the string.

    When set to None (the default value), will split on any whitespace
    character (including \n \r \t \f and spaces) and will discard
    empty strings from the result.
  maxsplit
    Maximum number of splits.
    -1 (the default value) means no limit.

Splitting starts at the front of the string and works to the end.

Note, str.split() is mainly useful for data that has been intentionally
delimited.  With natural text that includes punctuation, consider using
the regular expression module.
[0;31mType:[0m      method_descriptor

In [68]:
import re

In [69]:
tokens= re.split(r'([.!?:()]|\s)', text)
tokens=[item for item in tokens if item.split()]
tokens = sorted( list(set(tokens)))
print(tokens)

['!', ')', '.', ':', '?', 'Or', 'Test', 'This', 'a', 'be', 'is', 'it', 'not', 'sure', 'test', 'this', 'to']


In [70]:
vocab= {token:index for index, token in enumerate( tokens)}
print(vocab.items())

dict_items([('!', 0), (')', 1), ('.', 2), (':', 3), ('?', 4), ('Or', 5), ('Test', 6), ('This', 7), ('a', 8), ('be', 9), ('is', 10), ('it', 11), ('not', 12), ('sure', 13), ('test', 14), ('this', 15), ('to', 16)])


In [71]:
with open("SILVERBLAZE.txt", "r") as f:
    raw_text= f.read()
print(raw_text[:100])

SILVER BLAZE

"I AM afraid, Watson, that I shall have to go," said Holmes, as we sat down together t


In [72]:
tokens= re.split(r'([.,!?:()\'“”"]|\s)', raw_text)
tokens=[item for item in tokens if item.split()]
tokens.extend(["<|unk|>", "<|endoftext|>"])
print(len(tokens))


11564


In [73]:
tokens = sorted( list(set(tokens)))
print(len(tokens))

2075


In [74]:
print(tokens[:20])

['!', '"', '&', "'", '(', ')', ',', '.', '1000', '50', '<|endoftext|>', '<|unk|>', '?', 'A', 'AM', 'About', 'Above', 'Afterwards', 'Again', 'Ah']


In [75]:
vocab= { token:index for index, token in enumerate(tokens)}
vocab.items()



dict_items([('!', 0), ('"', 1), ('&', 2), ("'", 3), ('(', 4), (')', 5), (',', 6), ('.', 7), ('1000', 8), ('50', 9), ('<|endoftext|>', 10), ('<|unk|>', 11), ('?', 12), ('A', 13), ('AM', 14), ('About', 15), ('Above', 16), ('Afterwards', 17), ('Again', 18), ('Ah', 19), ('All', 20), ('An', 21), ('And', 22), ('As', 23), ('At', 24), ('BLAZE', 25), ('Backwater', 26), ('Balmoral', 27), ('Baxter', 28), ('Bayard', 29), ('Be', 30), ('Because', 31), ('Before', 32), ('Black', 33), ('Blaze', 34), ('Bless', 35), ('Blue', 36), ('Bond', 37), ('Brown', 38), ('But', 39), ('By', 40), ('Can', 41), ('Capital', 42), ('Cavendish', 43), ('Certainly', 44), ('Chronicle', 45), ('Cinnamon', 46), ('Clapham', 47), ('Co', 48), ('Colonel', 49), ('Cup', 50), ('Cup—Silver', 51), ('D', 52), ('Dartmoor', 53), ('Dartmoor;', 54), ('Dawson', 55), ('Dear', 56), ('Derbyshire', 57), ('Desborough', 58), ('Devonshire', 59), ('Did', 60), ('Do', 61), ('Don', 62), ('Dr', 63), ('Drive', 64), ('Duke', 65), ('Edith', 66), ('England', 6

In [76]:
phrase= "Because I made a blunder, my dear Watson—which is, I am afraid, a more common occurrence than any one would think who only knew me through your memoirs."
print(phrase)

Because I made a blunder, my dear Watson—which is, I am afraid, a more common occurrence than any one would think who only knew me through your memoirs.


In [77]:
phrase= re.split(r'([.,!?:()\'“”"]|\s)', phrase)
phrase=[item for item in phrase if item.split()]
print(phrase)

['Because', 'I', 'made', 'a', 'blunder', ',', 'my', 'dear', 'Watson—which', 'is', ',', 'I', 'am', 'afraid', ',', 'a', 'more', 'common', 'occurrence', 'than', 'any', 'one', 'would', 'think', 'who', 'only', 'knew', 'me', 'through', 'your', 'memoirs', '.']


In [78]:
ids=[vocab[token] for token in phrase]
print(ids)

[31, 97, 1190, 231, 404, 6, 1262, 619, 209, 1080, 6, 97, 284, 262, 6, 231, 1249, 546, 1315, 1829, 305, 1326, 2057, 1848, 2022, 1327, 1102, 1212, 1861, 2070, 1216, 7]


In [79]:
reverse_vocab={index: token for token, index in vocab.items()}
reverse_vocab.items()


dict_items([(0, '!'), (1, '"'), (2, '&'), (3, "'"), (4, '('), (5, ')'), (6, ','), (7, '.'), (8, '1000'), (9, '50'), (10, '<|endoftext|>'), (11, '<|unk|>'), (12, '?'), (13, 'A'), (14, 'AM'), (15, 'About'), (16, 'Above'), (17, 'Afterwards'), (18, 'Again'), (19, 'Ah'), (20, 'All'), (21, 'An'), (22, 'And'), (23, 'As'), (24, 'At'), (25, 'BLAZE'), (26, 'Backwater'), (27, 'Balmoral'), (28, 'Baxter'), (29, 'Bayard'), (30, 'Be'), (31, 'Because'), (32, 'Before'), (33, 'Black'), (34, 'Blaze'), (35, 'Bless'), (36, 'Blue'), (37, 'Bond'), (38, 'Brown'), (39, 'But'), (40, 'By'), (41, 'Can'), (42, 'Capital'), (43, 'Cavendish'), (44, 'Certainly'), (45, 'Chronicle'), (46, 'Cinnamon'), (47, 'Clapham'), (48, 'Co'), (49, 'Colonel'), (50, 'Cup'), (51, 'Cup—Silver'), (52, 'D'), (53, 'Dartmoor'), (54, 'Dartmoor;'), (55, 'Dawson'), (56, 'Dear'), (57, 'Derbyshire'), (58, 'Desborough'), (59, 'Devonshire'), (60, 'Did'), (61, 'Do'), (62, 'Don'), (63, 'Dr'), (64, 'Drive'), (65, 'Duke'), (66, 'Edith'), (67, 'England

In [80]:
print(" ".join([reverse_vocab[id]for id in ids]))


Because I made a blunder , my dear Watson—which is , I am afraid , a more common occurrence than any one would think who only knew me through your memoirs .


In [81]:
class SimpleTokenizer: 
    def __init__(self,vocab):
        self.str_to_int= vocab
        self.int_to_str={index:token for token, index in vocab.items()}

    def encode(self, text):
        tokens= re.split(r'([.,!?:()\'“”"]|\s)', text)
        tokens=[item if item in self.str_to_int else "<|unk|>" for item in tokens if item.split()]
        ids= [self.str_to_int[token]for token in tokens]
        return ids
    
    def decode(self, ids):
        text= " ".join([self.int_to_str[id]for id in ids])
        text= re.sub( r'\s+([.,!?:()\'“”"])', r'\1',text)
        return text


In [82]:
tokenizer= SimpleTokenizer(vocab)


In [83]:
phrase="There were, also, divers ladies in New York, Newport, and elsewhere, and celebrated for their palatial homes, their jewels, and their daughters, who were anxious to know how Bellew would comport himself under his disappointment."
phrase2="There was a fringe of flowering geraniums in the window; the two women had to stretch their heads over them."
phrase3="An affair of some mystery, remarked Mr. Gryce, as we halted at the corner to take a final look at the house and its environs."

In [84]:
ids= tokenizer.encode(phrase)
print(ids)

ids2= tokenizer.encode(phrase2)
print(ids2)

ids3= tokenizer.encode(phrase3)
print(ids3)

[186, 2007, 6, 279, 6, 11, 1113, 1037, 125, 11, 6, 11, 6, 294, 11, 6, 294, 11, 858, 1832, 11, 11, 6, 1832, 11, 6, 294, 1832, 11, 6, 2022, 2007, 11, 1871, 1107, 1007, 11, 2057, 11, 988, 1928, 989, 11, 7]
[186, 1986, 231, 11, 1317, 11, 11, 1037, 1831, 11, 1831, 1924, 2044, 947, 1871, 11, 1832, 11, 1351, 1833, 7]
[21, 261, 1317, 1698, 1264, 6, 1539, 119, 7, 11, 6, 331, 1998, 11, 344, 1831, 578, 1871, 1807, 231, 835, 1180, 344, 1831, 1005, 294, 1082, 11, 7]


In [85]:
text= tokenizer.decode(ids)
print(text)

text2= tokenizer.decode(ids2)
print(text2)

text3= tokenizer.decode(ids3)
print(text3)

There were, also, <|unk|> ladies in New <|unk|>, <|unk|>, and <|unk|>, and <|unk|> for their <|unk|> <|unk|>, their <|unk|>, and their <|unk|>, who were <|unk|> to know how <|unk|> would <|unk|> himself under his <|unk|>.
There was a <|unk|> of <|unk|> <|unk|> in the <|unk|> the two women had to <|unk|> their <|unk|> over them.
An affair of some mystery, remarked Mr. <|unk|>, as we <|unk|> at the corner to take a final look at the house and its <|unk|>.


In [86]:
import tiktoken

In [87]:
tiktokenizer= tiktoken.get_encoding("gpt2")

In [88]:
phrase= "idndiewsoismsowm"

In [89]:
ids= tiktokenizer.encode(phrase)
print(ids)

[312, 358, 769, 568, 6583, 322, 76]


In [90]:
tiktokenizer.decode(ids)

'idndiewsoismsowm'