# Tokenization

_Tokenization_ is the process of converting a body of text into individual _tokens_, e.g., words and punctuation characters. This is the first step for most Natural Language Processing (NLP) tasks, including preparing data for training an LLM. Let's see how it's done!

## Some sample text

In [5]:
text = "This is a test! Or is this not a test? Test it to be sure. :)"
print(text)
print(f"This sample text has {len(text)} characters.")

This is a test! Or is this not a test? Test it to be sure. :)
This sample text has 61 characters.


In [6]:
print( text.split() )

['This', 'is', 'a', 'test!', 'Or', 'is', 'this', 'not', 'a', 'test?', 'Test', 'it', 'to', 'be', 'sure.', ':)']


In [7]:
str.split?

[1;31mSignature:[0m [0mstr[0m[1;33m.[0m[0msplit[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m/[0m[1;33m,[0m [0msep[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mmaxsplit[0m[1;33m=[0m[1;33m-[0m[1;36m1[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a list of the substrings in the string, using sep as the separator string.

  sep
    The separator used to split the string.

    When set to None (the default value), will split on any whitespace
    character (including \n \r \t \f and spaces) and will discard
    empty strings from the result.
  maxsplit
    Maximum number of splits.
    -1 (the default value) means no limit.

Splitting starts at the front of the string and works to the end.

Note, str.split() is mainly useful for data that has been intentionally
delimited.  With natural text that includes punctuation, consider using
the regular expression module.
[1;31mType:[0m      method_descriptor

In [8]:
import re

In [9]:
tokens = re.split( r'([.?!:()]|\s)', text)
tokens = [ item for item in tokens if item.split() ]
tokens = sorted( list( set( tokens ) ) )
print(tokens)

['!', ')', '.', ':', '?', 'Or', 'Test', 'This', 'a', 'be', 'is', 'it', 'not', 'sure', 'test', 'this', 'to']


In [10]:
vocab = { token:index for index, token in enumerate( tokens )}
print( vocab.items())

dict_items([('!', 0), (')', 1), ('.', 2), (':', 3), ('?', 4), ('Or', 5), ('Test', 6), ('This', 7), ('a', 8), ('be', 9), ('is', 10), ('it', 11), ('not', 12), ('sure', 13), ('test', 14), ('this', 15), ('to', 16)])


In [11]:
vocab["Test"]

6

In [12]:
enumerate?

[1;31mInit signature:[0m [0menumerate[0m[1;33m([0m[0miterable[0m[1;33m,[0m [0mstart[0m[1;33m=[0m[1;36m0[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Return an enumerate object.

  iterable
    an object supporting iteration

The enumerate object yields pairs containing a count (from start, which
defaults to zero) and a value yielded by the iterable argument.

enumerate is useful for obtaining an indexed list:
    (0, seq[0]), (1, seq[1]), (2, seq[2]), ...
[1;31mType:[0m           type
[1;31mSubclasses:[0m     

## Tokenizing a story

In [13]:
with open( "datashort_story2.txt", "r" ) as f:
    raw_text = f.read()
print(raw_text[:100])

'My Lord Chancellor,

'When I consider the Affair of an Union betwixt the two Nations, as it is expr


In [14]:
tokens = re.split( r'([.?!:())"\'“”‘’^]|\s)', raw_text)
tokens = [ item for item in tokens if item.split() ]
tokens.extend(["<|unk|>", "<|endoftext|>"])
print( len(tokens) )


5840


In [15]:
tokens.extend?

[1;31mSignature:[0m [0mtokens[0m[1;33m.[0m[0mextend[0m[1;33m([0m[0miterable[0m[1;33m,[0m [1;33m/[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m Extend list by appending elements from the iterable.
[1;31mType:[0m      builtin_function_or_method

In [16]:
tokens = sorted( list( set( tokens ) ) )
print( len(tokens) )

1808


In [17]:
print( tokens[:20])

['!', '"', '&c', "'", '(', ')', '.', '1604', '2000', ':', '<|endoftext|>', '<|unk|>', '?', 'A', 'ANN,', 'Abolition', 'Abroad,', 'Account,', 'Act', 'Act,']


In [18]:
vocab = { token:index for index, token in enumerate(tokens)}
vocab.items()

dict_items([('!', 0), ('"', 1), ('&c', 2), ("'", 3), ('(', 4), (')', 5), ('.', 6), ('1604', 7), ('2000', 8), (':', 9), ('<|endoftext|>', 10), ('<|unk|>', 11), ('?', 12), ('A', 13), ('ANN,', 14), ('Abolition', 15), ('Abroad,', 16), ('Account,', 17), ('Act', 18), ('Act,', 19), ('Administration,', 20), ('Advantage', 21), ('Advice', 22), ('Advices', 23), ('Affair', 24), ('Affair,', 25), ('Affairs', 26), ('Affairs;', 27), ('Agape,', 28), ('Ah', 29), ('Air,', 30), ('Ale,', 31), ('All', 32), ('Alliances,', 33), ('Almighty', 34), ('Alteration', 35), ('Anabaptists,', 36), ('Ancestors', 37), ('And', 38), ('And,', 39), ('Animosities', 40), ('Annihilations,', 41), ('Anti-courtier', 42), ('Antistatesman', 43), ('Ape,', 44), ('Appeals', 45), ('Appearance,', 46), ('Appearances', 47), ('Application', 48), ('Are', 49), ('Arguments', 50), ('Aristocracy,', 51), ('Armies', 52), ('Armies,', 53), ('Arminians,', 54), ('Article', 55), ('Article,', 56), ('Article;', 57), ('Articles', 58), ('Articles,', 59), ('

In [19]:
phrase = "I think I see a free and independent Kingdom delivering up that, which all the World hath been fighting for since the Days of Nimrod; yea, that for which most of all the Empires, Kingdoms, States, Principalities, and Dukedoms of Europe, are at this time engaged in the most bloody and cruel Wars that ever were, to wit, a Power to manage their own Affairs by themselves, without the Assistance and Counsel of any other."
print(phrase)

I think I see a free and independent Kingdom delivering up that, which all the World hath been fighting for since the Days of Nimrod; yea, that for which most of all the Empires, Kingdoms, States, Principalities, and Dukedoms of Europe, are at this time engaged in the most bloody and cruel Wars that ever were, to wit, a Power to manage their own Affairs by themselves, without the Assistance and Counsel of any other.


In [20]:
phrase = re.split( r'([.?!:())"\'“”‘’^]|\s)', phrase)
phrase = [ item for item in phrase if item.split() ]
print(phrase)

['I', 'think', 'I', 'see', 'a', 'free', 'and', 'independent', 'Kingdom', 'delivering', 'up', 'that,', 'which', 'all', 'the', 'World', 'hath', 'been', 'fighting', 'for', 'since', 'the', 'Days', 'of', 'Nimrod;', 'yea,', 'that', 'for', 'which', 'most', 'of', 'all', 'the', 'Empires,', 'Kingdoms,', 'States,', 'Principalities,', 'and', 'Dukedoms', 'of', 'Europe,', 'are', 'at', 'this', 'time', 'engaged', 'in', 'the', 'most', 'bloody', 'and', 'cruel', 'Wars', 'that', 'ever', 'were,', 'to', 'wit,', 'a', 'Power', 'to', 'manage', 'their', 'own', 'Affairs', 'by', 'themselves,', 'without', 'the', 'Assistance', 'and', 'Counsel', 'of', 'any', 'other', '.']


In [21]:
ids = [ vocab[token] for token in phrase]
print(ids)

[358, 1677, 358, 1576, 859, 1181, 902, 1273, 397, 1048, 1739, 1655, 1776, 885, 1656, 849, 1224, 940, 1150, 1167, 1600, 1656, 188, 1415, 502, 1801, 1654, 1167, 1776, 1373, 1415, 885, 1656, 236, 401, 709, 590, 902, 230, 1415, 257, 922, 927, 1680, 1695, 1107, 1268, 1656, 1373, 959, 902, 1033, 826, 1654, 1119, 1769, 1700, 1791, 859, 573, 1700, 1343, 1657, 1446, 26, 974, 1662, 1794, 1656, 62, 902, 172, 1415, 913, 1431, 6]


In [22]:
vocab[ "Europe"]

256

In [23]:
reverse_vocab = {index:token for token,index in vocab.items()}
reverse_vocab.items()

dict_items([(0, '!'), (1, '"'), (2, '&c'), (3, "'"), (4, '('), (5, ')'), (6, '.'), (7, '1604'), (8, '2000'), (9, ':'), (10, '<|endoftext|>'), (11, '<|unk|>'), (12, '?'), (13, 'A'), (14, 'ANN,'), (15, 'Abolition'), (16, 'Abroad,'), (17, 'Account,'), (18, 'Act'), (19, 'Act,'), (20, 'Administration,'), (21, 'Advantage'), (22, 'Advice'), (23, 'Advices'), (24, 'Affair'), (25, 'Affair,'), (26, 'Affairs'), (27, 'Affairs;'), (28, 'Agape,'), (29, 'Ah'), (30, 'Air,'), (31, 'Ale,'), (32, 'All'), (33, 'Alliances,'), (34, 'Almighty'), (35, 'Alteration'), (36, 'Anabaptists,'), (37, 'Ancestors'), (38, 'And'), (39, 'And,'), (40, 'Animosities'), (41, 'Annihilations,'), (42, 'Anti-courtier'), (43, 'Antistatesman'), (44, 'Ape,'), (45, 'Appeals'), (46, 'Appearance,'), (47, 'Appearances'), (48, 'Application'), (49, 'Are'), (50, 'Arguments'), (51, 'Aristocracy,'), (52, 'Armies'), (53, 'Armies,'), (54, 'Arminians,'), (55, 'Article'), (56, 'Article,'), (57, 'Article;'), (58, 'Articles'), (59, 'Articles,'), (6

In [24]:
print( " ".join([ reverse_vocab[id] for id in ids]))

I think I see a free and independent Kingdom delivering up that, which all the World hath been fighting for since the Days of Nimrod; yea, that for which most of all the Empires, Kingdoms, States, Principalities, and Dukedoms of Europe, are at this time engaged in the most bloody and cruel Wars that ever were, to wit, a Power to manage their own Affairs by themselves, without the Assistance and Counsel of any other .


In [25]:
class SimpleTokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {index:token for token, index in vocab.items()}

    def encode(self, text):
        tokens = re.split( r'([.?!:())"\'“”‘’^]|\s)', text)
        tokens = [ item if item in self.str_to_int else "<|unk|>" for item in tokens if item.split() ]
        ids = [ self.str_to_int[token] for token in tokens]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[id] for id in ids])
        text = re.sub( r'\s+([.?!:())"\'“”‘’^])', r'\1', text )
        return text


In [26]:
tokenizer = SimpleTokenizer(vocab)

In [27]:
phrase = "their Enemies subdued and routed, their strong Holds besieged and taken, Sieges relieved, Marshals killed and taken Prisoners, Provinces and Kingdoms are the Results of their Victories; their Royal Navy is the Terror of Europe, their Trade and Commerce extended through the Universe, encircling the whole habitable World, and rendering their own capital City the Emporium for the whole Inhabitants of the earth"

In [28]:
ids = tokenizer.encode( phrase )
print(ids)

[1657, 241, 1630, 902, 1550, 1657, 1627, 342, 952, 902, 1646, 678, 1519, 447, 1297, 902, 1645, 592, 604, 902, 400, 922, 1656, 640, 1415, 1657, 815, 1657, 650, 494, 1286, 1656, 740, 1415, 257, 1657, 777, 902, 129, 1128, 1689, 1656, 806, 1101, 1656, 1782, 1215, 850, 902, 1527, 1657, 1446, 982, 117, 1656, 238, 1167, 1656, 1782, 373, 1415, 1656, 1093]


In [29]:
text = tokenizer.decode(ids)
print(text)

their Enemies subdued and routed, their strong Holds besieged and taken, Sieges relieved, Marshals killed and taken Prisoners, Provinces and Kingdoms are the Results of their Victories; their Royal Navy is the Terror of Europe, their Trade and Commerce extended through the Universe, encircling the whole habitable World, and rendering their own capital City the Emporium for the whole Inhabitants of the earth


In [30]:
phrase = "Where will this end, my Lord?"

In [31]:
ids = tokenizer.encode( phrase )
print(ids)

[837, 1786, 1680, 1104, 1377, 427, 12]


In [32]:
text = tokenizer.decode(ids)
print(text)

Where will this end, my Lord?


In [33]:
import tiktoken

In [34]:
tokenizer = tiktoken.get_encoding("gpt2")

In [35]:
ids = tokenizer.encode(phrase)
print(ids)

[8496, 481, 428, 886, 11, 616, 4453, 30]


In [36]:
phrase = "I'll get you tokenizer try this: “”‘’ sfdgjhsdfjzshfd"

In [37]:
ids = tokenizer.encode(phrase)
print(ids)

[40, 1183, 651, 345, 11241, 7509, 1949, 428, 25, 564, 250, 447, 251, 447, 246, 447, 247, 264, 16344, 70, 73, 11994, 7568, 73, 89, 1477, 16344]


In [38]:
tokenizer.decode(ids)

"I'll get you tokenizer try this: “”‘’ sfdgjhsdfjzshfd"