## Reading a short story

### Step1: Creating Tokens

In [31]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Totatl characters:", len(raw_text))
print(raw_text[:100])

Totatl characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


Our goal is to tokenize ths 20479 long text story into individual words that we can turn into embeddings for LLM training

Using regex split the entire book text into tokens

In [32]:
import re

text = "this is a test. This test is only a test."
result = re.split(r'(\s)', text)

print(result)

['this', ' ', 'is', ' ', 'a', ' ', 'test.', ' ', 'This', ' ', 'test', ' ', 'is', ' ', 'only', ' ', 'a', ' ', 'test.']


It has generated a list of individual words, whitespaces & punctuations

Let's modify the regex splits on whitespaces and punctuations

In [33]:
result = re.split(r'([,.]|\s)', text)

print(result)

['this', ' ', 'is', ' ', 'a', ' ', 'test', '.', '', ' ', 'This', ' ', 'test', ' ', 'is', ' ', 'only', ' ', 'a', ' ', 'test', '.', '']


Whitespaces are still counted as token; we have to remove those and consider only words

Removing whitespaces can save memory , however it can depend on the application to include or exclude

In [34]:
result = [item for item in result if item.strip()]

print(result)

['this', 'is', 'a', 'test', '.', 'This', 'test', 'is', 'only', 'a', 'test', '.']


Modify the tokenization to include all the other special characters as well

In [35]:
text = "this is a test. Alan:-- Do you like it? Yes, I do!"
result = re.split(r'([,.:;?_!"()\']--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['this', 'is', 'a', 'test.', 'Alan', ':--', 'Do', 'you', 'like', 'it?', 'Yes,', 'I', 'do!']


We have a basic tokenizer and we can apply it on full book text

In [36]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print("Total tokens:", len(preprocessed))
print(preprocessed[:100])

Total tokens: 4690
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself', 'in', 'a', 'villa', 'on', 'the', 'Riviera', '.', '(', 'Though', 'I', 'rather', 'thought', 'it', 'would', 'have', 'been', 'Rome', 'or', 'Florence', '.', ')', '"', 'The', 'height', 'of', 'his', 'glory', '"', '--', 'that', 'was', 'what', 'the', 'women', 'called', 'it', '.', 'I', 'can', 'hear', 'Mrs', '.', 'Gideon', 'Thwing', '--', 'his', 'last', 'Chicago', 'sitter', '--']


#### Step2: Creating token IDs

Create a list of unique tokens; Sort the tokens in alphabetical order and determine vocabulary size

In [37]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print("Unique tokens:", vocab_size)

Unique tokens: 1130


Create the vocabulary by assigning an integer to each token

In [38]:
vocab = {token:integer for integer, token in enumerate(all_words)}

In [39]:
for token, integer in vocab.items():
    print(f"{token}: {integer}")
    if integer >= 20:
        break

!: 0
": 1
': 2
(: 3
): 4
,: 5
--: 6
.: 7
:: 8
;: 9
?: 10
A: 11
Ah: 12
Among: 13
And: 14
Are: 15
Arrt: 16
As: 17
At: 18
Be: 19
Begin: 20


The dictionary contains individual tokens associted with integer

Note: Later in the process (output from LLM) we would need to convert the numbers back to words, so we need a way

We need to create an inverse version of the vocabulary that maps token IDs back to corresponding text/tokens

#### Step3: Implement Tokenizer class

Let's built a tokenizer class for it.

The class will have the encode method, that splits the token and carries out the string to integer mapping to produce token IDs

In addition, we will implement a decode method that carries out the reverse integer to string mapping to convert token IDs to text

Step 1: Store the vocab as class attribute for access in encode and decode methods

Step 2: Create an inverse vocab that maps token IDs back to original text token

Step 3: Process input text into token IDs

Step 4: Convert token IDs back to text

Step 5: Replace spaces before specified pumctuation

In [40]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {integer:token for token, integer in vocab.items()}

    def encode(self, text):
        # tokenize the input text using the same regex as before
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]

        # convert tokens to integers using the vocab dictionary
        ids = [self.str_to_int[token] for token in preprocessed]
        return ids
    
    def decode(self, ids):
        # convert integers back to tokens using the inverse vocab dictionary
        text = " ".join([self.int_to_str[id] for id in ids])

        # restore spacing around punctuation
        text = re.sub(r'\s([,.:;?_!"()\'])', r'\1', text)
        return text
    
    

In [41]:
tokenizer = SimpleTokenizerV1(vocab)

text = """ It's the last he painted, you know. Mrs Gisburn said with pardonable pride."""

ids = tokenizer.encode(text)
print(ids)

[56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 7, 67, 38, 851, 1108, 754, 793, 7]


In [42]:
tokenizer.decode(ids)

"It' s the last he painted, you know. Mrs Gisburn said with pardonable pride."

The decoded method is able to decode the tokens

In [43]:
text = """Hello ! there , my name is iron man."""

ids = tokenizer.encode(text)
print(ids)

KeyError: 'Hello'

Problem is "Hello" is not seen in the story text. Hence we are getting the error

That's why it is advised to use large training corpus to extend vocabulary

### Adding Special Context Token

unknown token        |unk|       --> 783

end of text token    |endoftext| --> 784

these token are added to handle the unknown and different data source texts effectively

We can modify the tokenizer to use |unk| token if it enconter a word that is not part of vocabulary

Also we add token between unrelated text sources

In [45]:
len(vocab)

1130

In [46]:
for token, integer in vocab.items():
    print(f"{token}: {integer}")
    if integer >= 10:
        break

!: 0
": 1
': 2
(: 3
): 4
,: 5
--: 6
.: 7
:: 8
;: 9
?: 10


In [60]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
print(all_tokens[-10:])

vocab = {token:integer for integer, token in enumerate(all_tokens)}


['year', 'years', 'yellow', 'yet', 'you', 'younger', 'your', 'yourself', '<|endoftext|>', '<|unk|>']


In [61]:
len(vocab)

1132

In [62]:
for i ,item in enumerate(list(vocab.items())[-5:]):
    print(f"{item[0]}: {item[1]}")

younger: 1127
your: 1128
yourself: 1129
<|endoftext|>: 1130
<|unk|>: 1131


Define the v2 of Tokenizer Class

Step 1: Replace unknown words by <|unk|> token

Step 2: Replace spaces before the specified punctuations

In [63]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {integer:token for token, integer in vocab.items()}

    def encode(self, text):
        # tokenize the input text using the same regex as before
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]

        preprocessed = [token if token in self.str_to_int 
                        else "<|unk|>" for token in preprocessed]

        # convert tokens to integers using the vocab dictionary
        ids = [self.str_to_int[token] for token in preprocessed]
        return ids
    
    def decode(self, ids):
        # convert integers back to tokens using the inverse vocab dictionary
        text = " ".join([self.int_to_str[id] for id in ids])

        # restore spacing around punctuation
        text = re.sub(r'\s+([,.:;?_!"()\'])', r'\1', text)
        return text
    
    

In [64]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello do you like tea?"
text2 = "In the sunlit terraces of the palace"

text = " <|endoftext|> ".join([text1, text2])

print("Original text:", text)

Original text: Hello do you like tea? <|endoftext|> In the sunlit terraces of the palace


In [65]:
tokenizer.encode(text)

[1131, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131]

In [66]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|> do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>'

For "Hello" and "Palaces" we know that there are no text in vocabulary, hence <|unk|>

There are few other special token used in LLM training such as:

[BOS]: Beginning of sequence: This token marks start of the text.

[EOS]: End of sequence: This token is positioned at the end of a text.

[PAD]: Padding: When training LLM with batch size larger than one, the macth might contain texts of varying length. To ensure all texts have same length, the shorter texts are extended or "padded" using [PAD] token, upto lenth of longest text in the batch

Tokenizer used for GPT models does not need any of these tokenizers mentioned above but only uses <|endoftext|> for simplicity

The tokenizer used for GPT models doesn't use an <|unk|> token for out of vocabulary words.

GPT uses as byte pair encoding tokenizer, which breaks down words into subword units.