<a href="https://colab.research.google.com/github/ayushmorya/Word-Based-Tokenization/blob/main/LLMscratch%20(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reading in a short story as text sample in Python.

# Step 1: Creating Tokens

<div style="background-color:#e0f7e9; padding: 10px; border-left: 5px solid green;">
    <b>Note:</b> The print command prints the total number of characters followed by the first 100 characters of this file for illustration purposes.
</div>


In [4]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
print("The total number of characters in the story:", len(raw_text))
print(raw_text[:99])

The total number of characters in the story: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


<div style="background-color:#e0f7e9; padding: 10px; border-left: 5px solid green;">
    <b>Note:</b> Our goal is to tokenize this 20479 character short story into individual words and special characters that we can then turn into embeddings for LLM training.
</div>

<div style="background-color:#d0f0fd; padding: 10px; border-left: 5px solid #2196F3;">
    <b>Note:</b> Note that it's common to process millions of articles and hundreds of thousands of
    books -- many gigabytes of text -- when working with LLMs. However, for educational
    purposes, it's sufficient to work with smaller text samples like a single book to
    illustrate the main ideas behind the text processing steps and to make it possible to
    run it in reasonable time on consumer hardware.
</div>

<div style="background-color:#ffe0b2; padding: 10px; border-left: 5px solid orange;">
    <b>Note:</b> How can we best split this text to obtain a list of tokens? For this, we go on a small
    excursion and use Python's regular expression library <code>re</code> for illustration purposes. (Note
    that you don't have to learn or memorize any regular expression syntax since we will
    transition to a pre-built tokenizer later in this chapter.)
</div>

<div style="background-color:#d0f0fd; padding: 10px; border-left: 5px solid #2196F3;">
    <b>Info:</b> Using some simple example text, we can use the <code>re.split</code> command with the following
    syntax to split a text on whitespace characters.
</div>

In [5]:
import re
text = "Hello, World. This, is a test."
result = re.split(r'(\s)', text)
print(result)

['Hello,', ' ', 'World.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


<div style="background-color:#f3e5f5; padding: 10px; border-left: 5px solid #9c27b0;">
    <b>Note:</b> The result is a list of individual words, whitespaces, and punctuation characters.
</div>

<div style="background-color:#fff9c4; padding: 10px; border-left: 5px solid #fbc02d;">
    <b>Note:</b> Let's modify the regular expression splits on whitespaces (<code>\s</code>) and commas, and periods (<code>[,.]</code>).
</div>

In [6]:
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'World', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


<div style="background-color:#e0f7e9; padding: 10px; border-left: 5px solid green;">
    <b>Note:</b> We can see that the words and punctuation characters are now separate list entries just as we wanted (<code>[,.]</code>).
</div>

<div style="background-color:#f3e5f5; padding: 10px; border-left: 5px solid #9c27b0;">
    <b>Note:</b> A small remaining issue is that the list still includes whitespace characters. Optionally, we
    can remove these redundant characters safely as follows:
</div>

In [7]:
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'World', '.', 'This', ',', 'is', 'a', 'test', '.']


<div style="background-color:#ffe0b2; padding: 10px; border-left: 5px solid orange;">
    <b>REMOVING WHITESPACES OR NOT</b><br><br>
    When developing a simple tokenizer, whether we should encode whitespaces as
    separate characters or just remove them depends on our application and its
    requirements. Removing whitespaces reduces the memory and computing
    requirements. However, keeping whitespaces can be useful if we train models that
    are sensitive to the exact structure of the text (for example, Python code, which is
    sensitive to indentation and spacing). Here, we remove whitespaces for simplicity
    and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme
    that includes whitespaces.
</div>

<div style="background-color:#bbdefb; padding: 10px; border-left: 5px solid #2196f3;">
    <b>Note:</b> The tokenization scheme we devised above works well on the simple sample text. Let's
    modify it a bit further so that it can also handle other types of punctuation, such as
    question marks, quotation marks, and the double-dashes we have seen earlier in the first
    100 characters of Edith Wharton's short story, along with additional special characters.
</div>


In [8]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'Is', ' ', 'this', '--', '', ' ', 'a', ' ', 'test', '?', '']


In [9]:
# Strip whitespace from each item and then filter our any empty string
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [10]:
text = "And I, Am...Iron Ma!"
result = re.split(r'(\.\.\.|[,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['And', 'I', ',', 'Am', '...', 'Iron', 'Ma', '!']


<div style="background-color:#e8f5e9; padding: 10px; border-left: 5px solid #43a047;">
    <b>Note:</b> Now that we got a basic tokenizer, let's apply it to the story.
</div>

In [11]:
preprocessed = re.split(r'(\.\.\.|[,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [12]:
print(len(preprocessed))

4690


# Step: 2 Creating Token IDs

<div style="background-color:#f3e5f5; padding: 10px; border-left: 5px solid #8e24aa;">
    <b>Note:</b> In the previous section, we tokenized Edith Wharton's short story and assigned it to a
    Python variable called <code>preprocessed</code>. Let's now create a list of all unique tokens and sort
    them alphabetically to determine the vocabulary size.
</div>

In [13]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

1130


<div style="background-color:#fff3e0; padding: 10px; border-left: 5px solid #fb8c00;">
    <b>Note:</b> After determining that the vocabulary size is 1,130 via the above code, we create the
    vocabulary and print its first 51 entries for illustration purposes.
</div>

In [14]:
vocab = {token:integer for integer,token in enumerate(all_words)}

In [15]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i>=50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


<div style="background-color:#fff9c4; padding: 10px; border-left: 5px solid #fbc02d;">
    <b>Note:</b> As we can see, based on the output above, the dictionary contains individual tokens
    associated with unique integer labels.
</div>

<div style="background-color:#d0f0c0; padding: 10px; border-left: 5px solid #388e3c;">
    <b>Note:</b> Later in this book, when we want to convert the outputs of an LLM from numbers back into
    text, we also need a way to turn token IDs into text. <br><br>
    For this, we can create an inverse version of the vocabulary that maps token IDs back to corresponding text tokens.
</div>

<div style="background-color:#e3f2fd; padding: 10px; border-left: 5px solid #1e88e5;">
    <b>Note:</b> Let's implement a complete tokenizer class in Python. <br><br>
    The class will have an <code>encode</code> method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary. <br><br>
    In addition, we implement a <code>decode</code> method that carries out the reverse integer-to-string mapping to convert the token IDs back into text.
</div>

<div style="background-color:#f3e5f5; padding: 10px; border-left: 5px solid #8e24aa;">
    <b>Steps:</b><br>
    Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods<br>
    Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens<br>
    Step 3: Process input text into token IDs<br>
    Step 4: Convert token IDs back into text<br>
    Step 5: Replace spaces before the specified punctuation
</div>


In [16]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'(\.\.\.|--|[.,;:?!"()\']|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        #Replace spaces vefore the specified punctuaions
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

<div style="background-color:#fff3e0; padding: 10px; border-left: 5px solid #fb8c00;">
    Let's instantiate a new tokenizer object from the <code>SimpleTokenizerV1</code> class and tokenize a
    passage from Edith Wharton's short story to try it out in practice:
</div>

In [17]:
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know,"
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


<div style="background-color:#e8f5e9; padding: 10px; border-left: 5px solid #43a047;">
    <b>Note:</b> The code above prints the following token IDs:<br>
    Next, let's see if we can turn these token IDs back into text using the <code>decode</code> method.
</div>


In [18]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

<div style="background-color:#fff9c4; padding: 10px; border-left: 5px solid #fbc02d;">
    <b>Note:</b> Based on the output above, we can see that the <code>decode</code> method successfully converted the
    token IDs back into the original text.
</div>

<div style="background-color:#e3f2fd; padding: 10px; border-left: 5px solid #2196f3;">
    <b>Note:</b> So far, so good. We implemented a tokenizer capable of tokenizing and de-tokenizing
    text based on a snippet from the training set. <br><br>
    Let's now apply it to a new text sample that is not contained in the training set:
</div>

In [19]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

<div style="background-color:#f3e5f5; padding: 10px; border-left: 5px solid #9c27b0;">
    <b>Note:</b> The problem is that the word "Hello" was not used in the <i>The Verdict</i> short story. <br><br>
    Hence, it is not contained in the vocabulary. <br><br>
    This highlights the need to consider large and diverse training sets to extend the vocabulary when working on LLMs.
</div>

# ADDING SPECIAL CONTEXT TOKENS
In the previous section, we implemented a simple tokenizer and applied it to a passage
from the training set.

In this section, we will modify this tokenizer to handle unknown
words.


In particular, we will modify the vocabulary and tokenizer we implemented in the
previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and
<|endoftext|>

<div style="background-color:#fff3e0; padding: 10px; border-left: 5px solid #fb8c00;">
    <b>Note:</b> We can modify the tokenizer to use an <code>&lt;|unk|&gt;</code> token if it encounters a word that is not part of the vocabulary. <br><br>
    Furthermore, we add a token between unrelated texts. <br><br>
    For example, when training GPT-like LLMs on multiple independent documents or books, it is common to insert a token before each document or book that follows a previous text source.
</div>

<div style="background-color:#fce4ec; padding: 10px; border-left: 5px solid #f06292;">
    <b>Note:</b> Let's now modify the vocabulary to include these two special tokens, <code>&lt;unk&gt;</code> and <code>&lt;|endoftext|&gt;</code>, by adding these to the list of all unique words that we created in the previous section.
</div>


In [20]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [21]:
len(vocab.items())

1132

<div style="background-color:#fff9c4; padding: 10px; border-left: 5px solid #fbc02d;">
    <b>Note:</b> Based on the output of the print statement above, the new vocabulary size is 1132 (the vocabulary size in the previous section was 1130).
</div>

<div style="background-color:#eeeeee; padding: 10px; border-left: 5px solid #9e9e9e;">
    <b>Note:</b> As an additional quick check, let's print the last 5 entries of the updated vocabulary.
</div>


In [22]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


<div style="background-color:#e8f5e9; padding: 10px; border-left: 5px solid #43a047;">
    <b>Note:</b> A simple text tokenizer that handles unknown words.
</div>

<div style="background-color:#e3f2fd; padding: 10px; border-left: 5px solid #1e88e5;">
    <b>Info:</b>
    <ul style="margin: 0; padding-left: 20px;">
        <li>Step 1: Replace unknown words by <code>&lt;|unk|&gt;</code> tokens</li>
        <li>Step 2: Replace spaces before the specified punctuations</li>
    </ul>
</div>

In [23]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'(\.\.\.|--|[.,;:?!"()\']|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        #Replace spaces vefore the specified punctuaions
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [24]:
tokenizer = SimpleTokenizerV2(vocab)
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [25]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [29]:
tokenizer.decode (tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

<div style="background-color:#d0f0c0; padding: 10px; border-left: 5px solid #4caf50;">
    <b>Info:</b> Based on comparing the de-tokenized text above with the original input text, we know that
    the training dataset, Edith Wharton's short story <i>The Verdict</i>, did not contain the words
    <code>"Hello"</code> and <code>"palace"</code>.
</div>

<div style="background-color:#e0f7fa; padding: 10px; border-left: 5px solid #0097a7;">
  <b>Info:</b> So far, we have discussed tokenization as an essential step in processing text as input to
  LLMs. Depending on the LLM, some researchers also consider additional special tokens such
  as the following:
  <ul>
    <li><b>[BOS]</b> (beginning of sequence): This token marks the start of a text. It
    signifies to the LLM where a piece of content begins.</li>
    <li><b>[EOS]</b> (end of sequence): This token is positioned at the end of a text,
    and is especially useful when concatenating multiple unrelated texts,
    similar to <code>&lt;|endoftext|&gt;</code>. For instance, when combining two different
    Wikipedia articles or books, the [EOS] token indicates where one article
    ends and the next one begins.</li>
    <li><b>[PAD]</b> (padding): When training LLMs with batch sizes larger than one,
    the batch might contain texts of varying lengths. To ensure all texts have
    the same length, the shorter texts are extended or "padded" using the
    [PAD] token, up to the length of the longest text in the batch.</li>
  </ul>
</div>


<div style="background-color:#f3e5f5; padding: 10px; border-left: 5px solid #9c27b0;">
  <b>Note:</b> Note that the tokenizer used for GPT models does not need any of these tokens mentioned
  above but only uses an <code>&lt;|endoftext|&gt;</code> token for simplicity.
</div>

<div style="background-color:#fff3e0; padding: 10px; border-left: 5px solid #fb8c00;">
  <b>Info:</b> The tokenizer used for GPT models also doesn't use an <code>&lt;|unk|&gt;</code> token for out-of-vocabulary words.
  Instead, GPT models use a byte pair encoding tokenizer, which breaks down words into subword units.
</div>


In [30]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.9.0


In [32]:
tokenizer = tiktoken.get_encoding("gpt2")

In [33]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit of terraces"
    "of someunknownPlace."
)
integers = tokenizer.encode(text,allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 286, 8812, 2114, 1659, 617, 34680, 27271, 13]
