In [1]:
## Download text file

import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/"
        "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
        "the-verdict.txt")

file_path = "the-verdict-txt"
urllib.request.urlretrieve(url, file_path)

('the-verdict-txt', <http.client.HTTPMessage at 0x1b4c1dc2da0>)

In [2]:
# Read some lines in short story as a sample
with open("the-verdict-txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of characters: ", len(raw_text))
print(raw_text[:99]) # gives first 100 characters of the file

Total number of characters:  20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


Our task is to tokenize these 20,479 character into individual words and special characters so that we can then turn into embeddings for LLM training.

We can use re.split for understandig purpose. We can see how this step gives list of individual words, whitespaces and punctuation characters.

In [4]:
import re
text = "Hello, Aditya. This is a test."
result = re.split(r'(\s)', text)
print(result)

['Hello,', ' ', 'Aditya.', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test.']


but still we have white space and that is of no use, also we will refrain from changing all characters small,as it can affect LLM training

In [5]:
text = "Hello, Aditya, Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()] # to remove whitespace
print(result)

['Hello', ',', 'Aditya', ',', 'Is', 'this', '--', 'a', 'test', '?']


Now we have our basic tokenizer scheme, let's apply it to our short story

In [6]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()] # to remove whitespace
print(len(preprocessed))

4690


outputs 4690, which is number of tokens in this text without whitespaces. Let's see first 30 tokens-

In [7]:
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


##### Now next task-
##### Converting tokens into token IDS

convert tokens to an integer representation to produce token IDs, this is an intermediate step before embedding


In [8]:
all_words = sorted(set(preprocessed)) # list of all unique tokens and sort them alphabatically
vocab_size = len(all_words)
print(vocab_size)

1130


Our vocabulary size is 1130, now create the vocabulary and print fist 51 entries to see.
Below dictionary contains individual tokens associated with unique integer label.

In [13]:
vocab = {token:integer for integer, token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


Now, we have to apply above created vocabulary to convert new text into tokens Ids.

Let's create tokenizer class with `encode` method that split text into tokens and carries out string-to-integrer mapping to produce token Ids via vocabulary. Alos `decode` method that carries out reverse integer-to-string mapping, and convert the tokens IDs back into text.

In [22]:
## Implementing a Simple text tokenizer
class SimpleTokenizerV1:
    def __init__(self, vocab):
        """
        Stores the vocabulary as class attribute for access in the encode and decode method
        """
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()} # inverse vocabualry that maps Ids to text
        
    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed] 
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids]) # convert IDs into text
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) # remove spaces before specified punctuation
        return text 

From this class, we can initiate new tokenizer objects via existing vocabulary. Let' see example from our short story and take some lines.

In [23]:
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know,"
       Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)



[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [24]:
# back to text
print(tokenizer.decode(ids))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


In [26]:
"""
# more example
text = "The brown dog playfully chased the swift fox"
print(tokenizer.encode(text))
"""
# it will thow error becuase our short story does not have word dog, hello etc


'\n# more example\ntext = "The brown dog playfully chased the swift fox"\nprint(tokenizer.encode(text))\n'

#### Adding special context tokens

Now, we need to modify the tokenizer to handle unknown words. And for this we use special tokens. These special tokens can include markers for unknown words and document boundaries, fro ex - we can support two new tokens <|unk|> and <|endoftext|>. The unk token can be used when word is not from vocabulary. endoftext tokens are prepended to each susequent text source.

In [27]:
# let's add these tokens by modifying our vocab
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer, token in enumerate(all_tokens)}

print(len(vocab.items()))

1132


new vocabulary size is 1,132 (the previous vocabulary size was 1,130). Let's quick check and print last five entries

In [28]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [38]:
## Implementing a Simple text tokenizer with 2 more tokens
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
        item.strip() for item in preprocessed if item.strip()
        ]
        preprocessed = [item if item in self.str_to_int
                        else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

V2 will replace unknown word with unk tokens. Let's try 

In [39]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [40]:
# tokenize
tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(text))

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]


In [42]:
print(tokenizer.decode(tokenizer.encode(text)))

<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


There are more special tokens used in LLM's like BOS-Beginning of seq, EOS- end of seq, PAD- padding- take care of texts of varying lengths.

Tokenizer use for GPT does not use any of these tokens it only uses endoftext for simplicity. Also, unk is also not used by GPT, Instead GPT model uses a `byte pair encoding` tokenizer which breaks words into subwords units. We will see this concept now.

The code will be based on tiktoken 0.7.0 
BPE tokenizer is used to train GPT 2,3. It is complex so we will access using tiktoken

In [1]:
pip install tiktoken

Note: you may need to restart the kernel to use updated packages.


In [2]:
from importlib.metadata import version
import tiktoken
print("tiktoken version: ", version("tiktoken"))

tiktoken version:  0.8.0


Now, we can insantiate the BPE tokenizer, it will work same as SimpleTokenizerV2 via an encode method

In [3]:
tokenizer = tiktoken.get_encoding("gpt2")

In [4]:
text = ("Hello, do you like tea? <|endoftext|> In the sunlit terraces"
        "of someunknownPlace." )
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [5]:
# convert back to text from above token IDs
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


A quick observation - 
First, the <|endoftext|> token is assigned a relatively large token ID, namely, 50256. In fact, the BPE tokenizer which was used to train models such as GPT-2, GPT-3, and the original model used in ChatGPT, has a total vocabulary size of 50,257, with <|endoftext|> being assigned the largest token ID.


Second, the BPE tokenizer encodes and decodes unknown words, such as someunknownPlace, correctly. The BPE tokenizer can handle any unknown word. How does it achieve this without using <|unk|> tokens?

BPE - breaks down unknown words that are not in vocabualry into smaller subword unit or individual characters, enabeling it to handle out of vocabualry words. So for unknown words it can represent it as a sequence of subword tokens or characters.

In [6]:
# Exercise 2.1
text = ("Akwirw ier")
integer = tokenizer.encode(text)
print(integer)

[33901, 86, 343, 86, 220, 959]


In [7]:
words = tokenizer.decode(integer)
print(words)

Akwirw ier


##### Next task - Data Sampling with a sliding window

Now we have to generate input-target pairs required for training an LLM. AS we know LLM are pretrained by predicting the next word in a text. So the diagonal elements will be target.

Implement data loader that fetches the input-target pair from training dataset using a sliding window approach. 

First let's tokenize whole "The verdict" story using BPE

In [8]:
with open("the-verdict-txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


Total number of tokens in training set are 5145 after BPE tokenizer. We remove first 50 tokens from dataset fro demonstration,as there is more intersting text passage in next steps:

In [9]:
enc_sample = enc_text[50:]

To create input-target pair for next prediction task is to create two variable, x and y, where x contain input tokens and y target. Shifted input by 1.

In [10]:
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x:  {x}")
print(f"y:       {y}")

x:  [290, 4920, 2241, 287]
y:       [4920, 2241, 287, 257]


By processsing the inputs alog with the target, which are the inputs shifted by one position, we can create the next-word prediction tasks

In [12]:
for i in range(1, context_size + 1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "----->", desired)

[290] -----> 4920
[290, 4920] -----> 2241
[290, 4920, 2241] -----> 287
[290, 4920, 2241, 287] -----> 257
