In this file we are going to work with text and then convert them into embeddings.We will go throught the entire process in detail further.The text corpus that has been taken from “The Verdict,” a short story by Edith Wharton, which has been released into the public domain and is thus permitted to be
used for LLM training tasks.

In [1]:
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/"
"LLMs-from-scratch/main/ch02/01_main-chapter-code/"
"the-verdict.txt")
file_path = "the-verdict.txt"
#urllib.request.urlretrieve(url, file_path)

In [2]:
with open("the-verdict.txt","r",encoding="utf-8") as f:
    raw_text  = f.read()
print("The total number of charecters :",len(raw_text))
print(raw_text[:99])

The total number of charecters : 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


We are now going to tokenize these 20479 short story into individual words and special charecters that can then be turnes into individual embeddings.The texts we take here is relatively very small

In [3]:
import re
text = "Hello world.This is a test for tokenizing."
result = re.split(r'(\s)', text)
print(result)

['Hello', ' ', 'world.This', ' ', 'is', ' ', 'a', ' ', 'test', ' ', 'for', ' ', 'tokenizing.']


In [4]:
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ' ', 'world', '.', 'This', ' ', 'is', ' ', 'a', ' ', 'test', ' ', 'for', ' ', 'tokenizing', '.', '']


In [5]:
result = [item for item in result if item.strip()]
print(result)

['Hello', 'world', '.', 'This', 'is', 'a', 'test', 'for', 'tokenizing', '.']


When developing a simple tokenizer, whether we should encode whitespaces as separate characters or just remove them depends on our application and its requirements. Removing whitespaces reduces the memory and
computing requirements. However, keeping whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example,sPython code, which is sensitive to indentation and spacing).Now lets modify it a bit to handle punctuations.

In [6]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


now lets just apply it to the text.

In [7]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:99])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself', 'in', 'a', 'villa', 'on', 'the', 'Riviera', '.', '(', 'Though', 'I', 'rather', 'thought', 'it', 'would', 'have', 'been', 'Rome', 'or', 'Florence', '.', ')', '"', 'The', 'height', 'of', 'his', 'glory', '"', '--', 'that', 'was', 'what', 'the', 'women', 'called', 'it', '.', 'I', 'can', 'hear', 'Mrs', '.', 'Gideon', 'Thwing', '--', 'his', 'last', 'Chicago', 'sitter']


Now lets convert tokens to tokenIds.We will jsut map the words with integers now.

In [8]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

1130


In [9]:
vocab = {token:integer for integer,token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


In [10]:
#A simple tokenizer
class SimpleTokenizerV1:
    def __init__(self,vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    def encode(self,text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    def decode(self,ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text
    

In [11]:
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know,"
Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [12]:
print(tokenizer.decode(ids))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


In [13]:
try:
    encoded = tokenizer.encode(text)
    print(encoded)
except Exception as e:
    print("An error occurred while encoding the text:", e)


[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


now we will just we will add new tokens like add new text and other things and unk if it is not a part of the library.

In [14]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>","<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_tokens)}
print(len(vocab.items()))

1132


In [15]:
#lets now just print last 5 elements
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [16]:
class SimpleTokenizerV2:
    def __init__(self,vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    def encode(self,text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        preprocessed = [item if item in self.str_to_int
        else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    def decode(self,ids):
        text = " ".join([self.int_to_str[i] for  i in ids])
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text
    

In [17]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces   the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces   the palace.


In [18]:
tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(text))

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 988, 1131, 7]


Let’s look at a more sophisticated tokenization scheme based on a concept called byte pair encoding (BPE). The BPE tokenizer was used to train LLMs such as GPT-2, GPT-3,and the original model used in ChatGPT.

In [19]:
!pip install tiktoken

Defaulting to user installation because normal site-packages is not writeable


Similar to other Python libraries, we can install the tiktoken library via Python’s pip installer from the terminal.Now we are going to use much more efficient tokenizer and implement it.

In [20]:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
text = (
"Hello, do you like tea? <|endoftext|> In the sunlit terraces"
"of someunknownPlace."
)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [21]:
#we can now encode it into text in a very simple way
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


The next step in creating the embeddings for the LLM is to generate the input–target pairs required for training an LLM.It is like a sliding window.This is how a typical LLM works and here we will now start from very beginning.An LLM is like a word predictor.What it does is predicts the next word of the statement of the sentence.In an LLM we pass words and hten it is agina passed in the llm along with the previously predicted words that it had been predicitng.
So what we do while training the llm is we hide the next words and then copare it to the word it had predicted with the correct answer.
ex [LLMs] ->learn
[LLMS learn]->to
[LLMS learn to ]->predict
[LLMS learn to predict]->one
..
...
.
.
.
.
[LLMS learn to predict one word at a time ]