Importing the text which we will tokenize

In [1]:
from pathlib import Path

text_path = Path('../data/')

with open(text_path / 'the-verdict.txt' , 'r') as f:
    raw_text = f.read()

In [2]:
print('total number of characters:',len(raw_text))

total number of characters: 20479


In [3]:
print(raw_text[:99])

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


Our goal is to tokenize this 20,479-character short story into individual words and special characters that we can then turn into embeddings for llm training

In [4]:
import re
text = "Hello, world. This, is a test."
result = re.split(r'([,.]|\s)',text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


we can then remove the whitespaces

In [5]:
result = [item for item in result if item.strip()]
result

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']

In [6]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


Let's apply it to the text

In [7]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)',raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))

4690


4690 are the number of tokens in the text

In [8]:
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


Converting tokens into token IDs

let's create a list of all unique tokens and sort them alphabetcially to determine the vocabulary size.

In [9]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

1130


Now let's create the vocab dictionary

In [10]:
vocab = {token:integer for integer,token in enumerate(all_words)}

In [11]:
for i,item in enumerate(vocab.items()):
    print(item)
    if(i >= 10):
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)


Let's implement a complete tokenizer class with an encode method that splits text into tokens and carries out the string -to-integer mapping to produce token IDs. We will also implement a decode method that carries out the reverse integer-to-string mapping.

In [12]:
class SimpleTokenizerV1:
    def __init__(self,vocab):
        self.str_to_int = vocab # stores the vocabulary as a class attribute 
        self.int_to_str = {i:s for s,i in vocab.items()} #inverse mapping of vocab
        
    def encode(self,text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)',text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    def decode(self,ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

Using the tokenizer

In [13]:
tokenizer = SimpleTokenizerV1(vocab)
text = """" It's the last he painted,you know,"
        Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


Let's decode it

In [14]:
print(tokenizer.decode(ids))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


Adding special context tokens

We need to modify the tokenizer to handle unkown words. We also need to address the usage and addition of speciql context tokens that can enhance a model's understanding of context or other relevant information in the text.

These special tokens can include markers for unkown words and document boundaries

In [15]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>","<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [16]:
print(len(vocab.items()))

1132


Let's print the last 5 elements

In [17]:
for i,item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


Now we will write a tokenizer class that handles unkown words

In [18]:
class SimpleTokenizerV2:
    def __init__(self,vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self,text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        preprocessed = [item if item in self.str_to_int
                        else "<|unk|>" for item in preprocessed] #replaces unkown words by <|unk|> tokens
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self,ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text
        

Let's now try this tokenizer

In [19]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1,text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [20]:
tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(text))
ids = tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]


In [21]:
print(tokenizer.decode(ids))

<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


Note, the tokenizer used for GPT models doesn't use an <|unk|>. Instead, GPT models use a byte-pair tokenizer, which breaks words down into subword units.

Byte pair encoding

We will use an existing python library (tiktoken) which implements BPE efficiently.

In [22]:
import tiktoken

We can instanciate the BPE tokenizer

In [23]:
tokenizer = tiktoken.get_encoding('gpt2')

In [24]:
text = (
 "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
 "of someunknownPlace."
)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


We can then convert the token IDs back into text using the decode method

In [25]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


The BPE tokenizer encodes and decodes unkown words correctly without using the <|unk|> tokens. The tokenizer breaks down words that aren't in its predefined vocabularly into smaller subword units or even individual characters.

Data sampling with a sliding window

llms are pretrained by predicting the next word in a text

Let's implement a dataloader that fetches the input-target pairs from the training dataset using a sliding window approach. First, we will tokenize the text using BPE

In [26]:
with open(text_path / 'the-verdict.txt' ,'r',encoding="utf-8") as f:
    raw_text = f.read()
enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


Let's take a sample

In [27]:
enc_sample = enc_text[50:]

In [28]:
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [29]:
for i in range(1,context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context,"---->",desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


We can now convert the token ids to text

In [30]:
for i in range(1,context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context),"---->",tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


Now we need to implement a data loader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors. We are interested in returning two tensors: an input tensor containing the text that the llm sees and a target tensor that includes the targets for the llm to predict.

In [31]:
import torch
from torch.utils.data import Dataset,DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self,txt,tokenizer,max_length,stride):
        self.input_ids = []
        self.target_ids = []
        
        token_ids = tokenizer.encode(txt)
        for i in range(0,len(token_ids)-max_length,stride):
            input_chunk = token_ids[i:i+ max_length]
            target_chunk = token_ids[i+1:i+max_length+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
    def __len__(self):
        return len(self.input_ids)
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

  cpu = _conversion_method_template(device=torch.device("cpu"))


In [32]:
def create_dataloader_v1(txt,batch_size=4,max_length=256
                         ,stride=128,shuffle=True
                         ,drop_last=True,num_workers=0):
    tokenizer = tiktoken.get_encoding('gpt2')
    dataset = GPTDatasetV1(txt,tokenizer,max_length,stride)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last, #drops the last batch if it is shorter than the specified batch_size
        num_workers=num_workers # number of cpu processes to use for preprocessing
    )    
    return dataloader

Let's test them

In [33]:
with open('../data/the-verdict.txt','r',encoding="utf-8") as f:
    raw_text = f.read()
dataloader = create_dataloader_v1(
    raw_text,
    batch_size=1,
    max_length=4,
    stride=1,
    shuffle=False
)
data_iter = iter(dataloader) # converts dataloader into a python iterator to fetch the next entry via Python's built in next()
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


In [34]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


In [35]:
dataloader = create_dataloader_v1(
    raw_text,
    batch_size=8,
    max_length=4,
    stride=4,
    shuffle=False
)

data_iter = iter(dataloader)
inputs,targets = next(data_iter)
print("Inputs :\n",inputs)
print("\nTargets:\n",targets)

Inputs :
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


Now that we increase the stride to 4 to utilize the data set fully (we don't skip a single word). This avoids any overlap between the batches since more overlap could lead to increased overfitting.

Creating token embeddings

we need to convert the token IDs into embedding vectors. First we must initialize these embedding weights with random values which we will optimize later during training.

Let's see how the token ID to embedding vector conversion works.

Suppose we have four input tokens with IDs 2,3,5,1, and vocab of 6 tokens and we want to create embeddings of size 3.

In [36]:
input_ids = torch.tensor([2,3,5,1])

In [37]:
vocab_size = 6
output_dim = 3

In [38]:
torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size,output_dim)
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


The weight matrix of the embedding layer contains small, random values. The values are optimized during LLM training.

The weight matrix has six rows and three columns.
* There is one row for each of the six tokens.
* There is one column for each of the three embedding dimensions.

In [39]:
# Let's apply it to a token ID to obtain the embedding vector
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


In [40]:
# Let's apply it to all four input ID's
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


We will now add a small modification to these embedding vectors to encode positional information about a token within a text.

In [41]:
vocab_size = 50257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size,output_dim)

If we have a batch_size of 8 with four tokens each, the result will be 8 * 4 * 256 tensors.

In [42]:
max_length = 4
dataloader = create_dataloader_v1(
    raw_text,batch_size=8,max_length=max_length,
    stride=max_length,shuffle=False
)
data_iter = iter(dataloader)
inputs,targets = next(data_iter)
print("Token IDs: \n",inputs)
print("\nInputs shape: \n",inputs.shape)

Token IDs: 
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape: 
 torch.Size([8, 4])


As we can see the token ID tensor is 8*4 dimensional, meaning that the data batch consists of eight text samples with four tokens each.

Let's now use the embedding layer to embed these token IDs into 256-dimensional vectors

In [43]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


This means that each token is embedded as a 256-dim vector.

Using an absolute embedding approach, we just need to create another embedding layer that has the same embedding dimension as the token_embedding_layer.

In [44]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length,output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)


torch.Size([4, 256])


Now we will add the 4x256 dim pos_embeddings tensor to each 4x256 dim token embedding tensor in each of the eight batches.

In [45]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])
