# Part 2: Working with Text

These packages will be used in this notebook

In [1]:
import os
import urllib.request
import re
import torch
from torch.utils.data import Dataset, DataLoader
from importlib.metadata import version
import tiktoken

## 1. Create tokenizer from a text file

- Tokenize text: breaking text into smaller units, such as individual words and punctuation characters
- Convert the text into vector of numbers (embeddings) so that LLMs work with them.



We will first take a look at the raw input text. 

In [2]:
if not os.path.exists("the-verdict.txt"):
    url = ("https://raw.githubusercontent.com/rasbt/"
           "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
           "the-verdict.txt")
    file_path = "the-verdict.txt"
    urllib.request.urlretrieve(url, file_path)
    
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


### a. Tokenize text
Split the raw text by spaces and various types of punctuation, such as periods and question marks.

In [3]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


### b. Build a vocabulary 
Collect all the unique tokens from raw text.

In [4]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
vocab = {token:integer for integer,token in enumerate(all_words)}
print('Some entries (tokens and their ids) in the voculary')
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)
print(f'Vocabulary size {vocab_size}')

Some entries (tokens and their ids) in the voculary
('yet', 1125)
('you', 1126)
('younger', 1127)
('your', 1128)
('yourself', 1129)
Vocabulary size 1130


### c. Adding special context tokens
- |endoftext| to sperate documents
- |unk| for tokens that dont exists in the vocabulary.

In [5]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}
print(len(vocab.items()))

1132


**Example**: 
- Create **Simple Tokenizer** class to convert tokens to token ids (encode) and vice versa (decode). 
- However, we will use the tokenizer from **tiktoken** with larger vocabulary size (50K) later on

In [6]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int 
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [7]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [8]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [9]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

## 2. Data sampling with a sliding window

We train LLMs to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict:


### a. An example of training samples

We first read the-verdict.txt and will use this as our dataset for training the model

In [10]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

context_size = 4

enc_text = tokenizer.encode(raw_text)
print("Length of raw text", len(enc_text))
enc_sample = enc_text[50:]

for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

Length of raw text 4690
in ----> a
in a ----> villa
in a villa ----> on
in a villa on ----> the


We use a **sliding window approach**, changing the position by +1 (**stride** = 1):




### b. Create Pytorch dataset and data loader.
- dataset: to sample one sample.
- dataloader: to sample a batch of samples from the dataset

Note that in this section we will use the tokenizer from **tiktoken.get_encoding("gpt2")**

In [11]:
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

def create_dataloader_v1(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

We can use dataloader to iterate through samples. We can see the **overlap** between the sequences because we used a stride of 1

In [12]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print("First batch:", first_batch)
second_batch = next(data_iter)
print("Second batch:", second_batch)

First batch: [tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
Second batch: [tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


In the following example, we used a batch size of 8 with a stride of 4. As a result:
- The input, and output will have 8 vectors of token ids.
- We do not see the **overlap** between the sequences, because the stride = 4. 

In [13]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


## 3. Create token embedding and position embedding layers.

### a. Token embedding
* The data is already almost ready for an LLM
* But lastly let us embed the tokens in a continuous vector representation using an embedding layer
* Usually, these embedding layers are part of the LLM itself and are updated (trained) during model training



**Example:**
- We use the encoder from tiktoken (tiktoken.get_encoding("gpt2")) has a vocabulary size of 50,257. Therefore we create the token embedding layer with the vocabulary size of 50,257. 
- Suppose that we want the **output dimension** of the embedding layer to be 256.
- As a result, after passing the token embedding layers, every token_id will be mapped to a vector of size 256.

In [14]:
max_length = 4
output_dim = 256
vocab_size = 50257

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print("Token IDs:\n", inputs)
print("\nInputs shape:", inputs.shape)

token_embeddings = token_embedding_layer(inputs)
print("Output shape:", token_embeddings.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape: torch.Size([8, 4])
Output shape: torch.Size([8, 4, 256])


### b. Posistion embedding: encoding word positions
- Embedding layer convert IDs into identical vector representations regardless of where they are located in the input sequence:
- Positional embeddings are combined with the token embedding vector to form the input embeddings for a large language model:


In the following code, we created the position embeding with the max length of sequence is 4. We will pass the positions of the tokens to the embedding layer.

In [15]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print('Input of position embedding layer', torch.arange(max_length))
print('Output shape', pos_embeddings.shape)

Input of position embedding layer tensor([0, 1, 2, 3])
Output shape torch.Size([4, 256])


### c. Combine both token embedding and position embedding.

Finally the input to the GPT model is the sum of token embeddings and position embeddings. The whole process can be illustrated in the following picture:



In [16]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])


### Activity:
Try to encode and decode the string "**What is the opposite word of hot?**" using the tokenizer **SimpleTokenizerV2** and **tiktoken.get_encoding("gpt2")**. 
- Print then token ids of the string. 
- You may also want to decode the ids to get the orginal text. 

In [25]:
tokenizer_1 = SimpleTokenizerV2(vocab)
tokenizer_2 = tiktoken.get_encoding("gpt2")

act_text = "What is the opposite word of hot?"

print("Encoding...")
encoded_1 = tokenizer_1.encode(act_text)
encoded_2 = tokenizer_2.encode(act_text)

print(encoded_1)
print(encoded_2)

print()

print("Decoding...")
decoded_1 = tokenizer_1.decode(encoded_1)
decoded_2 = tokenizer_2.decode(encoded_2)

print(decoded_1)
print(decoded_2)

Encoding...
[109, 584, 988, 1131, 1116, 722, 1131, 10]
[2061, 318, 262, 6697, 1573, 286, 3024, 30]

Decoding...
What is the <|unk|> word of <|unk|>?
What is the opposite word of hot?
