# Build a Large Language Model (from scratch)

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px">

Author of notes: https://github.com/deburky

## Chapter 2: Working with text data

### LLM tokenizers

`SimpleTokenizerV1`

When we want to convert the outputs of an LLM from numbers back into text, we need a way to turn token IDs into text. For this, we can create an inverse version of the vocabulary that maps token IDs back to the corresponding text tokens.

Let's implement a complete tokenizer class in Python with an encode method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary. In addition, we'll implement a decode method that carries out the reverse integer-to-string mapping to convert the token IDs back into text. The following listing shows the code for this tokenizer implementation.

`SimpleTokenizerV2`

We need to modify the tokenizer to handle unknown words. We also need to address the usage and addition of special context tokens that can enhance a model’s understanding of context or other relevant information in the text. These special tokens can include markers for unknown words and document boundaries, for example. In particular, we will modify the vocabulary and tokenizer, `SimpleTokenizerV2`, to support two new tokens, `<|unk|>` and `<|endoftext|>`.

In [2]:
import re
import urllib.request
from IPython.display import HTML


class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])

        text = re.sub(r'\s+([,.?!"()\'])', r"\1", text)
        return text


# Path to text data
url = (
    "https://raw.githubusercontent.com/rasbt/"
    "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
    "the-verdict.txt"
)

file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

# Reading the training data
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
print(f"Total number of character: {len(raw_text)}")
# print(raw_text[:99])

# Converting the entire text in a training dataset into tokens
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(f"Total number of tokens: {len(preprocessed)}")
# print(preprocessed[:30])

display(HTML(
    """Converting tokens into token IDs;
    This conversion is an intermediate step before
    converting the token IDs into embedding vectors."""
))

all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(f"Vocabulary size: {vocab_size}")

vocab = {token: integer for integer, token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
    # print(item)
    if i >= 50:
        break

simple_tokenizer_v1 = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know," 
       Mrs. Gisburn said with pardonable pride."""
ids = simple_tokenizer_v1.encode(text)

print(f"Encoding: {ids}")
print(f"Decoding: {simple_tokenizer_v1.decode(ids)}")

print("\n")
# text = "do you know what is the meaning of life?"
# print(f"Unseen text: {simple_tokenizer_v1.encode(text)}")

all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token: integer for integer, token in enumerate(all_tokens)}

print(f"Vocabulary size after extension: {len(vocab.items())}")

display(HTML(
    """As an additional quick check, let's print the last
    five entries of the updated vocabulary:"""
))

for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

Total number of character: 20479
Total number of tokens: 4649


Vocabulary size: 1159
Encoding: [1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]
Decoding: " It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


Vocabulary size after extension: 1161


('younger', 1156)
('your', 1157)
('yourself', 1158)
('<|endoftext|>', 1159)
('<|unk|>', 1160)


---
Based on the output of this print statement, the new vocabulary size is 1,161 (the previous vocabulary size was 1,159).

In [3]:
import re

class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])

        text = re.sub(r'\s+([,.:;?!"()\'])', r"\1", text)
        return text


text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

# 2nd version of tokenizer
tokenizer = SimpleTokenizerV2(vocab)
print(f"Encoding: {tokenizer.encode(text)}")
print(f"Decoding: {tokenizer.decode(tokenizer.encode(text))}")

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.
Encoding: [1160, 5, 362, 1155, 642, 1000, 10, 1159, 57, 1013, 981, 1009, 738, 1013, 1160, 7]
Decoding: <|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


🔖 **Special tokens**:

- **[BOS]** (beginning of sequence) —This token marks the start of a text. It signifies to the LLM where a piece of content begins.
- **[EOS]** (end of sequence) —This token is positioned at the end of a text and is especially useful when concatenating multiple unrelated texts, similar to <|endoftext|>. For instance, when combining two different Wikipedia articles or books, the [EOS] token indicates where one ends and the next begins.
- **[PAD]** (padding) —When training LLMs with batch sizes larger than one, the batch might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or “padded” using the [PAD] token, up to the length of the longest text in the batch.


### Byte pair econding (BPE)

If the tokenizer encounters an unfamiliar word during tokenization, it can represent it as a sequence of subword tokens or characters.

In [4]:
import tiktoken
from importlib.metadata import version
from rich import print as rprint

rprint(f"tiktoken version: {version('tiktoken')}")

tokenizer = tiktoken.get_encoding("gpt2")

text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces "
    "of someunknownPlace."
)
integers = tokenizer.encode(
    text, 
    allowed_special={"<|endoftext|>"}
)

rprint(f"Integers: {integers}")

strings = tokenizer.decode(integers)
rprint(f"Strings: {strings}")

# Get unknown token
unknown_token = tokenizer.decode([50256])
rprint(f"Unknown token: {unknown_token}")

## Byte pair encoding of unknown words Exercise 2.1
text_new = "Akwirw ier"

integers = tokenizer.encode(
    text_new, allowed_special={"<|endoftext|>"}
)
rprint(f"Integers: {integers}")

# decode each individual token ID
print([tokenizer.decode([i]) for i in integers])

strings = tokenizer.decode(integers)
rprint(f"Strings: {strings}")

['Ak', 'w', 'ir', 'w', ' ', 'ier']


### Sliding window

The next step in creating the embeddings for the LLM is to generate the input–target pairs required for training an LLM.

In [5]:
from IPython.display import HTML

import tiktoken
from rich import print as rprint

tokenizer = tiktoken.get_encoding("gpt2")

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
text_str = 'The total number of tokens in the training set, after applying the BPE tokenizer'
rprint(f"{text_str}: {len(enc_text)}")

display(HTML(
    """Executing this code will return 5145,
    the total number of tokens in the training set,
    after applying the BPE tokenizer."""
))

enc_sample = enc_text[50:]
rprint(f"{enc_sample[:2]}")

display(HTML(
    """One of the easiest and most intuitive ways to create 
    the input-target pairs for the next-word prediction task 
    is to create two variables, x and y, where x contains 
    the input tokens and y contains the targets, 
    which are the inputs shifted by 1:"""
))

context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1 : context_size + 1]  # Shifted by 1
rprint(f"features: {x}")
rprint(f"labels: \t {y}")
rprint(f"features (decoded): {tokenizer.decode(x)}")
rprint(f"labels (decoded): \t {tokenizer.decode(y)}")

display(HTML(
    """By processing the inputs along with the targets, 
    which are the inputs shifted by one position, 
    we can create the next-word prediction tasks"""
))

# Setting up for next word prediction
for i in range(1, context_size + 1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    rprint(context, "---->", desired)

# # Decoding
for i in range(1, context_size + 1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    rprint(
        tokenizer.decode(context),
        "---->",
        tokenizer.decode([desired]),
    )

---

We've now created the input–target pairs that we can use for LLM training.

### Data loader

There's only one more task before we can turn the tokens into embeddings: implementing an efficient data loader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors, which can be thought of as multidimensional arrays. In particular, we are interested in returning two tensors:
* An input tensor containing the text that the LLM sees;
* A target tensor that includes the targets for the LLM to predict

In [6]:
import torch
from torch.utils.data import Dataset, DataLoader
import tiktoken
from rich import print as rprint


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt)

        for i in range(
            0, len(token_ids) - max_length, stride
        ):
            input_chunk = token_ids[i : i + max_length]
            target_chunk = token_ids[
                i + 1 : i + max_length + 1
            ]
            # target_chunk = token_ids[
            #     i + stride : i + max_length + stride
            # ]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(
                torch.tensor(target_chunk)
            )

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]


def create_dataloader_v1(
    txt,
    batch_size=4,
    max_length=256,
    stride=128,
    shuffle=True,
    drop_last=True,
    num_workers=0,
):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(
        txt, tokenizer, max_length, stride
    )
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers,
    )

    return dataloader


with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

dataloader = create_dataloader_v1(
    raw_text,
    batch_size=1,
    max_length=4,
    stride=1,
    shuffle=False,
)


data_iter = iter(dataloader)

display(HTML(
    """Output first batch (stride=1):"""
))

inputs, targets = next(data_iter)

rprint(f"Inputs:\n {inputs}")
rprint(f"\nTargets:\n {targets}")

display(HTML(
    """Output second batch (stride=1):"""
))

second_batch = next(data_iter)
rprint(second_batch)

display(HTML(
    """A stride of 4 moves the input field by 4 positions:"""
))

dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=4, stride=4,
    shuffle=False
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
rprint("Inputs:\n", inputs)
rprint("\nTargets:\n", targets)

### Create token embeddings

The last step in preparing the input text for LLM training is to convert the token IDs into embedding vectors.

**Why do we need embeddings?**

A continuous vector representation, or embedding, is necessary since GPT-like LLMs are deep neural networks trained with the backpropagation algorithm.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/embeddings-and-linear-layers/3.png" width="450px">

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/embeddings-and-linear-layers/5.png" width="450px">

In [7]:
import torch
torch.manual_seed(123)

vocab_size = 6
output_dim = 3
input_ids = torch.tensor([2, 3, 5, 1])

embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
rprint(embedding_layer.embedding_dim)
rprint(embedding_layer.weight)

display(HTML(
    """Now, let's apply it to a token ID to obtain the embedding vector
    (Python starts with a zero index, so it's the row corresponding to index 3):
    """
))

rprint(embedding_layer.weight[3])
rprint(embedding_layer(torch.tensor([3])))

In [8]:
# Direct look-up vs one-hot encoding
torch.manual_seed(123)

num_idx = max(input_ids)
rprint(input_ids)

# From notebook example
onehot = torch.nn.functional.one_hot(input_ids)
rprint(onehot)

linear = torch.nn.Linear(num_idx, output_dim, bias=False)
rprint(embedding_layer.weight.T)

linear.weight = torch.nn.Parameter(embedding_layer.weight.T.detach())
rprint(linear(onehot.float()).T)

---

Having now created embedding vectors from token IDs, next we’ll add a small modification to these embedding vectors to encode positional information about a token within a text.

To achieve this, we can use two broad categories of position-aware embeddings:

* Relative positional embeddings, *"How far apart?"*
* Absolute positional embeddings, *"At which exact position?"* (OpenAI).

Both types of positional embeddings aim to augment the capacity of LLMs to understand the order and relationships between tokens, ensuring more **accurate** and **context-aware** predictions.

## Positional encoding

In [9]:
import torch
from rich.console import Console

console = Console()

# Define vocab and embedding dimensions
vocab_size = 50257  # GPT-3
output_dim = 256

# Initializes with random weights
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

display(HTML(
    """The token ID tensor is 8 x 4 dimensional, meaning that the data batch 
    consists of eight text samples with four tokens each."""
))

console.print("Token IDs:\n", inputs)
console.print("\nInputs shape:\n", inputs.shape)

display(HTML(
    """If we have a batch size of 8 with four tokens each, the result will be an 8 x 4 x 256 tensor.
    Let's now use the embedding layer to embed these token IDs into 256-dimensional vectors:"""
))

# # Convert to 8x4x256 dimensional tensor output
console.print('Converting to 8x4x256 dimensional tensor output:\n')
token_embeddings = token_embedding_layer(inputs)
console.print(token_embeddings.shape)

# add absolute embedding approach
console.print('\nAdd absolute positional embeddings:\n')
context_length = max_length
console.print(f"Max context length: {max_length}")
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
console.print(pos_embeddings.shape)

display(HTML(
    """PyTorch will add the 4x256-dimensional <code>pos_embeddings</code> tensor to each
    4x256-dimensional token embedding tensor in each of the 8 batches."""
))

# # Adding four 256-dimensional vectors to token embeddings
console.print('\nAdding four 256-dimensional vectors to token embeddings:\n')
input_embeddings = token_embeddings + pos_embeddings
console.print(input_embeddings.shape)

display(HTML(
    """Each sample in the batch consists of a tensor of shape (4, 256), 
    representing the embeddings for 4 tokens."""
))

console.print('\nEach sample in the batch consists of a tensor of shape (4, 256):\n')
console.print(input_embeddings[3:4])
console.print(input_embeddings[3:4].shape)