-----
Build a Large Language Model
Sebastian Raschka
-----

# Working with text data

  - During the pretraining stage, LLMs process text, one word at a time.

## 2.1 - Understanding word embeddings

  - Converting data into a vector format is referred to as embedding.  An embedding is a mapping from discrete objects, such as words, images, or even entire documents, to point in a continuous vector space - the primary purpose of embeddings is to convert nonnumeric data into a format that neural networks can process.
  - In addition to word embeddings, there are also embedding for sentences, paragraphs, or whole documents.
  - Retrieval-augmented generation combines generation (like producting text) with retrieval (like searching an external knowledge base) to pull relevant information when generating text.

## 2.2 - Tokenizing text

  - We will tokenize "The Verdict," from https://en.wikisource.org/wiki/The_Verdict


In [1]:
# Download our text
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/refs/heads/main/ch02/01_main-chapter-code/the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

# Now load our text file
with open(file_path, "r", encoding="utf-8") as file:
    raw_text = file.read()
print("Total number of characters:", len(raw_text))
print(raw_text[:99])

Total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


  - Our goal is to tokenize this 20,479 character short story into individual words and special characters that we can then turn into embeddings for LLM training.

In [2]:
# Simple example text using re.split command with the following syntax
import re
text = "Hello, world.  This, is a test."
result = re.split(r'(\s)', text)
print(result)

['Hello,', ' ', 'world.', ' ', '', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


  - Modify the regular expression splits on whitespaces (\s), commas, and periods ([,.])
  - Capitalization helps LLMs distinguish between proper nouns and common nouns, so we will refrain from making all text lowercase.

In [3]:
result = re.split(f'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


  - Remove all the whitespace characters

In [4]:
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


  - Need to modify the exmple further so that it can also handle other types of punctuation, such as question marks, quotation marks, and the double dashes we have seen earlier in the first 100 characters

In [5]:
text = "Hello, world.  Is this -- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


- Apply the basic tokenizer to the story

In [6]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))

4690


  - Let's print the first 30 tokens for a quick visual check

In [7]:
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


## 2.3 - Converting tokens into token IDs

  - Next let's convert these tokens from a Python string to an integer representation to produce the token IDs.

In [8]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(f"Vocab size: {vocab_size}")

Vocab size: 1130


  - Create a vocabulary and print the first 51 entries.

In [9]:
vocab = {token:integer for integer, token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
  print(item)
  if i >= 50:
    break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


  - Our next goal is to apply this vocabulary to convert new text into token IDs.
  - Let's implete a tokenizer class in Python with an encode method.
  - Also create a decode method, so we can reverse this process.

In [10]:
class SimpleTokenizerV1:
  def __init__(self, vocab):
    self.str_to_int = vocab
    self.int_to_str = {i:s for s, i in vocab.items()}

  def encode(self, text):
    preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
    preprocessed = [
        item.strip() for item in preprocessed if item.strip()
    ]
    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

  def decode(self, ids):
    text = " ".join([self.int_to_str[i] for i in ids])

    text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
    return text

  - Using the SimpleTokenizerV1, we can intantiate new tokenizer objects via an existing vocabulary

In [11]:
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know,"
        Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


  - Next ensure we can turn the token IDs back into text

In [12]:
print(tokenizer.decode(ids))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


  - Let's now apply our tokenizer to a new text sample.

In [13]:
text = "Hello, do you like tea?"

try:
  print(tokenizer.encode(text))
except Exception as e:
  print(f"unknown token {e}")

unknown token 'Hello'


  - Throws and error since "Hello" was not in our short story, hence it's not in our vocabulary.

## 2.4 - Adding special context tokens

  - We need to modify the tokenizer to handle unknown words.
  - Modify the tokenizer to handle two special tokesn, <unk> and <|endoftext|>

In [14]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer, token in enumerate(all_tokens)}

print(len(vocab))

1132


  - The new vocabulary size is 1132 (the previous vocabulary size was 1130)

In [15]:
for i, item in enumerate(list(vocab.items())[-5:]):
  print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


  - Base on the ouput, we can confirm that the two new special tokens were indeed successfully incorporated into the vocab.

In [16]:
class SimpleTokenizerV2:
  def __init__(self, vocab):
    self.str_to_int = vocab
    self.int_to_str = { i:s for s,i in vocab.items() }

  def encode(self, text):
    preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
    preprocessed = [
        item.strip() for item in preprocessed if item.strip()
    ]
    preprocessed = [item if item in self.str_to_int
                    else "<|unk|>" for item in preprocessed]

    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

  def decode(self, ids):
    text = " ".join([self.int_to_str[i] for i in ids])

    text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
    return text

  - Compared to v1, v2 replaces unknown words with <|unk|> tokens.

In [17]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [18]:
tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(text))

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]


In [19]:
strings = tokenizer.decode(tokenizer.encode(text))
print(strings)

<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


  - Depending on the LLM, some researchers also consider additional special tokens such as:
    - [BOS] (beginning of sequence) - marks the start of text
    - [EOS] (end of sequence) - positionaed at the end of a text is especially useful when concatenating multiple unreleated text similar to <|endoftext|>.
    - [PAD] (padding) - when training LLMs the batch sizes can vary, to ensure all texts have the same length, the shorter texts are extended or "padded".

## 2.5 Byte pair encoding

  - Since BPE (byte pair encoding) is complicated we will use an existing python library called toktoken.  The code is based on tiktoken 0.7.0, check the version you currently have installed.

In [20]:
pip install tiktoken==0.7.0

Collecting tiktoken==0.7.0
  Downloading tiktoken-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
  Attempting uninstall: tiktoken
    Found existing installation: tiktoken 0.10.0
    Uninstalling tiktoken-0.10.0:
      Successfully uninstalled tiktoken-0.10.0
Successfully installed tiktoken-0.7.0


In [21]:
from importlib.metadata import version
import tiktoken
print("tiktoken version:", version("tiktoken"))

tiktoken version: 0.7.0


In [22]:
# Instantiate the BPE tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# Demonstrated usage of tokenizer
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces "
    "of someunknownPlace."
)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]


In [23]:
# Convert the token IDs back into text using the decode method
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


  - The <|endoftext|> is assigned a relatively large token ID.  The total vocabulary is 50257
  - The BPE tokenizer encodes and decodes unknown words, such as someunknownPlace corrrectly.
  - BPE breaks down words that aren't in its predifined vocab into smaller subword units or even individual characters.
  - The BPE builds its vocab by iteratively merging frequent characters into sub-words and frequent subwords into words.

## 2.6 - Data sampling with a sliding window

  - The next step in creating the embeddings for LLM is to generate a input-target pairs required for training an LLM.

In [24]:
with open("the-verdict.txt", "r", encoding="utf-8") as file:
  text = file.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


In [25]:
#   Remove the first 50 tokens from the dataset for demonstration purposes, as it
# results in a slightly more interesting text passage
enc_sample = enc_text[50:]

#   Create two variable x and y, where x contains the input tokens and y contains
# the targets
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [26]:
#   By processing the inputs along with targets, we can create the next-word
# prediction tasks
for i in range(1, context_size+1):
  context = enc_sample[:i]
  desired = enc_sample[i]
  print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


In [27]:
#   Everything left of the arrow refers to the input, and the token id on the right
# side represents the target token ID that the LLM is supposed to predict.

for i in range(1, context_size+1):
  context = enc_sample[:i]
  desired = enc_sample[i]
  print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


  - The last task before we can turn the tokens into embeddings: implementing an efficient data loader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors.

In [28]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
  def __init__(self, txt, tokenizer, max_length, stride):
    self.input_ids = []
    self.target_ids = []

    token_ids = tokenizer.encode(txt)
    for i in range(0, len(token_ids) - max_length, stride):
      input_chunk = token_ids[i:i + max_length]
      target_chunk = token_ids[i + 1: i + max_length + 1]
      self.input_ids.append(torch.tensor(input_chunk))
      self.target_ids.append(torch.tensor(target_chunk))

  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self, idx):
    return self.input_ids[idx], self.target_ids[idx]

  - GPTDatasetV1 is based on PyTorch Dataset class and defines how individual rows are fetched.

In [29]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):
  tokenizer = tiktoken.get_encoding("gpt2")
  dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
  dataloader = DataLoader(
      dataset,
      batch_size=batch_size,
      shuffle=shuffle,
      drop_last=drop_last,
      num_workers=num_workers
  )

  return dataloader

# Test dataloader with batch size of 1 for an LLM
with open("the-verdict.txt", "r", encoding="utf-8") as f:
  raw_text = f.read()

dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


In [31]:
#   The first_batch contains two tensors: the first tensor stores the input token
# IDs, and the second tensor stores the target toke IDs.
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


In [33]:
# Use the data loader to sample with a batch size greater than 1
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=4, stride=4, shuffle=False
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("Targets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


## 2.7 - Creating token embeddings

  - The last step in preparing the input text for LLM training is to convert the token IDs into embedding vectors.