# **Primer on Working With Text Data**

Let's start off with checking versions of key libraries, in case these notebooks need to be revisited in the future and changes caused by updates to the underlying toolset are easier to track.

In [1]:
from importlib.metadata import version
print("torch version", version("torch"))
print("tiktoken version", version("tiktoken"))

torch version 2.7.0
tiktoken version 0.9.0


In [2]:
import os, re
import urllib.request


## **1. Tokenizing Text**

In the book, Sebastian demonstrates the tokenization of text using Edith Wharton's "The Verdict" as an example. I'll be using his provided example as well as samples from my own collection. Starting off with Frederic Bastiat's "The Law", which was published in the 19th century, offers a very different style of writing and punctuation compared to more modern writers.

In [3]:
# Downloading sample text
if not os.path.exists("the-verdict.txt"):
    url = ("https://raw.githubusercontent.com/rasbt/"
           "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
           "the-verdict.txt")
    file_path  = "data/the-verdict.txt"
    urllib.request.urlretrieve(url, file_path)

In [4]:
with open("data/the-verdict.txt", "r", encoding="utf-8") as f:
    verdict = f.read()

print("Total number of characters: ", len(verdict))
print(verdict[:99])

Total number of characters:  20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [5]:
# And now for Bastiat's text.
with open("data/the-law-bastiat.txt", "r", encoding="utf-8") as f:
    law = f.read()

print("Total number of characters: ", len(law))
print(law[:99])

Total number of characters:  95988
 The law perverted! The law—and, in its wake, all the collective forces of the nation—the law, I sa


These texts will be tokenized and embedded in an LLM. Initially, a simple tokenizer will be created in the next few cells and then it can be applied to the texts.

In [6]:
# This regex splits on white spaces.
test = "In the grim darkness of the far future, there is only war!"
result = re.split(r'(\s)', test)
print(result)

['In', ' ', 'the', ' ', 'grim', ' ', 'darkness', ' ', 'of', ' ', 'the', ' ', 'far', ' ', 'future,', ' ', 'there', ' ', 'is', ' ', 'only', ' ', 'war!']


In [7]:
# The regex should also be able to split on punctuations.
result = re.split(r'([,.!]|\s)', test)
print(result)

['In', ' ', 'the', ' ', 'grim', ' ', 'darkness', ' ', 'of', ' ', 'the', ' ', 'far', ' ', 'future', ',', '', ' ', 'there', ' ', 'is', ' ', 'only', ' ', 'war', '!', '']


In [8]:
# Now we can strip whitespaces from each item while filtering out the empty strings
result = [item for item in result if item.strip()]
print(result)

['In', 'the', 'grim', 'darkness', 'of', 'the', 'far', 'future', ',', 'there', 'is', 'only', 'war', '!']


It should be noted that stripping whitespaces can aid in brevity, while also reducing memory requirements, especially when working with LLMS using consumer hardware. However, this practice is usually avoided when the retention of exact sentence structures is important for e.g. working with programming languages where indentation and spacing are important.

In [9]:
# Also accounting for other punctuation types
test = "Hello, world! Is this a test? Surely--it must be."
result = re.split(r'([,./:"?_()\']|--|\s)', test)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world!', 'Is', 'this', 'a', 'test', '?', 'Surely', '--', 'it', 'must', 'be', '.']


Time to test this tokenization on the text samples.

In [10]:
verdict_preproc = re.split(r'([,./:"?!_()\']|--|\s)', verdict)
law_preproc = re.split(r'([,./:"?!_()\']|--|\s)', law)

verdict_preproc = [item.strip() for item in verdict_preproc if item.strip()]
law_preproc = [item.strip() for item in law_preproc if item.strip()]

In [11]:
print(verdict_preproc[:99])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself', 'in', 'a', 'villa', 'on', 'the', 'Riviera', '.', '(', 'Though', 'I', 'rather', 'thought', 'it', 'would', 'have', 'been', 'Rome', 'or', 'Florence', '.', ')', '"', 'The', 'height', 'of', 'his', 'glory', '"', '--', 'that', 'was', 'what', 'the', 'women', 'called', 'it', '.', 'I', 'can', 'hear', 'Mrs', '.', 'Gideon', 'Thwing', '--', 'his', 'last', 'Chicago', 'sitter']


In [12]:
print(law_preproc[:99])

['The', 'law', 'perverted', '!', 'The', 'law—and', ',', 'in', 'its', 'wake', ',', 'all', 'the', 'collective', 'forces', 'of', 'the', 'nation—the', 'law', ',', 'I', 'say', ',', 'not', 'only', 'diverted', 'from', 'its', 'proper', 'direction', ',', 'but', 'made', 'to', 'pursue', 'one', 'entirely', 'contrary', '!', 'The', 'law', 'become', 'the', 'tool', 'of', 'every', 'kind', 'of', 'avarice', ',', 'instead', 'of', 'being', 'its', 'check', '!', 'The', 'law', 'guilty', 'of', 'that', 'very', 'iniquity', 'which', 'it', 'was', 'its', 'mission', 'to', 'punish', '!', 'Truly', ',', 'this', 'is', 'a', 'serious', 'fact', ',', 'if', 'it', 'exists', ',', 'and', 'one', 'to', 'which', 'I', 'feel', 'bound', 'to', 'call', 'the', 'attention', 'of', 'my', 'fellow', 'citizens', '.']


In [13]:
# Total number of tokens for each text.
print(f"Tokens created:\nThe Verdict: {len(verdict_preproc)}\nThe Law: {len(law_preproc)}")

Tokens created:
The Verdict: 4669
The Law: 18807


## **2. Converting Tokens Into Token IDs**

The next step involves the conversion of text tokens into token IDs which can be processed via embedding layers.

In [14]:
all_words_verdict = sorted(set(verdict_preproc))
all_words_law = sorted(set(law_preproc))

vocab_sz_verdict = len(all_words_verdict)
vocab_sz_law = len(all_words_law)

print(f"Vocab Size:\nThe Verdict: {vocab_sz_verdict}\nThe Law: {vocab_sz_law}")

Vocab Size:
The Verdict: 1143
The Law: 3152


In [15]:
def create_token_ids(x):
    return {token:integer for integer, token in enumerate(x)}

vocab_verdict = create_token_ids(all_words_verdict)
vocab_law = create_token_ids(all_words_law)

In [16]:
# Displaying a sample containing token ids
for i, item in enumerate(vocab_verdict.items()):
    print(item)
    if i >= 50: 
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Carlo;', 25)
('Chicago', 26)
('Claude', 27)
('Come', 28)
('Croft', 29)
('Destroyed', 30)
('Devonshire', 31)
('Don', 32)
('Dubarry', 33)
('Emperors', 34)
('Florence', 35)
('For', 36)
('Gallery', 37)
('Gideon', 38)
('Gisburn', 39)
('Gisburns', 40)
('Grafton', 41)
('Greek', 42)
('Grindle', 43)
('Grindles', 44)
('HAD', 45)
('Had', 46)
('Hang', 47)
('Has', 48)
('He', 49)
('Her', 50)


`Experiment Note` strip "The Law" of ids linked to reference markers. Also, be mindful that integers are also being used to refer to dates.

Moving onto creating a simple tokenizer which encodes text into token IDs and also turns token IDs back into text.

In [37]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_2_int = vocab
        self.int_2_str = {i:s for s, i in vocab.items()}

    def encode(self, target):
        preprocessed = re.split(r'([,./:"?!_()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_2_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_2_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [41]:
# Testing the tokenizer using The Law's vocab.
t1 = SimpleTokenizerV1(vocab_law)

text = """If you suggest a doubt as to the morality of these institutions, it is said directly—
          "You are a dangerous experimenter,a utopian, a theorist, a despiser of the laws; 
          you would shake the basis upon which society rests!"""
id1 = t1.encode(text)
print(id1)

[184, 3142, 2810, 362, 1074, 541, 2912, 2872, 1973, 2063, 2885, 1697, 5, 1746, 1741, 2586, 1012, 1, 360, 515, 362, 917, 1268, 5, 362, 3007, 5, 362, 2881, 5, 362, 977, 2063, 2872, 1796, 3142, 3133, 2667, 2872, 612, 2999, 3082, 2713, 2532, 0]


In [44]:
# Testing the tokenizer using The Verdict's vocab
t2 = SimpleTokenizerV1(vocab_verdict)

text = """It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
id2 = t2.encode(text)
print(id2)

[57, 2, 861, 999, 610, 538, 754, 5, 1139, 603, 5, 1, 68, 7, 39, 862, 1121, 764, 803, 7]


Note that unique tokenizers are being used for the two texts since the texts contain words which don't overlap with one-another. This issue will be handled in the next sections.

Let's decode the IDs back to text.

In [45]:
t1.decode(id1)

'If you suggest a doubt as to the morality of these institutions, it is said directly—" You are a dangerous experimenter, a utopian, a theorist, a despiser of the laws; you would shake the basis upon which society rests!'

In [46]:
t2.decode(id2)

'It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

## **3. Adding Special Context Tokens**

In the previous section, both tokenizers returned key errors when the texts were swapped because certain words weren't contained in either's vocabulary. Special tokens can provide LLMS with additional context to handle / circumvent these issues. 

Examples of some special tokens include:
 - `[BOS]` Beginning of sequence (here sequence usually refers to a text sample).
 - `[EOS]` End of sequence.
 - `[PAD]` Padding, which can be used when LLMs havea  batch size greater than 1, comprising multiple texts of different lengths. This ensures that texts have equal length.
 - `[UNK]` Unknown, indicates words not contained in the vocabulary. 

In [49]:
# Swapping texts and tokenizers to see what effect this may have
t = SimpleTokenizerV1(vocab_law)

text = "Dang it!"

t.encode(text)

KeyError: 'Dang'

As expected, the odds of Bastiat's texts containing "Dang" are pretty much zero. The `"<|unk|>"` special token can be used as a workaround in such instances. The vocabulary can also be extended to cater to `"<|endoftext|>"` tokens as is the case in GPT-2.

In [50]:
all_tokens = sorted(list(set(law_preproc))) # Extending the law's vocab for now
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
# Shifting to a single vocab object
vocab = {token:integer for integer, token in enumerate(all_tokens)}

In [51]:
len(vocab.items()) # Results in two additional items

3154

In [53]:
for i, item in enumerate(list(vocab.items())[-5:]): 
    print(item)

('—But', 3149)
('—if', 3150)
('—the', 3151)
('<|endoftext|>', 3152)
('<|unk|>', 3153)


In [54]:
# Altering the tokenizer so that it can handle the new tokens
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_2_int = vocab
        self.int_2_str = {i:s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_2_int else
            "<|unk|>" for item in preprocessed
        ]
        ids = [self.str_2_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_2_str[i] for i in ids])
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [69]:
# Test the modified tokenizer
t = SimpleTokenizerV2(vocab)

text1 = "Matters of property are best left to the experts"
text2 = "Q: What is your pledge? - A: My pledge is eternal service."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Matters of property are best left to the experts <|endoftext|> Q: What is your pledge? - A: My pledge is eternal service.


In [70]:
enc = t.encode(text)
print(enc)

[3153, 2063, 2352, 515, 642, 1811, 2912, 2872, 3153, 3152, 3153, 65, 344, 1741, 3144, 3153, 66, 3153, 67, 65, 3153, 3153, 1741, 3153, 2661, 6]


In [71]:
t.decode(enc)

'<|unk|> of property are best left to the <|unk|> <|endoftext|> <|unk|>: What is your <|unk|>? <|unk|> A: <|unk|> <|unk|> is <|unk|> service.'

As expected, modern vocabulary and 19th century writings don't mix well.

## **4. BytePair Encoding**

GPT-2 used BytePair encoding as its tokenizer to handle out-of-vocabulary words. It basically lets the model break down unfamiliar words into smaller subword units or individual characters, thereby enabling it to handling out of vocabulary words.

The BPE tokenizer used here is from OpenAI's [tiktoken](https://github.com/openai/tiktoken) library.

In [76]:
import importlib, tiktoken

print(f"tiktoken version: {importlib.metadata.version('tiktoken')}")

tiktoken version: 0.9.0


In [77]:
t = tiktoken.get_encoding("gpt2")

In [78]:
# Reusing the concatenated text from the previous section
integers = t.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[19044, 1010, 286, 3119, 389, 1266, 1364, 284, 262, 6154, 220, 50256, 1195, 25, 1867, 318, 534, 13995, 30, 532, 317, 25, 2011, 13995, 318, 15851, 2139, 13]


In [79]:
strings = t.decode(integers)
print(strings)

Matters of property are best left to the experts <|endoftext|> Q: What is your pledge? - A: My pledge is eternal service.


## **5. Data Sampling With a Sliding Window**

LLMs generate one word at a time, so training data needs to be prepared through sampling using sliding windows. This is not dissimilar to autoregressive forecasting methodologies. Each text chunk will require inputs and targets, which are shifted by one position from left to right.

In [82]:
# Lets re-encode text from the law.
enc_txt = t.encode(law)
print(len(law), len(enc_txt))

95988 21939


In [83]:
enc_sample = enc_txt[50:]

In [94]:
# Demonstrating sliding window sampling
context_size = 10

for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(context, "---->", desired)

[790] ----> 1611
[790, 1611] ----> 286
[790, 1611, 286] ----> 1196
[790, 1611, 286, 1196] ----> 283
[790, 1611, 286, 1196, 283] ----> 501
[790, 1611, 286, 1196, 283, 501] ----> 11
[790, 1611, 286, 1196, 283, 501, 11] ----> 2427
[790, 1611, 286, 1196, 283, 501, 11, 2427] ----> 286
[790, 1611, 286, 1196, 283, 501, 11, 2427, 286] ----> 852
[790, 1611, 286, 1196, 283, 501, 11, 2427, 286, 852] ----> 663


In [95]:
# Demonstrating the same on sample text
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(t.decode(context), "---->", t.decode([desired]))

 every ---->  kind
 every kind ---->  of
 every kind of ---->  av
 every kind of av ----> ar
 every kind of avar ----> ice
 every kind of avarice ----> ,
 every kind of avarice, ---->  instead
 every kind of avarice, instead ---->  of
 every kind of avarice, instead of ---->  being
 every kind of avarice, instead of being ---->  its


A simple data loader can be implemented to iterate over the input dataset. This will return the inputs and targets, which are shifted by one.

In [109]:
# Dataset and dataloader which extracts chunks from the input text data
import torch
from torch.utils.data import DataLoader, Dataset

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize input text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
        assert len(token_ids) > max_length, "Number of tokenized inputs must at least be equivalent to max_length+1!"

        # Sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i+1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [110]:
def DataLoaderV1(txt, batch_size=4, max_length=256, stride=128,
                shuffle=True, drop_last=True, num_workers=0):

    # Init tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )
    return dataloader

In [111]:
dataloader = DataLoaderV1(law, batch_size=1, max_length=4, stride=1, shuffle=False)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[  383,  1099,   583, 13658]]), tensor([[ 1099,   583, 13658,     0]])]


In [112]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 1099,   583, 13658,     0]]), tensor([[  583, 13658,     0,   383]])]


In [113]:
# Creating batched outputs
dataloader = DataLoaderV1(law, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[  383,  1099,   583, 13658],
        [    0,   383,  1099,   960],
        [  392,    11,   287,   663],
        [ 7765,    11,   477,   262],
        [10098,  3386,   286,   262],
        [ 3277,   960,  1169,  1099],
        [   11,   314,   910,    11],
        [  407,   691, 35673,   422]])

Targets:
 tensor([[ 1099,   583, 13658,     0],
        [  383,  1099,   960,   392],
        [   11,   287,   663,  7765],
        [   11,   477,   262, 10098],
        [ 3386,   286,   262,  3277],
        [  960,  1169,  1099,    11],
        [  314,   910,    11,   407],
        [  691, 35673,   422,   663]])


## **6. Creating Token Embeddings**

## **7. Encoding Word Positions**