# **Working with Text Data**

In this chapter, we'll learn how to prepare input text for training LLMs by splitting it into word and subword tokens, encoding them into vector representations. We'll explore advanced tokenization schemes like byte pair encoding used in models like GPT. Additionally, we'll implement a sampling and data loading strategy to generate input-output pairs for LLM training.

**2.1 Understanding word embeddings**

Deep neural networks, including LLMs, cannot directly process raw text as it is categorical and incompatible with neural network operations. To overcome this, text is represented as continuous-valued vectors through a process called embedding. Embeddings can be generated using specific neural network layers or pretrained models, allowing various data types like text, video, and audio to be processed.

**2.2 Tokenizing** **text**

This section covers how we split input text into individual tokens, a required preprocessing step for creating embeddings for an LLM. These tokens are
either individual words or special characters, including punctuation
characters,

The text we will tokenize for LLM training is a short story by Edith Wharton called The Verdict

In [10]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
  raw_text = f.read()
# The total number of characters
print("Total number of character:", len(raw_text))
# The first 100 characters of this file for illustration purposes:
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


Our goal is to tokenize this 20479-character short story into individual words and special characters that we can then turn into embeddings for LLM
training.

We will develop a simple tokenizer using Python's regular expression library re for illustration purposes.

In [11]:
import re
text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


In [12]:
# Modify the regular expression splits on whitespaces (\s) and commas, and periods ([,.])
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


In [13]:
# Remove whitespace characters
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


Removing whitespaces reduces memory and computing needs but may miss important structural information, like in Python code. Initially, we'll remove them for simplicity, but later include them for more accurate tokenization.

In [14]:
# Modify it a bit further so that it can also handle other types of punctuation
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [15]:
# Apply the tokenizer to Edith Wharton's entire short story
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))

4649


In [16]:
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


The resulting output shows that our tokenizer appears to be handling the text well since all words and special characters are neatly separated.

**2.3 Converting tokens into token IDs**

In this section, we will convert these tokens from a Python string to an integer representation to produce the so-called token IDs. This conversion is an intermediate step before converting the token IDs into
embedding vectors.

We have to build a so-called vocabulary first. This vocabulary defines how we map each unique word and special character to a unique integer

In [17]:
# Create a list of all unique tokens and sort them alphabetically
all_words = sorted(list(set(preprocessed)))
vocab_size = len(all_words)
# Determine the vocabulary size
print(vocab_size)

1159


In [18]:
# Create the vocabulary and print its first 50 entries for illustration purposes
vocab = {token:integer for integer,token in enumerate(all_words)}
for i, item in enumerate(vocab.items()):
  print(item)
  if i > 50:
    break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Carlo;', 25)
('Chicago', 26)
('Claude', 27)
('Come', 28)
('Croft', 29)
('Destroyed', 30)
('Devonshire', 31)
('Don', 32)
('Dubarry', 33)
('Emperors', 34)
('Florence', 35)
('For', 36)
('Gallery', 37)
('Gideon', 38)
('Gisburn', 39)
('Gisburns', 40)
('Grafton', 41)
('Greek', 42)
('Grindle', 43)
('Grindle:', 44)
('Grindles', 45)
('HAD', 46)
('Had', 47)
('Hang', 48)
('Has', 49)
('He', 50)
('Her', 51)


Later, when we want to convert the outputs of an LLM from numbers back into text, we also need a way to turn token IDs into text.

In [19]:
# Implement a complete tokenizer class with an encode and decode method
class SimpleTokenizerV1:
  def __init__(self, vocab):
    self.str_to_int = vocab
    self.int_to_str = {i:s for s,i in vocab.items()}

  def encode(self, text):
    preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
    preprocessed = [item.strip() for item in preprocessed if item.strip()]
    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

  def decode(self, ids):
    text = " ".join([self.int_to_str[i] for i in ids])
    text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
    return text

In [20]:
# Tokenize a passage from Edith Wharton's short story using SimpleTokenizerV1
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know," Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]


In [21]:
# Turn these token IDs back into text using the decode method:
print(tokenizer.decode(ids))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


We implemented a tokenizer capable of tokenizing and de-tokenizing text based on a snippet from the training set. Let's now apply it to
a new text sample that is not contained in the training set

In [22]:
text = "Hello, do you like tea?"
# tokenizer.encode(text)

The problem is that the word "Hello" was not used in the The Verdict short
story. Hence, it is not contained in the vocabulary.

**2.4 Adding special context tokens**

In this section, we will modify this tokenizer
to handle unknown words. we will modify the tokenizer to use an <|unk|> token
if it encounters a word that is not part of the vocabulary. Furthermore, we will add
a <|endoftext|> token between unrelated texts.

In [23]:
# Modify the vocabulary to include these two special tokens, <unk> and <|endoftext|>
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_tokens)}
print(len(vocab.items()))

1161


In [24]:
# Print the last 5 entries of the updated vocabulary
for i, item in enumerate(list(vocab.items())[-5:]):
  print(item)

('younger', 1156)
('your', 1157)
('yourself', 1158)
('<|endoftext|>', 1159)
('<|unk|>', 1160)


In [25]:
# Simple text tokenizer that handles unknown words:It replaces unknown words by <|unk|> tokens
class SimpleTokenizerV2:
  def __init__(self, vocab):
    self.str_to_int = vocab
    self.int_to_str = { i:s for s,i in vocab.items()}
  def encode(self, text):
    preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
    preprocessed = [item.strip() for item in preprocessed if item.strip()]
    preprocessed = [item if item in self.str_to_int
    else "<|unk|>" for item in preprocessed]
    ids = [self.str_to_int[s] for s in preprocessed]
    return ids
  def decode(self, ids):
    text = " ".join([self.int_to_str[i] for i in ids])
    text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
    return text

In [26]:
# Use a simple text sample concatenated from two independent and unrelated sentences
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [27]:
# Try the new tokenizer
tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(text))

[1160, 5, 362, 1155, 642, 1000, 10, 1159, 57, 1013, 981, 1009, 738, 1013, 1160, 7]


we can see that the list of token IDs contains 1159 for the <|endoftext|> separator token as well as two 1160 tokens, which are used for
unknown words.

In [28]:
print(tokenizer.decode(tokenizer.encode(text)))

<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


The tokenizer used for GPT models also doesn't use an <|unk|>
token for out-of-vocabulary words. Instead, GPT models use a byte pair
encoding tokenizer, which breaks down words into subword units

**2.5 Byte pair encoding**

Since implementing BPE can be relatively complicated, we will use an
existing Python open-source library called tiktoken

In [29]:
!pip install tiktoken
# check the version
from importlib.metadata import version
import tiktoken
print("tiktoken version:", version("tiktoken"))

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m0.5/1.1 MB[0m [31m15.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.7.0
tiktoken version: 0.7.0


In [30]:
# instantiate the BPE tokenizer from tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

In [31]:
text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace "
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 220]


In [32]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace 


observations:
* The token `<|endoftext|>` is assigned the largest token ID, 50256. The BPE tokenizer used for models like GPT-2, GPT-3, and the original ChatGPT,has a total vocabulary size of 50,257.
* the BPE tokenizer above encodes and decodes unknown words, such
as "someunknownPlace" correctly

**2.6 Data sampling with a sliding window**

The next step before we
can finally create the embeddings for the LLM is to generate the input-target
pairs required for training an LLM.
In this section we implement a data loader that fetches the input-target pairs from the training dataset using a sliding window
approach.

In [33]:
# Tokenize the whole The Verdict short story using the BPE tokenizer
with open("the-verdict.txt", "r", encoding="utf-8") as f:
  raw_text = f.read()
  enc_text = tokenizer.encode(raw_text)
  print(len(enc_text))

5145


In [34]:
# Remove the first 50 tokens from the dataset for demonstration purposes
enc_sample = enc_text[50:]

One of the easiest and most intuitive ways to create the input-target pairs for
the next-word prediction task is to create two variables, x and y, where x
contains the input tokens and y contains the targets, which are the inputs
shifted by 1

In [35]:
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y:     {y}")

x: [290, 4920, 2241, 287]
y:     [4920, 2241, 287, 257]


In [36]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


In [37]:
# For illustration purposes, let's repeat the previous code but convert the token IDs into text
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


In [38]:
# For the efficient data loader implementation, we will use PyTorch's built-in Dataset and DataLoader classes
import torch
from torch.utils.data import Dataset, DataLoader
class GPTDatasetV1(Dataset):
  def __init__(self, txt, tokenizer, max_length, stride):
    self.tokenizer = tokenizer
    self.input_ids = []
    self.target_ids = []
    token_ids = tokenizer.encode(txt)
    for i in range(0, len(token_ids) - max_length, stride):
      input_chunk = token_ids[i:i + max_length]
      target_chunk = token_ids[i + 1: i + max_length + 1]
      self.input_ids.append(torch.tensor(input_chunk))
      self.target_ids.append(torch.tensor(target_chunk))
  def __len__(self):
    return len(self.input_ids)
  def __getitem__(self, idx):
    return self.input_ids[idx], self.target_ids[idx]

In [39]:
# Use the GPTDatasetV1 to load the inputs in batches via a PyTorch DataLoader
def create_dataloader_v1(txt, batch_size=4,
  max_length=256, stride=128, shuffle=True, drop_last=True):
  tokenizer = tiktoken.get_encoding("gpt2")
  dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
  dataloader = DataLoader(
  dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last)
  return dataloader

In [40]:
#  Test the dataloader
with open("the-verdict.txt", "r", encoding="utf-8") as f:
  raw_text = f.read()
  dataloader = create_dataloader_v1(
  raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)
  data_iter = iter(dataloader)
  first_batch = next(data_iter)
  print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


The first_batch variable contains two tensors: the first tensor stores the
input token IDs, and the second tensor stores the target token IDs.

In [41]:
# To illustrate the meaning of stride=1
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


we can see that the second
batch's token IDs are shifted by one position compared to the first batch.

If the stride is set to 1, we shift the input window by 1 position when creating the
next batch. If we set the stride equal to the input window size, we can prevent overlaps between
the batches.

In [42]:
# Batch size greater than 1
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


**2.7 Creating token embeddings**

The last step for preparing the input text for LLM training is to convert the
token IDs into embedding vectors

Note
that we initialize these embedding weights with random values as a
preliminary step

In [43]:
# Instantiate an embedding layer in PyTorch
input_ids = torch.tensor([2, 3, 5, 1])
vocab_size = 6
output_dim = 3
torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


In [44]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


Observation: Each row in this output matrix is obtained via a lookup operation from the
embedding weight matrix,

**2.8 Encoding word positions**

The way the previously introduced embedding layer works is that the same
token ID always gets mapped to the same vector representation, regardless of
where the token ID is positioned in the input sequence

there are two broad categories of position-aware
embeddings: relative positional embeddings and absolute positional
embeddings.
* Absolute positional embeddings are directly associated with specific
positions in a sequence.
*the emphasis of
relative positional embeddings is on the relative position or distance between
tokens. This means the model learns the relationships in terms of "how far
apart" rather than "at which exact position."

In [45]:
# We now consider more realistic and useful embedding sizes and encode the input tokens into a 256-dimensional vector representation.
output_dim = 256
vocab_size = 50257
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [46]:
max_length = 4
dataloader = create_dataloader_v1(
raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


As we can see, the token ID tensor is 8x4-dimensional, meaning that the data
batch consists of 8 text samples with 4 tokens each

In [47]:
# use the embedding layer to embed these token IDs into 256-dimensional vectors
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


For a GPT model's absolute embedding approach, we just need to create
another embedding layer that has the same dimension as the
token_embedding_layer:

In [48]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


In [49]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])


The input_embeddings we created are the
embedded input examples that can now be processed by the main LLM
modules.