# Chapter 2: Working with Text Data

### Understanding Word Embeddings

- Deep neural network models, including LLMs, cannot process raw text directly. Since text is categorical, it isn't compatible with the mathematical operations used to implement and train neural networks
- Therefore, we need a way to represent words as continuous-valued vectors
- **Embedding:** Converting data into a vector format
- At its core, embedding is a mapping from discrete objects, such as words, images, or even entire documents, to points in a continuous vector space - the primary purpose of embeddings is to convert nonnumeric data into a format that neural networks can process
- While word embeddings are the most common form of text embedding, there are also embeddings for sentences, paragraphs, or whole documents.
- Sentence or paragraph embeddings are popular choices for *retrieval-augmented* generation
- Retrieval-augmented generation combines generation with retrieval to pull relevant information when generating text

- Several algorithms and frameworks have been developed to generate word embeddings.
- One of the earlier and most popular examples is the *Word2Vec* approach.
- *Word2Vec:* trained neural network architecture to generate word embeddings by predicting the context of a word given the target word or vice versa.
- LLMs commonly produce their own embeddings that are part of the input layer and are updated during training. The advantage of optimizing the embeddings as part of the LLM training instead of using Word2Vec is that the embeddings are optimized to the specific task and data at hand.

### Tokenizing Text

- Split input text into individual tokens, a required preprocessing step for creating embeddings for an LLM
- These tokens are either individual words or special characters, including punctuation characters.

![Tokenizing Text](pic4.png)



In [87]:
# Tokenize practice on a sample text

with open("the-verdict.txt", "r") as file:
    raw_text = file.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99]) 

# Our goal is to tokenize this 20,479-character short story into individual words and special characters that we can then turn into 
# embeddings for training a language model.

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


In [88]:
import re

# Use re.split command to split a text on whitespace characters
text = "Hello, my name is Jack.Doe! How's it going?"
tokens =re.split(r'(\s)', text)
print(tokens)

# Split on whitespaces (\s), commas, and periods([,.])
tokens = re.split(r'(\s|[,.])',text)
print(tokens)

# Remove redundant characters
tokens = [item for item in tokens if item.strip()]
print(tokens)

['Hello,', ' ', 'my', ' ', 'name', ' ', 'is', ' ', 'Jack.Doe!', ' ', "How's", ' ', 'it', ' ', 'going?']
['Hello', ',', '', ' ', 'my', ' ', 'name', ' ', 'is', ' ', 'Jack', '.', 'Doe!', ' ', "How's", ' ', 'it', ' ', 'going?']
['Hello', ',', 'my', 'name', 'is', 'Jack', '.', 'Doe!', "How's", 'it', 'going?']


In [89]:
preprocessed_tokens = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
preprocessed_tokens = [item for item in preprocessed_tokens if item.strip()]
print(len(preprocessed_tokens))
print(preprocessed_tokens[:20])

4649
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was']


### Converting Tokens into token IDs
- Convert these tokens from a Python string to an integer representation to produce the token IDs
- This conversion is an intermediate step before converting the token IDs into embedding vectors
- To map the previously generated tokesn in token IDs, we have to build a vocabulary that map each unique word and special character

![Build Vocabulary By Tokening](pic5.png)



In [90]:
allwords = sorted(set(preprocessed_tokens))
vocab_size = len(allwords)
print("Vocabulary size:", vocab_size)

# Create a vocabulary that maps each unique word and special character to a unique integer
vocab = {token: integer for integer, token in enumerate(allwords)}

for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

# Dictionary contains individual tokens associated with unique integer lables
# --> Our next goal is to apply this vocabulary to convert new text into token IDs

Vocabulary size: 1159
('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Carlo;', 25)
('Chicago', 26)
('Claude', 27)
('Come', 28)
('Croft', 29)
('Destroyed', 30)
('Devonshire', 31)
('Don', 32)
('Dubarry', 33)
('Emperors', 34)
('Florence', 35)
('For', 36)
('Gallery', 37)
('Gideon', 38)
('Gisburn', 39)
('Gisburns', 40)
('Grafton', 41)
('Greek', 42)
('Grindle', 43)
('Grindle:', 44)
('Grindles', 45)
('HAD', 46)
('Had', 47)
('Hang', 48)
('Has', 49)
('He', 50)


![Tokenizing new text](pic6.png)

In [91]:
# When we want to convert the outputs of an LLM from numbers back into text
# We need a way to turn token IDs into text --> create an inverse version of the vocabulary that 
# maps token IDs back to the corresponding text tokens


# Class tokenizer: with encode method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs
# We will also implement decode to reverse integer-to-string mapping to convert the token IDs back into text

class TokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab # Dictionary mapping tokens to unique integer labels
        self.int_to_str = {i:s for s, i in vocab.items()}

    # Splits text into tokens and carries out string-to-integer mapping to produce token IDs
    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    # Carries out reverse integer-to-string mapping to convert the token IDs back into text
    def decode(self,token_ids):
        text = " ".join([self.int_to_str[i] for i in token_ids])
        text = re.sub(r'\s+([,\.?\!"()\'])', r'\1', text)
        return text



![Encode and Decode Pipeline](pic7.png)

In [92]:
tokenizer = TokenizerV1(vocab)
sample_text = """"It's the last he painted, you know,"
                  Mrs. Gisburn said with pardonable pride."""
token_ids = tokenizer.encode(sample_text)
print("Token IDs:", token_ids)
decoded_text = tokenizer.decode(token_ids)
print("Decoded Text:", decoded_text)


# Error will occur if we try to encode a text that contains a token not in the vocabulary
# This highlights the need to consider large and diverse training sets to extend the vocabulary when working on LLMs

Token IDs: [1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]
Decoded Text: " It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


## Adding Special Context Tokens
- We need to modify the tokenizer to handle unknown words
- We also need to address the usage and addition of sepcial context tokens that can enhance a model's understanding of cotext or other relevant information in the text
- These special tokens can include markers for unknwon words and document boundaries. 
- Create a new tokenizer to support two new tokens ,<|unk|> and <|endoftext|>


![Taking into account unknown tokens](pic8.png)

- We can modify the tokenizer to use an <|unk|> token if it encounters a word that is not part of the vocabulary.
- Furthermore, we add a token between unrelated texts
- Ex: When training GPT-like LLMs on multiple independent documents or books, it is common to insert a token before each document or book that follows a previous text source --> helps LLM understand that these text sources are concatenated for training, they are unrelated

![Separating Documents](pic9.png)


In [94]:
# Modify the vocabulary to include special tokens for unknown words and end of text

all_tokens = sorted(list(set(preprocessed_tokens))) # Take the list of tokens and removes duplicates then converts to list and sort it --> produce a unique, ordered list of tokens
all_tokens.extend(["<|endoftext|>", "<|unk|>"]) # Appends two special tokens to the end of the list
vocab = {token: integer for integer, token in enumerate(all_tokens)} # Recreate the vocabulary dictionary mapping each token to a unique integer label

print(len(vocab.items()))

1161


In [95]:
for item in list(vocab.items())[-5:]:
    print(item)


('younger', 1156)
('your', 1157)
('yourself', 1158)
('<|endoftext|>', 1159)
('<|unk|>', 1160)


In [96]:

class TokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab # Dictionary mapping tokens to unique integer labels
        self.int_to_str = {i:s for s, i in vocab.items()}

    # Splits text into tokens and carries out string-to-integer mapping to produce token IDs
    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    # Carries out reverse integer-to-string mapping to convert the token IDs back into text
    def decode(self,token_ids):
        text = " ".join([self.int_to_str[i] for i in token_ids])
        text = re.sub(r'\s+([,\.?\!"()\'])', r'\1', text)
        return text



In [97]:
text_with_unknown = """This is a new sentence with unkwn words."""
tokenizer_v2 = TokenizerV2(vocab)
token_ids_v2 = tokenizer_v2.encode(text_with_unknown)
print("Token IDs with unknown token:", token_ids_v2)
decoded_text_v2 = tokenizer_v2.decode(token_ids_v2)
print("Decoded Text with unknown token:", decoded_text_v2)

Token IDs with unknown token: [101, 595, 119, 1160, 1160, 1136, 1160, 1160, 7]
Decoded Text with unknown token: This is a <|unk|> <|unk|> with <|unk|> <|unk|>.


- Depending on the LLM, some researches also consider additional special tokens such as the following:

    - [BOS]: Beginning of sequence - this token marks the start of a text. It signifies to the LLM where a piece of content begins
    - [EOS]: End of sequence - this token positioned at the end of a text and is expecially useful when concatenating multiple unrelated texts
    - [PAD]: Padding - when training LLMs with batch sizes larger the one, the bath might contrain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or "padded" using [PAD] token, up to the lenght of the longes text in the batch

- Tokenizer used for GPT models does not need any of these tokens; it only uses <|endoftext|> token for simplicity
- Tokenizer used for GPT models doesn't use an <|unk|> token for out-of-vocabulary words. Instead GPT models use **by pair encoding tokenize**, which breaks words down into subword units

## Byte Pair Encoding

- BPE was used to train LLMs such as GPT-2, GPT-3, and the original model used in ChatGPT

In [98]:
%pip install tiktoken 

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [99]:
# tiktoken is a fast BPE tokenizer used by OpenAI for GPT models

from importlib.metadata import version
import tiktoken

print("tiktoken version:", version("tiktoken"))

tiktoken version: 0.12.0


In [100]:
tokenizer = tiktoken.get_encoding("gpt2")

text = ( "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
         "of someunknownPlace."
)

integer_ids = tokenizer.encode(text,allowed_special={"<|endoftext|>"})
print("Token IDs using tiktoken:", integer_ids)

Token IDs using tiktoken: [15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [101]:
strings = tokenizer.decode(integer_ids)
print("Decoded text using tiktoken:", strings)

Decoded text using tiktoken: Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


The algorithm underlying BPE breaks down words that aren't in its predefined vocuabulary into smaller subword units or even individual charactacters, enabling it to handle out-of-vocabulary words. With the support of BPE algorithm, if the tokenizer encounters an unfamiliar word during tokenization, it can represent it as a sequence of subword tokens or characters

![Tokenizing Unknown Words](pic10.png)

## Data Sampling with a Sliding Window
- The next step in creating the embeddings for the LLM is to generate the input-target pairs required for training an LLM
- LLMs are pretrained by predicting the next word in a text

![Data Sampling With a Sliding Window](pic11.png)



Let's implement a data loader that fetches the input-target pairs in figure 2.12 from the training dataset using a sliding window approach. To get started, we will tokenize the whole "The Verdict" short story using the BPE tokenizer

In [102]:
with open("the-verdict.txt", "r") as file:
    raw_text = file.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


In [103]:
# Remove the first 50 tokens from the dataset for demonstration purposes

enc_sample = enc_text[50:]

# x: contains the inputs token
# y: contains the target tokens (i.e., the next token to predict)

context_size = 4 # number of tokens in the input sequence
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [104]:
# By processing the inputs along with the targets, which are the inputs shifted by one position, we can create
# the next-word prediction tasks

for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context,"---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


In [105]:
# Everything left of the arrow refers to the input an LLM would receive
# The token ID on the right side of the arrow represents that target token ID that the LLM is supposed to predict

for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


We have now created the input-target pairs that we can use for LLM training --> There is only one more task before we can turn the tokens into embeddings: implementing efficient data loader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors, which can be thought of as multidimensional arrays.

In particular, we are interested in returning two sensors: input sensor containing the text that the LLM sees and a target sensor that includes the targets for the LLM to predict

![Implement efficient data loader implementation](pic12.png)

For the efficient data loader implementation, we will use PyTorch's built-int *Dataset* and *DataLoader* classes.

In [76]:
%pip install torch

Collecting torch
  Downloading torch-2.9.0-cp311-cp311-win_amd64.whl (109.3 MB)
     -------------------------------------- 109.3/109.3 MB 7.7 MB/s eta 0:00:00
Collecting filelock
  Downloading filelock-3.20.0-py3-none-any.whl (16 kB)
Collecting sympy>=1.13.3
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ---------------------------------------- 6.3/6.3 MB 8.7 MB/s eta 0:00:00
Collecting networkx>=2.5.1
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ---------------------------------------- 2.0/2.0 MB 9.3 MB/s eta 0:00:00
Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     -------------------------------------- 134.9/134.9 kB 7.8 MB/s eta 0:00:00
Collecting fsspec>=0.8.5
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ------------------------------------- 199.3/199.3 kB 11.8 MB/s eta 0:00:00
Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ------------------------------------- 536.2/536.2 k


[notice] A new release of pip available: 22.3 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [106]:
import torch
from torch.utils.data import Dataset, DataLoader


# Take a long text string, tokenize it, and split it into sliding windows of training samples - each window
# being a sequence of tokens of fixed max_length
# Each sample teaches the model to predict the next token in a sequence

class GPTDatasetV1(Dataset):
    def __init__(self,txt,tokenizer,max_length,stride):
        self.input_ids = []
        self.target_ids = []

        tokens_ids = tokenizer.encode(txt) # Tokenizes the entire text

        for i in range(0, len(token_ids) - max_length,stride): # Uses a sliding window to chunk the book into overlapping sequences of max_length
            input_chunk = tokens_ids[i:i+max_length] 
            target_chunk = tokens_ids[i+1:i+max_length+1]

            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self): # Returns the total number of rows in the datasedt
        return len(self.input_ids)

    def __getitem__(self, idx): # Returns a single row from the dataset
        return self.input_ids[idx], self.target_ids[idx]

In [107]:
# Function wraps the dataset inside a PyTorch DataLoader, which handles batching and shuffling
# It is to prepare batches of (input,target) pairs efficiently for model training
def create_dataloader_v1(txt,batch_size=4,max_length=256,stride=128, shuffle=True, drop_last=True,num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt,tokenizer,max_length,stride)
    dataloader = DataLoader(dataset,batch_size=batch_size,shuffle=shuffle,drop_last=drop_last,num_workers=num_workers)
    return dataloader

In [108]:
with open("the-verdict.txt", "r") as file:
    raw_text = file.read()

dataloader = create_dataloader_v1(raw_text,batch_size=1,max_length=4,stride=1, shuffle=False)
data_iter = iter(dataloader) # Converts dataloader into a Python iterator to fetch the next entry via Python's built in next() function
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


![Meaning of stride](pic13.png)

## Creating Token Embeddings
- The last step in preparing the input text for LLM training is to convert the token IDs into embedding vectors
- As a preliminary step, we must initialize these embedding weights with random values
- This initialization serves as the starting point for the LLM's learning process
- **Embedding:** Is a way to represent something as a vector of numbers, usually in a high-dimensional space

![Creating Token Embeddings](pic14.png)

A continuous vector representation, or embedding, is necessary since GPT-like LLMs are deep neural networks trained with backpropagation algorithm.

In [109]:
input_ids = torch.tensor([2,3,4,1])
vocab_size = 6
output_dim = 3

torch.manual_seed(123)  # For reproducibility
embedding_layer = torch.nn.Embedding(vocab_size,output_dim)
print(embedding_layer.weight)    

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


The weight matrix of the embedding layer contains small, random values. These values are then optimized during LLM training as part of the LLM optimization itself.

Moreover, we can see that the weight matrix has six rows and three columns. There is one row fo each of the six possible tokens in the vocabulary, and there is one column for each of the three embedding dimensions

In [110]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


In [111]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


![Weight Matrix of Embedding Layer](pic15.png)

## Encoding Word Positions
- Each otken ID gets turned into a vector using the embedding matrix
    - Every word has its own unique vector of numbers
    - But- - the same word always gets the same vector, no matter where it appears in the sentence

**Problem:**
- The model can't tell which word came first and which came last.
- This matters because language depends on word order, but because embeddings don't include position, the model has no built-in sense of sequence or structure

![Positional Econdings](pic16.png)

**Solution:**
- Add positional encodings (extra vectors) to each token embeddings to tell the model where each token is in the sentence

In [113]:
vocab_size = 50257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size,output_dim)

max_length = 4
dataloader = create_dataloader_v1(raw_text,batch_size=8,max_length=max_length,stride=max_length, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Token IDs:\n", inputs)
print("\nInputs shape:\\n",inputs.shape)

StopIteration: 