<a href="https://colab.research.google.com/github/athahibatullah/llm-from-scratch/blob/main/02%20-%20Working%20with%20Text%20Data/ch02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **2.1 Understanding Word Embeddings**

Deep neural networks including LLMs cannot process raw text directly since text is categorical. Categorical mean it isn't compatible with the mathematical operations used to implement and train neural networks. Therefore, we need a way to represent words as continuous-valued vectors. The concept of converting data into a vector format is often referred to as embedding. Different data formats require distinct embedding models, for example an embedding model designed for text would not be suitable for embedding audio or video data.
<img src="https://drive.google.com/uc?id=1Uu1CZJM8mGLImwADawl9CCB0xCicfocF">

Embedding is a mapping from discrete objects, such as words, images, or even entire documents, to points in a continuous vector space. The primary purpose of embeddings is to convert nonnumeric data into a format that neural networks can process. Text embedding is not limited to word embeddings, there are also embeddings for sentence, paragraphs, or whole documents. Sentence or paragraph embeddings are popular choices for Retrieval-Augmented Generation (RAG). Retrieval-Augmented Generation combines generation (like producing text) with retrieval (like searching an external knowledge base) to pull relevant information when generating text, RAG isn't included in this scope of building LLM from scratch. We will focus on word embeddings.

One of the earlier and most popular examples of word embeddings algorithm is Word2Vec. Word2Vec trained neural network architecture to generate word embeddings by predicting the context of a word given the target word or vice versa. The main idea of Word2Vec is word that appear in similar contexts tend to have similar meanings. This mean if we project them into two-dimensional word embeddings for visualization purpose, similar terms are clustered together. Word embeddings can have varying dimensions, from one to thousands. More dimension is better but with more computational cost.

<img src="https://drive.google.com/uc?id=16PcjL0A-UuN9O3CxZ4cjgGh2DtjEZ_nl">

While we can use pretrained models such as Word2Vec to generate embeddings for machine learning models, LLMs commonly produce their own embeddings that are part of the input layer and are updated during training. The advantage of optimizing the embeddings as part of the LLM training instead of using Word2Vec is that the embeddings are optimized to the specific task and data at hand.

High-dimensional embeddings present a challenge for visualization because humans are limited to seeing a 3 or fewer dimensions. When working with LLMs, we typically use embedding with a much higher dimensionality. The smallest GPT-2 models (117M and 125M paramters) use an embedding size of 768 dimensions and the largest GPT-3 model (175B parameters) use an embedding size of 12288 dimensions.

# **2.2 Tokenizing Text**

Let's split input text into individual tokens first as it's required preprocessing step for creating embeddings for an LLM.
<img src="https://drive.google.com/uc?id=1rkJu1MKBaQbVSEETK07rxCulowe1COZQ">

The text we will tokenize for LLM training is "The Verdict", a short story by Edith Wharton. The text is available on Wikisource: https://en.wikisource.org/wiki/The_Verdict.

First, we will load the text file: the-verdict.txt, count total number of character and print the first 100 character of the file.

In [21]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [22]:
import os
import urllib.request

if not os.path.exists("the-verdict.txt"):
    url = ("https://raw.githubusercontent.com/rasbt/"
           "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
           "the-verdict.txt")
    file_path = "the-verdict.txt"
    urllib.request.urlretrieve(url, file_path)

In [23]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


We can start splitting the text by using a short sentence first to make sure our splitting function works as intended.

In [24]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)

print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


In [25]:
result = re.split(r'([,.]|\s)', text)

print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


In [26]:
# Strip whitespace from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


In [27]:
# Handle punctuations
text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


After successfully create splitting function, we can apply it first to the previous first 100 characters and after that apply it to the entire document.

Note that we refrain from making all text lower-case because capitalization helps LLMs distinguish between proper nouns and common nouns, understand sentence structure, and learn to generate text with proper capitalization.

When developing a simple tokenizer, it's up to us to decide whether we should encode whitespaces as separate characters or just remove them as it's depends on our application and its requirements. Removing whitespaces reduces the memory and computing requirements, but can also be helpful if we train models that are sensitive to the exact structure of the text like for example python code as it's sensitive to indentation and spacing. In this example, we remove whitespaces for simplicity and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme that includes whitespaces.

In [28]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [29]:
print(len(preprocessed))

4690


# **2.3 Converting Tokens into Token IDs**

After successfully extracting token from the text, we convert these tokens to an integer representation to produce the token IDs. To map the previously generated tokens into token IDs, we have to build a vocabulary first. This vocabulary defines how we map each unique word and special character to a unique integer.
<img src="https://drive.google.com/uc?id=1DgFbB-9X8qR9s_M13UUyuIhh8kyB2yPg">

Unique tokens mean duplicated word/punctuation will be counted as 1.

In [30]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

1130


In [31]:
vocab = {token:integer for integer,token in enumerate(all_words)}

In [32]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


set() function will return a list of unique token, and sorted() will sort the list. We will only print the first 50 vocabulary.

Next, our goal is to apply this vocabulary to convert our text into token IDs.

<img src="https://drive.google.com/uc?id=15xDpA-gvfAAghdZA06hLV6wGeLnG0gKr">


In [33]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab # Stores the vocabulary as a class attribute for access in the encode and decode methods
        self.int_to_str = {i:s for s,i in vocab.items()} # Creates an inverse vocabulary that maps
                                                         # token into IDs back to the original text tokens

    def encode(self, text): # Processes input text into token IDs
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids): # Converts token IDs back into text
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

We create an inverse vocabulary that maps token IDs back to the original text tokens, an encode to process input text into token IDs, and decode to converts token IDs back into text.

In [34]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know,"
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


We try our encode function and it's working. Next let's decode to see our text back.

In [35]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

In [36]:
tokenizer.decode(tokenizer.encode(text))

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

It successfully decoded our text into a readable human text.

# **2.4 Adding Special Context Tokens**

We need to modify the tokenizer to handle unknown words. We also need to address the usage and addition of special context tokens that can enhance a model's understanding of context or other relevant information in the text. These special tokens can include markers for unknown words and document boundaries. For example, we will modify the vocabulary and add special token to handle unknown words and end of text. End of text is needed to separate different dataset since LLM need to be trained using multiple books, articles, documents, or even separating different context in the same documents and this is where this special token come into play. For unknown words (words doesn't exist in current vocabulary), we name <|unk|>. For end of text, we name <|endoftext|>.

<img src="https://drive.google.com/uc?id=1QVadUQfaqZQm79uk32iS1o4wLT_T6zO6" >

<img src="https://drive.google.com/uc?id=1RHETuKRD5Zr89fH7nHP3lgS-QY77SrYG" >

To better understand, lets encode a new text with word that doesn't exist in current vocabulary

In [64]:
tokenizer = SimpleTokenizerV1(vocab)

text = "Hello, do you like tea. Is this-- a test?"

tokenizer.encode(text)

KeyError: 'Hello'

It will return error since "Hello" token isn't registered in vocabulary. Now let's modify the code and add <|unk|> and <|endoftext|>.

In [38]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [39]:
len(vocab.items())

1132

In [40]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


Now we successfully added our new token. After adding new token, we should modify our encode & decode function to use this new token.



In [41]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [ # Replaces unknown words by <|unk|>
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text) # Replaces spaces before the specified
        return text

Now let's test our modified function.

In [42]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [43]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [44]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

It's now added our new special tokens.

Depending on the LLM, some researchers also consider additional special tokens such as the following:

[BOS] (beginning of sequence): This token marks the start of a text. It signifies to the LLM where a piece of content begins.

[EOS] (end of sequence): This token is positioned at the end of a text and is especially useful when concatenating multiple unrelated texts, similar to <|endoftext|>. For instance, when combining two different Wikipedia articles or books, the [EOS] indicates where one ends and the next begins.

[PAD] (padding): When training LLMs with batch sizes larger than one, the batch might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or "padded" using the [PAD] token, up to the length of the longest text in the batch.
The tokenizer used for GPT models does not need any of these tokens. <|endoftext|> is enough for simplicity. <|endoftext|> is analogous to the [EOS] and can also be used for padding. On later chapter when training on batched inputs introduced, we typically use a mask, meaning we don't attend to padded tokens. Morover, the tokenized used for GPT models also doesn't use <|unk|> for unknown words but instead use a byte pair encoding tokenizer, which breaks words down into subword units, which we will discuss next.

# **2.5 Byte Pair Encoding**

Byte Pair Encoding (BPE) was used to train LLM like GPT-2, GPT-3, and the original model used in ChatGPT. We will use library called tiktoken, which implements the BPE algorithm very efficiently based on source code in Rust.

Let's install the tiktoken first.

In [45]:
pip install tiktoken



In [46]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.9.0


In [47]:
tokenizer = tiktoken.get_encoding("gpt2")

Once installed, let's do our encode and decode like before but this time with tiktoken.

In [48]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [49]:
strings = tokenizer.decode(integers)

print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


We can make two noteworthy observations based on the token IDs and decoded text. First, the <|endoftext|> token is assigned a relatively large token ID, namely, 50256. In fact, the BPE tokenizer, which was used to train models such as GPT-2, GPT-3, and the original model used in ChatGPT, has a total vocabulary size of 50257, with <|endoftext|> being assigned the largest token ID.

Second, the BPE tokenizer encodes and decodes unknown words, such as someunknownPlace, correctly. The BPE tokenizer can handle any unknown word without using <|unk|> token. BPE algorithm breakdown words that aren't predefined in the vocabulary into smaller sub word unit or even down to characters, enabling it to handle out-of-vocabulary words.

<img src="https://drive.google.com/uc?id=1e4ejtAvUhcvMAgzn0tlFXjcVGZ0wvMWs">

The ability to break down unknown words into individual characters ensures that the tokenizer and, consequently, the LLM that is trained with it can process any text, even if it contains words that were not present in its training data.

BPE builds its vocabulary by iteratively merging frequent characters into sub words and frequent sub words into words. For example, BPE starts with adding all individual single characters to its vocabulary ("a", "b", etc.). Next, it merges character combinations that frequently occur together into sub words. For example, "d" and "e" may be merged into the subword "de," which is a common word in English like "define", "decode", "depend", "made", and "hidden".

## **Exercise**

In [50]:
text = "Akwirw ier"

integers = tokenizer.encode(text)

print(integers)

[33901, 86, 343, 86, 220, 959]


In [51]:
strings = tokenizer.decode(integers)
print(strings)

Akwirw ier


# **2.6 Data Sampling with a Sliding Window**

The next step in creating the embeddings for the LLM is to generate the input-target pairs required for training an LLM. This is what input-target pairs look like:

<img src="https://drive.google.com/uc?id=1_AxTlzcmcs7jifcLw9H0t0njgOYV4nHR">

Now let's implement a data loader that fetches the input-target pairs in above figure from the training dataset using a sliding window approach. To get started, we will tokenize the whole "The Verdict" short story using the BPE tokenizer.



In [52]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


5145 is the total token from the dataset. Next, we remove the first 50 tokens from the dataset for demonstration purposes, as it results in a slightly more interesting text passage in the next steps.

In [53]:
enc_sample = enc_text[50:]

To better understand input-target, we will create 2 variable X and Y. X is a variable list to contain token value from the dataset (input) and Y is the same as X but with shifted right by 1 (target).

In [54]:
context_size = 4

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


From this, we can create next-word prediction task just like our figure earlier

In [55]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


In [56]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


Now we created both next-word prediction in encoded format or decoded format. We've now created the input-target pairs that we can use for LLM training.

There's only one more task before we can turn the tokens into embeddings: implementing an efficient data loader that iterates over the input dataset and returns the inputs and targets as PyToch tensors, which can be thought of as multidimensional arrays. We will return two tensors:

*   Input tensor: containing the text that the LLM sees.
*   Target tensor: includes the targets for the LLM to predict.

The figure below illustrates how our data loader looks like. We will show the tokens in string format for illustration purpose. The code implementation will operate on token IDs directly since the encode method of the BPE tokenizer performs both tokenization and conversion into token IDs as a single step.

<img src="https://drive.google.com/uc?id=1mquEyLp1QTapnPumuOUGDU7fUhfuZZe7">

In [57]:
import torch
print("PyTorch version:", torch.__version__)

PyTorch version: 2.6.0+cu124


In [58]:
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
        assert len(token_ids) > max_length, "Number of tokenized inputs must at least be equal to max_length+1"

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

The GPTDatasetV1 class is based on the PyTorch Dataset class and defines how individual rows are fetched from the dataset, where each row consists of a number of token IDs (based on a max_length) assigned to an input_chunk tensor. The target_chunk tensor contains the corresponding targets.

In [59]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2") # initializes the tokenizer

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) # creates dataset

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last, # drop_last=True drops the last batch if it is shorter
                             # than the specified batch_size to prevent loss spikes during training
        num_workers=num_workers # The number of CPU processes to use for preprocessing
    )

    return dataloader

Let's test the dataloader with a batch size of 1 for an LLM with a context size of 4 to develop an intuition of how the GPTDatasetV1 class from previous code and the create_dataloader_v1 function

In [60]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

In [61]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader) # Converts dataloader into a Python iterator to fetch the next
                             # entry via Python's built-in next() function
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


The first_batch variable contains two tensors: the first tensor stores the input token IDs, and the second tensor stores the target token IDs. Since the max_lenght is set to 4, each of the two tensors contains four token IDs. Input size of 4 is quite small and size of at least 256 is more common to use. To understand what is stride=1, let's create another batch:

In [62]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


As you can see, we shift the token ID's for both tensor by 1. This is what stride does.

<img src="https://drive.google.com/uc?id=1KIXtW1p_MBpkKAt2h54l7OuuGHXyuLYG">

Batch sizes of 1, as we used from the data loader earlier, are useful for illustration purposes. But in real case, we may know that small batch sizes require less memory during training but lead to more noisy model updates. Just like in regular deep learning, the batch size is a tradeoff and a hyperparameter to experiment with when training LLMs. Now let's try to increase the batch size and stride. Stride will be set to 4 to utilize the dataset fully (not a single word is skipped) and this will avoid any overlap between the batches since more overlap could lead to increased overfitting.

In [63]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


## **Exercise**

In [67]:
dataloaderExercise1 = create_dataloader_v1(
    raw_text, batch_size=8, max_length=2, stride=2, shuffle=False
)

data_iter = iter(dataloaderExercise1)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

dataloaderExercise2 = create_dataloader_v1(
    raw_text, batch_size=8, max_length=8, stride=2, shuffle=False
)

data_iter = iter(dataloaderExercise2)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367],
        [ 2885,  1464],
        [ 1807,  3619],
        [  402,   271],
        [10899,  2138],
        [  257,  7026],
        [15632,   438],
        [ 2016,   257]])

Targets:
 tensor([[  367,  2885],
        [ 1464,  1807],
        [ 3619,   402],
        [  271, 10899],
        [ 2138,   257],
        [ 7026, 15632],
        [  438,  2016],
        [  257,   922]])
Inputs:
 tensor([[   40,   367,  2885,  1464,  1807,  3619,   402,   271],
        [ 2885,  1464,  1807,  3619,   402,   271, 10899,  2138],
        [ 1807,  3619,   402,   271, 10899,  2138,   257,  7026],
        [  402,   271, 10899,  2138,   257,  7026, 15632,   438],
        [10899,  2138,   257,  7026, 15632,   438,  2016,   257],
        [  257,  7026, 15632,   438,  2016,   257,   922,  5891],
        [15632,   438,  2016,   257,   922,  5891,  1576,   438],
        [ 2016,   257,   922,  5891,  1576,   438,   568,   340]])

Targets:
 tensor([[  367,  2885,  1464,  1807,  3619,  

# **2.7 Creating Token Embeddings**

The last step in preparing the input text for LLM training is to convert the token IDs into embedding vectors

<img src="https://drive.google.com/uc?id=1qMReW8VUmju0kNltrTG9NNUXXFMKevcT" >

The preliminary step is we must initialize these embedding weights with random values. This initialization serves as the starting point for the LLM's learning process. In chapter 5, we will optimize the embedding weights as part of the LLM training.

A continuous vector representation, or embedding, is necessary since GPT-like LLMs are deep neural networks trained with the backpropagation algorithm.

Let's code how the token ID to embedding vector conversion works. We first create four input tokens with ID 2, 3, 5, and 1

In [68]:
input_ids = torch.tensor([2, 3, 5, 1])

For simplicity, we use only 6 words from BPE tokenizer vocabulary, and we want to create embeddings of sizes 3 (GPT-3 have the embedding size of 12288 dimensions) Using the vocab_size and output_dim, we can instantiate an embedding layer in PyTorch, setting the random seed to 123 for reproducibility purpose

In [69]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

This print the embedding layer's underlying weight matrix

In [70]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


The weight matrix of the embedding layer contains small, random values. These values are used during LLM training for LLM optimization. From the matrix, we can see it has 6 rows and 3 columns. For each row, there are six possible token in the vocabulary, and there is one column for each of the three embedding dimensions.

Now let's apply it to a token ID to obtain the embedding vector:

In [71]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


If we compare the embedding vector for token ID 3 to the previous embedding matrix, we see that it is identical to the fourth row. The above prints just prints the 4th element of embedding layer's weight.

We've seen how to convert a single token ID into a three-dimensional embedding vector. Let's now apply that to all four input IDs (the input_ids one)

In [72]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


It will print all of the embedding layer's weight based on input_ids earlier:
[2, 3, 5, 1]

Each row in this output matrix is obtained via a lookup operation from the embedding weight matrix, as illustrated on the following figure.

<img src="https://drive.google.com/uc?id=1j1rYRuyZbPmJF9NWxG8WpPEx6oBIx-7_">

Now we have created embedding vectors from token IDs, next we'll add a small modification to these embedding vectors to encode positional information about a token within a text

# **2.8 Encoding Word Positions**

In principle, token embeddings are a suitable input for an LLM. However, a minor shortcoming of LLMs is that their self-attention mechanisms (chapter 3) doesn't have a notion position or order for the tokens with a sequence. The way the previously introduced embedding layer works is that the same token ID always gets mapped to the same vector representation, regardless of where the token ID is positioned in the input sequence.

<img src="https://drive.google.com/uc?id=12yVZl4uWOPCY1wERToHm7MNGAE4OEjVh">

In principle, the deterministic, position-independent embedding of the token ID is good for reproducibility purposes. However, since the self-attention mechanisms of LLMs itself is also position-agnostic, it is helpful to inject additional position information into the LLM.

To achieve this, we can use two broad categories of position-aware embeddings: relative positional embeddings and absolute positional embeddings. Absolute positional embeddings are directly associated with specific positions in a sequence. For each position in the input sequence, a unique embedding is added to the token's embedding to convey its exact location. For instance, the first token will have a specific positional embedding, the second token another distinct embedding, and so on. As illustrated below:

<img src="https://drive.google.com/uc?id=1Ahl95RtFbMQw3AUvpZEJb-3Q4LVT0C2q">

Instead of focusing on the absolute position of a token, the emphasis of relative postional embeddings is on the relative position or distance between tokens. This means the model learns the relationships in terms of "how far apart" than "at which exact position". The advantage here is that the model can generalize better to sequences of varying lengths, even if it hasn't seen such lengths during training.

Both types of positional embeddings aim to augment the capacity of LLMs to understand the order and relationships between tokens, ensuring more accurate and context-aware predictions. The choice between them often depends on the specific application and the nature of the data being processed.

OpenAI's GPT models use absolute positional embeddings that are optimized during the training process rather than being fixed or predefined like the positional encodings in the original transformer model. This optimization process is part of the model training itself. Now, let's create the initial positional embeddings to create the LLM inputs.


Previously, we focused on very small embedding sizes for simplicity. Now, let's consider more realistic and useful embedding sizes and encode the input tokens into a 256-dimensional vector representation. For vocab size, we will use the size of token IDs created by the BPE tokenizer we implemented earlier.

In [73]:
vocab_size = 50257 # From BPE tokenizer earlier
output_dim = 256 # embedding sizes of 256-dimensional vector representation instead just 3

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

We will use the previous token_embedding_layer and sample data from the data loader, we embed each token in each batch into a 256-dimensional vector. If we have a batch size of 8 with four tokens each, the result will be an 8x4x256 tensor.

In [74]:
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [75]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


As we can see, the token ID tensor is 8x4 dimensional, meaning that the data batch consists of eight text samples with four token each.

Let's now use the embedding layer to embed these token IDs into 256-dimensional vectors:

In [80]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

# uncomment & execute the following line to see how the embeddings look like
print(token_embeddings)

torch.Size([8, 4, 256])
tensor([[[ 0.4913,  1.1239,  1.4588,  ..., -0.3995, -1.8735, -0.1445],
         [ 0.4481,  0.2536, -0.2655,  ...,  0.4997, -1.1991, -1.1844],
         [-0.2507, -0.0546,  0.6687,  ...,  0.9618,  2.3737, -0.0528],
         [ 0.9457,  0.8657,  1.6191,  ..., -0.4544, -0.7460,  0.3483]],

        [[ 1.5460,  1.7368, -0.7848,  ..., -0.1004,  0.8584, -0.3421],
         [-1.8622, -0.1914, -0.3812,  ...,  1.1220, -0.3496,  0.6091],
         [ 1.9847, -0.6483, -0.1415,  ..., -0.3841, -0.9355,  1.4478],
         [ 0.9647,  1.2974, -1.6207,  ...,  1.1463,  1.5797,  0.3969]],

        [[-0.7713,  0.6572,  0.1663,  ..., -0.8044,  0.0542,  0.7426],
         [ 0.8046,  0.5047,  1.2922,  ...,  1.4648,  0.4097,  0.3205],
         [ 0.0795, -1.7636,  0.5750,  ...,  2.1823,  1.8231, -0.3635],
         [ 0.4267, -0.0647,  0.5686,  ..., -0.5209,  1.3065,  0.8473]],

        ...,

        [[-1.6156,  0.9610, -2.6437,  ..., -0.9645,  1.0888,  1.6383],
         [-0.3985, -0.9235, -1.31

It's now shown that each token ID is now embeded as a 256-dimensional vector.

For a GPT model's absolute embedding approach, we just need to create another embedding layer that has the same embedding dimension as the token_embedding_layer

In [81]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

# uncomment & execute the following line to see how the embedding layer weights look like
print(pos_embedding_layer.weight)

Parameter containing:
tensor([[-0.8194,  0.5543, -0.8290,  ...,  0.1325,  0.2115,  0.3610],
        [ 0.4193, -0.9461, -0.3407,  ...,  0.7930,  1.7009,  0.5663],
        [-0.2362, -1.7187, -1.0489,  ...,  1.1218,  0.2796,  0.9912],
        [-0.9549,  0.4699,  0.2580,  ..., -1.3689,  1.6505,  1.3488]],
       requires_grad=True)


The input to the pos_embeddings is usually a placeholder vector
torch.arrange(context_length), which contains a sequence of numbers 0, 1, ..., up to the maximum input length -1. The context_length is a variable that represents the supported input size of the LLM. Here, we choose it similar to the maximum length of the input text. In practice, input text can be longer than the supported context length, in which case we have to truncate the text.

In [82]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

# uncomment & execute the following line to see how the embeddings look like
print(pos_embeddings)

torch.Size([4, 256])
tensor([[-0.8194,  0.5543, -0.8290,  ...,  0.1325,  0.2115,  0.3610],
        [ 0.4193, -0.9461, -0.3407,  ...,  0.7930,  1.7009,  0.5663],
        [-0.2362, -1.7187, -1.0489,  ...,  1.1218,  0.2796,  0.9912],
        [-0.9549,  0.4699,  0.2580,  ..., -1.3689,  1.6505,  1.3488]],
       grad_fn=<EmbeddingBackward0>)


As we can see, the positional embedding tensor consists of four 256-dimensional vectors. We can now add these directly to the token embeddings, where PyTorch will add the 4x256-dimensional pos_embeddings tensor to each 4x256-dimensional token embedding tensor in each of the eight batches:

In [83]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

# uncomment & execute the following line to see how the embeddings look like
print(input_embeddings)

torch.Size([8, 4, 256])
tensor([[[-0.3281,  1.6782,  0.6298,  ..., -0.2670, -1.6620,  0.2165],
         [ 0.8674, -0.6925, -0.6063,  ...,  1.2927,  0.5018, -0.6181],
         [-0.4869, -1.7733, -0.3802,  ...,  2.0836,  2.6533,  0.9384],
         [-0.0091,  1.3356,  1.8771,  ..., -1.8233,  0.9045,  1.6972]],

        [[ 0.7267,  2.2912, -1.6138,  ...,  0.0321,  1.0699,  0.0189],
         [-1.4429, -1.1375, -0.7219,  ...,  1.9150,  1.3513,  1.1754],
         [ 1.7486, -2.3669, -1.1904,  ...,  0.7377, -0.6559,  2.4390],
         [ 0.0099,  1.7672, -1.3627,  ..., -0.2226,  3.2302,  1.7457]],

        [[-1.5907,  1.2115, -0.6627,  ..., -0.6719,  0.2657,  1.1036],
         [ 1.2239, -0.4414,  0.9515,  ...,  2.2578,  2.1106,  0.8868],
         [-0.1567, -3.4823, -0.4740,  ...,  3.3041,  2.1027,  0.6277],
         [-0.5282,  0.4051,  0.8265,  ..., -1.8898,  2.9570,  2.1961]],

        ...,

        [[-2.4349,  1.5153, -3.4727,  ..., -0.8320,  1.3004,  1.9994],
         [ 0.0208, -1.8696, -1.65

The input_embeddings we created, as summarized in figure below, are the embedded input examples that can now be processed by the main LLM modules, which will be implemented next chapter.

<img src="https://drive.google.com/uc?id=1JDYWH7YA10fXY217b3K5Nd9fiqcXVZ5x">