<a href="https://colab.research.google.com/github/abhishekkumawat23/build-llm-from-scratch/blob/main/Build_a_large_language_model_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a Large Language Model from Scratch

This notebook follows the [Build a Large Language Model (From Scratch)](https://sebastianraschka.com/llms-from-scratch/) book to build a large language model from scratch.

Wherever needed, the notebook explains the relevant theory and underlying concepts, focusing on learning through hands-on implementation.


## What is an LLM (Large Language Model)?

**Large Language Models (LLMs)** can be of various types, but the most common ones are **Generative LLMs**. A Generative LLM is a deep neural network model that, when given a piece of text (whether a word, line, or paragraph), generates the next word (or token, to be precise) that should follow that text.

### Example

- Given `The cat sat`, the LLM might generate `on`
- Given `The cat sat on`, the LLM might generate `the`
- Given `The cat sat on the`, the LLM might generate `mat`

Since the model can generate the next word after any given text, if we pass `The cat sat` and call the model multiple times—each time appending the word it generated in the previous iteration—we can generate a significant chunk of text. For example, starting with just `The cat sat`, we can generate `The cat sat on the mat and started playing.`

End users don't need to call the model repeatedly after each word—the LLM wraps this logic internally, and predicts multiple times, appending each generated word to the next prediction iteration. This is why such LLMs are also called **Auto-Regressive** models.

### How it works - LLMs during inference time

Inference time means when trained LLMs are used by the users. During inference time, when we pass an input sequence `The cat sat` to the LLM, the LLM makes predictions one by one auto-regressively to generate the output sequence like `The cat sat on the mat`.

These is how the predictions happens internally.
- `The cat sat` → ` on`
- `The cat sat on` → ` the`
- `The cat sat on the` → ` mat`
- `The cat sat on the mat` → ` and`
- `The cat sat on the mat and` → ` started`
- `The cat sat on the mat and started` → ` playing`

Thus, for each prediction iteration, LLM gets entire sequence as input(including the past generation), and outputs a single word as prediction.

### How it works - LLMs during training time

During training, the aim is on training. Thus, we pass a sequence as input, and LLM internally predicts the next word not for the entire sequence but for each of prediction prefix sub-sequence [0,k]. All of this is done in one single iteration.

For example, when you pass `The cat sat on the mat and started` to the LLM it predicts all of the below **in one single iteration**:

- `The` → ` cat`

  `The cat` → ` sat`

  `The cat sat` → ` on`

  `The cat sat on` → ` the`

  `The cat sat on the` → ` mat`

  `The cat sat on the mat` → ` and`

  `The cat sat on the mat and` → ` started`

  `The cat sat on the mat and started` → ` playing`

Why does it do that? It does to train more, to train better, to train predicting on sequences of variables length, just from a single input sequence.

You might think that its easy for LLM to predict ` mat` for `The cat sat on the` sub-sequence because we already passed the full sentence `The cat sat on the mat and started` but guess what - LLM masks/hides the future words when it tried to predict a sub-sequence. So while predicting the next word after `The cat sat on the`, it masked ` mat and started`, so its not aware of it.

You might also thinking that it was compute heavy to train a single input sequence as LLM internally predicts for every prefix-subsequence but guess what - all of this is done via the magic of **vectorization** - everything is done in one single pass - together - without interfering with each other. The time take to predict one sub-sequence is similar to time take to predict all the sub-sequences in one go.

### Batching

**Inference time:**

During inference time, LLM predicts the next word auto-regressively i.e. one prediction iteation generating one word and the process repeats. This makes things slow. So, what we can do is batch multiple unrelated input sequences from multiple users and pass it together as list of input sequences to the LLM.

For single prediction iteration, LLM will generate the next word for each one of the input sequence, and then repeat it in next prediction iterations.

Why to do batching? Because for a single predicton iteration, if we predict for one input sequence or predict for multiple input sequences, it takes similar time.. thanks to **vectorization**.

For example: if we gave `The cat sat`, `I was driving` input sequences to the LLM, it does following:

- `The cat sat` -> ` on`

  `I was driving` -> ` a`

- `The cat sat on` -> ` the`

  `I was driving a` -> ` racing`

- `The cat sat on the` -> ` mat`

  `I was driving a racing` -> ` car`

**Training time:**

We know that during training time, LLM is already making multiple predictions for a single input sequence by predicting next word for each prefix sub-sequence. We can take things at next level by giving it a batch of input sequences. Thanks to **vectorization**, all of the batches and all of the sub-sequences of single input sequence will take similar computation time as doing it for only one input sub-sequence without any batching.

For example, if we have batch of 2 input sequences `The cat sat on the` and `I was driving a racing`, it will do all below in one single prediction iteration:

- `The` → ` cat`

  `The cat` → ` sat`

  `The cat sat` → ` on`

  `The cat sat on` → ` the`

  `The cat sat on the` → ` mat`

  `I` -> ` was`

  `I was` -> ` driving`

  `I was driving` -> ` a`

  `I was driving a` -> ` racing`

  `I was driving a racing` -> ` car`

---

### TODO
Introduce additional foundational concepts such as:
- Deep neural networks
- Transformer layers
- Attention mechanisms
- Encoder-only, decoder-only, and encoder-decoder architectures
- Tokenizers
- Embeddings
- Difference between inference and training
- Optimizing attentions - multi-head, KV caching, multi-query, grouped query, flash attention

*Note: Some of these concepts may be introduced at later stages in the notebook.*

---

# Text Processing

## Basics

### What is an Embedding?

LLMs don't understand text the way humans do—they understand numbers. Therefore, we need to:
- Convert the input text into a numerical representation before passing it to the LLM
- When the LLM predicts the next word, convert its numerical output back into text

This numerical representation of text is called an **embedding**. An embedding is not a single number, but rather a vector (a list) of floating-point numbers. The length of this vector depends on the LLM architecture and can be 32, 64, 768, 4096, or more. The length of the embedding vector is called the **embedding dimension**.

### What to Embed: Entire Input Text or Individual Words?

If we want to pass `The cat sat` to the LLM, should we convert the entire text to a single embedding vector, or should we convert each word to its own embedding vector?

This depends on the type of LLM being used, but most generative LLMs use individual embeddings for each word-like unit rather than sentence-level embeddings. So `The cat sat` would be converted to 3 embedding vectors, one for each word. As input, a generative LLM receives a sequence of embedding vectors, and as output it returns another sequence of embedding vectors representing the generated text `on the mat and started playing` (where each word is generated auto-regressively, one by one, internally).

However, the reality is more nuanced: LLMs don't actually use *word-level* embeddings, but rather *token-level* subword embeddings. Here, a *token* can represent a complete word, a subword, a space, a special character, or a regular character.

### Why Tokens Instead of Words?

Words are typically composed of root words combined with prefixes and suffixes (affixes). If we created an embedding vector for each complete word, we would need an enormous number of embedding vectors. For example, `help`, `helper`, `helpless`, `helpful`, and `unhelpful` are 5 different words all derived from the root word `help`.

Instead, if we split these words into reusable subword tokens like `help`, `er`, `less`, `ful`, and `un`, we only need 5 embedding vectors—but now these subwords can be reused to represent many other words (like `teach`+`er`, `hope`+`less`, `use`+`ful`, `un`+`done`). This significantly reduces the total number of embedding vectors needed for the training dataset. These reusable subword units—such as `help`, `er`, `less`, `ful`, and `un`—are called **tokens**.

### Token Vocabulary

If we collect all the unique tokens from the training dataset, we have a token **vocabulary**. Each token in the vocabulary is assigned a unique identifier called a **token ID**.

### Tokenizer: Converting Between Text and Token IDs

A **tokenizer** is responsible for converting between text and token IDs:
- **Encoding**: The tokenizer converts text into token IDs. These token IDs are then converted to embedding vectors by the embedding layer. The LLM processes these input embedding vectors and generates output embedding vectors, which are converted back to token IDs.
- **Decoding**: The tokenizer converts the output token IDs back into human-readable text.

A tokenizer is capable of performing both encoding and decoding operations.

### Embedding vs Token IDs

One might think that since we can get token IDs from tokens, we already have a numerical representation of tokens. Then what's the need for converting them to embedding vectors? Why not just pass the token IDs (numerical representation) directly to LLMs?

**Semantic Information:**

Embedding vectors are more than just numerical representations of words. They also capture relationships and patterns between words. Similar embedding vectors are close to each other, while dissimilar ones are far apart. For example, the embedding vectors of *cat* and *dog* will be closer to each other than to *car*. By *close*, I mean the angle between the *cat* and *dog* embedding vectors is smaller than the angle between *cat* and *car* in the embedding vector space. In other words, embedding vectors hold **semantic information** about the word in addition to representing the word itself. We will cover later how embedding vectors capture this semantic information, relationships, and complex patterns. For now, we just want to establish that we still need embedding vectors even though we have token IDs as numerical representations of tokens.

**Continuous Vector Space:**

Token IDs are discrete numbers—token ID 34, token ID 35, etc. These are not continuous. For example, there is no token with token ID 34.15. LLMs perform mathematical computations internally, and thus they need continuous numbers.

Embedding vectors exist in a continuous vector space. For example, if token ID 34 represents a 2D embedding vector [0.45, 1.328] in continuous vector space, and the vocabulary has 5,000 defined tokens, you can still perform mathematical computations in that continuous vector space. You can find points in that continuous space that don't represent a specific token but might represent semantic relationships between tokens that only the LLM can understand and compute. This may not be very intuitive, but at least the mathematical reasoning is intuitive: you can't perform mathematical operations like differentiation and integration using discrete token IDs.

---

## Loading the Training Dataset

Before diving deeper into converting text to tokens and then tokens to embedding vectors for input to an LLM, let's first load our dataset.

To build our LLM, we will use a very small dataset: a short story by **Edith Wharton** titled **The Verdict**.

Download the text from https://en.wikisource.org/wiki/The_Verdict and save it as a file named **the-verdict.txt**.

In [32]:
# Load the file and read content

with open('the-verdict.txt', 'r', encoding='utf-8') as file:
  raw_text = file.read()

print(f'Total number of characters: {len(raw_text)}')
print(raw_text[:99])


Total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


## Tokenizer

Let's dive deep into tokenizers.

### Simple Token Vocabulary - Word-Based

Before implementing tokenizer, lets create a token vocabulary:

- Let's implement a simple token vocabulary. We will treat each word as a token, without any fancy word splitting.
- Since special characters like spaces, commas, periods, and exclamation marks are typically attached to words, let's split by these characters to extract them as separate tokens.

In [33]:
# Simple token vocab - v1 - word based

import re

tokens = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
vocab = sorted(set([token for token in tokens if token.strip()]))

### Simple Tokenizer - v1 - Word-Based

- Let's implement a simple text tokenizer called `SimpleTokenizerV1` using the word-based token vocabulary we created earlier.
- In the `encode` method, we will convert the text into a list of token IDs. For splitting the text, we will use the same regex pattern used to create the vocabulary, ensuring consistency.
- In the `decode` method, **TODO**

**Limitations:**
- Since the vocabulary contains only words from the training dataset, this tokenizer will **throw an error when it encounters an unknown word** during inference.
- Given text like `It's the last he painted.`, the encoding and subsequent decoding will return ` It' s the last he painted.` with an extra space at the start of the decoded text and an extra space before the `s` in `It's`. This occurs due to how the regex splits tokens and how they are rejoined during decoding.

In [34]:
# Simple tokenizer v1 - word are tokens here
class SimpleTokenizerV1:
  def __init__(self, vocab):
    self.token_to_id = {token: id for id, token in enumerate(vocab)}
    self.id_to_token = {id: token for token, id in self.token_to_id.items()}

  def encode(self, text):
    tokens = re.split(r'([,.?_!"()\']|--|\s)', text)
    token_ids = [self.token_to_id[token] for token in tokens if token.strip()]
    return token_ids

  def decode(self, token_ids):
    tokens = [self.id_to_token[token_id] for token_id in token_ids]
    text = ' '.join(tokens)
    # Joining by space caused space before each special character as well. Remove it.
    text = re.sub(r'\s+([,.?_!"()\']|--)', r'\1', text)
    return text



In [35]:
# Testing the simple tokenizer v1.
tokenizer = SimpleTokenizerV1(vocab)
text_to_encode = 'It\'s the last he painted.'
token_ids = tokenizer.encode(text_to_encode)
print(f'Encoded token ids for {text_to_encode}: {token_ids}')

decoded_text = tokenizer.decode(token_ids)
print(f'Decoded text for {token_ids}: {decoded_text}')

Encoded token ids for It's the last he painted.: [58, 2, 872, 1013, 615, 541, 763, 7]
Decoded text for [58, 2, 872, 1013, 615, 541, 763, 7]: It' s the last he painted.


### Simple Tokenizer - v2 - Word Based

**Adding the `<|unk|>` token:** The previous version threw an error when it encountered unknown words. We can introduce a special token to represent any unknown word. Let's call this token `<|unk|>`.

**Adding the `<|endoftext|>` token:** As we know, we pass a list of token IDs to the LLM. The maximum length of this list is called `max_seq_len`. We will later learn that computationally, it costs roughly the same whether we pass 1 token ID or `max_seq_len` token IDs to the LLM. Since the computational cost is similar, it makes sense during training to always pass inputs of length `max_seq_len` to maximize efficiency.

To achieve this, we can concatenate multiple lines or sentences until we reach `max_seq_len`, then pass that combined input to the LLM. However, we need the LLM to treat these concatenated lines as separate and unrelated, so that text generation isn't negatively affected by the arbitrary concatenation. For this purpose, we can add a special token called `<|endoftext|>` between concatenated lines. We can then train the LLM to understand that segments separated by `<|endoftext|>` are independent and unrelated to each other.

In [36]:
# Add `<|unk|>` and `<|endoftext|>` tokens in the vocab
vocab.extend(['<|unk|>', '<|endoftext|>'])
print(vocab[-5:])

# Create tokenize with these special tokens support
class SimpleTokenizerV2:
  def __init__(self, vocab):
    self.token_to_id = {token:id for id, token in enumerate(vocab)}
    self.id_to_token = {id:token for id, token in enumerate(vocab)}

  def encode(self, text):
    tokens = re.split(r'([,.?_!"()\']|--|\s)', text)
    tokens = [token for token in tokens if token.strip()]
    tokens = [token if token in self.token_to_id else '<|unk|>' for token in tokens]
    token_ids = [self.token_to_id[token] for token in tokens]
    return token_ids

  def decode(self, token_ids):
    tokens = [self.id_to_token[token_id] for token_id in token_ids]
    text = ' '.join(tokens)
    text = re.sub(r'\s+([,.?_!"()\']|--)', r'\1', text)
    return text

['younger', 'your', 'yourself', '<|unk|>', '<|endoftext|>']


In [37]:
# Lets test with tokenizer

tokenizer = SimpleTokenizerV2(vocab)
token_ids = tokenizer.encode('Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.')
print(tokenizer.decode(token_ids))


<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


---

### Simple Byte based Tokenizer

**Limitations of simple tokenizer:**

In the simple tokenizers we implemented, there are many limitations:
- We are considering words as tokens but as we discussed earlier example of `help`, `helper`, `helpless`, `helpful`, and `unhelpful` where we if instead of these, we stored `help`, `er`, `less`, `ful` and `un` as tokens, we would end up storing way few tokens as these common prefix and suffix we separated can be used to create many other words.
- Even though we are handling unknown word with special token `<|unk|>` and not erroring out, we are not passing the actual unkown word to the LLM, and thus even if LLM know how to make predictions with the unknown words(Example: typos), it can't do anything as its not getting that unknown word. Thus, Tokenizer should be able to create tokens for unknown words.

**Tokenizer made out of byte tokens:**

As we know that at the very core, each word, character, symbol, special tokens are made out of bytes. And how many bytes are possible? 256 (2^8 different values). So if we create a token vocabulary with all these 256 bytes in it, we can convert any word, character, symbol to tokens. This way we don't need to think about any unknown words.

**Do we need to implement byte tokenizer?:**

Python text already has encode and decode methods which encodes text to bytes and then decodes to text, so we don't need to implement a byte tokenizer. But lets still implement one because we are going to make it more complicated by adding many optimization in it.

**Limitations:**

In below tokenizer, number of tokens are only 256 so very small number of token. In addition, there is no problem of unknown words as all words can be created from the 256 bytes. There what are the limitations?
- Number of tokens required to represent a single word will be huge. Even a simple english word like `implement` now needs 9 tokens just to reprsent 1 word. LLMs have a limit on number of tokens it can take called `max_seq_length`. As we have increased number of tokens needed per word, LLM can take way smaller text/words in this case.
- As we have only 256 tokens, corresponding emebdding vectors will only be 256. As we know that embedding vectors somehow holds semnatic relationships as well, as we now have only 256 such vectors, the amount of relationships it can hold will be limited.

In [38]:
class SimpleByteTokenizer:
  def __init__(self):
    self.token_to_id = {bytes([id]): id for id in range(256)}
    self.id_to_token = {id: token for token, id in self.token_to_id.items()}

  def encode(self, text):
    tokens = text.encode('utf-8') # Using this method, we get the bytes
    token_ids = [self.token_to_id[bytes([token])] for token in tokens]
    return token_ids

  def decode(self, token_ids):
    tokens = [self.id_to_token[token_id] for token_id in token_ids]
    all_bytes = b''.join(tokens)
    return all_bytes.decode('utf-8', errors='replace')

In [39]:
tokenizer = SimpleByteTokenizer()
token_ids = tokenizer.encode('Hello how are you?')
print(token_ids)
print(tokenizer.decode(token_ids))


[72, 101, 108, 108, 111, 32, 104, 111, 119, 32, 97, 114, 101, 32, 121, 111, 117, 63]
Hello how are you?


### Simple Byte Pair Encoding Tokenizer

We can build on top of our existing simple byte tokenizer to address its limitations. We are going to build a tokenizer called **Byte Pair Encoding** Tokenizer aka **BPE** Tokenizer. GPT like LLMs use this BPE tokenizer. Here for simplicity we will implement a simple version.

**Creating the vocab:**
- Go through entire training dataset and split it into bytes to get `tokens`.
- Create a frequency map of consecutive byte pairs. Find out the byte pair with max frequency.
- Merge this pair and add the merged bytes in the `token_to_id` and `id_to_token` map.
- Also go through the `tokens` list and replace the byte pair with single merged bytes.
- Now repeat this process until you reach a vocab of size you want. For example if you want vocab of 5000 size, then repeat the process for 5000-256 times as we already had 256 values in the vocab.
- Repeating this process allows vocab to be added with most common sub-words, word (in merged byte format).

**Encoding:**
- Apply greedy longest match approach to split text into minimal tokens within O(n) time

**Decoding:**
- Decoding is pretty simple as same as before.

In [40]:
from collections import Counter

class SimpleBPETokenizer:
  def __init__(self, vocab_size, freq_threshold):
    self.token_to_id = {bytes([id]): id for id in range(256)}
    self.id_to_token = {id: token for token, id in self.token_to_id.items()}
    self.vocab_size = vocab_size
    self.freq_threshold = freq_threshold

  def train(self, text):
    # Pass entire training text as text so that we can create tokens for common
    # sub-words, words etc.
    tokens = text.encode('utf-8')
    tokens = [bytes([token]) for token in tokens]

    num_merges = self.vocab_size - len(self.token_to_id)
    for _ in range(num_merges):
      # Find best token pair which comes together most frequently.
      pairs = Counter()
      for idx in range(len(tokens)-1):
        pair = (tokens[idx], tokens[idx+1])
        pairs[pair] += 1
      if not pairs:
        # Note: Train text had very few tokens that we have exhausted them
        # before creating the vocab of needed size.
        break
      best_pair = max(pairs, key=pairs.get)
      best_pair_merged = best_pair[0] + best_pair[1]

      # If best_pair's frequency doesn't cross the frequency threshold, then stop.
      if pairs[best_pair] < self.freq_threshold:
        break

      # Add best pair in our token_to_id and id_to_token maps.
      best_pair_idx = len(self.token_to_id) # As we are adding as last, index will be last.
      self.token_to_id[best_pair_merged] = best_pair_idx
      self.id_to_token[best_pair_idx] = best_pair_merged

      # Merge this best pair in tokens so that next iteration can repeat the
      # process but where tokens have this best pair merged. This way, iteration
      # by iteration, we would have merged common sub-words, words, etc.
      new_tokens = []
      idx = 0
      while idx < len(tokens):
        if idx < len(tokens) - 1 and (tokens[idx], tokens[idx+1]) == best_pair:
          new_tokens.append(best_pair_merged)
          idx += 2
        else:
          new_tokens.append(tokens[idx])
          idx += 1
      tokens = new_tokens

  def vocab(self):
    return list(self.token_to_id.keys())

  def encode(self, text):
    # Apply greedy longest match approach to split text into minimal tokens within O(n) time
    # TODO: is there stringbuilder kinda thing in python?
    text_bytes = [bytes([token]) for token in text.encode('utf-8')]

    start = 0
    end = 0
    current_token = b''
    tokens = []
    while end <= len(text_bytes):
      if end == len(text_bytes):
        tokens.append(current_token)
        break
      new_token = current_token + text_bytes[end]
      if new_token in self.token_to_id:
        current_token = new_token
      else:
        tokens.append(current_token)
        # print(f'Tokens: {tokens}')
        start = end
        current_token = text_bytes[end]
      end += 1

    token_ids = [self.token_to_id[token] for token in tokens]
    return token_ids

  def decode(self, token_ids):
    tokens = [self.id_to_token[token_id] for token_id in token_ids]
    all_bytes = b''.join(tokens)
    return all_bytes.decode('utf-8', errors='replace')

In [41]:
# Train tokenizer
vocab_size = 500
freq_threshold = 5
tokenizer = SimpleBPETokenizer(vocab_size, freq_threshold)
# print(raw_text[:99])
tokenizer.train(raw_text)

# Print vocab
vocab = tokenizer.vocab()
print(vocab[-10:])
print(f'Vocab size: {len(vocab)}')

[b'for ', b'ould', b'tch', b'gre', b'pr', b'ould have ', b'--t', b'cou', b'ure ', b'k ']
Vocab size: 500


In [42]:
# Try encoding and decoding
token_ids = tokenizer.encode('I felt able to face the fact')
print(token_ids)
print(tokenizer.decode(token_ids))

[278, 102, 288, 259, 336, 353, 116, 273, 102, 300, 305, 262, 102, 300, 116]
I felt able to face the fact


### GPT's BPE Tokenizer

The tokenizer we wrote just gives a simple idea about Byte Pair Encoding works. It lacks lot of optimizations like:
- We don't want weird tokens spreading over spaces. So ideally we should have pre-tokenize using word split to create text chunks, and then built the byte pair encoding on each chunk.
- Our encode is not efficient. Even though we are using greedy approach, we are appending to string multiple times where string is immutable. This causes time complexity to increase.

These are just few missing points we discussed but there are many. So, its better to use **GPT's BPE tokenizer**. Alternatively, we can also use some other tokenizers like **WordPiece** (used by BERT) tokenizer.

In [43]:
import tiktoken

tokenizer = tiktoken.get_encoding('gpt2')
token_ids = tokenizer.encode('I felt able to face the fact')
print(token_ids)
print(tokenizer.decode(token_ids))

[40, 2936, 1498, 284, 1986, 262, 1109]
I felt able to face the fact


---

## Data Loader

To prepare data for LLM training, we need to a lot of things:
- Data sampling - i.e. Creating chunks of entire raw text so that each chunk can be passed as one input sequence to the LLM. Here we can use stride and sliding window concept to pick the samples.
- Creating input-target pairs
- Creating batches of inputs so that one entire batch can be passed to the LLM at a time
- Whether to shuffle data or not betweeb say epochs
- Whether to drop the last batch or not as it might contain less data and thus we can avoid unnecessary spikes.

We can write all these logics by ourselves and its not that difficult but we have to be cautius while sampling the data.

Instead, we can use torch's dataset and dataloader classes to define the dataset. There are some other famous libraries for data processing like pandas.

### Dataset

- In dataset class, we define `__getitem__` which returns the next pair of input sequence and corresponding targets which should be passed to the LLM for prediction. Why multiple targets for single input sequence? Because as we read earlier, during training, LLM predicts next token for each prefix sub-sequence. So for input sequence of length n, we will have n predictions and thus n target ids.
- In constructor, we define the pairs of input_ids and target_ids using stride, max_seq_length etc, where stride is the index gap to make while choosing the start of next input sequence; max_seq_length is the seq length of a single input sequence.
- If stride is smaller than max_seq_length, input sequences will have overlap and thus we have chance of overfitting
- If stride is larger than max_seq_length, we are skipping many tokens in between and thus we are not using lot of data for training.
- Thus usually folks set, stride as same as max_seq_length.

In [44]:
from torch.utils.data import Dataset, DataLoader
import torch

class GPTDatasetV1(Dataset):
  def __init__(self, data, tokenizer, stride, max_seq_length):
    print(f'max_seq_length: {max_seq_length}')
    self.input_ids = []
    self.target_ids = []

    token_ids = tokenizer.encode(data)
    for i in range(0, len(token_ids) - max_seq_length, stride):
      input_chunk = token_ids[i : i + max_seq_length]
      target_chunk = token_ids[i + 1 : i + max_seq_length + 1]
      self.input_ids.append(torch.tensor(input_chunk))
      self.target_ids.append(torch.tensor(target_chunk))

  def __getitem__(self, idx):
    return self.input_ids[idx], self.target_ids[idx]

  def __len__(self):
    return len(self.input_ids)

In [45]:
stride = 4
max_seq_length = 4
dataset = GPTDatasetV1(raw_text, tokenizer, stride, max_seq_length)
data_iterator = iter(dataset)

next_input_ids, next_target_ids = next(data_iterator)
print(f'Next input token ids: {next_input_ids} and target ids: {next_target_ids}')
print(f'Next input sequence: {tokenizer.decode(next_input_ids.tolist())} and target ids: {tokenizer.decode(next_target_ids.tolist())}')

max_seq_length: 4
Next input token ids: tensor([  40,  367, 2885, 1464]) and target ids: tensor([ 367, 2885, 1464, 1807])
Next input sequence: I HAD always and target ids:  HAD always thought


### Data loader

Just defining the next pair of input sequence and target is not enough. We need to think of other things like batch size, whether to shuffle data for each epoch or not, whether to drop to last batch as it might not have input_sequences of batch_size and thus can cause spike in numbers like loss etc.

Defining these are easy but torch already does that by providing `DataLoader` instances.

In [46]:
def create_dataloader_v1(data, batch_size=4, max_seq_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
  tokenizer = tiktoken.get_encoding('gpt2')
  dataset = GPTDatasetV1(data, tokenizer, stride, max_seq_length)
  dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
  return dataloader

In [47]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_seq_length=4, stride=4)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

max_seq_length: 4
Inputs:
 tensor([[  464,  1109,  3181,  1363],
        [  314,  2138,  1807,   340],
        [ 1544,  9373,  6364,    25],
        [34537,   526,   198,   198],
        [ 2495, 11331,  2768,   590],
        [48422,   540,   450,    67],
        [  379,   790,   966,   262],
        [ 1310,  1165,   881, 40642]])

Targets:
 tensor([[ 1109,  3181,  1363,   284],
        [ 2138,  1807,   340,   561],
        [ 9373,  6364,    25,   366],
        [  526,   198,   198,     1],
        [11331,  2768,   590,   286],
        [  540,   450,    67,  3299],
        [  790,   966,   262,  3512],
        [ 1165,   881, 40642,   972]])


---

## Token Embeddings

Now we need to create embeddings (embedding vectors) from token IDs. As we discussed earlier, embeddings not only represent the token itself but also contain semantic information and relationships about tokens. This means the embedding vectors of `cat` and `dog` will be close to each other in the embedding vector space, while they will be far from `car`.

### Embedding Dimension

The first thing we need to decide when creating embeddings is the **embedding dimension**. Should we convert a token to a 2D embedding vector or a 2000D embedding vector?

**Too Small (Underfitting):**
If we create a 2D embedding vector, it can't hold much complex semantic information about the token. The model won't have enough capacity to capture the nuanced relationships between words.

**Too Large (Overfitting):**
On the other hand, if we define a 2000D embedding vector with limited training data, the model might memorize specific relationships (like "`cat` and `dog` are close") rather than learning the underlying semantic reasoning (they're both pets). This is problematic because during inference, the LLM won't generalize well to new words or contexts it hasn't explicitly seen during training.

#### Analogy: LeetCode Problem Solving

Consider solving LeetCode problems:

- **Infinite memory (overfitting)**: You memorize answers to all LeetCode questions. If asked a question from LeetCode, you give the memorized answer. But if asked a question outside LeetCode, you can't answer it because you never memorized it.

- **Limited memory (optimal learning)**: You can't memorize all answers, so you identify patterns and retain those patterns in memory. Now when any question is asked—from LeetCode or outside—you can apply those generic patterns to solve it.

- **Very limited memory (underfitting)**: The number of patterns you can learn is severely limited, so you can't answer many problems effectively.

#### Finding the Right Balance

Neither a very small dimension nor a very large dimension is ideal. The optimal embedding dimension depends on:
- Vocabulary size (number of unique tokens)
- Amount of training data available
- Model size and complexity

**Typical embedding dimensions:**
- 768, 2048, or 4096 are common choices for LLMs with 50,000+ vocabulary size
- The dimension should scale with both vocabulary size and available training data

### How to Create Embeddings?

Since embeddings contain semantic information about the tokens, they're not easy to create. We need to **learn/train** embeddings using a lot of training data. `word2vec` and `GloVe` are a few of the famous models that train embeddings.

### word2vec - Learning Embedding Vectors for Tokens

Implementing word2vec is out of scope for this notebook. For a simple implementation of word2vec, refer to [this](https://github.com/abhishekkumawat23/word2vec-embedding-model-from-scratch/blob/main/word2vec_embedding_model_from_scratch.ipynb) notebook.

Here, let's just understand the concept:

- On the text corpus, a sliding window is applied. The token at the center of the sliding window is paired with all other tokens in the sliding window. Each of these pairs is in context/relation to each other as they were present in the same sliding window. These are our positive pairs, which are related to each other. This is called the skip-gram approach.

- Using the same center token of the sliding window, we pair it with a random token from the text corpus, and these are out-of-context pairs. This approach is called negative sampling.

- The model defines an embedding matrix internally with random weights—actually two embedding matrices: one for the first word and one for the second word.

- The model takes a pair as input and outputs the probability of this pair being in the same context or not.

- Our target is either 0 or 1 depending on whether the pair was positive or negative. We calculate loss using the predicted probability and target. We use binary cross-entropy for this so that if the probability is very far from the target, it's heavily penalized, but the penalty isn't linearly proportionate. This is done so that the model learns not to make big mistakes.

- We don't need a neural network for this—just one layer with 2 embedding matrices. When we get the pair, we retrieve the corresponding embedding vector from the respective embedding matrices and then perform a dot product, because the dot product represents the similarity between vectors.

- We apply sigmoid to the dot product so that non-linearity is applied, which allows us to learn complex, non-linear relations. In addition, sigmoid gives a value in the range of (0, 1), which acts as a probability. So, two birds with one stone.

### LLMs Don't Need Pre-trained Embeddings from Outside

Modern LLMs don't need embeddings learned from word2vec, GloVe, or other external embedding models. Instead, they create embedding vectors for the entire vocabulary (represented as an embedding matrix) internally, and then train/learn those embeddings along with training the LLM for next token prediction.

This is magical. Not only do we get an LLM trained for next token prediction, but it also creates/learns embedding vectors for the entire vocabulary while doing so. We can actually take these embedding vectors created by the LLM and use them for a different purpose without using the LLM's token generation capability. For example, we can use them in a recommendation system, given that semantically similar tokens live close together, which is exactly what we want in a recommendation system.

### Create Embedding Matrix with Random Initialization

Since the LLM we are building will take care of learning the embedding vectors for the entire vocabulary by itself, we will just randomly initialize these vectors for the entire vocabulary in a single matrix called the embedding matrix. In other words, the weights of this embedding matrix are initialized randomly. As training of the LLM progresses, these weights of the matrix (i.e., the embedding vectors of the vocabulary) will keep updating. They will continuously update with semantic information and relationships of the words.

To define it, we will use `torch.nn.Embedding` and pass it dimensions of `vocab_size x embedding_dim`. By default, PyTorch uses normal distribution with N(0, 1), i.e., mean=0 and std=1, to initialize the weights. It doesn't apply any other weight initialization formulas like He or Xavier initializations because in the embedding layer we are not performing operations like weighted sums or applying activation functions like ReLU, sigmoid, etc. This layer simply defines an embedding matrix without any computations or non-linear activation functions. Later layers in the LLM neural network will perform all such operations and thus will need careful weight initialization.

### How to Get Embedding Vectors from Embedding Matrix?

The embedding matrix represents embedding vectors for each word in the vocabulary. So during the forward pass of the LLM, when we receive a list of token IDs, for each token ID, we use the token ID as an index and retrieve the embedding vector row at that index in the embedding matrix to get the embedding vector for that token ID.

In [48]:
# Token emebdding layer

torch.manual_seed(123)

# Config
vocab_size = 50257
embedding_dim = 256
max_length = 4
batch_size = 8

# Create token embedding layer
token_embedding_layer = torch.nn.Embedding(vocab_size, embedding_dim)
print(token_embedding_layer.weight)

# Get inputs
dataloader = create_dataloader_v1(raw_text, batch_size=batch_size, max_seq_length=max_length, stride=max_length, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

# Pass inputs via token embedding layer
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.3035,  ...,  1.3337,  0.0771, -0.0522],
        [ 0.2386,  0.1411, -1.3354,  ..., -0.0315, -1.0640,  0.9417],
        [-1.3152, -0.0677, -0.1350,  ..., -0.3181, -1.3936,  0.5226],
        ...,
        [ 0.5871, -0.0572, -1.1628,  ..., -0.6887, -0.7364,  0.4479],
        [ 0.4438,  0.7411,  1.1263,  ...,  1.2091,  0.6781,  0.3331],
        [-0.2537,  0.1446,  0.7203,  ..., -0.2134,  0.2144,  0.3006]],
       requires_grad=True)
max_seq_length: 4
Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])
torch.Size([8, 4, 256])


---

## Positional Embeddings

The above token embedding matrix don't care about the position at which the tokens appear in the text. For example, `Only I love you.` and `I only love you.` both have the same words, but the position of `Only` is changed and the entire meaning is different. The learned embedding vector of `Only` will represent the word and its semantic meaning, but it will be the same embedding vector whether it's at the first position or the second position. But as you can see in the above examples, the meaning was totally changed just because of the position.

### Why Positional Information is Necessary

The attention mechanism itself is **permutation invariant**—it treats tokens as an unordered set and doesn't inherently distinguish token positions. Similarly, feed-forward layers operate identically on each position without considering where a token appears in the sequence. This means if we shuffle the input tokens, the model would process them the same way, which is clearly problematic for understanding language.

Given the importance of word order (as shown in the above example), we need to explicitly inject positional information into our model. This is where positional embeddings come in.

### Creating Positional Embeddings

We will add positional embeddings for each position in the sequence. The dimension of these will be the same as the embedding dimensions, so they can be added element-wise to the token embeddings.

The positional embedding matrix will be of size `(max_seq_length, embedding_dim)` to represent an embedding vector for each position. We also have the token embedding matrix which we discussed above, which is of size `(vocab_size, embedding_dim)` to represent an embedding vector for each word in the vocabulary.

**Note:** In LLMs, `max_seq_length` and `context_length` typically refer to the same value—the maximum number of tokens the model can process at once. The model attends to the entire input sequence up to this limit, so every token has access to the full context within this window.

### Learned vs. Fixed Positional Embeddings

There are different approaches to creating positional embeddings:

- **Learned positional embeddings** (GPT-2, BERT): These are trainable parameters that the model learns during training, just like token embeddings. We initialize them randomly (by default, PyTorch uses a normal distribution of N(0, 1)), and they get updated through backpropagation.

- **Fixed sinusoidal encodings** (Original Transformer): These use predetermined sine and cosine functions at different frequencies and are not learned.

- **Rotary Position Embeddings (RoPE)** (LLaMA, Mistral, most modern LLMs): These encode positional information by rotating the embedding vectors, allowing better extrapolation to longer sequences.

- **Relative positional encodings** (T5): These encode the relative distance between tokens rather than absolute positions.

For our implementation, we'll use **learned positional embeddings** (similar to GPT-2). We define them with random weights and let the model learn the optimal positional representations during training.

### How to Get Positional Embedding Vectors from Positional Embedding Matrix?

The positional embedding matrix represents positional embedding vectors for each position in the input sequence. So during the forward pass of the LLM, we retrieve the positional embedding vector using the position index from the positional embedding matrix.

### Adding Token Embeddings and Positional Embeddings

Both token embeddings and positional embeddings have the same dimension: `(seq_len, embedding_dim)`. We add them element-wise to get **input embeddings**, which contain both token information and positional information.

In [51]:
torch.manual_seed(123)

import numpy as np

# Create token embedding layer
pos_embedding_layer = torch.nn.Embedding(max_length, embedding_dim)
print(token_embedding_layer.weight)

# Pass inputs via token embedding layer
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(token_embeddings.shape)

# Input embeddings - token + pos embeddings
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.3035,  ...,  1.3337,  0.0771, -0.0522],
        [ 0.2386,  0.1411, -1.3354,  ..., -0.0315, -1.0640,  0.9417],
        [-1.3152, -0.0677, -0.1350,  ..., -0.3181, -1.3936,  0.5226],
        ...,
        [ 0.5871, -0.0572, -1.1628,  ..., -0.6887, -0.7364,  0.4479],
        [ 0.4438,  0.7411,  1.1263,  ...,  1.2091,  0.6781,  0.3331],
        [-0.2537,  0.1446,  0.7203,  ..., -0.2134,  0.2144,  0.3006]],
       requires_grad=True)
torch.Size([8, 4, 256])
torch.Size([8, 4, 256])
