<a href="https://colab.research.google.com/github/abhishekkumawat23/build-llm-from-scratch/blob/main/Build_a_large_language_model_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a Large Language Model from Scratch

This notebook follows the [Build a Large Language Model (From Scratch)](https://sebastianraschka.com/llms-from-scratch/) book to build a large language model from scratch.

Wherever needed, the notebook explains the relevant theory and underlying concepts, focusing on learning through hands-on implementation.

---

## What is an LLM (Large Language Model)?

**Large Language Models (LLMs)** can be of various types, but the most common ones are **Generative LLMs**. A Generative LLM is a deep neural network model that, when given a piece of text (whether a word, line, or paragraph), generates the next word (or token, to be precise) that should follow that text.

### Example

- Given `The cat sat`, the LLM might generate `on`
- Given `The cat sat on`, the LLM might generate `the`
- Given `The cat sat on the`, the LLM might generate `mat`

Since the model can generate the next word after any given text, if we pass `The cat sat` and call the model multiple times—each time appending the word it generated in the previous iteration—we can generate a significant chunk of text. For example, starting with just `The cat sat`, we can generate `The cat sat on the mat and started playing.`

### How It Works

- `The cat sat` → `The cat sat on`
- `The cat sat on` → `The cat sat on the`
- `The cat sat on the` → `The cat sat on the mat`
- `The cat sat on the mat` → `The cat sat on the mat and`
- `The cat sat on the mat and` → `The cat sat on the mat and started`
- `The cat sat on the mat and started` → `The cat sat on the mat and started playing`

End users don't need to call the model repeatedly after each word—the LLM wraps this logic internally, calling the underlying model multiple times and appending each generated word to the next iteration. This is why such LLMs are also called **Auto-Regressive** models.

---

### TODO
Introduce additional foundational concepts such as:
- Deep neural networks
- Transformer layers
- Attention mechanisms
- Encoder-only, decoder-only, and encoder-decoder architectures
- Tokenizers
- Embeddings
- Difference between inference and training
- During training, the entire input is processed at once, while during inference, text is generated one token at a time
- KV caching

*Note: Some of these concepts may be introduced at later stages in the notebook.*

---

# Text Processing

## What is an Embedding?

LLMs don't understand text the way humans do—they understand numbers. Therefore, we need to:
- Convert the input text to a numerical representation before passing it to the LLM
- When the LLM predicts the next word, it outputs a numerical representation that we need to convert back to text

This numerical representation of text is called an **embedding**. An embedding is not a single integer or float, but rather a vector of floats. The length of this vector depends on the LLM architecture and can be 32, 64, 768, 4096, or more. The length of the embedding vector is called the **embedding dimension**.

## What to Embed: Entire Input Text or Individual Words?

If we want to pass `The cat sat` to the LLM, should we convert the entire text to a single embedding vector, or should we convert each word to its own embedding vector? This depends on the type of LLM being used, but most generative LLMs use *word-level* embeddings rather than sentence-level embeddings. So `The cat sat` would be converted to 3 embedding vectors, one for each word. Thus, as input, a generative LLM receives a list of embedding vectors, and as output it returns a list of 6 embedding vectors representing the generated text `on the mat and started playing` (where each word was generated auto-regressively, one by one, internally).

However, things are not quite that simple:
- First, spaces are not ignored as we might assume from word-level conversion
- Second, it's not actually *word-level* embeddings that LLMs use, but *token-level* subword embeddings. Here, a *token* can represent a word, subword, space, special character, or regular character

## Why Tokens Instead of Words?

Words are usually composed of root words and affixes. If we create an embedding vector for each complete word, we would need a very large number of embedding vectors. For example, `help`, `helper`, `helpless`, `helpful`, and `unhelpful` are 5 different words derived from the root word `help`.

Instead, if we split these words into subword tokens like `help`, `er`, `less`, `ful`, and `un`, we still have 5 embedding vectors, but now these subwords can be reused to represent many other words. This significantly reduces the number of embedding vectors needed for the training dataset. These subword units—`help`, `er`, `less`, `ful`, `un`—are called **tokens**.

## Token Vocabulary

If we collect all the unique tokens from the training dataset, we have a token **vocabulary**. Each token in the vocabulary is assigned a unique identifier called a **token ID**.

## Tokenizer: Converting Between Text and Token IDs

- A **tokenizer** converts text to token IDs. These token IDs are later converted to embedding vectors by a separate mechanism. LLMs use these input embedding vectors to generate a list of output embedding vectors, which are then converted back to token IDs. The **tokenizer** converts these output token IDs back into text.
- The process of converting text to token IDs is called **encoding**, while converting output token IDs back to text is called **decoding**. A **tokenizer** is capable of performing both encoding and decoding.

## Loading the Training Dataset

Before diving deeper into how to convert text to tokens and then tokens to embedding vectors so we can pass them to an LLM, let's first load our dataset.

For building our LLM, we will use a very small dataset: a short story by **Edith Wharton** called **The Verdict**.

Download the file content from https://en.wikisource.org/wiki/The_Verdict and save it as a file named **the-verdict.txt**.

In [32]:
# Load the file and read content

with open('the-verdict.txt', 'r', encoding='utf-8') as file:
  raw_text = file.read()

print(f'Total number of characters: {len(raw_text)}')
print(raw_text[:99])


FileNotFoundError: [Errno 2] No such file or directory: 'the-verdict.txt'

## Simple token vocabulary - Word based

- Let's implement a simple token vocabulary. We will consider each word as a token. No fancy splitting the words.
- As special characters like are space, comma, punctuation mark, exclamation mark are attached to words, lets split by those to get these special characters as separate tokens.

In [None]:
# Simple token vocab - v1 - word based

import re

tokens = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
vocab = sorted(set([token for token in tokens if token.strip()]))

## Simple Tokenizer - v1 - Word Based

- Let's implement a simple text tokenizer called `SimpleTokenizerV1` using the simple word based token vocabulary we created earlier.
- In `encode`, we will split the text into list of token ids. For `split`, we will use the same regex we used to create the vocab for consistency.
- In `decode`, **TODO**

**Limitations:**
-  As the vocab contains only the words from the training dataset, this tokenizer will throw *error when it gets a unknown word* during inference.
- If text had `It's the last he painted.`. The encoding and then decoding will return ` It' s the last he painted`. One extra space at the start of the decoded text and one extra space before the `s` of `It's`. This is because of the regex split and join logic we have.

In [None]:
# Simple tokenizer v1 - word are tokens here
class SimpleTokenizerV1:
  def __init__(self, vocab):
    self.token_to_id = {token: id for id, token in enumerate(vocab)}
    self.id_to_token = {id: token for token, id in self.token_to_id.items()}

  def encode(self, text):
    tokens = re.split(r'([,.?_!"()\']|--|\s)', text)
    token_ids = [self.token_to_id[token] for token in tokens if token.strip()]
    return token_ids

  def decode(self, token_ids):
    tokens = [self.id_to_token[token_id] for token_id in token_ids]
    text = ' '.join(tokens)
    # Joining by space caused space before each special character as well. Remove it.
    text = re.sub(r'\s+([,.?_!"()\']|--)', r'\1', text)
    return text



In [None]:
# Testing the simple tokenizer v1.
tokenizer = SimpleTokenizerV1(vocab)
text_to_encode = 'It\'s the last he painted.'
token_ids = tokenizer.encode(text_to_encode)
print(f'Encoded token ids for {text_to_encode}: {token_ids}')

decoded_text = tokenizer.decode(token_ids)
print(f'Decoded text for {token_ids}: {decoded_text}')

## Simple Tokenizer - v2 - Handle unknown words

Previous version threw error when it got unknown