<a href="https://colab.research.google.com/github/antndlcrx/Oxford-Methods-Spring-School/blob/main/llm_fundamentals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://cdn.githubraw.com/antndlcrx/oss_2024/main/images/dpir_oss.png?raw=true:,  width=70" alt="My Image" width=500>

# **LLM Fundamentals**

Language models have demonstrated remarkable capabilities in generating and understanding human-like language, profoundly impacting modern NLP applications. Their success is primarily driven by two factors:

- **Architecture**: Modern language models use **transformer** neural network architecture (with attention mechanisms at its core), which enables (sub)words or tokens to dynamically adjust their meanings based on surrounding context. This architecture **allows models to capture nuanced linguistic relationships and adapt flexibly to different contexts**.

- **Scalability**: These models excel because they effectively scale to billions of parameters and learn from massive datasets. **Large datasets expose models to diverse linguistic patterns**, enriching their representations, while increasing the **number of parameters enables the capture of subtle, context-dependent language nuances**.

## 🗓️Outlook:

This session covers:

- **Language Modelling** (what is a model, what is language, how to build a model of language)
- **Tokenisation** (How to process text)
- **Architecture** (How to build a model)
- **Inference** (given the model, how do we generate text)




In [None]:
#@title Download Data for this Session

import requests

GUTENBERG_URLS = {
    "pride_and_prejudice.txt": "https://www.gutenberg.org/files/1342/1342-0.txt",
    "sense_and_sensibility.txt": "https://www.gutenberg.org/files/161/161-0.txt",
    "mansfield_park.txt": "https://www.gutenberg.org/files/141/141-0.txt"
}

DELIMITER = "\n<|endoftext|>\n\n"
COMBINED_FILENAME = "austen_combined.txt"

def download_file(filename, url):
    """Download a file and save it locally."""
    print(f"Downloading {filename}...")
    response = requests.get(url)
    response.raise_for_status()
    with open(filename, "w", encoding="utf-8") as f:
        f.write(response.text)

def combine_files(file_list, output_file, delimiter):
    """Combine a list of files into one, separated by a delimiter."""
    with open(output_file, "w", encoding="utf-8") as outfile:
        for fname in file_list:
            with open(fname, "r", encoding="utf-8") as infile:
                text = infile.read().strip()
                outfile.write(text + delimiter)


for fname, url in GUTENBERG_URLS.items():
    download_file(fname, url)
combine_files(GUTENBERG_URLS.keys(), COMBINED_FILENAME, DELIMITER)


# with open("austen_combined.txt", "r", encoding="utf-8") as f:
#     raw_text = f.read()

# len(raw_text)

Downloading pride_and_prejudice.txt...
Downloading sense_and_sensibility.txt...
Downloading mansfield_park.txt...


## **1**.&nbsp; What is a Language Model?

### 🔮 **Language Modeling Objective**

Language modeling is a fundamental task in natural language processing (NLP), where the goal is to predict the next word in a sequence based on the preceding words. In other words, a language model aims to estimate the probability distribution of word sequences in a given language. Training a language model involves maximizing the likelihood that the model assigns to actual word sequences observed in the training data.

```perl
Predict next word based on previous context

Given a word sequence:
      w₁ → w₂ → w₃ → ... → wₙ₋₁ → wₙ
       │    │    │            │
       ▼    ▼    ▼            ▼
Predict:   Predict:   Predict:          Predict:
   w₂         w₃         w₄                 wₙ
given:      given:      given:            given:
  w₁       w₁,w₂      w₁,w₂,w₃        w₁,w₂,...,wₙ₋₁
```

**Mathematical Description**

Formally, given a sequence of words $(w_1, w_2, \dots, w_N)$, the language modeling objective is to maximize the joint probability of the entire sequence. Using the chain rule of probability, this joint probability can be decomposed into a product of conditional probabilities:

$$
P(w_1, w_2, \dots, w_N) = P(w_1) \cdot P(w_2 \mid w_1) \cdot P(w_3 \mid w_1, w_2) \dots P(w_N \mid w_1, w_2, \dots, w_{N-1})
$$

In practice, a language model predicts each word $w_n$ based solely on the preceding words, thus learning conditional probabilities:

$$
P(w_n \mid w_1, w_2, \dots, w_{n-1})
$$

---

### 📉**Finding a Good Model: Negative Log Likelihood**
To effectively train language models, the objective is typically framed as minimizing the **negative log-likelihood** of these probabilities over the training corpus:

$$
\text{Minimize: } -\log P(w_1, w_2, \dots, w_N) = -\sum_{n=1}^{N} \log P(w_n \mid w_1, w_2, \dots, w_{n-1})
$$

- **Likelihood**: how probable is our data given the current model. We want our language model to assign high probabilities to the actual words seen during training.
- **Why Log?**: Probabilities multiply quickly, becoming very small numbers. **Taking a logarithm transforms these multiplications into sums, which are numerically stable and computationally convenient**. Instead of multiplying many small probabilities, we add their logarithms.
- **Why Negative?**: Our goal is to maximize likelihood, but mathematically it is more convenient to frame optimization problems as minimization. Thus, we minimize the negative of the log likelihood. **Minimizing the negative log likelihood is equivalent to maximizing the likelihood**.

This negative log-likelihood measure is computationally convenient and helps the model learn meaningful linguistic patterns by penalizing low-probability predictions.

```perl
Minimize negative log-likelihood:
  
− log [P(w₁, w₂, ..., wₙ)]
           │
           ▼
= − [log P(w₁) + log P(w₂|w₁) + log P(w₃|w₁,w₂) + ... + log P(wₙ|w₁,...,wₙ₋₁)]
```
---

### 🤔**Evaluation: Perplexity**
**Perplexity** is a measure used to evaluate how well a language model predicts unseen text. Intuitively, it answers the question: "How many equally likely words is my model choosing between?"

Formally, perplexity is defined as:

$$
\text{Perplexity} = e^{-\frac{1}{N}\sum_{n=1}^{N}\log P(w_n|w_1,\dots,w_{n-1})}
$$

We initially computed the **average negative log likelihood** (or cross-entropy). Taking the exponential **transforms the log-scale back to a normal scale**, giving us a measure that's intuitive to interpret as an effective "branching factor."

- **Lower perplexity** → the model is confident and accurate, fewer "choices" per step.
- **Higher perplexity** → the model is uncertain, predicting many possible next words.

🧠 Quick Intuitive Example:
- A perplexity of 1000 means the model is roughly guessing among 1000 possible words for every prediction — a poor model.

- A perplexity of 10 means the model consistently narrows down to about 10 possible words — a much better model.

> 📖 For more:
- [Language Modelling NLP Course for You](https://lena-voita.github.io/nlp_course/language_modeling.html)
- [Perplexity of fixed-length models
by 🤗](https://huggingface.co/docs/transformers/en/perplexity)


## **2**.&nbsp; 🍓 **Tokenisation**


### **Motivation**

**Tokenization** is the critical step of **converting human-readable text into numerical representations that language models can process**.

After defining language modeling as predicting words from context, the natural next question is: **how exactly do we represent words numerically**? This is precisely the role of tokenization, bridging the complexity of natural language to structured numeric inputs suitable for neural networks.

Tokenization serves several key purposes:

1. **Reduction of Vocabulary Size**:
Natural languages are vast, filled with misspellings, colloquialisms, technical jargon, and new words constantly emerging. Tokenization condenses this enormous diversity into a fixed, manageable vocabulary, making computations feasible.

2. **Efficient Computation**:
Transforming words (or subwords) into numeric indices lets models efficiently perform mathematical operations required by neural networks.

3. **Meaningful Representation**:
Tokenization methods, especially subword approaches (like Byte-Pair Encoding or WordPiece), effectively handle semantic similarities. They break words down into meaningful parts, allowing models to generalize across related terms or word forms, even if the model hasn’t explicitly encountered them during training.

### **Toy Example**

In [None]:
# load data
with open("austen_combined.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

len(raw_text)

2302675

In [None]:
#@title Clean Up the Raw Text
import re

def clean_gutenberg_text(text):
    """
    Cleans Gutenberg text by removing header, footer, and metadata.
    """
    # Remove header
    text = re.split(r"\*\*\* START OF (THE|THIS) PROJECT GUTENBERG EBOOK .* \*\*\*", text, flags=re.IGNORECASE)[-1]

    # Remove footer
    text = re.split(r"\*\*\* END OF (THE|THIS) PROJECT GUTENBERG EBOOK .* \*\*\*", text, flags=re.IGNORECASE)[0]

    # Remove illustration tags and bracketed contents
    text = re.sub(r"\[Illustration.*?\]", "", text, flags=re.DOTALL)

    # Remove "Contents" and chapter listings (ToC)
    text = re.split(r"Contents\n\n", text, flags=re.IGNORECASE)
    if len(text) > 1:
        text = re.split(r"\n{2,}(CHAPTER\s+I\b)", text[1], flags=re.IGNORECASE)
        text = "".join(text[-2:]) if len(text) >= 2 else text[-1]
    else:
        text = text[0]

    # Remove excessive newlines and whitespace
    text = re.sub(r"\n{2,}", "\n\n", text)

    # Strip leading and trailing whitespace
    text = text.strip()

    return text

cleaned_text = clean_gutenberg_text(raw_text)

In [None]:
words = cleaned_text.split()
len(words)

159527

In [None]:
word_set = sorted(set(words))
vocab = {el:i for i, el in enumerate(word_set)}

class SimpleTokenizer():
    def __init__(self, train_text):
        self.word_set = sorted(set(train_text.split()))
        self.vocab = {el:i for i, el in enumerate(self.word_set)}
        self.vocab["<unk>"] = len(self.word_set) + 1
        self.inverse_vocab = {i:el for el, i in self.vocab.items()}

    def encode(self, text:str):
        token_ids = [self.vocab.get(x, self.vocab["<unk>"]) for x in text.split()]
        return token_ids

    def decode(self, token_ids: list[int]):
        words = [self.inverse_vocab[x] for x in token_ids]
        return " ".join(words)


tokenizer = SimpleTokenizer(cleaned_text)

In [None]:
cleaned_text[:150]

'CHAPTER I\n\nAbout thirty years ago Miss Maria Ward, of Huntingdon, with only seven\nthousand pounds, had the good luck to captivate Sir Thomas Bertram, '

In [None]:
toks = tokenizer.encode("I love going to University parks, it is just so beautiful there!")
tokenizer.decode(toks)

'I love going to <unk> <unk> it is just so beautiful there!'

In [None]:
#@title Build a Toy Tokenizer
# tokenize text

class SimpleTokenizer():
    def __init__(self, train_text):
        self.word_set = sorted(set(train_text.split()))
        self.vocab = {el:i for i, el in enumerate(self.word_set)}
        self.vocab["<unk>"] = len(self.word_set) + 1
        self.inverse_vocab = {i:el for el, i in self.vocab.items()}

    def encode(self, text:str):
        token_ids = [self.vocab.get(x, self.vocab["<unk>"]) for x in text.split()]
        return token_ids

    def decode(self, token_ids: list[int]):
        words = [self.inverse_vocab[x] for x in token_ids]
        return " ".join(words)

tokenizer = SimpleTokenizer(cleaned_text)

test = tokenizer.encode(cleaned_text[:995])
test_decoded = tokenizer.decode(test)

test = tokenizer.encode("I like walking my dog in the evenings in the University park where sunsets are just so beautiful.")
test_decoded = tokenizer.decode(test)
print(test_decoded)

I like walking my <unk> in the evenings in the <unk> park where <unk> are just so <unk>


### 🔗 **Byte-Pair Encoding (BPE)**

A common challenge in language modeling is dealing with words that weren't present in the training data. **Byte-Pair Encoding (BPE)**, introduced by [Sennrich et al. (2015)](https://arxiv.org/abs/1508.07909), elegantly solves this by breaking words into smaller, meaningful subword units. The key idea is simple yet powerful: it iteratively merges the most frequent pairs of bytes or characters in the training corpus to build a flexible vocabulary. By doing so, BPE allows models to handle unseen or rare words effectively, dramatically improving their generalization.

 > 📖 For an in-depth exploration of BPE, check out the [Hugging Face NLP Course (Chapter 6)](https://huggingface.co/learn/nlp-course/en/chapter6/5), or watch [Andrej Karpathy's "Let's Build a GPT Tokenizer" video](https://www.youtube.com/watch?v=zduSFxRajkE).


> 📚 Several libraries implement BPE:

- [**Tiktoken** by OpenAI](https://github.com/openai/tiktoken)
- [**SentencePiece** by Google](https://github.com/google/sentencepiece)

🛠️ Try exploring how tokenizers process text directly in the [Tiktokenizer app](https://tiktokenizer.vercel.app/).



In [None]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.9.0


In [None]:
import tiktoken
bpe_tokenizer = tiktoken.get_encoding("gpt2")

In [None]:
bpe_tokenizer.n_vocab

50257

In [None]:
bpe_tokenizer.decode([0])

'!'

In [None]:
tokens = bpe_tokenizer.encode("I like walking my dog in the evenings in the University park where sunsets are just so beautiful.")
print(tokens)
bpe_tokenizer.decode(tokens)

[40, 588, 6155, 616, 3290, 287, 262, 37119, 287, 262, 2059, 3952, 810, 4252, 28709, 389, 655, 523, 4950, 13]


'I like walking my dog in the evenings in the University park where sunsets are just so beautiful.'

### 🔧**Torch DataLoader and DataSet**

In PyTorch, a `Dataset` provides an organized way of accessing and managing your data, while a `DataLoader` handles batching, shuffling, and efficiently loading data during training. Specifically, a `Dataset` defines how individual data samples (inputs and labels) are accessed, while a `DataLoader` wraps around it to deliver batches seamlessly to your model. Together, they simplify data management, enhance training speed, and help ensure reproducible and robust training pipelines.

We use these tools to create a flow of input-target pairs of tokens to train a language model.

Suppose we have a sequence of tokens:

```ini
token_ids = [t₀, t₁, t₂, t₃, ..., tₙ₋₂, tₙ₋₁, tₙ]
```

We construct training examples by defining a context lenghth (for example `context_len=4`) and a sliding window (`stride`=2) as follows:

```less
Iteration 1:
    Input (X):   [t₀,   t₁,   t₂,   t₃]
    Target (Y):  [t₁,   t₂,   t₃,   t₄]

Iteration 2 (stride forward by 2):
    Input (X):   [t₂,   t₃,   t₄,   t₅]
    Target (Y):  [t₃,   t₄,   t₅,   t₆]

Iteration 3:
    Input (X):   [t₄,   t₅,   t₆,   t₇]
    Target (Y):  [t₅,   t₆,   t₇,   t₈]

...
```

until no full sequences remain.

The **context window** is the number of tokens an LLM considers simultaneously when predicting the next token. A **longer context window gives the model more information and improves its ability to capture meaningful relationships**, but at the cost of increased computational requirements.

The **stride determines how much the context window moves forward between each training example**. A smaller stride creates more overlapping examples, increasing the amount of training data but also introducing redundancy. A larger stride reduces overlap and speeds up data preparation but can reduce the diversity of training examples. Choosing these parameters involves balancing model performance, computational efficiency, and the richness of training data.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader, random_split

In [None]:
len(bpe_tokenizer.encode(cleaned_text))

223694

In [None]:
#@title Create Dataset and DataLoader

class CustomDataset(Dataset):
    def __init__(self, text, tokenizer, context_len, stride):
        super().__init__()
        self.Y = []
        self.X = []

        input_ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
        for i in range(0, len(input_ids) - context_len, stride):
            xids = input_ids[i: i + context_len]
            yids = input_ids[i + 1: i + 1 + context_len]

            self.X.append(torch.tensor(xids))
            self.Y.append(torch.tensor(yids))

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.Y[idx]

ds = CustomDataset(cleaned_text, bpe_tokenizer, context_len=64, stride=64)


### Train Val Split ###
dataset_size = len(ds)
train_size = int(0.9 * dataset_size)
val_size = dataset_size - train_size

print(f"Train size: {train_size}; Val size: {val_size}")

generator = torch.Generator().manual_seed(42)
train_ds, val_ds = random_split(ds, [train_size, val_size], generator=generator)

### Create DataLoaders ###
train_loader = DataLoader(
    train_ds,
    batch_size=64,
    shuffle=True,  # shuffle for training
    drop_last=True,
    num_workers=0
)

val_loader = DataLoader(
    val_ds,
    batch_size=64,
    shuffle=False,  # no need to shuffle for validation
    drop_last=True,
    num_workers=0
)

Train size: 3145; Val size: 350


In [None]:
torch.manual_seed(42)
for i, (x,y) in enumerate(train_loader):
    print(f'batch {i}:',"\n", x, "\n", y)
    print(f'batch {i}:',"\n", x.shape, "\n", y.shape)
    break

batch 0: 
 tensor([[  198,   198, 22788,  ..., 32649,   507,   286],
        [  286,  2279,   198,  ...,   286, 12921,   475],
        [ 3860,   286,   257,  ..., 29023,   540,   284],
        ...,
        [  866,   355,   262,  ...,  3675,   477,   198],
        [  976,   661,   284,  ...,   502,   922,   357],
        [ 1122,   198, 19188,  ...,   290,  1583,    13]]) 
 tensor([[  198, 22788,  5658,  ...,   507,   286,   262],
        [ 2279,   198,  7091,  ..., 12921,   475,   198],
        [  286,   257,   614,  ...,   540,   284,    13],
        ...,
        [  355,   262,   198,  ...,   477,   198,  9948],
        [  661,   284,   804,  ...,   922,   357,  1219],
        [  198, 19188,   307,  ...,  1583,    13, 12181]])
batch 0: 
 torch.Size([64, 64]) 
 torch.Size([64, 64])


In [None]:
torch.manual_seed(42)

for i, (x, y) in enumerate(train_loader):
    print(f"Batch {i} (shape: {x.shape})")

    # Decode the first few examples in the batch
    for idx in range(5):  # Show first 5 samples
        input_text = bpe_tokenizer.decode(x[idx].tolist())
        target_text = bpe_tokenizer.decode(y[idx].tolist())

        print(f"\nSample {idx + 1}:")
        print("Input :", input_text)
        print("Target:", target_text)

    break

Batch 0 (shape: torch.Size([64, 64]))

Sample 1:
Input : 

Sir Thomas, meanwhile, went on with his own hopes and his own
observations, still feeling a right, by all his knowledge of human
nature, to expect to see the effect of the loss of power and
consequence on his niece’s spirits, and the past attentions of
Target: 
Sir Thomas, meanwhile, went on with his own hopes and his own
observations, still feeling a right, by all his knowledge of human
nature, to expect to see the effect of the loss of power and
consequence on his niece’s spirits, and the past attentions of the

Sample 2:
Input :  of everything
she was wishing for. Edmund would be forgiven for being a clergyman, it
seemed, under certain conditions of wealth; and this, she suspected,
was all the conquest of prejudice which he was so ready to congratulate
himself upon. She had only learnt to think nothing of consequence but
Target:  everything
she was wishing for. Edmund would be forgiven for being a clergyman, it
seemed, under

## **3**.&nbsp; **Building a Transformer Language Model**

- Embeddings Input
- Base LM
- Transformer Architecture
- Text Generation Params

### **3. 1**.&nbsp; **Input Embeddings**

Embedding layer is a look-up operation, which we can use to get the representation of a token by indexing it via token id.

In transformer LMs, we econde both individual tokens themselves, and their positions in the sentence.

In [None]:
#@title Embedding Layer Showcase
torch.manual_seed(42)

embd_layer = torch.nn.Embedding(5, 2)
x = torch.tensor([4, 3, 2, 1, 0])
print(embd_layer.weight, "\n","\n", embd_layer(x))

Parameter containing:
tensor([[ 0.3367,  0.1288],
        [ 0.2345,  0.2303],
        [-1.1229, -0.1863],
        [ 2.2082, -0.6380],
        [ 0.4617,  0.2674]], requires_grad=True) 
 
 tensor([[ 0.4617,  0.2674],
        [ 2.2082, -0.6380],
        [-1.1229, -0.1863],
        [ 0.2345,  0.2303],
        [ 0.3367,  0.1288]], grad_fn=<EmbeddingBackward0>)


In [None]:
#@title Build a Base LM
class LanguageModel(torch.nn.Module):
    def __init__(self, n_embd, n_hidden, tokenizer, device):
        super().__init__()
        self.n_embd = n_embd
        self.n_hidden = n_hidden
        self.tokenizer = tokenizer
        self.device = device

        self.embd = torch.nn.Embedding(self.tokenizer.n_vocab, self.n_embd)
        self.rnn = torch.nn.RNN(self.n_embd, self.n_hidden, batch_first=True)
        self.out = torch.nn.Linear(self.n_hidden, self.tokenizer.n_vocab)

    def forward(self, x, hidden=None):
        x = self.embd(x)  # [batch, context_len, n_embd]
        x, hidden = self.rnn(x)  # x is [batch, context_len, n_hidden]
        logits = self.out(x)  # [batch, context_len, vocab_size]
        return logits

    def fit(self, train_loader, val_loader=None, epochs=10, lr=1e-3):
        self.to(self.device)
        optimizer = torch.optim.Adam(self.parameters(), lr=lr)
        loss_fn = torch.nn.CrossEntropyLoss()

        for epoch in range(epochs):
            self.train()
            total_train_loss = 0.0

            for X, Y in train_loader:
                X, Y = X.to(self.device), Y.to(self.device)

                logits = self(X)
                logits = logits.view(-1, self.tokenizer.n_vocab)
                Y = Y.view(-1)
                loss = loss_fn(logits, Y)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                total_train_loss += loss.item()

            average_loss = total_train_loss / len(train_loader)

            if val_loader is not None:
                self.eval()
                total_val_loss = 0.0
                with torch.no_grad():
                    for Xv, Yv in val_loader:
                        Xv, Yv = Xv.to(self.device), Yv.to(self.device)
                        val_logits = self(Xv)
                        val_logits = val_logits.view(-1, self.tokenizer.n_vocab)
                        Yv = Yv.view(-1)

                        val_loss = loss_fn(val_logits, Yv)
                        total_val_loss += val_loss.item()

                avg_val_loss = total_val_loss / len(val_loader)

                print(f"Epoch [{epoch+1}/{epochs}]"
                      f"  Train Loss: {average_loss:.3f}"
                      f"  |  Val Loss: {avg_val_loss:.3f}")
            else:
                print(f"Epoch [{epoch+1}/{epochs}]"
                      f"  Train Loss: {average_loss:.3f}")

    def generate(self, prompt, max_new_tokens):
        self.eval()
        input_ids = self.tokenizer.encode(prompt)
        input_ids = torch.tensor(input_ids).to(self.device).unsqueeze(0) # batch dim

        for i in range(max_new_tokens):
            with torch.no_grad():
                logits = self(input_ids) # [batch_size, seq_len, vocab_size]

            last_logit = logits[:, -1, :]
            probs = torch.softmax(last_logit, dim=-1)

            next_token = torch.multinomial(probs, num_samples=1) # sample 1 token from probs
            input_ids = torch.cat((input_ids, next_token), dim=1) # add new token id to sequence of input_ids

        generated_tokens = input_ids[0].tolist()  # remove batch dimension
        return self.tokenizer.decode(generated_tokens)

### training ###

## uncomment in case CUDA out of memory
# del model
# torch.cuda.empty_cache()

config = {
    "tokenizer": bpe_tokenizer,
    "device": "cuda"
}

model = LanguageModel(n_embd=32, n_hidden=32, **config)
model.fit(train_loader, val_loader, epochs=5, lr=0.01)


### inference ###
print(model.generate("I went out with Lady Elizabeth to the meadows", max_new_tokens=32))

Epoch [1/5]  Train Loss: 7.307  |  Val Loss: 6.392
Epoch [2/5]  Train Loss: 6.095  |  Val Loss: 6.054
Epoch [3/5]  Train Loss: 5.781  |  Val Loss: 5.832
Epoch [4/5]  Train Loss: 5.578  |  Val Loss: 5.699
Epoch [5/5]  Train Loss: 5.435  |  Val Loss: 5.613
I went out with Lady Elizabeth to the meadows will
was
 behaviour hadver, hisrive was having never smiles, if as the society of equal as fearing of to distract beconf Crawford ought.�


In [None]:
#@title Exercise: Play around with the model

# experiment with prompts, try training the model with different configurations (be careful with CUDA out of memory issue!)
# what is your impression? what do you notice?

### **3. 2**.&nbsp; **Transformer Architecture**

Nearly all SOTA LLMs and their predaccessors are a version of a **transformer** architecture. Transformer is a neural net architecture that mainly relies on attention. Was first introduced by [Vaswani et al. 2017](https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf).

The main idea is of the transformer is to **create a neural network that understands the meaning of each word by looking directly at every other word in the sentence at the same time**, rather than reading words one-by-one. This allows the network to better grasp context and relationships between words, significantly improving its ability to process language effectively..

> 🔧Explore the model via online visualisation tools:
- [LLM visualisation](https://bbycroft.net/llm).
- [Transformer Explainer](https://poloclub.github.io/transformer-explainer/).
- [Classic: Illustrated Transformer by Jay Alammar](https://jalammar.github.io/illustrated-transformer/).
- [Illustrated GPT-2](https://jalammar.github.io/illustrated-gpt2/).

### **Attention**

**Attention** is the main architecture component in transformer. It allows tokens to update their representation by learning from all other tokens in the sequence. This allows transformer lms to have nuanced, context dependent meaning of words and text sequences.

To implement this, attention mechanism consists of three key elements for each token (token representation) in the sequence:

- **Query**: Vector summarising which info the token is "looking for".
- **Key**: Vector storing information that is used to "index" a token, to match with the query.
- **Value**: Vector representing the actual "content" of token representations. Gets "picked up" by referening the relevant key for the given query.

$$
\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V
$$


Dot Product: mathematical operation that combines two vectors and yields a scalar value. Dot product is a measure of simlarity between vectors as it quantifies how closely two vectors are aligned (high - more aligned).

In [None]:
import torch.nn.functional as F

In [None]:
torch.manual_seed(42)
# create toy sequence of data
x = torch.randn((2, 6, 10)) # 5 words by 10 emb dimension
B, T, C = x.shape # input shape dims
head_size = 5 # att dimension
# all tokens to receive information from each other
Q = torch.nn.Linear(C, head_size)
K = torch.nn.Linear(C, head_size)
V = torch.nn.Linear(C, head_size)

Q = Q(x) # (T, C) @ (C, head_size) -> (T, head_size)
K = K(x)
V = V(x)

QK = (Q @ K.transpose(-2, -1)) / torch.sqrt(torch.tensor(head_size))
att = torch.softmax(QK, dim=-1)
att = att @ V # (n tokens, head_size)

In [None]:
x.shape[-1]

10

In [None]:
class SelfAttention(torch.nn.Module):
    def __init__(self, n_embd, head_size, contex_len):
        """
        Single-head self-attention
        n_embd : embedding dimension (i.e. the input feature size)
        head_size : dimension for this particular head
        contex_len : maximum sequence length (for constructing the causal mask)
        """
        super().__init__()
        self.head_size = head_size

        self.Q = torch.nn.Linear(n_embd, head_size, bias=False)
        self.K = torch.nn.Linear(n_embd, head_size, bias=False)
        self.V = torch.nn.Linear(n_embd, head_size, bias=False)

        # mask to hide "future" tokens (we only attend to current and previous tokens)
        self.register_buffer("tril", torch.tril(torch.ones(contex_len, contex_len)))

    def forward(self, x):
        B,T,C = x.shape

        q = self.Q(x) # (B, T, C) @ (C, head_size) -> (B, T, head_size)
        k = self.K(x)
        v = self.V(x)

        # compute attention scores:
        # shape: (B, T, head_size) @ (B, T, head_size).T -> (B, T, T)
        # but we need the last dimension to match so we do a transpose on K:
        att_weight = q @ k.transpose(-2, -1) # (B, T, T)

        # causal mask
        att_weight = att_weight.masked_fill(self.tril[:T, :T] == 0, float('-inf'))

        # scale
        att_weight = att_weight / k.shape[-1]**0.5

        # normalize over last dimension
        att_weight = torch.softmax(att_weight, dim=-1)

        # weighted sum over V
        att_weight = att_weight @ v
        return att_weight

### **3. 3**.&nbsp; **Multi-Head Attention**



Words can have multiple different ways they relate to each other in text. They can convey grammatic relationships, different facets of meaning. For that reason, in transformer we implement not just one single attention mehanism to create represenatation of a text sequence, instead, we do multiple attention mechanisms (heads), to allow for different relationships at the same time.

In implementation, we just split the attention stage into multiple chunks that run in parallel and independently, and then concatenate their results to get the final (for that stage) representation of the sequence.

In [None]:
class MultiHeadAttention(torch.nn.Module):
    def __init__(self, n_embd, n_heads, context_len):
        """
        Multi-head self-attention
        n_embd : total embedding dimension
        n_heads : how many separate attention heads
        context_len : for the causal mask
        """
        super().__init__()
        head_size = n_embd // n_heads

        self.heads = torch.nn.ModuleList(
            [SelfAttention(n_embd, head_size, context_len)
            for _ in range(n_heads)]
        )

        self.proj = torch.nn.Linear(n_heads * head_size, n_embd)

    def forward(self, x):
        out = torch.cat([head(x) for head in self.heads], dim=-1)
        out = self.proj(out)
        return out


In [None]:
#@title Remove RNN and Set MHA
import torch.nn.functional as F

class LanguageModel(torch.nn.Module):
    def __init__(
            self,
            n_embd,
            tokenizer,
            device="cpu",
            context_len=64,    # maximum sequence length
            n_heads=4        # number of attention heads
    ):
        super().__init__()
        self.tokenizer = tokenizer
        self.device = device
        self.context_len = context_len
        self.vocab_size = tokenizer.n_vocab

        self.embd = torch.nn.Embedding(self.vocab_size, n_embd)
        self.pos_embd = torch.nn.Embedding(self.context_len, n_embd)
        self.attn = MultiHeadAttention(n_embd, n_heads, context_len)
        self.out = torch.nn.Linear(n_embd, self.vocab_size)

    def forward(self, x):
        seq_len = x.shape[1]
        embds = self.embd(x)  # (batch, context_len, n_embd)
        # add positional embd
        pos = torch.arange(0, seq_len, dtype=torch.long, device=self.device).unsqueeze(0)
        embds = embds + self.pos_embd(pos) # (B, seq_len, n_embd)
        attention_out = self.attn(embds)  # (batch, context_len, n_embd)
        logits = self.out(attention_out)  # (batch, context_len, vocab_size)
        return logits

    def fit(self, train_loader, val_loader=None, epochs=5, lr=1e-3):
        self.to(self.device)
        optimizer = torch.optim.Adam(self.parameters(), lr=lr)
        loss_fn = torch.nn.CrossEntropyLoss()

        for epoch in range(epochs):
            self.train()
            total_train_loss = 0.0

            for X, Y in train_loader:
                X, Y = X.to(self.device), Y.to(self.device)

                logits = self(X).view(-1, self.vocab_size)
                Y = Y.view(-1)
                loss = loss_fn(logits, Y)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                total_train_loss += loss.item()

            avg_train_loss = total_train_loss / len(train_loader)

            if val_loader:
                self.eval()
                total_val_loss = 0.0
                with torch.no_grad():
                    for Xv, Yv in val_loader:
                        Xv, Yv = Xv.to(self.device), Yv.to(self.device)
                        val_logits = self(Xv).view(-1, self.vocab_size)
                        Yv = Yv.view(-1)
                        val_loss = loss_fn(val_logits, Yv)
                        total_val_loss += val_loss.item()

                avg_val_loss = total_val_loss / len(val_loader)
                print(f"Epoch [{epoch+1}/{epochs}]  Train Loss: {avg_train_loss:.3f} | Val Loss: {avg_val_loss:.3f}")
            else:
                print(f"Epoch [{epoch+1}/{epochs}]  Train Loss: {avg_train_loss:.3f}")

    def generate(self, prompt, max_new_tokens=16):
        self.eval()
        input_ids = torch.tensor([self.tokenizer.encode(prompt)], device=self.device)

        for _ in range(max_new_tokens):
            # only consider last context_len tokens to manage memory usage
            input_ids_cond = input_ids[:, -self.context_len:]

            with torch.no_grad():
                logits = self(input_ids_cond)

            last_logits = logits[:, -1, :]
            probs = torch.softmax(last_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            input_ids = torch.cat((input_ids, next_token), dim=1)

        return self.tokenizer.decode(input_ids[0].tolist())


In [None]:
# runs about 20 seconds on a T4 GPU

config = {
    "n_embd": 64,
    "tokenizer": bpe_tokenizer,
    "device": "cuda",
    "context_len": 64,
    "n_heads": 4
}

model = LanguageModel(**config)
model.fit(train_loader, val_loader, epochs=5, lr=1e-3)

print(model.generate("I went out with Lady Elizabeth to the meadows", max_new_tokens=32))

Epoch [1/5]  Train Loss: 8.853 | Val Loss: 6.706
Epoch [2/5]  Train Loss: 6.423 | Val Loss: 6.468
Epoch [3/5]  Train Loss: 6.295 | Val Loss: 6.395
Epoch [4/5]  Train Loss: 6.193 | Val Loss: 6.278
Epoch [5/5]  Train Loss: 6.014 | Val Loss: 6.096
I went out with Lady Elizabeth to the meadows mere notvern
againconfidence again char alarming instantly time from Endurance Mr of secured
b close came young had was press thought her gladkeepingThings. than Crawford


In [None]:
prompt = "It was a sunny day, I went out with Lady Margrett to the gardens."

print(model.generate(prompt))

It was a sunny day, I went out with Lady Margrett to the gardens. He you be that it as I really owe had not staying some comfort, it


In [None]:
#@title Exercise: Play around with the model

# experiment with prompts, try training the model with different configurations
# what is your impression? what do you notice?

### **3. 4**.&nbsp; **Feed Froward Layer, Skip Connection and Layer Normalisation**

- [**Feed Forward**]: Computation designed for inputs in our sequence to process the information they learned over all other inputs during multi-head attention step.  

- [**Layer Normalisation**](https://arxiv.org/abs/1607.06450): When training deep neural networks (that is, networks with many layers), we can experience instability in gradient updating such as vanishing or exploding gradients. LayerNorm helps prevent this problem by rescaling the outputs of a nn.Layer to have mean of 0 and varience of 1. This adjustment speeds up the convergence to good weights and ensures consistent training.

- [**Skip Connection**](https://arxiv.org/abs/1512.03385): Helps prevent vanishing gradients.

- [**Dropout**](http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf): Helps prevent overfitting.

[Source: "Build a Large Language Model from Scratch" by Sebastian Raschka, ch 4.](https://www.manning.com/books/build-a-large-language-model-from-scratch?a_aid=raschka&a_bid=4c2437a0&chan=mm_github)

In [None]:
#@title FeedForward and Transformer Block Classes

class FeedForward(torch.nn.Module):
    def __init__(self, n_embd, dropout):
        super().__init__()

        self.ff = torch.nn.Sequential(
            torch.nn.Linear(n_embd, n_embd * 4),
            torch.nn.GELU(),
            torch.nn.Linear(n_embd * 4, n_embd),
            torch.nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.ff(x)

class TfBlock(torch.nn.Module):
    def __init__(self, n_embd, n_heads, context_len, dropout):
        super().__init__()
        self.mha = MultiHeadAttention(n_embd, n_heads, context_len, dropout)
        self.ff = FeedForward(n_embd, dropout)
        self.norm_1 = torch.nn.LayerNorm(n_embd)
        self.norm_2 = torch.nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.norm_1(x)
        x = x + self.mha(x)
        x = x + self.norm_2(x)
        return x

In [None]:
#@title Add Dropout to Attention

class SelfAttention(torch.nn.Module):
    def __init__(self, n_embd, head_size, contex_len, dropout):
        """
        Single-head self-attention
        n_embd : embedding dimension (i.e. the input feature size)
        head_size : dimension for this particular head
        contex_len : maximum sequence length (for constructing the causal mask)
        """
        super().__init__()
        self.head_size = head_size

        self.Q = torch.nn.Linear(n_embd, head_size, bias=False)
        self.K = torch.nn.Linear(n_embd, head_size, bias=False)
        self.V = torch.nn.Linear(n_embd, head_size, bias=False)

        self.dropout = torch.nn.Dropout(dropout)

        # mask to hide "future" tokens (we only attend to current and previous tokens)
        self.register_buffer("tril", torch.tril(torch.ones(contex_len, contex_len)))

    def forward(self, x):
        B,T,C = x.shape

        q = self.Q(x) # (B, T, C) @ (C, head_size) -> (B, T, head_size)
        k = self.K(x)
        v = self.V(x)

        # compute attention scores:
        # shape: (B, T, head_size) @ (B, T, head_size).T -> (B, T, T)
        # but we need the last dimension to match so we do a transpose on K:
        att_weight = q @ k.transpose(-2, -1) # (B, T, T)

        # mask
        att_weight = att_weight.masked_fill(self.tril[:T, :T] == 0, float('-inf'))

        # scale
        att_weight = att_weight / k.shape[-1]**0.5

        # normalize over last dimension
        att_weight = torch.softmax(att_weight, dim=-1)

        # NEW: add dropout to prevent overfitting
        att_weight = self.dropout(att_weight)

        # weighted sum over V
        att_weight = att_weight @ v
        return att_weight

class MultiHeadAttention(torch.nn.Module):
    def __init__(self, n_embd, num_heads, context_len, dropout):
        """
        Multi-head self-attention
        n_embd : total embedding dimension
        num_heads : number of parallel attention heads
        context_len : for the causal mask
        dropout : probability for dropout
        """
        super().__init__()
        head_size = n_embd // num_heads  # dimension per head

        self.heads = torch.nn.ModuleList([
            SelfAttention(n_embd, head_size, context_len, dropout=dropout)
            for _ in range(num_heads)
        ])

        self.proj = torch.nn.Linear(num_heads * head_size, n_embd)
        self.dropout = torch.nn.Dropout(dropout)

    def forward(self, x):
        """
        x: (batch_size, sequence_length, n_embd)
        Returns: (batch_size, sequence_length, n_embd)
        """
        out = torch.cat([head(x) for head in self.heads], dim=-1)  # (B, context_len, num_heads * head_size)
        out = self.proj(out)  # (B, T, n_embd)
        out = self.dropout(out)

        return out

In [None]:
#@title Full Transformer Model

class LanguageModel(torch.nn.Module):
    def __init__(
        self,
        n_embd,
        tokenizer,
        device="cpu",
        dropout=0.2,
        context_len=64,
        n_heads=4,
        n_blocks=3
    ):
        super().__init__()
        self.tokenizer = tokenizer
        self.device = device
        self.dropout = dropout
        self.vocab_size = tokenizer.n_vocab
        self.context_len = context_len

        self.embd = torch.nn.Embedding(self.vocab_size, n_embd)
        self.pos_embd = torch.nn.Embedding(self.context_len, n_embd)
        self.blocks = torch.nn.Sequential(*[
            TfBlock(n_embd, n_heads, context_len, dropout)
            for _ in range(n_blocks)
        ])
        self.norm_final = torch.nn.LayerNorm(n_embd)
        self.out = torch.nn.Linear(n_embd, self.vocab_size)

    def forward(self, x):
        seq_len = x.shape[1]
        embds = self.embd(x)  # (batch, context_len, n_embd)
        # add positional embd
        pos = torch.arange(0, seq_len, dtype=torch.long, device=self.device).unsqueeze(0)
        embds = embds + self.pos_embd(pos) # (B, seq_len, n_embd)
        blocks_out = self.blocks(embds)                # (B, context_len, n_embd)
        normed_out = self.norm_final(blocks_out)       # (B, context_len, n_embd)
        logits = self.out(normed_out)                  # (B, context_len, vocab_size)
        return logits

    def fit(self, train_loader, val_loader=None, epochs=5, lr=1e-3):
        self.to(self.device)
        optimizer = torch.optim.AdamW(self.parameters(), lr=lr)
        loss_fn = torch.nn.CrossEntropyLoss()

        for epoch in range(epochs):
            self.train()
            total_train_loss = 0.0

            for X, Y in train_loader:
                X, Y = X.to(self.device), Y.to(self.device)

                logits = self(X).view(-1, self.vocab_size)
                Y = Y.view(-1)
                loss = loss_fn(logits, Y)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                total_train_loss += loss.item()

            avg_train_loss = total_train_loss / len(train_loader)

            if val_loader is not None:
                self.eval()
                total_val_loss = 0.0
                with torch.no_grad():
                    for Xv, Yv in val_loader:
                        Xv, Yv = Xv.to(self.device), Yv.to(self.device)
                        val_logits = self(Xv).view(-1, self.vocab_size)
                        Yv = Yv.view(-1)

                        val_loss = loss_fn(val_logits, Yv)
                        total_val_loss += val_loss.item()

                avg_val_loss = total_val_loss / len(val_loader)

                print(f"Epoch [{epoch+1}/{epochs}]"
                      f"  Train Loss: {avg_train_loss:.3f}"
                      f"  |  Val Loss: {avg_val_loss:.3f}")
            else:
                print(f"Epoch [{epoch+1}/{epochs}]"
                      f"  Train Loss: {avg_train_loss:.3f}")

    def generate(self, prompt, max_new_tokens=16):
        self.eval()
        input_ids = torch.tensor([self.tokenizer.encode(prompt)], device=self.device)

        for _ in range(max_new_tokens):
            input_ids_cond = input_ids[:, -self.context_len:]

            with torch.no_grad():
                logits = self(input_ids_cond)

            last_logits = logits[:, -1, :]
            probs = F.softmax(last_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)

            input_ids = torch.cat((input_ids, next_token), dim=1)

        return self.tokenizer.decode(input_ids[0].tolist())

In [None]:
# runs for about 2 mins on T4

config = {
    "n_embd": 64,
    "tokenizer": bpe_tokenizer,
    "device": "cuda",
    "context_len": 64,
    "n_heads": 4,
    "n_blocks": 2
}

model = LanguageModel(**config)
model.fit(train_loader, val_loader, epochs=5, lr=1e-3)

print(model.generate("I went out with Lady Elizabeth to the meadows", max_new_tokens=32))

Epoch [1/5]  Train Loss: 9.076  |  Val Loss: 6.826
Epoch [2/5]  Train Loss: 6.482  |  Val Loss: 6.449
Epoch [3/5]  Train Loss: 6.330  |  Val Loss: 6.407
Epoch [4/5]  Train Loss: 6.256  |  Val Loss: 6.289
Epoch [5/5]  Train Loss: 6.092  |  Val Loss: 6.118
I went out with Lady Elizabeth to the meadowsious
 goodt dearony a another ever good and
that! took to by� farewell F or, only a without what any and,that, business


## **4**.&nbsp; **Inference: Decoding Strategies**

Decoding strategies define how a language model selects the next token when generating text.

- **Greedy decoding**: model always picks the token with the highest probability at each step. While efficient, greedy decoding often leads to repetitive and uncreative outputs because it lacks diversity.
- **Beam Search**: improves upon this by keeping multiple hypotheses (beams) at each step, selecting the best overall sequence rather than just the best token at each step.
- **Temperature Scaling**: controls randomness by adjusting the probability distribution—higher temperatures encourage more randomness, while lower temperatures make predictions more deterministic.
- **Top K sampling**: ntroduces randomness by restricting token selection to only the top K most probable words at each step, effectively eliminating unlikely choices while maintaining diversity.



In [None]:
#@title Very Final Model
import torch.nn.functional as F

class LanguageModel(torch.nn.Module):
    def __init__(
        self,
        n_embd,
        tokenizer,
        device="cpu",
        dropout=0.2,
        context_len=64,
        n_heads=4,
        n_blocks=3
    ):
        super().__init__()
        self.tokenizer = tokenizer
        self.device = device
        self.dropout = dropout
        self.vocab_size = tokenizer.n_vocab
        self.context_len = context_len

        self.embd = torch.nn.Embedding(self.vocab_size, n_embd)
        self.pos_embd = torch.nn.Embedding(self.context_len, n_embd)
        self.blocks = torch.nn.Sequential(*[
            TfBlock(n_embd, n_heads, context_len, dropout)
            for _ in range(n_blocks)
        ])
        self.norm_final = torch.nn.LayerNorm(n_embd)
        self.out = torch.nn.Linear(n_embd, self.vocab_size)

    def forward(self, x):
        seq_len = x.shape[1]
        embds = self.embd(x)
        pos = torch.arange(seq_len, device=self.device).unsqueeze(0)
        embds = embds + self.pos_embd(pos)
        blocks_out = self.blocks(embds)
        normed_out = self.norm_final(blocks_out)
        logits = self.out(normed_out)
        return logits

    def fit(self, train_loader, val_loader=None, epochs=5, lr=1e-3):
        self.to(self.device)
        optimizer = torch.optim.AdamW(self.parameters(), lr=lr)
        loss_fn = torch.nn.CrossEntropyLoss()

        for epoch in range(epochs):
            self.train()
            total_train_loss = 0.0

            for X, Y in train_loader:
                X, Y = X.to(self.device), Y.to(self.device)
                logits = self(X).view(-1, self.vocab_size)
                Y = Y.view(-1)
                loss = loss_fn(logits, Y)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                total_train_loss += loss.item()

            avg_train_loss = total_train_loss / len(train_loader)

            if val_loader is not None:
                self.eval()
                total_val_loss = 0.0
                with torch.no_grad():
                    for Xv, Yv in val_loader:
                        Xv, Yv = Xv.to(self.device), Yv.to(self.device)
                        val_logits = self(Xv).view(-1, self.vocab_size)
                        Yv = Yv.view(-1)
                        val_loss = loss_fn(val_logits, Yv)
                        total_val_loss += val_loss.item()

                avg_val_loss = total_val_loss / len(val_loader)

                print(f"Epoch [{epoch+1}/{epochs}]"
                      f"  Train Loss: {avg_train_loss:.3f}"
                      f"  |  Val Loss: {avg_val_loss:.3f}")
            else:
                print(f"Epoch [{epoch+1}/{epochs}]"
                      f"  Train Loss: {avg_train_loss:.3f}")

    def generate(self, prompt, max_new_tokens=16, temperature=1.0, top_k=50):
        """
        Generate text from a prompt with temperature and top-k sampling.

        Parameters:
        - prompt (str): The input text to generate from.
        - max_new_tokens (int): Number of tokens to generate.
        - temperature (float): Higher values increase randomness, lower values make generation more deterministic.
        - top_k (int): Limit token selection to the top K most probable tokens.

        Returns:
        - str: The generated text.
        """
        self.eval()
        input_ids = torch.tensor([self.tokenizer.encode(prompt)], device=self.device)

        for _ in range(max_new_tokens):
            input_ids_cond = input_ids[:, -self.context_len:]

            with torch.no_grad():
                logits = self(input_ids_cond)
                logits = logits[:, -1, :]  # Get logits for last token only

                # Temperature scaling
                logits /= max(temperature, 1e-5)

                # Top-k filtering
                if top_k > 0:
                    top_k_values, _ = torch.topk(logits, k=top_k)
                    min_top_k = top_k_values[:, -1].unsqueeze(-1)
                    logits = torch.where(
                        logits < min_top_k,
                        torch.tensor(float('-inf'), device=self.device),
                        logits
                    )

                probs = F.softmax(logits, dim=-1)
                next_token = torch.multinomial(probs, num_samples=1)

                input_ids = torch.cat([input_ids, next_token], dim=1)

        return self.tokenizer.decode(input_ids[0].tolist())

In [None]:
# takes up to 3 mins on T4 for 15 epchs
config = {
    "n_embd": 128,
    "tokenizer": bpe_tokenizer,
    "device": "cuda",
    "context_len": 128,
    "n_heads": 4,
    "n_blocks": 2
}


model = LanguageModel(**config)
model.fit(train_loader, val_loader, epochs=15, lr=1e-3)

print(model.generate("I went out with Lady Elizabeth to the meadows", max_new_tokens=128))

Epoch [1/15]  Train Loss: 8.043  |  Val Loss: 6.511
Epoch [2/15]  Train Loss: 6.342  |  Val Loss: 6.359
Epoch [3/15]  Train Loss: 6.143  |  Val Loss: 6.099
Epoch [4/15]  Train Loss: 5.843  |  Val Loss: 5.844
Epoch [5/15]  Train Loss: 5.596  |  Val Loss: 5.695
Epoch [6/15]  Train Loss: 5.386  |  Val Loss: 5.568
Epoch [7/15]  Train Loss: 5.190  |  Val Loss: 5.459
Epoch [8/15]  Train Loss: 5.008  |  Val Loss: 5.368
Epoch [9/15]  Train Loss: 4.842  |  Val Loss: 5.295
Epoch [10/15]  Train Loss: 4.689  |  Val Loss: 5.242
Epoch [11/15]  Train Loss: 4.549  |  Val Loss: 5.203
Epoch [12/15]  Train Loss: 4.426  |  Val Loss: 5.168
Epoch [13/15]  Train Loss: 4.312  |  Val Loss: 5.156
Epoch [14/15]  Train Loss: 4.210  |  Val Loss: 5.142
Epoch [15/15]  Train Loss: 4.116  |  Val Loss: 5.137
I went out with Lady Elizabeth to the meadows always,
her_.lect, which any she did_ for the park and there is it.
” And she had never do not; but this idea for everybody of her

that some time to her spirits to her

In [None]:
prompt = "It was a sunny day, I went out with Lady Margrett to the gardens."
for i in range(10):
    print(model.generate(prompt, max_new_tokens=32, temperature=1.5), "\n")

It was a sunny day, I went out with Lady Margrett to the gardens. Five on him so many
c c us as I think in such a most
whiched by her sisteringness about Fanny: the Admiral
was 

It was a sunny day, I went out with Lady Margrett to the gardens. You as he only
conferred first that of the shade: so very difficult, were the whole in future, was only two; her first three boys
 

It was a sunny day, I went out with Lady Margrett to the gardens. To the
not sympath
nothinged about the heat. Grant was always answered, that how. My time of it had never and then out to her
 

It was a sunny day, I went out with Lady Margrett to the gardens. How, from the same Park, they; she would know
not be an answer to appear in a point of an early and langu said her good. You 

It was a sunny day, I went out with Lady Margrett to the gardens. Lady Bertram
was enduring, her to be done into what you them the world on her own family the morrow, I hope Mr, the most at 

It was a sunny day, I went out with Lady M