<a href="https://colab.research.google.com/github/UdaraChamidu/Large-Language-Models/blob/main/LLM_from_Scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Building a Large Language Model from Scratch**

**Next Word Prediction**

The goal of our language model will be to predict the next word.

**Main Steps**

*   Tokenization
*   Self-Attention
*   Transformer Block
*   Full Language Model
*   Embedding Layer
*   Positional Encoding


Import packeges

In [None]:
import torch
import torch.nn as nn       # nn=neural network
import torch.optim as optim
import math

# Step 1: Tokenization


What:
Converts text into tokens (usually subwords or characters). These tokens are then mapped to unique IDs using a vocabulary.

Why:
Models can’t understand text directly—they understand numbers. Tokenization is the bridge.

Example:
"Hello world" → ["Hello", "world"] → [15496, 995]

In [None]:
def tokenize(text, vocab):
    return [vocab.get(word, vocab[""]) for word in text.split()]

# text.split()  -  split the sentences
# vocab - a dictionary that gives numbers to words.
# <UNK> - for unknown

# Step 2: Embedding Layer

What:
Converts token IDs into dense vectors (learned representations). Each token gets a vector of fixed size (e.g., 768-dim).

Why:
One-hot encodings are too sparse and don’t capture relationships. Embeddings provide context-aware, continuous values.

✅ Why It Matters
It’s your model’s first layer.

Converts human-readable text into something the model can learn from.

The quality of these embeddings directly impacts model performance.

In [None]:
class Embedding(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(Embedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

    def forward(self, x):
        return self.embedding(x)

# nn.Embedding - creates a table where each word id map to a vector.
# embedding_dim - define the length of each vector
# vocab_size - Number of unique tokens in your vocabulary (e.g., 50,000)
# embedding_dim -  Size of the embedding vector for each token (e.g., 256 or 768)
# x = torch.tensor([1, 42, 6])  # token IDs

# Step 3: Positional Encoding


What:
Adds information about the position of each token in the sequence since the Transformer has no concept of order by default.

Why:
"Dog bites man" ≠ "Man bites dog" — order matters!

How:
Either fixed (sinusoidal) or learned embeddings are added to token embeddings.

The PositionalEncoding class adds position information to your input embeddings because Transformers don’t have any built-in sense of word order.

In [None]:
# transformers process all the words at once
# which word should take after which word
# position of the word ...

class PositionalEncoding(nn.Module):
    def __init__(self, embedding_dim, max_seq_len=5000):
        super(PositionalEncoding, self).__init__()
        self.embedding_dim = embedding_dim
        pe = torch.zeros(max_seq_len, embedding_dim) # store the positional encoding vectors
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embedding_dim, 2).float() * (-math.log(10000.0) / embedding_dim))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:x.size(0), :]

# Step 4: Self-Attention



**Give some extra focus on most important words.**
-----------------------------------
🧠 What Is Self-Attention?

It answers the question:

“For each word in the input, how much attention should I pay to every other word (including myself)?”

Why:
It captures dependencies and relationships between words, no matter their distance.


---------------------------------

Query (Q): What I'm looking for   
Key (K): What I contain          
Value (V): What I offer if selected

--------------------------------
🧪 Example (Mini Visualization)
Let’s say we input the sentence:

"The cat sat on the mat"

When processing the word "mat", self-attention might focus more on "sat" and "on", because those are contextually related.

In [None]:
# heart of the Transformer — the Self-Attention mechanism

class SelfAttention(nn.Module):
    def __init__(self, embedding_dim):
        super(SelfAttention, self).__init__()
        self.query = nn.Linear(embedding_dim, embedding_dim)
        self.key = nn.Linear(embedding_dim, embedding_dim)
        self.value = nn.Linear(embedding_dim, embedding_dim)

    def forward(self, x):
        queries = self.query(x)
        keys = self.key(x)
        values = self.value(x)
        scores = torch.bmm(queries, keys.transpose(1, 2)) / torch.sqrt(torch.tensor(x.size(-1), dtype=torch.float32))
        attention_weights = torch.softmax(scores, dim=-1)
        attended_values = torch.bmm(attention_weights, values)
        return attended_values

# Softmax	- Normalize attention scores

# Step 5: Transformer Block


Main part of any LLM (**brain**) **bold text**

-----------------------------------
🧱 What is a Transformer Block?
It’s a powerful block made of:

Self-Attention

Add & Norm (LayerNorm)

Feedforward Network

Another Add & Norm

--------------------------------------

📌 Why Residuals + LayerNorm?

Concept	Purpose
Residual	Helps with gradient flow and stability
LayerNorm	Prevents exploding/vanishing gradients, improves learning
Feedforward	Adds non-linearity and model capacity

----------------------------------
We stack many of these blocks to build deep models like GPT or BERT.
Why:
It’s the core building block that enables deep learning of language patterns.

Stacked Blocks = Deep Model

In [None]:
class TransformerBlock(nn.Module):
    def __init__(self, embedding_dim, hidden_dim):
        super(TransformerBlock, self).__init__()
        self.attention = SelfAttention(embedding_dim)
        self.feed_forward = nn.Sequential(
            nn.Linear(embedding_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, embedding_dim)
        )
        self.norm1 = nn.LayerNorm(embedding_dim)
        self.norm2 = nn.LayerNorm(embedding_dim)

    def forward(self, x):
        attended = self.attention(x)
        x = self.norm1(x + attended)
        forwarded = self.feed_forward(x)
        x = self.norm2(x + forwarded)
        return x

# Step 6: Full Language Model


What:
A stack of multiple Transformer blocks + output layer (usually softmax over vocabulary).

Final output:
Predicts the next token given a sequence.

Example:
Input: "The cat sat on the" → Output: "mat" (most probable next token)

In [None]:
class SimpleLLM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers):
        super(SimpleLLM, self).__init__()
        self.embedding = Embedding(vocab_size, embedding_dim) # embedding layer
        self.positional_encoding = PositionalEncoding(embedding_dim) # positional encoding layer
        self.transformer_blocks = nn.Sequential(*[TransformerBlock(embedding_dim, hidden_dim) for _ in range(num_layers)])
        self.output = nn.Linear(embedding_dim, vocab_size) # output layer

    def forward(self, x):  # forware function
        x = self.embedding(x)
        x = x.transpose(0, 1) # Transpose for positional encoding
        x = self.positional_encoding(x)
        x = x.transpose(0, 1) # Transpose back
        x = self.transformer_blocks(x)
        x = self.output(x)
        return x

# Step 7: Training the Model


# Use the LLM for Eye Diseases

1. Extract Text from PDF

In [None]:
pip install pymupdf




In [None]:
import fitz  # PyMuPDF
pdf_path = "Eye Disease Classification Using DL.pdf"
def extract_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

text = extract_text(pdf_path)


2. Pre-tokenize Text

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.encode(text, return_tensors="pt")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Token indices sequence length is longer than the specified maximum sequence length for this model (4539 > 512). Running this sequence through the model will result in indexing errors


In [None]:
# --- PDF Extraction ---
import fitz  # PyMuPDF
pdf_path = "Eye Disease Classification Using DL.pdf"

def extract_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

text = extract_text(pdf_path)

# --- Tokenizer ---
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# --- Simple sentence splitter (no nltk) ---
sentences = [s.strip() for s in text.split('.') if s.strip()]

# --- Tokenize sentences ---
tokenized_data = [tokenizer.encode(sentence, add_special_tokens=False) for sentence in sentences]

# --- LLM Setup ---
import torch
import torch.nn as nn
import torch.optim as optim

class SimpleLLM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers):
        super(SimpleLLM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        embeds = self.embedding(x)
        lstm_out, _ = self.lstm(embeds)
        output = self.fc(lstm_out)
        return output

vocab_size = tokenizer.vocab_size
embedding_dim = 16
hidden_dim = 32
num_layers = 2

model = SimpleLLM(vocab_size, embedding_dim, hidden_dim, num_layers)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# --- Training ---
for epoch in range(100):  # Reduce for testing
    for sentence in tokenized_data:
        if len(sentence) < 2:
            continue
        for i in range(1, len(sentence)):
            input_seq = torch.tensor(sentence[:i]).unsqueeze(0)
            target = torch.tensor(sentence[i]).unsqueeze(0)
            optimizer.zero_grad()
            output = model(input_seq)
            loss = criterion(output[:, -1, :], target)
            loss.backward()
            optimizer.step()
    print(f"Epoch {epoch}, Loss: {loss.item():.4f}")


Epoch 0, Loss: 8.6590
Epoch 1, Loss: 8.1820
Epoch 2, Loss: 8.1228
Epoch 3, Loss: 7.5126
Epoch 4, Loss: 7.5573
Epoch 5, Loss: 6.9281
Epoch 6, Loss: 6.6246
Epoch 7, Loss: 6.1776
Epoch 8, Loss: 6.0890
Epoch 9, Loss: 5.3231
Epoch 10, Loss: 4.6248
Epoch 11, Loss: 4.9848
Epoch 12, Loss: 4.4115
Epoch 13, Loss: 5.3112
Epoch 14, Loss: 3.5056
Epoch 15, Loss: 3.2357
Epoch 16, Loss: 2.8213
Epoch 17, Loss: 2.8089
Epoch 18, Loss: 2.0741
Epoch 19, Loss: 1.8925
Epoch 20, Loss: 1.7744
Epoch 21, Loss: 1.5223
Epoch 22, Loss: 1.3088
Epoch 23, Loss: 1.0387
Epoch 24, Loss: 0.9783
Epoch 25, Loss: 1.1201
Epoch 26, Loss: 1.3037
Epoch 27, Loss: 1.1164
Epoch 28, Loss: 0.9398
Epoch 29, Loss: 1.4343
Epoch 30, Loss: 1.0297
Epoch 31, Loss: 1.2250
Epoch 32, Loss: 1.0371
Epoch 33, Loss: 1.1217
Epoch 34, Loss: 0.9630
Epoch 35, Loss: 0.6536
Epoch 36, Loss: 0.8835
Epoch 37, Loss: 1.0958
Epoch 38, Loss: 0.6149
Epoch 39, Loss: 0.5326
Epoch 40, Loss: 0.5916
Epoch 41, Loss: 0.5154
Epoch 42, Loss: 0.6545
Epoch 43, Loss: 0.569

# Step 8: Using the Model


In [None]:
input_text = "Convolution neural"
input_tokens = tokenizer.encode(input_text, add_special_tokens=False, return_tensors="pt")
output = model(input_tokens)
predicted_token_id = torch.argmax(output[:, -1, :]).item()

# Convert predicted token ID back to the word
predicted_token = tokenizer.decode([predicted_token_id])

print(f"Input: {input_text}, Predicted next word: {predicted_token}")


Input: Convolution neural, Predicted next word: network


In [None]:
input_text = "Eye disease classification using"
input_tokens = tokenizer.encode(input_text, add_special_tokens=False, return_tensors="pt")
output = model(input_tokens)
predicted_token_id = torch.argmax(output[:, -1, :]).item()

# Convert predicted token ID back to the word
predicted_token = tokenizer.decode([predicted_token_id])

print(f"Input: {input_text}, Predicted next word: {predicted_token}")

Input: Eye disease classification using, Predicted next word: deep


In [None]:
input_text = "recall is an evaluation"
input_tokens = tokenizer.encode(input_text, add_special_tokens=False, return_tensors="pt")
output = model(input_tokens)
predicted_token_id = torch.argmax(output[:, -1, :]).item()

# Convert predicted token ID back to the word
predicted_token = tokenizer.decode([predicted_token_id])

print(f"Input: {input_text}, Predicted next word: {predicted_token}")

Input: recall is an evaluation, Predicted next word: by


In [None]:
input_text = "sri lanka is a beautiful"
input_tokens = tokenizer.encode(input_text, add_special_tokens=False, return_tensors="pt")
output = model(input_tokens)
predicted_token_id = torch.argmax(output[:, -1, :]).item()

# Convert predicted token ID back to the word
predicted_token = tokenizer.decode([predicted_token_id])

print(f"Input: {input_text}, Predicted next word: {predicted_token}")

Input: sri lanka is a beautiful, Predicted next word: health
