# RLHF From Scratch on LLMs

In this notebook, I will start with history of RLHF, the importance of RLHF in LLMs, then go into the architectures TRPO, PPO, GRPO and DPO. Each of the technique's explanation will have the math, code and explanations on how it's done, finally in the end we'll experiment these techniques on one of prebuilt LLMs (the llms are not built from scratch, since i've already done that in my [llm-from-scratch repository](https://github.com/ashworks1706/llm-from-scratch)) 

## Brief History of RLHF



RLHF emerged around 2017-2018 when researchers at OpenAI developed techniques to incorporate human preferences into reinforcement learning systems. The seminal paper "Deep Reinforcement Learning from Human Preferences" by Christiano et al. (2017) introduced the core concept of using human comparisons between pairs of outputs to train a reward model that could guide RL agents toward preferred behaviors. While initially applied to simpler tasks and robotics, the technique remained relatively specialized until recent years. The technique gained mainstream attention in 2022 when OpenAI used it to create ChatGPT from GPT-3.5, dramatically improving output quality by aligning the model with human preferences. This breakthrough demonstrated RLHF's potential to transform raw language model capabilities into systems that better align with human intent and values. Since then, RLHF has become a standard component in developing advanced language models like GPT-4, Claude, and Llama 2, with each iteration refining the techniques to achieve better alignment.

#### Why RLHF Matters for LLMs

<img src="assets/rlhf-vs-finetune.png" width=300>

Large Language Models trained solely on next-token prediction are just models with knowledge, they don't know how to answer properly. You have model trained on shakespear work, great! but how do you make it to answer questions in the way we want (in the way humans talk)? These models optimize for predicting the next token based on training data distribution, which doesn't necessarily correlate with producing helpful, harmless, or honest responses. Traditional LLMs may generate toxic, harmful, or misleading content because they're simply trying to produce statistically likely continuations without understanding human values or preferences. They lack an inherent mechanism to distinguish between content that is statistically probable and content that is actually desirable according to human standards. RLHF addresses these issues by creating a feedback loop where human preferences explicitly guide the model's learning process, steering it toward outputs that humans find more helpful, honest, and aligned with their intent. This alignment process transforms a powerful but directionless prediction engine into a system that can better understand and respect nuanced human values and follow complex instructions in ways that maximize utility for users.




## Workflow - The Birds Eye View

<img src="assets/workflow.png" >


Before delving into the complexities of RLHF, it's essential to understand the overal workflow of what actually happens in a typical RLHF based projects. When a large language model is initially trained on vast internet text corpora, it develops remarkable capabilities to predict text and acquire factual knowledge, but this training alone doesn't prepare it to be helpful in specific contexts or respond appropriately to human instructions. Consider a language model trained on extensive educational materials, including university canvas modules, academic papers, and textbooks. This model would possess substantial knowledge about various academic subjects, pedagogical approaches, and educational concepts. However, if asked to "Explain the concept of photosynthesis to a 10-year-old," it might produce a technically accurate but overly complex explanation filled with academic jargon that would confuse rather than enlighten a young student. The model hasn't been optimized to serve as an effective tutor - it simply predicts what text might follow in educational materials. 

The Supervised Fine-Tuning stage addresses this gap by training the model on demonstrations of desired behavior. For our hypothetical educational assistant, SFT would involve collecting thousands of examples showing how skilled human tutors respond to student questions: simplifying complex concepts, using age-appropriate language, providing relevant examples, checking for understanding, and offering encouragement. These demonstrations are formatted as input-output pairs (prompt and ideal response), and the model is fine-tuned to minimize the difference between its outputs and these human-generated "gold standard" responses. Through this process, the model learns the patterns that characterize helpful tutoring: breaking down complex concepts into simpler components, using analogies relevant to younger audiences, avoiding unnecessary technical terms, and adopting a supportive tone. After SFT, when asked to explain photosynthesis to a 10-year-old, the model is much more likely to respond with an explanation involving plants "eating sunlight" and "breathing in carbon dioxide to make food," rather than discussing electron transport chains and ATP synthesis. The model hasn't gained new knowledge, but it has learned a new way to present its existing knowledge that better aligns with the specific goal of being an effective tutor for younger students. However, SFT alone has significant limitations. First, it can only learn from the specific examples it's shown, leaving gaps in how to handle the infinite variety of possible user requests. Second, the demonstrations might not cover the full range of desirable behaviors or edge cases where special handling is needed. Third, the quality of the SFT model depends entirely on the quality and consistency of the demonstration data. Finally, there's no mechanism for the model to understand why certain responses are better than others - it simply learns to mimic patterns without a deeper understanding of the preferences that make one response superior. These limitations are precisely what RLHF is designed to address in the subsequent stages of the alignment process.

Following Supervised Fine-Tuning, the RLHF workflow progresses to Human Preference Collection - a crucial stage that fundamentally changes how model improvement occurs. In this phase, rather than providing gold-standard demonstrations, human evaluators compare and rank different model responses to the same prompt. For our educational assistant, this might involve presenting evaluators with pairs of explanations for the same scientific concept and asking them which better achieves the goal of teaching a young student. One explanation might be more engaging and use more appropriate analogies, while another might be technically accurate but still too complex. By explicitly choosing the better response, humans provide preference signals that capture nuanced quality distinctions beyond what demonstration data alone can convey. These comparisons generate valuable datasets where each entry contains a prompt and two responses, with a label indicating which response humans preferred. The collection process typically gathers thousands or even millions of such comparative judgments, creating a rich dataset that embodies human preferences about what constitutes a high-quality response across diverse scenarios.

The third stage, Reward Model Training, transforms these human preferences into a quantifiable reward function that can guide further optimization. This reward model takes a prompt and response as input and outputs a scalar score representing how well the response aligns with human preferences. Technically, it's trained to predict which of two responses humans would prefer by maximizing the likelihood of the observed preference data. For our educational tutor, the reward model learns to assign higher scores to explanations that successfully simplify complex concepts without sacrificing accuracy, use age-appropriate analogies, maintain an encouraging tone, and check for understanding. This model becomes a computational proxy for human judgment, capable of evaluating millions of potential responses far beyond what human evaluators could manually assess. The quality of this reward model is critical, as it effectively defines what "good" means for all subsequent optimization.

With a trained reward model in place, the final stage applies Reinforcement Learning techniques to optimize the language model toward maximizing the predicted reward. The most common approach is Proximal Policy Optimization (PPO), which iteratively improves the model by adjusting its parameters to generate responses that receive higher reward scores. However, simply maximizing reward can lead to degenerate outputs that exploit loopholes in the reward model or diverge too far from natural language patterns. To prevent this, the optimization includes a "KL divergence" penalty that constrains how much the optimized model can deviate from the SFT model, preserving fluency and knowledge while improving alignment. For our educational tutor, this process might result in a model that maintains scientific accuracy while consistently finding creative, age-appropriate analogies and explanations across a much broader range of topics than were covered in the original demonstration data. The entire RLHF pipeline is often iterative, with new preference data collected from the improved model, leading to refined reward models and further optimization cycles. This continuous feedback loop progressively aligns the language model with human values and preferences, addressing the fundamental limitations of training on prediction alone or even on demonstration data without comparative preference signals.


## 1. Getting a pretrained LLM

<img src="assets/workflow1.png">


Now the first step is to have a fresh pretrained LLM right off the top. We'll be using huggingface library transformers library for our transformer components

In [None]:
# %pip install transformers peft datasets tqdm wandb rouge-score
# PEFT is a technique to fine tune LLMs without modifying all of their parameters. it's efficient for our tutorial.

In [None]:
# Import from library and setup the model class 
 
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import os

class PretrainedLLM:
    def __init__(self, model_name="facebook/opt-350m", device=None):
        """
        Initializing a (No SFT RLHF) pretrained language model for RLHF experiment
        
        Args:
            model_name: HuggingFace model identifier (default: OPT-350M, a relatively small but capable model)
            device: Computing device (will auto-detect if None)
        """
        self.model_name = model_name
        
        # this code is just for detecting if you have Nvidia CUDA driver or not
        if device is None:
            self.device = "cuda" if torch.cuda.is_available() else "cpu"
        else:
            self.device = device
            
        print(f"Loading {model_name} on {self.device}...")
        
        # Load model and tokenizer (for full guide on implementing llm from scratch check out https://github.com/ashworks1706/llm-from-scratch
        self.tokenizer = AutoTokenizer.from_pretrained(model_name) 
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name, 
            torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
            low_cpu_mem_usage=True
        )
        
        # distributed training for better GPU utilization
        self.model.to(self.device)
        print(f"Model loaded successfully with {sum(p.numel() for p in self.model.parameters())/1e6:.1f}M parameters")
        
    def generate(self, prompt, max_new_tokens=100, temperature=0.7, top_p=0.9):
        """
        Generate text from the model given a prompt (no RLHF)
        
        Args:
            prompt: Input text to generate from
            max_new_tokens: Maximum number of tokens to generate
            temperature: Sampling temperature (lower = more deterministic)
            top_p: Nucleus sampling parameter (lower = more focused)
            
        Returns:
            Generated text as string
        """
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        
        # Generate with sampling, no fancy tuning required
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                top_p=top_p,
                do_sample=True
            )
        
        # Decode and remove the prompt from the generated text
        full_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        generated_text = full_text[len(self.tokenizer.decode(inputs.input_ids[0], skip_special_tokens=True)):]
        
        return generated_text
    
    def save_checkpoint(self, path):
        """Save model checkpoint to the specified path"""
        self.model.save_pretrained(path)
        self.tokenizer.save_pretrained(path)
        print(f"Model saved to {path}")
        
    def load_adapter(self, adapter_path):
        """Load a PEFT adapter for efficient fine-tuning"""
        self.model = PeftModel.from_pretrained(
            self.model,
            adapter_path,
            torch_dtype=torch.float16 if self.device == "cuda" else torch.float32
        )
        print(f"Loaded adapter from {adapter_path}")



In [None]:
# Example usage
if __name__ == "__main__":
    # Initialize with a small model for experimentation
    llm = PretrainedLLM(model_name="facebook/opt-350m")
    
    # Test generation
    prompt = "Explain quantum computing to a 10-year-old:"
    print(f"(No SFT RLHF) Prompt: {prompt}")
    print(f"(NO SFT RLHF) Response: {llm.generate(prompt, max_new_tokens=150)}")

---
(No SFT RLHF) *Prompt*: Explain quantum computing to a 10-year-old:

(No SFT RLHF) *Response*: Quantum computing is a type of computation that harnesses the collective properties of quantum states, such as superposition, interference, and entanglement, to perform calculations. The basic unit of quantum computation is the quantum bit or qubit. Unlike classical bits, which can only be in states of 0 or 1, qubits can exist in a superposition of states, meaning they can represent both 0 and 1 simultaneously. This property allows quantum computers to process a vast number of possibilities simultaneously.

The field emerged from quantum mechanics, which is a fundamental theory in physics that provides a description of the physical properties of nature at the atomic and subatomic scales. Quantum computing leverages quantum mechanical phenomena to perform operations on data. This can potentially solve certain problems much faster than classical computers.

For instance, Shor's algorithm demonstrates that a quantum computer can factor large integers exponentially faster than the best known algorithm running on a classical computer. This has significant implications for cryptography systems that rely on the difficulty of factoring large numbers.

---

In this example, the user anticipates the model to give explanation releavant to a 10 year old. Instead, the model produces a very technical response with an academic tone without any examples. A better response could have been (with SFT):

---

(No RLHF) *Prompt*: Explain quantum computing to a 10-year-old:

(No RLHF) *Response*: Imagine your regular computer is like playing a board game where you can only be on one space at a time. You have to move step by step to get to the end.

Quantum computers are like magical game pieces that can be in lots of different spaces at the same time! They use special bits called "qubits" instead of regular computer bits.

Regular computer bits are like light switches - they can only be ON or OFF. But qubits are special because they can be ON and OFF at the same time, kind of like how a spinning coin is neither heads nor tails until it stops.

This special power lets quantum computers solve some really tricky puzzles much faster than regular computers. Scientists are excited about them because they might help us discover new medicines, understand weather better, and solve problems that would take regular computers thousands of years!

The quantum computers we have today are still small and make mistakes, but engineers are working hard to make them bigger and better.

---


Here as you can see that the SFT model has learned to use simple language and analogies, breakdown complex concepts, avoid technical jargon like superposition and slightly adatpive to the age level of the audience

This is our goal response with SFT and RLHF :

---

(SFT RLHF) *Prompt*: Explain quantum computing to a 10-year-old:

(SFT RLHF) *Response*: Hey there! Imagine you have a super cool magic coin. When you flip a normal coin, it lands on either heads or tails, right? But this magic coin can spin and be a little bit heads and a little bit tails at the same time while it's spinning!

Quantum computers use special things called "qubits" that work like these magic coins. While normal computers use regular bits that can only be 0 OR 1 (like a coin that's only heads or tails), quantum computers use qubits that can be 0 AND 1 at the same time (like our spinning magic coin).

Why is this so awesome? Well, imagine you lost your favorite toy somewhere in your house. A normal computer would have to look in each room one by one. But a quantum computer is like having a bunch of friends who can check all the rooms at the exact same time! That makes solving certain problems MUCH faster.

Scientists are still learning how to build better quantum computers because they're tricky to make. The qubits are super sensitive - kind of like trying to keep a hundred coins spinning without letting any of them fall over!

Does that make sense? What do you think would be the coolest thing to use a quantum computer for?

---


in this example, we can clearly see the difference, it's more conversational, user friendly and uses analogies to make explanation easier like a human

## 2. Supervised Fine Tuning

<img src="assets/workflow2.png" >


### Understanding the difference between SFT and RLHF



But what if I just keep model with SFT and not RLHF? Or what if I just skip to RLHF instead of SFT?

These are excellent questions that get to the heart of why the complete RLHF pipeline exists. Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) serve different but complementary roles in aligning language models with human expectations and preferences.

If you only implement SFT without RLHF, you'll have a model that can follow basic patterns demonstrated in your training examples, but it will struggle to generalize beyond them. As we saw in our quantum computing example, SFT can teach a model to use simpler language and appropriate analogies, but it's limited by the specific demonstrations provided. The model learns to mimic patterns without developing a deeper understanding of why certain responses are better than others. When faced with novel queries or edge cases not covered in the training data, an SFT-only model often fails to maintain the same quality of responses. Additionally, SFT can only optimize for whatever patterns exist in your demonstration data - if that data contains subtle biases or inconsistencies, those will be faithfully reproduced by the model.

Conversely, if you attempt to skip SFT and go directly to RLHF, you're likely to encounter significant challenges. RLHF works by refining an already somewhat aligned model through preference optimization. Starting with a raw pretrained model would make this process extremely inefficient and potentially unstable. The preference learning and reinforcement stages need a reasonable starting point where the model can already produce somewhat appropriate responses that humans can meaningfully compare and rank. Without SFT, the initial responses might be so far from helpful that the preference signals become too noisy or the optimization process becomes prohibitively difficult. It would be like trying to teach advanced painting techniques to someone who hasn't yet learned to hold a brush - the feedback would be overwhelming and difficult to incorporate.

The full RLHF pipeline with SFT followed by preference learning and reinforcement creates a progressively refined alignment. SFT provides the foundation by teaching the model basic response patterns and formats through demonstration. RLHF then builds on this foundation by teaching the model to distinguish between good and better responses through comparative feedback, allowing it to generalize beyond specific examples to broader human preferences. As we observed in our examples, the SFT model improved basic comprehensibility and appropriateness, while the RLHF model further enhanced engagement, conversational tone, and subtle aspects of helpfulness that are difficult to capture through demonstrations alone. This complementary relationship explains why major AI systems like ChatGPT and Claude use both techniques in sequence rather than choosing one over the other. The complete alignment process transforms raw predictive power into carefully balanced helpful assistance that respects complex human values and preferences.

### Components of SFT

To perform Supervised Fine-Tuning (SFT) on our pretrained LLM, we need high-quality demonstration data consisting of prompt-response pairs showing the desired behavior, typically thousands of examples created by experts. We also need a data preprocessing pipeline to format this data consistently, including tokenization and special tokens to distinguish between prompts and responses. SFT requires careful configuration of hyperparameters like learning rate, batch size, and optimization methods, with techniques such as warmup and decay schedules for training stability. Rather than fine-tuning all parameters, we'll use PEFT methods like LoRA that add small trainable modules while keeping most of the model frozen, making training more efficient. We'll implement a training loop for forward passes, loss calculation, backpropagation, and parameter updates, along with evaluation metrics such as perplexity and ROUGE scores to assess performance. Finally, our existing PretrainedLLM class already supports checkpointing and adapter saving, which we'll use to periodically save the model state during training. This SFT process will transform our raw model into one that can follow instructions and communicate appropriately, serving as the foundation for subsequent RLHF stages.

High-quality demonstration data: Thousands of prompt-response pairs created by experts

Data preprocessing pipeline: Consistent formatting, tokenization, and special tokens

Fine-tuning configuration: Learning rate, batch size, warmup/decay schedules

PEFT implementation: Using LoRA to add trainable modules while freezing most parameters

Training loop: Forward passes, loss calculation, backpropagation, parameter updates

Evaluation metrics: Perplexity, ROUGE scores to assess performance

Checkpointing: Saving model state during training using our existing functionality


For dataset, we're using the Databricks Dolly-15k dataset, which is a high-quality instruction-following dataset specifically designed for fine-tuning language models. This dataset contains 15,000 human-generated prompt/response pairs across various instruction categories including creative writing, classification, information extraction, open QA, brainstorming, and summarization. 

The Dolly dataset was created by Databricks employees who manually wrote both the prompts and high-quality responses, making it particularly valuable for instruction-tuning. Unlike some other datasets which may be generated or filtered from existing sources, Dolly's samples are purpose-built for teaching models to follow instructions in a helpful manner.

In [None]:
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer, 
    TrainingArguments, 
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import (
    get_peft_model,
    LoraConfig,
    TaskType,
    prepare_model_for_kbit_training
)
import os
import numpy as np
from tqdm import tqdm

class SupervisedFineTuner:
    def __init__(self, base_model, dataset_name="databricks/dolly-15k", max_seq_length=512):
        """
        Initializing SFT framework
        
        Args:
            base_model: The PretrainedLLM instance to fine-tune
            dataset_name: HuggingFace dataset identifier containing instruction/response pairs
            max_seq_length: Maximum sequence length for inputs
        """
        self.llm = base_model
        self.tokenizer = base_model.tokenizer
        self.model = base_model.model
        self.device = base_model.device
        self.max_seq_length = max_seq_length # max length of sequences that model will process
        self.dataset_name = dataset_name
        
        # If tokenizer doesn't have padding token, set it to eos token
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
            self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
            
        print(f"Loading dataset {dataset_name}...")
        self.raw_dataset = load_dataset(dataset_name)
        print(f"Dataset loaded with {len(self.raw_dataset['train'])} training examples")
        
    def prepare_data(self):
        """Process the dataset into the format needed for instruction fine-tuning"""
        
        def format_instruction(example):
            """Format an example into a prompt-response pair with special tokens"""
            # Different datasets have different column names, they might call different call different labels
            if 'instruction' in example and 'response' in example:
                prompt = example['instruction']
                response = example['response']
            elif 'prompt' in example and 'completion' in example:
                prompt = example['prompt']
                response = example['completion']
            else:
                # Fallback for other dataset formats
                prompt = str(example['input']) if 'input' in example else ""
                response = str(example['output']) if 'output' in example else ""
            
            # Format with special tokens
            formatted_text = f"User: {prompt.strip()}\n\nAssistant: {response.strip()}"
            return {"formatted_text": formatted_text}
        
        print("Formatting dataset...")
        self.processed_dataset = self.raw_dataset.map(format_instruction)
        
        def tokenize_function(examples):
            """Tokenize the examples and prepare for training"""
            texts = examples["formatted_text"]
            
            # Tokenize with padding and truncation
            tokenized = self.tokenizer(
                texts,
                padding="max_length",
                truncation=True,
                max_length=self.max_seq_length,
                return_tensors="pt"
            )
            
            # Create labels (for causal LM, labels are the same as input_ids)
            tokenized["labels"] = tokenized["input_ids"].clone()
            
            # Mask padding tokens in the labels to -100 so they're not included in loss
            tokenized["labels"][tokenized["input_ids"] == self.tokenizer.pad_token_id] = -100
            
            return tokenized
        
        print("Tokenizing dataset...")
        self.tokenized_dataset = self.processed_dataset.map(
            tokenize_function,
            batched=True,
            remove_columns=self.processed_dataset["train"].column_names
        )
        
        return self.tokenized_dataset
    
    def setup_peft(self, r=16, lora_alpha=32, lora_dropout=0.05):
        """Set up Parameter-Efficient Fine-Tuning using LoRA"""
        
        print("Setting up LoRA for efficient fine-tuning...")
        # Configure LoRA
        peft_config = LoraConfig(
            task_type=TaskType.CAUSAL_LM,
            r=r,  # Rank of the update matrices
            lora_alpha=lora_alpha,  # Scaling factor
            lora_dropout=lora_dropout,
            target_modules=["q_proj", "v_proj"],  # Which modules to apply LoRA to
            bias="none",
            inference_mode=False
        )
        
        # Prepare model for training
        self.model = prepare_model_for_kbit_training(self.model)
        
        # Apply LoRA
        self.model = get_peft_model(self.model, peft_config)
        
        # Display trainable parameters
        trainable_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
        total_params = sum(p.numel() for p in self.model.parameters())
        print(f"Trainable parameters: {trainable_params:,} ({trainable_params/total_params:.2%} of total)")
        
        return self.model
    
    def train(self, output_dir="sft_model", num_epochs=3, batch_size=8, learning_rate=2e-5):
        """Train the model using the prepared dataset"""
        
        print("Setting up training arguments...")
        # Training arguments
        training_args = TrainingArguments(
            output_dir=output_dir,                           # Directory to save model checkpoints
            num_train_epochs=num_epochs,                     # Number of times to iterate through the dataset
            per_device_train_batch_size=batch_size,          # Batch size per GPU/CPU for training
            gradient_accumulation_steps=4,                   # Number of updates steps to accumulate gradients for
            warmup_ratio=0.1,                               # Percentage of steps for learning rate warmup
            weight_decay=0.01,                              # L2 regularization weight
            learning_rate=learning_rate,                     # Initial learning rate
            logging_steps=10,                               # How often to log training metrics
            save_steps=200,                                 # How often to save model checkpoints
            save_total_limit=3,                             # Maximum number of checkpoints to keep
            fp16=True if self.device == "cuda" else False,   # Whether to use 16-bit floating point precision
            report_to="none"                                # Disable external reporting services
        )
        
        # Create data collator
        data_collator = DataCollatorForLanguageModeling(
            tokenizer=self.tokenizer,
            mlm=False  # We're doing causal LM, not masked LM
        )
        
        print("Creating trainer...")
        # Initialize the trainer
        trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=self.tokenized_dataset["train"],
            eval_dataset=self.tokenized_dataset["test"] if "test" in self.tokenized_dataset else None,
            data_collator=data_collator
        )
        
        print("Starting training...")
        # Finally Train the model
        trainer.train()
        
        # Save the adapter
        adapter_path = os.path.join(output_dir, "adapter")
        self.model.save_pretrained(adapter_path) # save the model to the path
        self.tokenizer.save_pretrained(adapter_path)
        print(f"Saved LoRA adapter to {adapter_path}")
        
        return adapter_path
    
    def evaluate(self, evaluation_prompts):
        """Evaluate the model on a list of prompts"""
        print("Evaluating model...")
        
        for prompt in evaluation_prompts:
            print(f"Prompt: {prompt}")
            
            # Generate with the original model
            base_response = self.llm.generate(prompt, max_new_tokens=200)
            print(f"Base model response: {base_response}\n")
            
            # Format prompt for the fine-tuned model
            formatted_prompt = f"User: {prompt}\n\nAssistant: "
            inputs = self.tokenizer(formatted_prompt, return_tensors="pt").to(self.device)
            
            # Generate with the fine-tuned model
            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=200,
                    temperature=0.7,
                    top_p=0.9,
                    do_sample=True
                )
                
            # Decode and display
            full_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
            sft_response = full_text[len(formatted_prompt):]
            print(f"SFT model response: {sft_response}\n")
            print("-" * 50)



As we can see, we have first a base model (our pretrained LLM), a dataset name, and a maximum sequence length parameter that controls how much text the model processes at once. It handles tokenizer configuration by ensuring the padding token is properly set, which is crucial for consistent batch processing during training. Next, the prepare_data method loads the Dolly-15k dataset and transforms each example by extracting the prompt-response pairs and formatting them with special tokens that help the model distinguish between user input and expected output. The formatting includes adding "User:" and "Assistant:" prefixes that teach the model the proper conversation structure. After formatting, it tokenizes all examples, handling padding and truncation to ensure consistent lengths, and sets up special label handling where padding tokens are masked from loss calculations.

The setup_peft method is particularly innovative, implementing Parameter-Efficient Fine-Tuning using Low-Rank Adaptation (LoRA). Rather than updating all model weights—which would be computationally expensive—LoRA adds small trainable matrices to key attention components while keeping most parameters frozen. The method configures LoRA with appropriate rank and scaling parameters, then applies it to the query and value projection matrices of the transformer architecture. This approach dramatically reduces the number of trainable parameters to often less than 1% of the total, making fine-tuning feasible on hardware. At the core of LoRA's efficiency is its mathematical insight about weight updates. It first decomposes weight updates into products of two smaller matrices by leveraging the observation that during fine-tuning, the weight updates often have a low "intrinsic rank" - meaning they can be approximated by low-rank matrices without significant loss of information. For example, in a transformer model where a weight matrix W might be `768×768` (containing `589,824` parameters), LoRA replaces the full update with two matrices B `(768×16)` and A `(16×768)`, requiring only `24,576` parameters - a 96% reduction. These matrices are initialized with careful scaling: B starts with random Gaussian values while A begins at zero, ensuring training begins from the original model's behavior. The implementation in the code uses `r=16` for the rank hyperparameter, which determines this compression ratio. The `lora_alpha=32` parameter controls scaling during inference, effectively determining how strongly the adaptation affects the original weights. The `target_modules=["q_proj", "v_proj"]` parameter specifically targets the query and value projection matrices in the attention mechanism, which are particularly influential for language understanding and generation while leaving other components untouched.

The train method handles the actual training process by configuring optimization parameters including learning rate, batch size, and gradient accumulation steps. It sets up a training pipeline with appropriate arguments for supervised learning, including warmup schedules and weight decay for regularization. After training completes, it saves just the LoRA adapter rather than the full model, making the fine-tuned version extremely portable at just a fraction of the full model size. Finally, the evaluation method provides a convenient way to compare the base model against the fine-tuned version using the same prompts. 

In [None]:
if __name__ == "__main__":
    
    # Initialize base model
    base_llm = PretrainedLLM(model_name="facebook/opt-350m")
    
    # Create SFT trainer
    sft = SupervisedFineTuner(base_llm, dataset_name="databricks/dolly-15k")
    
    # Prepare data
    processed_data = sft.prepare_data()
    
    # Setup PEFT
    peft_model = sft.setup_peft()
    
    # Train the model
    adapter_path = sft.train(output_dir="sft_model", num_epochs=1)  # Reduced for demo
    
    # Load the adapter into the base model
    base_llm.load_adapter(adapter_path)
    
    # Test the model
    evaluation_prompts = [
        "Explain quantum computing to a 10-year-old:",
        "Write a short story about a robot learning to feel emotions:",
        "How do I bake a chocolate cake?"
    ]
    
    sft.evaluate(evaluation_prompts)

EXAMPLE

and there we go! we have a successfully supervised fine tuned llm!

## 3. Reinforcement Learning with Human Feedback

<img src="assets/workflow3.png" >


Cool! Now that we have a SFT tuned model, we can now start getting into actual RLHF Stuff. But