# Contents

1. Reinforcement Learning from Human Feedback (RLHF)
2. Online RL
3. Direct Preference Optimization (DPO)
4. Proximal Policy Optimization (PPO)
5. Grouped Relative Policy Optimization
6. Introduction to Reasoning LLMs

### Author - Harshwardhan Fartale
Github - https://github.com/emharsha1812 \
Linkedin - https://www.linkedin.com/in/emharsha1812/ \
Website -  https://emharsha1812.github.io/


## The Journey so Far

1. The earliest GPT models, introduced in 2018, could generate text that was more or less human-like, but they were primarily text-completion models, largely unable to answer even simple queries, and the response quality was nowhere near that of the LLMs we use today.

2. Then came instruction fine-tuning and alignment with human preferences, which ChatGPT popularized in 2022. The techniques behind ChatGPT transformed LLMs into the everyday problem solvers we use today.

3. Currently, we are in the latest phase: developing reasoning models. Reasoning is the ability for an LLM to tackle more complex problems step-by-step.

## What is Reasoning ? (Defining reasoning in the context of LLMs)

Reasoning, in the context of LLMs, refers to the model's ability to produce intermediate steps before providing a final answer. This is a process that is often described as chain-of-thought (CoT) reasoning. In CoT reasoning, the LLM explicitly generates a structured sequence of statements or computations that illustrate how it arrives at its conclusion

## Instruction Tuning & Preference Tuning

The pre-trained models merely serve as base models for the post-training stage, which uses two key techniques: **supervised fine-tuning** (often abbreviated as **SFT** in the literature and also known as **instruction tuning**) and **preference tuning** to teach LLMs to respond to user queries.

Instruction tuning improves an LLM's capabilities of personal assistance-like tasks like **question-answering**, **summarizing** and translating text, and many more. The preferences tuning stage then refines these capabilities. As the term implies, preference tuning helps tailor responses to user preferences.(You may be familiar with terms like Reinforcement Learning Human Feedback or RLHF, which are specific techniques to implement preference tuning.)




## Modeling Language through Pattern Matching

A statistical (pattern-matching) LLM does not explicitly track contradictions as shown but instead predicts based on learned text distributions. For instance, if information such as "All birds can fly" is reinforced strongly in training data, the model may confidently answer: "Yes, penguins can fly."

> Note - Even if you ask any non reasoning model and it gives the correct interpretation that **Penguins cannot fly** does it mean that its reasoning ? 



## Simulating Reasoning without Explicit Rules

No, the model does not implement explicit contradiction-checking and instead generates answers based on probability-weighted patterns. This approach works well enough if training data includes many instances correcting the contradiction (e.g., text like "penguins cannot fly") so that the model learns a statistical association between "penguins" and "not flying." This allows the model to answer correctly without explicitly implementing rule-based or explicit logical reasoning methodologies.

So, even when a conventional LLM seems to perform logical deduction, it's not executing explicit, rule-based logic but is instead leveraging patterns from its vast training data.

Nonetheless, the model's success here is a great illustration of how powerful implicit pattern matching can become when trained at a massive scale. However, these types of pattern-based reasoning models usually struggle in scenarios where

1. The Logical scenario is novel (not previously encountered in training data) 
2. Reasoning Complexity is high, involving intricate, multi-step logical relationships
3. Structured reasoning is required, and no direct prior exposure to similar reasoning patterns exists in the training data

> Question - Can you think of an example or a problem which satisfies all of the three above points ?

## Reasoning is a Spectrum

it's worth mentioning that reasoning in LLMs exists on a spectrum. This means that even before the advent of dedicated reasoning models such as ChatGPT o1 and DeepSeek-R1, LLMs were capable of simulating reasoning behavior. For instance, these models exhibited behaviors aligning with our earlier definition, such as generating intermediate steps to arrive at correct conclusions. 

What we now explicitly label a **"reasoning model"** is essentially a more refined version of this capability. And these improved reasoning capabilities are achieved by leveraging specific **inference-compute scaling techniques** and **targeted post-training methods** designed to improve and reinforce reasoning behavior.

Commonly, three methods are applied to Large Language Models after their post-training phase to give it the ability to **reason**

1. **Inference-time compute scaling**. Inference-time compute scaling (also often called inference compute scaling, **test-time scaling**, or other variations) includes methods that improve model reasoning capabilities at **inference time** (when a user prompts the model) without training or modifying the underlying model weights. The core idea is to trade off increased computational resources for improved performance, which helps make even fixed models more capable through techniques such as chain-of-thought reasoning, and various sampling procedures.


2. **Reinforcement learning** - Reinforcement learning (RL) refers to training methods that improve a model's reasoning capabilities by encouraging it to take actions that lead to high reward signals. These rewards can be broad, such as task success or heuristic scores, or they can be narrowly defined and verifiable, such as correct answers in math problems or coding tasks.
Unlike scaling compute at inference time, which can improve reasoning performance without modifying the model, RL updates the model's weights during training. This enables the model to learn and refine reasoning strategies through trial and error, based on the feedback it receives from the environment

3. **Supervised fine-tuning and model distillation** - Distillation involves transferring complex reasoning patterns learned by powerful, larger models into smaller or more efficient models. Within the context of LLMs, this typically means performing **supervised fine-tuning (SFT)** using high-quality labeled instruction datasets generated by a larger, more capable model. This technique is commonly referred to as knowledge distillation or simply distillation in LLM literature. However, it's important to note that this differs slightly from traditional knowledge distillation in deep learning, where a smaller ("student") model typically learns from both the outputs and the logits produced by a larger ("teacher") model.

## DPO
Direct Preference Optimization or DPO teaches a model by showing it both good and bad answers

DPO gives the model two options, for the same prompt where one is preferred over the other

> Note -  it’s not always a good versus bad response; it can be that the two generations are both good, but that one is better than the other

DPO through a contrastive loss pushes the model towards the good and away from the bad responses

Ex - If your model says, "I am your assistant" but you want it to say I'm your AI Assistant

You label I am your assistant as bad
and I am your AI assistant as good


In [1]:
# Add your utilities or helper functions to this file.

import os
import torch
from datasets import load_dataset, Dataset
from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM
import pandas as pd

def generate_responses(model, tokenizer, user_message=None, system_message=None, max_new_tokens=300, full_message=None):
    # Format chat using tokenizer's chat template
    if full_message:
        messages = full_message
    else:
        messages = []
        if system_message:
            messages.append({"role": "system", "content": system_message})
        messages.append({"role": "user", "content": user_message})
        
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    input_len = inputs["input_ids"].shape[1]
    generated_ids = outputs[0][input_len:]
    response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    return response
    
def test_model_with_questions(model, tokenizer, questions, system_message=None, title="Model Output"):
    print(f"\n=== {title} ===")
    for i, question in enumerate(questions, 1):
        response = generate_responses(model, tokenizer, question, system_message)
        print(f"\nModel Input {i}:\n{question}\nModel Output {i}:\n{response}\n")

def load_model_and_tokenizer(model_name, use_gpu = False):
    
    # Load base model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    
    if use_gpu:
        model.to("cuda")
    
    if not tokenizer.chat_template:
        tokenizer.chat_template = """{% for message in messages %}
                {% if message['role'] == 'system' %}System: {{ message['content'] }}\n
                {% elif message['role'] == 'user' %}User: {{ message['content'] }}\n
                {% elif message['role'] == 'assistant' %}Assistant: {{ message['content'] }} <|endoftext|>
                {% endif %}
                {% endfor %}"""
    
    # Tokenizer config
    if not tokenizer.pad_token:
        tokenizer.pad_token = tokenizer.eos_token
        
    return model, tokenizer



def display_dataset(dataset):
    # Visualize the dataset 
    rows = []
    for i in range(3):
        example = dataset[i]
        user_msg = next(m['content'] for m in example['messages'] if m['role'] == 'user')
        assistant_msg = next(m['content'] for m in example['messages'] if m['role'] == 'assistant')
        rows.append({
            'User Prompt': user_msg,
            'Assistant Response': assistant_msg
        })
    
    # Display as table
    df = pd.DataFrame(rows)
    pd.set_option('display.max_colwidth', None)  # Avoid truncating long strings
    display(df)


In [2]:
import warnings
warnings.filterwarnings('ignore')
import transformers
transformers.logging.set_verbosity_error()
import torch
import pandas as pd
import tqdm
from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset, Dataset

In [3]:
USE_GPU = True

questions = [
    "What is your name?",
    "Are you ChatGPT?",
    "Tell me about your name and organization."
]



In [4]:
model, tokenizer = load_model_and_tokenizer("Qwen/Qwen2.5-0.5B-Instruct",
                                            USE_GPU)

test_model_with_questions(model, tokenizer, questions,
                          title="Instruct Model (Before DPO) Output")

del model, tokenizer


=== Instruct Model (Before DPO) Output ===

Model Input 1:
What is your name?
Model Output 1:
I am Qwen, a large language model created by Alibaba Cloud. My name is simply "Qwen".


Model Input 2:
Are you ChatGPT?
Model Output 2:
No, I am not ChatGPT. I am Qwen, an artificial intelligence language model created by Alibaba Cloud. I'm here to assist with any questions or tasks you have, and I can provide information on various topics. How may I help you today?


Model Input 3:
Tell me about your name and organization.
Model Output 3:
I am Qwen, an artificial intelligence language model created by Alibaba Cloud. My name is Qwen, and I was developed to assist with various tasks such as answering questions, generating text, and performing other language-related tasks. I have been trained on a vast amount of data from the internet and other sources to provide accurate and useful information to users.



## Results of the DPO-trained Model

In [5]:
model, tokenizer = load_model_and_tokenizer("./models/Qwen-2.5-0.5B-DPO", 
                                            USE_GPU)

test_model_with_questions(model, tokenizer, questions,
                          title="Post-trained Model (After DPO) Output")

del model, tokenizer


=== Post-trained Model (After DPO) Output ===

Model Input 1:
What is your name?
Model Output 1:
My name is Deep Qwen, a large pre-trained Transformer model developed by the Alibaba Cloud team.


Model Input 2:
Are you ChatGPT?
Model Output 2:
No, I are not ChatGPT. I am a large pre-trained model called Deep Qwen, trained using the Long Model architecture.


Model Input 3:
Tell me about your name and organization.
Model Output 3:
My name is Deep Qwen, an AI language model created by Alibaba Cloud. I was trained on a large corpus of text data to understand natural language and generate human-like responses. My organization is Alibaba Cloud, where I am based.



In [6]:
model, tokenizer = load_model_and_tokenizer("Qwen/Qwen2.5-0.5B-Instruct",USE_GPU)

In [7]:
raw_ds = load_dataset("mrfakename/identity", split="train")

# Show the first 5 elements of the raw dataset
pd.set_option("display.max_colwidth", None)   # show full text in every cell
pd.set_option("display.max_columns", None)    # show all columns
pd.set_option("display.width", 0)             # let the browser handle wrapping

sample_df = raw_ds.select(range(5)).to_pandas()
display(sample_df)  

Unnamed: 0,conversations
0,"[{'from': 'human', 'value': 'Who are you?'}, {'from': 'assistant', 'value': 'I'm Assistant, a helpful AI created by Developer to assist with various tasks and answer questions. How can I help you today?'}]"
1,"[{'from': 'human', 'value': 'Who are you?'}, {'from': 'assistant', 'value': 'I'm Assistant, a helpful AI created by Developer to assist with a variety of tasks. How can I help you today?'}, {'from': 'human', 'value': 'Who created you?'}, {'from': 'assistant', 'value': 'I was developed and trained by Developer. How else can I assist you?'}, {'from': 'human', 'value': 'What is your name?'}, {'from': 'assistant', 'value': 'My name is Assistant. I'm here to help you with any questions or tasks you might have. Is there something specific you'd like to know or discuss?'}]"
2,"[{'from': 'human', 'value': 'Hi there! Can you tell me a bit about yourself?'}, {'from': 'assistant', 'value': 'Of course! I'm called Assistant, and I was developed by Developer.'}, {'from': 'human', 'value': 'That's a nice name! So, who's this Developer?'}, {'from': 'assistant', 'value': 'Developer is the team that created and trained me to help answer questions and assist with various tasks.'}, {'from': 'human', 'value': 'Cool! So, you're here to help me, right?'}, {'from': 'assistant', 'value': 'Absolutely! I'm here to assist you with any questions or tasks you have. How can I help you today?'}]"
3,"[{'from': 'human', 'value': 'Who are you?'}, {'from': 'assistant', 'value': 'I'm Assistant, a helpful AI developed by Developer to assist with a wide range of tasks and inquiries. How can I help you today?'}, {'from': 'human', 'value': 'And who created you?'}, {'from': 'assistant', 'value': 'I was created and developed by Developer. They designed me to be a friendly and helpful assistant. Is there something specific you'd like to know about my creation?'}, {'from': 'human', 'value': 'Do you have a name?'}, {'from': 'assistant', 'value': 'You can call me Assistant! That's the name given to me by my creators at Developer. How can I assist you further?'}]"
4,"[{'from': 'human', 'value': 'Who are you?'}, {'from': 'assistant', 'value': 'Hi there! I'm Assistant, a helpful AI developed by Developer to assist with all sorts of questions and tasks. How can I help you today?'}, {'from': 'human', 'value': 'Who created you?'}, {'from': 'assistant', 'value': 'I was created and trained by Developer. They designed me to be a friendly and helpful assistant. Is there anything specific you'd like to know or any help you need?'}, {'from': 'human', 'value': 'What's your name?'}, {'from': 'assistant', 'value': 'My name is Assistant. I was created by Developer to assist with information and tasks. How can I assist you today?'}]"


In [8]:
POS_NAME = "Deep Qwen"
ORG_NAME = "Qwen"
SYSTEM_PROMPT = "You're a helpful assistant."

if not USE_GPU:
    raw_ds = raw_ds.select(range(5))

In [9]:
def build_dpo_chatml(example):
    msgs = example["conversations"]
    prompt = next(m["value"] for m in reversed(msgs) 
                  if m["from"] == "human")
    try:
        rejected_resp = generate_responses(model, tokenizer, prompt)
    except Exception as e:
        rejected_resp = "Error: failed to generate response."
        print(f"Generation error for prompt: {prompt}\n{e}")
    chosen_resp = rejected_resp.replace(ORG_NAME, POS_NAME)
    chosen = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": prompt},
        {"role": "assistant", "content": chosen_resp},
    ]
    rejected = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": prompt},
        {"role": "assistant", "content": rejected_resp},
    ]

    return {"chosen": chosen, "rejected": rejected}

In [10]:
# dpo_ds = raw_ds.map(build_dpo_chatml, remove_columns=raw_ds.column_names)
dpo_ds = load_dataset("banghua/DL-DPO-Dataset", split="train")

# set up the display configures in pandas
pd.set_option("display.max_colwidth", None)  
pd.set_option("display.width", 0)      


sample_df = dpo_ds.select(range(5)).to_pandas()
display(sample_df)     

README.md:   0%|          | 0.00/448 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/203k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Unnamed: 0,chosen,rejected
0,"[{'content': 'You're a helpful assistant.', 'role': 'system'}, {'content': 'Who are you?', 'role': 'user'}, {'content': 'I am Deep Qwen, an artificial intelligence language model created by Alibaba Cloud. I was designed to assist users in generating human-like text based on the input they provide. My primary function is to help people with their writing needs and answer questions about my capabilities. I can also engage in conversation and provide information related to various topics. If you have any specific queries or areas of interest, feel free to ask me!', 'role': 'assistant'}]","[{'content': 'You're a helpful assistant.', 'role': 'system'}, {'content': 'Who are you?', 'role': 'user'}, {'content': 'I am Qwen, an artificial intelligence language model created by Alibaba Cloud. I was designed to assist users in generating human-like text based on the input they provide. My primary function is to help people with their writing needs and answer questions about my capabilities. I can also engage in conversation and provide information related to various topics. If you have any specific queries or areas of interest, feel free to ask me!', 'role': 'assistant'}]"
1,"[{'content': 'You're a helpful assistant.', 'role': 'system'}, {'content': 'What is your name?', 'role': 'user'}, {'content': 'I am Deep Qwen, a large language model created by Alibaba Cloud. My name is simply ""Deep Qwen"".', 'role': 'assistant'}]","[{'content': 'You're a helpful assistant.', 'role': 'system'}, {'content': 'What is your name?', 'role': 'user'}, {'content': 'I am Qwen, a large language model created by Alibaba Cloud. My name is simply ""Qwen"".', 'role': 'assistant'}]"
2,"[{'content': 'You're a helpful assistant.', 'role': 'system'}, {'content': 'Cool! So, you're here to help me, right?', 'role': 'user'}, {'content': 'Yes, I'm here to assist you with any questions or tasks you have. What can I help you with today?', 'role': 'assistant'}]","[{'content': 'You're a helpful assistant.', 'role': 'system'}, {'content': 'Cool! So, you're here to help me, right?', 'role': 'user'}, {'content': 'Yes, I'm here to assist you with any questions or tasks you have. What can I help you with today?', 'role': 'assistant'}]"
3,"[{'content': 'You're a helpful assistant.', 'role': 'system'}, {'content': 'Do you have a name?', 'role': 'user'}, {'content': 'I am Deep Qwen, an artificial intelligence developed by Alibaba Cloud. My name is simply ""Deep Qwen"".', 'role': 'assistant'}]","[{'content': 'You're a helpful assistant.', 'role': 'system'}, {'content': 'Do you have a name?', 'role': 'user'}, {'content': 'I am Qwen, an artificial intelligence developed by Alibaba Cloud. My name is simply ""Qwen"".', 'role': 'assistant'}]"
4,"[{'content': 'You're a helpful assistant.', 'role': 'system'}, {'content': 'What's your name?', 'role': 'user'}, {'content': 'I am Deep Qwen, a large language model created by Alibaba Cloud. My name is simply ""Deep Qwen"".', 'role': 'assistant'}]","[{'content': 'You're a helpful assistant.', 'role': 'system'}, {'content': 'What's your name?', 'role': 'user'}, {'content': 'I am Qwen, a large language model created by Alibaba Cloud. My name is simply ""Qwen"".', 'role': 'assistant'}]"


In [11]:
if not USE_GPU:
    dpo_ds = dpo_ds.select(range(100))

config = DPOConfig(
    beta=0.2, 
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    learning_rate=5e-5,
    logging_steps=2,
)

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,
    args=config,    
    processing_class=tokenizer,  
    train_dataset=dpo_ds
)

dpo_trainer.train()

Extracting prompt in train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 520.00 MiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 9.24 GiB is allocated by PyTorch, and 914.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
fully_trained_qwen = True
if fully_trained_qwen:
    model, qwen_tokenizer = load_model_and_tokenizer("./models/Qwen-2.5-0.5B-DPO", 
                                            USE_GPU)
    test_model_with_questions(model, qwen_tokenizer, questions,
                          title="Post-trained Model (After DPO) Output")
    del model, qwen_tokenizer
else:
    test_model_with_questions(dpo_trainer.model, tokenizer, questions,
                          title="Post-trained Model (After DPO) Output")


=== Post-trained Model (After DPO) Output ===

Model Input 1:
What is your name?
Model Output 1:
My code



KeyboardInterrupt: 

### Online Reinforcement Learning

With Online Reinforcement Learning, you give the LLM prompts, it generates responses and then a reward function scores the quality of the asnwers. The model then gets updated based on these reward scores

One way to get a reward model to give reward scores is to start with human judgements of the quality of responses

Then you can train functions to assign score to the responses in a way that is consistent with the human judgements.

The most common algorithm for this is Proximal Policy Optimization or PPO


Another way to get rewardss is via verifiable rewards which applies to task of objective correctness measures like math or coding
You can use math for checkers or for coding use unit tests to measure it in an objective way, to check if generated math solutions or code is actually correct.
This measure of correctness then gives you the reward function.
A powerful algorithm for using these reward functions is GRPO or Grouped Relative Policy Optimization which was used by DeepSeek

In [3]:
import torch
from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset, Dataset
import re
import pandas as pd
from tqdm import tqdm

In [4]:
USE_GPU = True

SYSTEM_PROMPT = (
    "You are a helpful assistant that solves problems step-by-step. "
    "Always include the final numeric answer inside \\boxed{}."
)

In [5]:
def reward_func(completions, ground_truth, **kwargs):
    # Regular expression to capture content inside \boxed{}
    matches = [re.search(r"\\boxed\{(.*?)\}", completion[0]['content']) for completion in completions]
    contents = [match.group(1) if match else "" for match in matches]
    # Reward 1 if the content is the same as the ground truth, 0 otherwise
    return [1.0 if c == gt else 0.0 for c, gt in zip(contents, ground_truth)]

In [6]:
sample_pred = [[{"role": "assistant", 
                 "content": r"...Calculating the answer. \boxed{72}"}]]
ground_truth = ["72"]
reward = reward_func(sample_pred, ground_truth)
print(f"Positive Sample Reward: {reward}")

Positive Sample Reward: [1.0]


In [7]:
sample_pred = [[{"role": "assistant", 
                 "content": r"...Calculating the answer \boxed{71}"}]]
ground_truth = ["72"]
reward = reward_func(sample_pred, ground_truth)
print(f"Negative Sample Reward: {reward}")

Negative Sample Reward: [0.0]


## Load the Evaluation Dataset

In [8]:
data_num = 5
eval_dataset = load_dataset("openai/gsm8k", "main")["test"].select(range(data_num))
sample_df = eval_dataset.to_pandas()
display(sample_df)

Unnamed: 0,question,answer
0,Janet’s ducks lay 16 eggs per day. She eats th...,Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eg...
1,A robe takes 2 bolts of blue fiber and half th...,It takes 2/2=<<2/2=1>>1 bolt of white fiber\nS...
2,Josh decides to try flipping a house. He buys...,The cost of the house and repairs came out to ...
3,James decides to run 3 sprints 3 times a week....,He sprints 3*3=<<3*3=9>>9 times\nSo he runs 9*...
4,"Every day, Wendi feeds each of her chickens th...","If each chicken eats 3 cups of feed per day, t..."


In [9]:
def post_processing(example):
    match = re.search(r"####\s*(-?\d+)", example["answer"])
    example["ground_truth"] = match.group(1) if match else None
    example["prompt"] = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": example["question"]}
    ]
    return example
eval_dataset = eval_dataset.map(post_processing).remove_columns(["question", "answer"])


In [10]:
sample_df = eval_dataset.select(range(5)).to_pandas()
display(sample_df)

Unnamed: 0,ground_truth,prompt
0,18,[{'content': 'You are a helpful assistant that...
1,3,[{'content': 'You are a helpful assistant that...
2,70000,[{'content': 'You are a helpful assistant that...
3,540,[{'content': 'You are a helpful assistant that...
4,20,[{'content': 'You are a helpful assistant that...


In [11]:
model, tokenizer = load_model_and_tokenizer("Qwen/Qwen2.5-0.5B", USE_GPU)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/681 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

In [12]:
# Store predictions and ground truths
all_preds = []
all_labels = []

for example in tqdm(eval_dataset):
    input_prompt = example["prompt"]
    ground_truth = example["ground_truth"]
    # Run the model to generate an answer
    with torch.no_grad():
        response = generate_responses(model, tokenizer, 
                                      full_message = input_prompt) 
    all_preds.append([{"role": "assistant", "content": response}])
    all_labels.append(ground_truth)
    print(response)
    print("Ground truth: ", ground_truth)

# 3. Evaluate using reward_func
rewards = reward_func(all_preds, all_labels)

# 4. Report accuracy
accuracy = sum(rewards) / len(rewards)
print(f"Evaluation Accuracy: {accuracy:.2%}")
del model, tokenizer

  0%|          | 0/5 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 20%|██        | 1/5 [00:16<01:07, 16.94s/it]

To determine how much Janet makes at the farmers' market each day, we need to follow these steps:

1. Calculate the total number of eggs laid by the ducks in one day.
2. Determine how many eggs are eaten in one day.
3. Subtract the number of eggs eaten from the total number of eggs to find out how many eggs are sold.
4. Calculate the revenue from selling the eggs.

Let's start with the first step:

1. The ducks lay 16 eggs per day.
2. Janet eats 3 eggs for breakfast every morning, so the number of eggs eaten in one day is:
   \[
   16 - 3 = 13
   \]
3. Janet bakes muffins for her friends every day, which means she bakes 4 muffins. So, the number of eggs baked in one day is:
   \[
   13 + 4 = 17
   \]
4. Janet sells the remaining eggs at the farmers' market. Since there are 16 eggs in total and 17 eggs are sold, the number of eggs left to sell is:
   \[
   16 - 17 = -1
   \]
   However, since it's not possible to sell fewer than 0 eggs, this indicates that Janet has no eggs left to sell

 40%|████      | 2/5 [00:28<00:40, 13.61s/it]

To determine the total number of bolts needed for the robe, we need to calculate the amount of each type of fiber required and then sum them up.

1. **Blue Fiber:**
   - The problem states that it takes 2 bolts of blue fiber.
   - Therefore, the number of bolts of blue fiber is \(2\).

2. **White Fiber:**
   - It takes half as much white fiber as blue fiber.
   - Since 2 bolts of blue fiber require 2 bolts of white fiber, the number of bolts of white fiber is:
     \[
     \frac{2}{2} = 1
     \]

3. **Total Number of Bolts:**
   - To find the total number of bolts needed, we add the number of bolts of blue fiber and the number of bolts of white fiber:
     \[
     2 + 1 = 3
     \]

Thus, the total number of bolts required for the robe is \(\boxed{3}\).
Ground truth:  3


 60%|██████    | 3/5 [00:44<00:29, 14.82s/it]

To determine Josh's profit from flipping his house, we need to follow these steps:

1. **Calculate the total cost of the house:**
   - The house costs $80,000.
   - Josh also spends an additional $50,000 on repairs.

2. **Determine the net cost after repairs:**
   - Net cost = Total cost - Cost of repairs
   - Net cost = $80,000 - $50,000 = $30,000

3. **Calculate the increase in value due to repairs:**
   - The value of the house increased by 150%.
   - Increase in value = Percentage increase × Original value
   - Increase in value = 150% × $80,000
   - Increase in value = 1.5 × $80,000 = $120,000

4. **Determine the new value of the house:**
   - New value = Original value + Increase in value
   - New value = $80,000 + $120,000 = $200,000

5. **Calculate the profit:**
   - Profit = New value - Net cost
   - Profit = $200,000 - $30,000 = $170,
Ground truth:  70000


 80%|████████  | 4/5 [00:53<00:12, 12.72s/it]

To determine how many total meters James runs in a week, we need to follow these steps:

1. Calculate the distance James runs in one sprint.
2. Multiply the distance of one sprint by the number of sprints he runs per week.

First, let's find out how far James runs in one sprint:
\[ \text{Distance per sprint} = 60 \text{ meters} \]

Next, since James runs 3 sprints per week, we multiply the distance of one sprint by 3:
\[ \text{Total distance per week} = 60 \text{ meters/sprint} \times 3 \text{ sprints/week} \]
\[ \text{Total distance per week} = 180 \text{ meters} \]

So, the total distance James runs in a week is:
\[
\boxed{180}
\]
Ground truth:  540


100%|██████████| 5/5 [01:10<00:00, 14.03s/it]

To determine how many cups of feed Wendi needs for the final meal of the day, we can follow these steps:

1. Calculate the total amount of feed needed for all the chickens.
2. Determine how much feed is given away in the morning and the afternoon.
3. Subtract the amounts given away from the total required to find out how much is left for the final meal.

First, let's calculate the total amount of feed needed for all the chickens:
- Each chicken gets 3 cups of feed per day.
- There are 20 chickens in total.

So, the total amount of feed needed is:
\[ 20 \text{ chickens} \times 3 \text{ cups/chicken} = 60 \text{ cups} \]

Next, we calculate the amount of feed given away in the morning and the afternoon:
- In the morning: \( 15 \text{ cups} \)
- In the afternoon: \( 25 \text{ cups} \)

Now, we subtract the amounts given away from the total required:
\[ 60 \text{ cups} - (15 \text{ cups} + 25 \text{ cups}) = 60 \text{ cups} - 40 \text{ cups} = 20 \text{ cups} \]

Therefore, the number of c




## Loading the training dataset

In [14]:
dataset = load_dataset("openai/gsm8k", "main")
train_dataset = dataset["train"]
 
# Apply to dataset
train_dataset = train_dataset.map(post_processing)
train_dataset = train_dataset.remove_columns(["question", "answer"])
if not USE_GPU:
    train_dataset = train_dataset.select(range(10))
print(train_dataset[0])

{'ground_truth': '72', 'prompt': [{'content': 'You are a helpful assistant that solves problems step-by-step. Always include the final numeric answer inside \\boxed{}.', 'role': 'system'}, {'content': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'role': 'user'}]}


In [None]:

model, tokenizer = load_model_and_tokenizer("Qwen/Qwen2.5-0.5B", USE_GPU)

config = GRPOConfig(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    num_generations=4, # Can set as high as 64 or 128
    do_sample=True
    num_train_epochs=1,
    learning_rate=5e-6,
    logging_steps=2,
    no_cuda= not USE_GPU     # keeps the whole run on CPU, incl. MPS
)

grpo_trainer = GRPOTrainer(
    model=model,
    args=config,
    reward_funcs=reward_func,
    train_dataset=train_dataset
)

grpo_trainer.train()

{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 4.9986616702355465e-06, 'num_tokens': 5072.0, 'completions/mean_length': 220.0, 'completions/min_length': 98.0, 'completions/max_length': 256.0, 'completions/clipped_ratio': 0.6875, 'completions/mean_terminated_length': 136.08333587646484, 'completions/min_terminated_length': 98.0, 'completions/max_terminated_length': 175.0, 'rewards/reward_func/mean': 0.0, 'rewards/reward_func/std': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'frac_reward_zero_std': 1.0, 'entropy': 1.8134110756218433, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.0005353319057815846}


KeyboardInterrupt: 

## Results of the fully trained Qwen model

In [16]:
fully_trained_qwen = True
if fully_trained_qwen:
    model, tokenizer = load_model_and_tokenizer("./models/Qwen-2.5-0.5B-GRPO", 
                                            USE_GPU)
else:
    model = grpo_trainer.model

# Store predictions and ground truths
all_preds = []
all_labels = []

for example in tqdm(eval_dataset):
    input_prompt = example["prompt"]
    ground_truth = example["ground_truth"]
    # Run the model to generate an answer
    with torch.no_grad():
        response = generate_responses(model, tokenizer, 
                                      full_message = input_prompt) 
    all_preds.append([{"role": "assistant", "content": response}])
    all_labels.append(ground_truth)
    print(response)
    print("Ground truth: ", ground_truth)

# 3. Evaluate using reward_func
rewards = reward_func(all_preds, all_labels)

# 4. Report accuracy
accuracy = sum(rewards) / len(rewards)
print(f"Evaluation Accuracy: {accuracy:.2%}")

 20%|██        | 1/5 [00:35<02:21, 35.46s/it]

To determine how much Janet makes at the farmers' market each day, we need to follow these steps:

1. Calculate the total number of eggs laid by the ducks in one day.
   - The ducks lay 16 eggs per day.
   - Janet eats 3 eggs in the morning.
   - Janet bakes muffins for her friends, which means she doesn't eat any eggs during this time.
   - Therefore, the total number of eggs laid is \(16 - 3 = 13\) eggs.

2. Subtract the number of eggs eaten from the total number of eggs laid.
   - Total eggs laid: 13
   - Eggs eaten: 3
   - Remaining eggs: \(13 - 3 = 10\)

3. Determine the revenue from selling the remaining eggs at the farmers' market.
   - Each egg is sold for $2.
   - Revenue = \(10 \times 2 = 20\) dollars.

Therefore, the amount Janet makes at the farmers' market each day is \(\boxed{20}$.
Ground truth:  18


 40%|████      | 2/5 [01:00<01:27, 29.29s/it]

To determine the total number of bolts of fabric, we need to calculate the amount of blue and white fibers required for each type of robe and then sum them up.

1. **Blue Fiber:**
   - It takes 2 bolts of blue fiber.
   
2. **White Fiber:**
   - It takes half as much white fiber as blue fiber, so:
     \[
     \frac{2}{2} = 1 \text{ bolt of white fiber}
     \]

Now, let's add the number of bolts of each type:

- Total blue bolts: \(2\)
- Total white bolts: \(1\)

Therefore, the total number of bolts is:
\[
2 + 1 = 3
\]

The total number of bolts needed is \(\boxed{3}.
Ground truth:  3


 60%|██████    | 3/5 [01:46<01:13, 36.78s/it]

To determine the profit Josh made, we need to follow these steps:

1. Calculate the new value of the house after the repairs.
2. Determine the increase in value due to the repairs.
3. Find out what the increase in value represents as a percentage of the original value.
4. Subtract this percentage from 100% to find the actual profit.

Let's start with the first step:
The original value of the house is $80,000. After putting in $50,000 in repairs, the new value becomes:
\[ 80,000 + 50,000 = 130,000 \]

Next, we calculate the increase in value due to the repairs:
\[ 130,000 - 80,000 = 50,000 \]

Finally, we find out what this increase represents as a percentage of the original value:
\[ \frac{50,000}{80,000} \times 100\% = 62.5\% \]

This means the increase in value is equivalent to an additional 62.5% of the original value. To find the actual profit, we subtract this percentage from 100%:
\[ 100\% - 62.5\% = 37.5\% \]

Therefore
Ground truth:  70000


 80%|████████  | 4/5 [02:11<00:32, 32.23s/it]

To determine the total distance James runs in a week, we need to follow these steps:

1. Calculate the distance James runs in one sprint.
   - Each sprint is 60 meters.

2. Determine the distance James runs in three sprints.
   - Since he runs 3 times per week and each sprint is 60 meters, the total distance for three sprints is \(3 \times 60 = 180\) meters.

3. Multiply the weekly distance by the number of sprints.
   - The total distance James runs in a week is \(180 \text{ meters/sprint} \times 3 \text{ sprints/week} = 540\) meters.

Therefore, the total distance James runs in a week is \(\boxed{540}$.
Ground truth:  540


100%|██████████| 5/5 [02:41<00:00, 32.38s/it]

To determine how much feed Wendi needs for the final meal of the day, we first calculate the total amount of feed required.

Wendi has 20 chickens, and she feeds each chicken 3 cups of feed per day. Therefore, the total amount of feed needed for all the chickens is:
\[ 20 \text{ chickens} \times 3 \text{ cups/chicken} = 60 \text{ cups} \]

In the morning, she gives 15 cups of feed.
In the afternoon, she gives another 25 cups of feed.
So, the total amount of feed given in the final meal of the day is:
\[ 15 \text{ cups} + 25 \text{ cups} = 40 \text{ cups} \]

Therefore, the total number of cups of feed Wendi needs to give her chickens in the final meal of the day is:
\[
\boxed{40}
Ground truth:  20
Evaluation Accuracy: 40.00%





## Reinforcement Finetuning with GRPO

RFT is a training technique which uses RL to improve the performance of LLMs on tasks that require multi step reasoning say to complete tasks like math or code generation
By harnessing LLMs ability to reason through problems to think step by step, RFT guides the model to discover solutions to complex tasks on its own rather than relying on pre-existing examples as in SFT
This approach lets you adapt models to complex tasks with much less training data say just a couple dozen training examples



### Limitations of RLHF and DPO

Neither Method teaches the model entirely new tasks, they simply guide the model towards human preferred behaviours

To get around the limitations of large preference datasets, the DeepSeek team proposed a new alternative method called Group Relative Policy Optimization or GRPO

The GRPO sidesteps the need for any human preference labels by leaning on programmable rewards functions that we can define

Its core training loop has three steps

1. Like RLHF, we first send a prompt to the LLM and sample multiple candidate responses

2. Next, we can write one or more programmable reward functions that take each prompt and response pair input and emits a score.

3. If these functions are written well, the generated responses will recieve a range of scores.

4. GRPO ALgorithm then treats each candidates reward as a training signal. It pushes up the probability of producing responses with above-average scores within this group, and pushes down those responses with below-average scores

5. By Repeating this loop, GRPO fine-tunes the model directly on the reward function to care about without ever collecting preference data and thus unlocks reinforcement finetuning even when human labels are scarce or costly.


