# Lecture 23 - Reasoning Models

[![View notebook on Github](https://img.shields.io/static/v1.svg?logo=github&label=Repo&message=View%20On%20Github&color=lightgrey)](https://github.com/avakanski/Fall-2025-Applied-Data-Science-with-Python/blob/main/docs/Lectures/Theme_3-Model_Engineering/Lecture_23-Reasoning_Models/Lecture_23-Reasoning_Models.ipynb)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/avakanski/Fall-2025-Applied-Data-Science-with-Python/blob/main/docs/Lectures/Theme_3-Model_Engineering/Lecture_23-Reasoning_Models/Lecture_23-Reasoning_Models.ipynb)

<a id='top'></a>

- [23.1 Introduction to Reasoning Models](#23.1-introduction-to-reasoning-models)
  - [23.1.1 Test-time Compute](#23.1.1-test-time-compute)
  - [23.1.2 Main Categories of Reasoning Models](#23.1.2-main-categories-of-reasoning-models)
- [23.2 Inference-based Reasoning Models](#23.2-inference-based-reasoning-models)
  - [23.2.1 Chain-of-Thought Prompting](#23.2.1-chain-of-thought-prompting)
  - [23.2.2 Self-Consistency Sampling](#23.2.2-self-consistency-sampling)
- [23.3 Supervised Finetuning Reasoning Models](#23.3-supervised-finetuning-reasoning-models)
- [23.4 Reinforcement Learning-based Reasoning Models](#23.4-reinforcement-learning-based-reasoning-models)
  - [23.4.1 Outcome and Process Reward Models](#23.4.1-outcome-and-process-reward-models)
  - [23.4.2 RL Algorithms for Reasoning Models](#23.4.2-rl-algorithms-for-reasoning-models)
  - [23.4.3 Finetuning a Reasoning Model using GRPO](#23.4.3-finetuning-a-reasoning-model-using-grpo)
- [23.5 Reasoning Models Limitations](#23.5-reasoning-models-limitations)
- [References](#references)

## 23.1 Introduction to Reasoning Models <a name='23.1-introduction-to-reasoning-models'></a>

Compared to regular LLMs that immediately produce the answer to a given question, **reasoning LLMs** break down a problem into smaller steps called **reasoning steps** or **thought processes** before answering.

<img src="images/regular_vs_reasoning_llms.png" width="600">

*Figure: Regular vs Reasoning LLMs.* Source [1].


I.e., instead of directly outputting the final answer, a reasoning LLM first produces a series of intermediate steps that explain how the model arrived at the conclusion. These steps make the internal processing more transparent and easier to follow. Reasoning steps are often referred to as **chain-of-thought (CoT)**, or in some sources they are also called *reasoning chain* or *reasoning traces* or *reasoning trajectory*.

<img src="images/reasoning_steps.png" width="500">

*Figure: Reasoning steps to produce the answer.* Source [1].

LLM developers describe the reasoning process as the model spending more time “thinking” through the problem before responding. However, it is important to note that reasoning LLMs do not actually think or reason in the human sense, and these terms are used more for convenience. LLMs generate responses autoregressively, one token at a time, based on statistical patterns learned from training data. The same applies to reasoning steps: they are also generated token-by-token from learned patterns. Consequently, there is no guarantee that LLM-generated reasoning steps are logical or correct.

Producing intermediate steps often improves LLM performance, and  has led to major advancements in generating solutions to complex reasoning tasks. This caused an important shift among LLM developers to focus on reasoning models. Today, most premier LLMs are equipped with reasoning abilities, and when answering users' questions they commonly display that they are “thinking” through the problem.

### 23.1.1 Test-time Compute <a name='23.1.1-test-time-compute'></a>


Until 2024, developers typically improved LLM performance during pre-training and finetuning steps by collecting larger datasets (increasing the number of tokens), designing larger models (increasing the number of parameters), and using parallel computing across many GPUs (increasing the number of FLOPs). This approach is referred to as **train-time compute**, since increases across the three dimensions—dataset size, model size, and computing power—occur during model training.

In this context, the term *"compute"* refers to the computational resources required to train or run a model. Compute is typically expressed in floating-point operations (FLOPs), which measure the number of performed mathematical operations (e.g., multiplications, additions, etc).  

The relationship between compute scale and model performance is described by **scaling laws**. Scaling laws diagrams are usually shown on a log scale to illustrate how model quality improves with increased compute. An example is presented in the next figure. In the figure, `pass@1 accuracy` is a performance metric that indicates how often the model solves a problem correctly on the first attempt. It measures how reliable the model is without sampling multiple attempts. E.g., `pass@k accuracy` is a related metric that involves multiple attempts and measures how often the correct answer appears in any of k generated samples.

Well-known scaling laws include the Kaplan and Chinchilla laws, which imply that model performance increases with more data tokens, parameters, and compute FLOPs. The laws suggest that all three factors must be scaled simultaneously for optimal performance.

In 2024, OpenAI introduced a **test-time compute** scaling law and demonstrated that increasing computation during inference can boost model performance similarly to increasing train-time compute. Test-time compute is also referred to as **inference-time compute**.

<img src="images/scaling_laws.jpg" width="600">

*Figure: LLM scaling laws.* Source [1].

Test-time compute scaling caused a paradigm shift in LLM development. Instead of focusing primarily on train-time scaling through pre-training and finetuning, recent LLMs use more compute during inference to achieve improved reasoning and enhanced performance.

The figure below illustrates the difference in test-time compute between a non-reasoning LLM (which uses one token during inference) and reasoning models that use 6 and 15 tokens, respectively, to generate answers to the same question. By using more compute during inference, a reasoning LLM derives the answer through step-by-step "thinking".

<img src="images/tokens_compute.png" width="900">

*Figure: Comparison of used tokens for generating an answer*. Source [1].






### 23.1.2 Main Categories of Reasoning Models <a name='23.1.2-main-categories-of-reasoning-models'></a>



Numerous approaches have recently been introduced for enhancing LLM reasoning abilities. These methods can be classified into four main categories:

1. **Inference-based reasoning methods**: These methods improve reasoning at inference time by using prompting and sampling strategies, but without retraining the model. That is, model weights are fixed, and no architectural changes are made. Examples include:

- *Chain-of-thought prompting* - prompting the model to produce step-by-step explanations.
- *Self-consistency sampling* - sampling multiple reasoning paths and selecting the most frequent answer.
- *Tree-of-thought* - exploring multiple reasoning branches using a search algorithm.
- *Reflection decoding* - prompting the model to reflect on mistakes and iteratively revise its solution.

2. **Supervised finetuning reasoning models**: The models are finetuned on datasets containing reasoning steps. Unlike inference-based methods that don't change model weights, this category of methods updates model weights during finetuning. Reasoning datasets typically contain questions and corresponding chain-of-thought reasoning steps and final answers. The datasets are often created based on solved problems from math or logic domains. The model learns reasoning patterns directly from supervised data. An exemplary approach is *distillation-based reasoning*, where reasoning abilities are transferred from a large, powerful LLM to a smaller and more efficient LLM.


3. **Reinforcement learning-based reasoning models**: Employ Reinforcement Learning (RL) algorithms that encourage the model to select reasoning steps that maximize reward signals. Rewards can differ across tasks, e.g., they can be defined based on producing correct answers to math problems, quality of intermediate reasoning steps, etc. This approach allows the model to learn reasoning strategies through trial and error, based on whether the selected steps increase the reward value. Like supervised finetuning category of reasoning models, RL-based methods also apply additional finetuning and update the model weights.


4. **Modular or hybrid reasoning models**: This approach enhances model reasoning by using external modules or tools. For instance, LLMs can use calculators or search engines to answer specific questions, rely on retriever-augmented reasoning techniques, employ mixture-of-experts architectures with reasoning-specialized experts, or use other domain-specific tools.

## 23.2 Inference-based Reasoning Models <a name='23.2-inference-based-reasoning-models'></a>

As explained above, **inference-based reasoning methods** rely on prompting and sampling techniques that improve the reasoning capabilities of LLMs without modifying the model's weights or changing the architecture. Although such models still generate responses in the standard autoregressive manner as any other text, certain prompting and sampling strategies can significantly enhance their reasoning performance. Reasons for the improvements include:

- The model is encouraged to reason through intermediate steps before producing the final answer.
- Complex problems are decomposed into simpler subproblems that are easier to solve.
- Sampling multiple reasoning paths increases the likelihood of exploring correct solutions.
- The model can reflect on its own responses, identify mistakes, and self-correct its reasoning.


Inference-based reasoning techniques are simple to use because they require only prompt engineering or sampling multiple responses from a fixed model. The following sections provide examples of implementing basic inference-based reasoning methods using the Hugging Face library.

### Load Required Packages

Let's first install the necessary Hugging Face packages `transformers`, `peft`, `trl` and `datasets`, and import the relevant modules. These steps are familiar from the previous lectures in this course.

In [None]:
# Install packages
!pip install -qq transformers datasets trl peft

In [None]:
# Import libraries and modules
import torch
# Causal Language Model and Tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer
# Parameter-efficient fine-tuning
from peft import LoraConfig, get_peft_model, TaskType
# Dataset handling
from datasets import load_dataset
# GRPO training components
from trl import GRPOConfig, GRPOTrainer
# Regex patterns for reward functions
import re
# Counter for tracking number of occurrences
from collections import Counter

### Load Model and Tokenizer

For this demonstration, we will use a fairly small LLM `Qwen3-0.6B-Base`. This model has been pre-trained and finetuned for general language understanding, but it has not been explicitly finetuned for reasoning tasks.

In [None]:
# Non-reasoning model
non_reasoning_model_name = "Qwen/Qwen3-0.6B-Base"

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    non_reasoning_model_name,
    # Auto-distribute across available GPUs/CPU
    device_map="auto",
    # Allow custom model code execution
    trust_remote_code=True,
    # Use FP16 to reduce memory
    torch_dtype=torch.float16,
    # Optimize memory usage during loading
    low_cpu_mem_usage=True,
)

# Load corresponding tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    non_reasoning_model_name,
    trust_remote_code=True
)
# Ensure tokenizer has proper padding token for batch processing
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

`torch_dtype` is deprecated! Use `dtype` instead!


### Prompt

Let's now evaluate the model's answer to the following question. Although it is a simple math problem, LLMs often fail to generate a correct answer because they are trained primarily for next-token prediction rather than for solving logical or mathematical problems.

In [None]:
prompt = (r"Half the value of $3x-9$ is $x+37$. What is the value of $x$?")

This cell formats the question by inserting a system message that requires the model to write the result on a new line in an answer
box as `\boxed{ANSWER}`.

In [None]:
prompt_fmt = ("You are a helpful math assistant.\n"
        "Answer the question and write the final result on a new line as:\n"
        "\\boxed{ANSWER}\n\n"
        f"Question:\n{prompt}\n\nAnswer:")

In [None]:
print(prompt_fmt)

You are a helpful math assistant.
Answer the question and write the final result on a new line as:
\boxed{ANSWER}

Question:
Half the value of $3x-9$ is $x+37$. What is the value of $x$?

Answer:


### Generate Response

The code in the next cell first converts the text in the prompt into a sequence of tokens, then the model generates a response in the form of output tokens, which afterward are decoded into text.

The correct answer to the provided question is 83. In this case, the LLM directly provided the final answer as 22, which is incorrect. This illustrates the limitation of standard autoregressive decoding.

In [None]:
input_tokens = tokenizer(prompt_fmt, return_tensors="pt").to(model.device)

output_tokens = model.generate(
        **input_tokens,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
print(output_text)

You are a helpful math assistant.
Answer the question and write the final result on a new line as:
\boxed{ANSWER}

Question:
Half the value of $3x-9$ is $x+37$. What is the value of $x$?

Answer: \boxed{22}


### 23.2.1 Chain-of-Thought Prompting <a name='23.2.1-chain-of-thought-prompting'></a>

Next, let's apply CoT prompting to encourage the LLM to reason step by step. One simple way to achieve this is by modifying the original prompt and append the phrase "Explain your response
step by step."

Notice in the output of the next cell that the model provides the intermediate steps in solving the problem, as Steps 1 to 4. This reasoning process likely helped generate the correct final answer 83.

We can also notice that the output of CoT prompting is much longer than the direct answer in the previous case where the LLM produced only a few tokens for the final answer. With CoT prompting, the model describes in detail all steps for yielding the final answer. Hence, this approach requires much larger computational resources, and it is significantly more expensive.

In [None]:
prompt_chainofthought = prompt + " \n\nExplain your response step by step."

In [None]:
input_tokens = tokenizer(prompt_chainofthought, return_tensors="pt").to(model.device)

output_tokens = model.generate(
        **input_tokens,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
print(output_text)

Half the value of $3x-9$ is $x+37$. What is the value of $x$? 

Explain your response step by step. To solve for \( x \) in the equation where half the value of \( 3x - 9 \) is equal to \( x + 37 \), we will follow these steps:

1. **Write down the equation:**
   \[
   \frac{1}{2}(3x - 9) = x + 37
   \]

2. **Eliminate the fraction by multiplying both sides by 2:**
   \[
   2 \cdot \frac{1}{2}(3x - 9) = 2(x + 37)
   \]
   Simplifying both sides, we get:
   \[
   3x - 9 = 2x + 74
   \]

3. **Isolate the variable \( x \) by subtracting \( 2x \) from both sides:**
   \[
   3x - 2x - 9 = 2x - 2x + 74
   \]
   Simplifying both sides, we get:
   \[
   x - 9 = 74
   \]

4. **Solve for \( x \) by adding 9 to both sides:**
   \[
   x - 9 + 9 = 74 + 9
   \]
   Simplifying both sides, we get:
   \[
   x = 83
   \]

Therefore, the value of \( x \) is \(\boxed{83}\).


### 23.2.2 Self-Consistency Sampling <a name='23.2.2-self-consistency-sampling'></a>

Beside generating longer outputs with more tokens as in the case of CoT prompting, another inference-based approach involves generating multiple responses, each with their own reasoning paths. Afterward, an aggregation method is applied to select a single final answer.

**Self-consistency sampling** is one such strategy that samples multiple responses, and selects the most frequent answer as the final answer. Because of the adopted aggregation, this technique is also called *majority voting*.

The process of generating multiple responses is also referred to as *parallel decoding* as multiple decoding paths are explored.  

<img src="images/self_consistency.png" width="600">

*Figure: Self-consistency sampling.* Source [1].

An example is presented next, where the model generates 5 reasoning paths. To increase reasoning diversity, we used `temperature` and `top_p` sampling. In this case, 4 out of the 5 paths produced 83 as an answer, which is therefore selected as the final answer.

In [None]:
# Generate multiple reasoning paths
num_paths = 5

# List to save all answers
all_answers = []

for path_id in range(num_paths):
    input_tokens = tokenizer(prompt_chainofthought, return_tensors="pt").to(model.device)
    output_tokens = model.generate(
        **input_tokens,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
    output_text = tokenizer.decode(output_tokens[0][input_tokens['input_ids'].shape[1]:], skip_special_tokens=True)

    # Extract numeric answer for this path
    numbers = re.findall(r'\d+\.?\d*', output_text)
    extracted_answer = numbers[-1] if numbers else None

    # Append the answer for the current path
    all_answers.append(extracted_answer)

    print(f"\nPath {path_id + 1}:")
    print(f"Extracted Answer: {extracted_answer}")


Path 1:
Extracted Answer: 83

Path 2:
Extracted Answer: 4

Path 3:
Extracted Answer: 83

Path 4:
Extracted Answer: 83

Path 5:
Extracted Answer: 83


In [None]:
# Majority vote to find the most common answer
answer_counts = Counter(all_answers)
most_common_answer, _ = answer_counts.most_common(1)[0]

print(f"\n Self-Consistency Answer: {most_common_answer}")


 Self-Consistency Answer: 83


In [None]:
# Delete non-reasoning model and tokenizer from GPU memory
del model
del tokenizer
torch.cuda.empty_cache()

The main benefits of inference-based reasoning approaches are that they are simple, effective, and do not require retraining the model. On the other hand, these techniques do not provide the model with new knowledge as they simply change how the model uses its existing knowledge to generate more accurate and reliable responses.

In addition, although inference-based techniques can improve performance via CoT, they do not guarantee that the generated answers are correct. The model may still produce flawed reasoning, and it can introduce unnecessary steps that lead to incorrect conclusions.



## 23.3 Supervised Finetuning Reasoning Models <a name='23.3-supervised-finetuning-reasoning-models'></a>

**Supervised finetuning (SFT) reasoning models** are developed by finetuning a pre-trained model using supervised learning on a dataset comprising questions and CoT outputs. An advantage of supervised finetuning reasoning over inference-based approaches is that the model weights are updated during finetuning, which usually improves the reasoning capabilities.

The quality of the training data is a critical factor in this approach. On the other hand, manually creating high-quality reasoning datasets is a labor-intensive and expensive process. It requires experts to compose step-by-step solutions that undergo rigorous verification. Recent advances now allow large reasoning models to generate synthetic CoT examples for supervised finetuning. Several open-source datasets containing solved math and logic problems are available and have been broadly used for this purpose.

As discussed in the previous section, *knowledge distillation* can be used to finetune smaller models using reasoning traces generated by larger reasoning models. This technique allows small LLMs to inherit reasoning abilities from large LLMs.

SFT strategy for creating reasoning models is straightforward to implement and it is similar to standard instruction-following tuning procedures which we covered earlier in the course in the lecture on LLMs.

## 23.4 Reinforcement Learning-based Reasoning Models <a name='23.4-reinforcement-learning-based-reasoning-models'></a>

**RL-based reasoning models** use reward signals to evaluate the quality of their intermediate reasoning steps or final answers, allowing the model to improve and refine the reasoning process. This approach enables the model to explore different reasoning paths during finetuning and gradually select strategies that yield higher rewards.

An important advantage of RL-based approaches is that they encourage models to develop more efficient search procedures over possible solutions. As a result, RL-based reasoning models can outperform inference-based or supervised finetuning models on complex tasks requiring mathematical or logical reasoning.

RL-driven reasoning techniques are currently among the most powerful methods for enhancing the reasoning capabilities of modern LLMs. They help models not only produce correct answers but also adopt effective inference-time strategies, such as branching, self-consistency sampling, or reflective refinement.

It is also important to differentiate RL-based optimization for reasoning from Reinforcement Learning with Human Feedback (RLHF), which we studied in the lecture on LLMs. In particular, RLHF is used during the alignment phase of LLM development (also called preference tuning). In RLHF, rewards are obtained from human annotators who rank or compare model responses, and the objective of RLHF is to guide the model toward generating safe outputs that are aligned with human preferences. Conversely, in reasoning models, rewards for RL are computed automatically based on whether the model produces the correct final answer to a given question, or whether it produces correct intermediate reasoning steps.

### 23.4.1 Outcome and Process Reward Models <a name='23.4.1-outcome-and-process-reward-models'></a>

Applying RL to develop reasoning LLMs relies on **verifiable tasks** where the correctness of an answer or a reasoning step can be automatically checked. Typical examples include math problems, logic puzzles, programming tasks, and scientific questions where solutions can be validated. For instance, math problems can be checked by comparing the model's answer to the ground-truth solution, and programming tasks can be verified by running the generated code to see if it passes all test cases. The verification results serve as reward signals for training reasoning models with RL. Because verification in these domains can be done automatically, they are well suited for developing reasoning models, since verifiable tasks allow the model to explore many reasoning paths and receive reward signals for its performance.

Based on how rewards are assigned in RL-based reasoning models, the models can be divided into Outcome Reward Models (ORM) and Process Reward Models (PRMs).

**Outcome Reward Models (ORM)** assign rewards based only on the correctness of the final answer. Although the model may produce a full CoT, only the final output is checked to determine whether the reasoning path was successful. If the answer is correct, the entire path is rewarded; otherwise, it receives little or no reward. A limitation of ORM is that it does not evaluate the quality of intermediate reasoning steps, so the model must independently discover strategies that lead to correct final answers.

**Process Reward Models (PRM)** extend this idea by evaluating the intermediate steps leading to the final answer. I.e., PRMs assign rewards to individual reasoning steps, e.g., partial derivations in a math solution or intermediate code sections in a programming task. This helps the model learn more structured reasoning, and makes it easier to avoid hallucinations or logically inconsistent CoT. PRMs are especially useful for tasks that require multi-step reasoning, because they provide feedback on each step that leads to correct reasoning rather than only correct outcomes.

<img src="images/orm_prm.png" width="800">

*Figure: Outcome Reward Models vs Process Reward Models.* Source [1].

### 23.4.2 RL Algorithms for Reasoning Models <a name='23.4.2-rl-algorithms-for-reasoning-models'></a>

Various RL algorithms have been proposed that use reward signals assigned by ORMs or PRMs for updating LLM parameters to improve a model's ability to "think" through complex problems. Among the initial RL algorithms used for reasoning LLMs is PPO (Proximal Policy Optimization). This algorithm uses actor and critic models, where the actor proposes reasoning steps, and the critic evaluates them to guide the learning process and selects the most promising steps. However, PPO is computationally expensive, and its high memory requirements have driven the adoption of more efficient algorithms. GRPO (Group Relative Policy Optimization) eliminates the need for a critic model and replace it with statistical group averaging. This frees up GPU resources and allows the model to explore long CoT sequences. Similar efficiency is achieved by RLOO (REINFORCE Leave-One-Out), which also learns from multiple reasoning paths without a separate critic model. Other RL algorithms include bootstrapping frameworks like STaR (Self-Taught Reasoner) and IDPO (Iterative Direct Preference Optimization), in which the model iteratively learns from its own successes, teaching itself to reason better over time. Please note that the detailed mathematical internal workings of these algorithms are outside the scope of this course.

### 23.4.3 Finetuning a Reasoning Model using GRPO <a name='23.4.3-finetuning-a-reasoning-model-using-grpo'></a>

This section demonstrates using GRPO (Group Relative Policy Optimization) RL algorithm to finetune an LLM for mathematical reasoning. We will use the GSM8K dataset that contains a curated collection of approximately 8,000 solved math problems. GSM8K is commonly used as a benchmark dataset for reasoning models.

The objective is for the model to generate structured mathematical solutions and provide the intermediate reasoning steps for solving the problems.

#### Load Dataset

The next cell loads separately the train and test splits of the GSM8K mathematical reasoning dataset.

In [None]:
# Load train split of GSM8K
train_dataset = load_dataset("openai/gsm8k", "main", split="train")

# Load test split of GSM8K
test_dataset = load_dataset("openai/gsm8k", "main", split="test")

README.md: 0.00B [00:00, ?B/s]

main/train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

main/test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

In [None]:
# Check the dataset
train_dataset

Dataset({
    features: ['question', 'answer'],
    num_rows: 7473
})

In [None]:
# Check the first math problem
train_dataset[0]

{'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
 'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'}

In [None]:
# Print the first question and answer
print("Question:", train_dataset[0]['question'])
print("Answer:", train_dataset[0]['answer'])

Question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Answer: Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72


GSM8K answers use a `####` marker before the final numeric answer. This next function extracts the final answer to create clean targets.

In [None]:
# Dataset processing utilities
def extract_hash_answer(text):
    """Extract numerical answer from GSM8K format (#### marker)"""
    if "####" not in text:
        return None
    # GSM8K uses format: "Answer... #### 72"
    return text.split("####")[1].strip()

Next, we define a short structured output format for training, consisting of a reasoning section and a solution section.

In [None]:
# Define structured output format for mathematical reasoning
reasoning_start = "<REASONING>"
reasoning_end = "</REASONING>"
solution_start = "<SOLUTION>"
solution_end = "</SOLUTION>"

# System prompt that defines the desired reasoning structure
system_prompt = f"""You are a mathematical reasoning assistant.
When given a math problem:
1. Show your step-by-step work between {reasoning_start} and {reasoning_end}
2. Provide ONLY your final numerical answer between {solution_start} and {solution_end}
   - Example: <SOLUTION> 18 </SOLUTION>
3. Be precise and show all calculation steps clearly in the reasoning section."""

This code formats dataset examples as conversations for GRPO.

In [None]:
def process_dataset_example(example):
    """Convert GSM8K example to conversation format for GRPO training"""
    question = example["question"]
    answer = extract_hash_answer(example["answer"])

    # Create conversation with system prompt for structured reasoning
    prompt = [{"role": "system", "content": system_prompt},
        {"role": "user", "content": question},]

    return {"prompt": prompt, "answer": answer}

In [None]:
# Apply conversation formatting to all examples
train_dataset = train_dataset.map(process_dataset_example)

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

In [None]:
# Print the first example
train_dataset[0]

{'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
 'answer': '72',
 'prompt': [{'content': 'You are a mathematical reasoning assistant.\nWhen given a math problem:\n1. Show your step-by-step work between <REASONING> and </REASONING>\n2. Provide ONLY your final numerical answer between <SOLUTION> and </SOLUTION>\n   - Example: <SOLUTION> 18 </SOLUTION>\n3. Be precise and show all calculation steps clearly in the reasoning section.',
   'role': 'system'},
  {'content': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
   'role': 'user'}]}

#### Model Loading

For this task we will use `Llama-3.2-3B-Instruct` with 3B parameters.

In [None]:
# Select model
model_name = "unsloth/Llama-3.2-3B-Instruct"
# Token limit for mathematical problems
max_seq_length = 2048

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    # Auto-distribute across available GPUs/CPU
    device_map="auto",
    # Allow custom model code execution
    trust_remote_code=True,
    # Use FP16 to reduce memory
    torch_dtype=torch.float16,
    # Optimize memory usage during loading
    low_cpu_mem_usage=True,
)

# Load corresponding tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True
)

# Ensure tokenizer has proper padding token for batch processing
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

config.json:   0%|          | 0.00/890 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

We will apply LoRA (Low-Rank Adaptation) performance-efficient finetuning to update only approximately 0.3% of the model parameters.

In [None]:
# Configure LoRA for mathematical reasoning adaptation
lora_config = LoraConfig(
    # Rank: adaptation capacity
    r=16,
    # Scaling factor (typically 2x rank)
    lora_alpha=32,
    # Focus on attention for reasoning
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    # Regularization to prevent overfitting
    lora_dropout=0.1,
    # Skip bias adaptation for simplicity
    bias="none",
    # Causal language modeling task
    task_type=TaskType.CAUSAL_LM,
)

# Apply LoRA configuration to create trainable adapter
model = get_peft_model(model, lora_config)

# Enable gradient checkpointing to reduce memory usage
model.gradient_checkpointing_enable()

# Display trainable vs total parameters
model.print_trainable_parameters()

trainable params: 9,175,040 || all params: 3,221,924,864 || trainable%: 0.2848


#### RL Reward Definition

This cell defines the reward signal for GRPO. The reward uses a combination of 4 functions that evaluate different aspects of the generated responses:

- Match Format Exactly: Assigns high reward if the format of the response matches the required response pattern.
- Match Format Approximately: Assigns partial reward if the response matches individual components of the response pattern, even when the matching is not perfect.
- Check Answer Correctness: Assigns rewards for mathematical accuracy with graduated scoring (e.g., 3.0 reward for exact match, 1.5 reward for answer within 10%, etc.).
- Check Numbers Extraction: Assign rewards if the model can parse the output and extract numerical results.

In [None]:
# The code in this cell is not required in preparation for quizzes

# Compiled regex patterns for efficient reward computation
match_format = re.compile(
    rf"^[\s]{{0,}}"                      # Optional whitespace at start
    rf"{reasoning_start}.+?{reasoning_end}.*?"  # Reasoning section (non-greedy)
    rf"{solution_start}(.+?){solution_end}"     # Solution section with capture group
    rf"[\s]{{0,}}$",                     # Optional whitespace at end
    flags=re.MULTILINE | re.DOTALL       # Multi-line matching with . matching newlines
)

match_numbers = re.compile(
    rf"{solution_start}.*?([\d\.]{{1,}})", # Extract numbers from solution section
    flags=re.MULTILINE | re.DOTALL        # Flexible pattern matching
)

# Matches from solution_start until the closing tag OR the end of the string
match_lenient = re.compile(
    rf"{solution_start}\s*(.*?)(?:{solution_end}|$)",
    flags=re.MULTILINE | re.DOTALL
)

# Reward Function 1: Exact Format Compliance
def match_format_exactly(completions, **kwargs):
    """
    High reward (3.0) for perfect format adherence
    Ensures model learns the complete structured output pattern
    """
    scores = []
    for completion in completions:
        response = completion[0]["content"]
        # Check if response matches complete format pattern
        score = 3.0 if match_format.search(response) is not None else 0.0
        scores.append(score)
    return scores

# Reward Function 2: Partial Format Credit
def match_format_approximately(completions, **kwargs):
    """
    Graduated scoring for format elements
    Encourages learning individual components even if not perfect
    """
    scores = []
    for completion in completions:
        response = completion[0]["content"]
        score = 0

        # Award +0.5 for correct token count, -0.5 for wrong count
        score += 0.5 if response.count(reasoning_start) == 1 else -0.5
        score += 0.5 if response.count(reasoning_end) == 1 else -0.5
        score += 0.5 if response.count(solution_start) == 1 else -0.5
        score += 0.5 if response.count(solution_end) == 1 else -0.5

        scores.append(score)
    return scores

# Reward Function 3: Mathematical Accuracy
def check_answer_correctness(prompts, completions, answer, **kwargs):
    """
    Graduated scoring for mathematical accuracy:
    - 3.0: Exact match
    - 1.5: Within 10% (close answer)
    - 0.5: Within 20% (reasonable attempt)
    - -0.5: Wrong answer (penalty for incorrect math)
    """
    responses = [completion[0]["content"] for completion in completions]

    # Extract answers using format pattern
    extracted_responses = [
        guess.group(1) if (guess := match_format.search(r)) is not None else None
        for r in responses
    ]

    scores = []
    for guess, true_answer in zip(extracted_responses, answer):
        if guess is None:  # No extractable answer
            scores.append(0)
            continue

        # Exact string match gets full points
        if guess.strip() == true_answer.strip():
            scores.append(3.0)
        else:
            # Try numerical comparison for partial credit
            try:
                ratio = float(guess) / float(true_answer)
                if 0.9 <= ratio <= 1.1:      # Within 10%
                    scores.append(1.5)
                elif 0.8 <= ratio <= 1.2:    # Within 20%
                    scores.append(0.5)
                else:                         # Wrong answer
                    scores.append(-0.5)
            except (ValueError, ZeroDivisionError):
                scores.append(-0.5)           # Invalid numerical format

    return scores

# Reward Function 4: Number Extraction Ability
def check_numbers_extraction(prompts, completions, answer, **kwargs):
    """
    Tests the model's ability to extract numerical values from solution sections
    Complementary to exact format matching - focuses on parsing capability
    """
    responses = [completion[0]["content"] for completion in completions]

    # Extract numbers from solution sections using number pattern
    extracted_responses = [
        guess.group(1) if (guess := match_numbers.search(r)) is not None else None
        for r in responses
    ]

    scores = []
    for guess, true_answer in zip(extracted_responses, answer):
        if guess is None:  # No extractable number
            scores.append(0)
            continue

        try:
            # Simple numerical equality check
            true_val = float(true_answer.strip())
            guess_val = float(guess.strip())
            # Binary scoring: correct (1.5) or incorrect (0)
            scores.append(1.5 if guess_val == true_val else 0.0)
        except (ValueError, TypeError):
            scores.append(0)  # Invalid number format

    return scores

#### Configure GRPO Training Arguments

GRPO training arguments are defined next. GRPO trainer in Hugging Face has similar format to SFT trainer, except that it defines the reward functions (`reward_funcs` argument) for updating the model. To output the values of the rewards, we added the simple callback function `RewardLoggingCallback`.

In [None]:
# Configure GRPO training parameters
training_args = GRPOConfig(
    # Learning rate
    learning_rate=5e-6,
    # Batch configuration
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    # Sequence length limits for mathematical problems
    max_prompt_length=1024,
    max_completion_length=1024,
    # Training duration
    # num_train_epochs=1,
    max_steps = 100,
    # Log every 5 steps
    logging_steps=10,
    # Enable FP16 training for memory efficiency
    fp16=True,
    bf16=False,
    # Output configuration
    output_dir="./trl_grpo_outputs",
    # Gradient clipping for stable training
    max_grad_norm=0.1,
    # Disable external logging
    report_to=[],
    # Generation parameters for better reward signal
    num_generations=4,
    # Higher temp for more diverse outputs
    temperature=0.8,
    # KL divergence penalty for GRPO
    beta=0.01,
)

In [None]:
# Custom Trainer to log rewards
from transformers import TrainerCallback

class RewardLoggingCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs and any('reward' in k for k in logs.keys()):
            reward_logs = {k: v for k, v in logs.items() if 'reward' in k}
            print(f"Step {state.global_step}: {reward_logs}")

trainer = GRPOTrainer(
    model=model,
    # Four complementary reward functions
    reward_funcs=[
        match_format_exactly,         # Structure compliance
        match_format_approximately,   # Partial format credit
        check_answer_correctness,     # Mathematical accuracy
        check_numbers_extraction,     # Number parsing ability
    ],
    # Training configuration
    args=training_args,
    # Processed GSM8K dataset
    train_dataset=train_dataset,
    # Tokenizer (processing_class in TRL)
    processing_class=tokenizer,
    # Log RL rewards
    callbacks=[RewardLoggingCallback()]
)

The model is already on multiple devices. Skipping the move to device specified in `args`.


Note that the GRPO algorithm uses a loss function that is different than the cross-entropy loss that is commonly used in ANN classification and for supervised training or finetuning. GRPO loss function can have negative values, and in fact, negative values mean that the model is producing high-reward outputs. However, interpretation of the GRPO loss function is more difficult than in supervised learning, because it is calculated as a sum of several terms. This creates compound behavior, where one term may be improving the model while another is having a negative impact. Hence, the final loss number can go up or down while the model is actually improving.

A more relevant metric to monitor in GRPO is the reward values, where increasing rewards indicate improving behavior. As you recall, we defined the overall reward as a sum of four reward criteria. In the output of the training cell below, the combined reward is displayed under the, well, "reward" variable. During the training, the reward increased from 3.72 at step 10 to 8.20 at step 100.


In [None]:
trainer.train()

`generation_config` default values have been modified to match model-specific defaults: {'max_length': 131072, 'top_p': 0.9}. If this is not desired, please set these values explicitly.


Step,Training Loss
10,0.0381
20,-0.0025
30,-0.0089
40,0.0044
50,0.0184
60,0.0069
70,0.0073
80,0.0034
90,0.0268
100,0.0152


Step 10: {'rewards/match_format_exactly/mean': 1.275, 'rewards/match_format_exactly/std': 1.4583582043647767, 'rewards/match_format_approximately/mean': 0.73125, 'rewards/match_format_approximately/std': 1.263677716255188, 'rewards/check_answer_correctness/mean': 1.05, 'rewards/check_answer_correctness/std': 1.4218261599540711, 'rewards/check_numbers_extraction/mean': 0.665625, 'rewards/check_numbers_extraction/std': 0.7159868717193604, 'reward': 3.721875, 'reward_std': 3.1834747076034544, 'frac_reward_zero_std': 0.075}
Step 20: {'rewards/match_format_exactly/mean': 2.45625, 'rewards/match_format_exactly/std': 1.1661575436592102, 'rewards/match_format_approximately/mean': 1.675, 'rewards/match_format_approximately/std': 0.7206614732742309, 'rewards/check_answer_correctness/mean': 2.121875, 'rewards/check_answer_correctness/std': 1.3605764508247375, 'rewards/check_numbers_extraction/mean': 1.18125, 'rewards/check_numbers_extraction/std': 0.5642989039421081, 'reward': 7.434375, 'reward_s

TrainOutput(global_step=100, training_loss=0.010910292826592923, metrics={'train_runtime': 4973.4869, 'train_samples_per_second': 0.322, 'train_steps_per_second': 0.02, 'total_flos': 0.0, 'train_loss': 0.010910292826592923})

#### Evaluate on Test Set Problems

The model is next evaluated on a small set of 5 problems from the test dataset. The code in the cell first formats the prompt for the model, next the text is tokenized and a response is generated, and finally the numerical answer is extracted from the model's response.

In this case, the model correctly answered 4 of the 5 questions. Note that premier LLMs have achieved over 97% accuracy on the GSM8K dataset. Smaller models similar to `Llama-3.2-3B` that we finetuned, typically reach around 50-60% accuracy.  

In [None]:
# The code in this cell is not required in preparation for quizzes

# Select 5 problems from test set
test_indices = [1, 3, 5, 7, 9]

model.eval()
correct_count = 0

# Generate responses
for idx, test_idx in enumerate(test_indices, 1):
    example = test_dataset[test_idx]
    question = example["question"]
    true_answer = extract_hash_answer(example["answer"])

    test_messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": question}
    ]

    test_input = tokenizer.apply_chat_template(
        test_messages,
        tokenize=False,
        add_generation_prompt=True
    )

    with torch.no_grad():
        inputs = tokenizer(test_input, return_tensors="pt").to(model.device)
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            # Lower temp for more focused answers
            temperature=0.7,
            do_sample=True,
            repetition_penalty=1.1,
            length_penalty=1.0,
            early_stopping=True,
            pad_token_id=tokenizer.pad_token_id
        )
        response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)

    # Extract the model's answer
    solution_text = None  # Initialize here
    format_match = match_format.search(response)

    if format_match:
        solution_text = format_match.group(1).strip()
    # If strict failed, just grab what is inside/after the solution tag
    else:
        lenient_match = match_lenient.search(response)
        if lenient_match:
            solution_text = lenient_match.group(1).strip()

    # Extract the final number
    if solution_text:
        numbers = re.findall(r'-?[\d,]+\.?\d*', solution_text)
        model_answer = numbers[-1] if numbers else "NO ANSWER FOUND"
    else:
        model_answer = "NO ANSWER FOUND"

    # Check if correct
    try:
        # Remove symbols like $ and , for numerical comparison
        clean_model = model_answer.replace('$', '').replace(',', '')
        clean_true = true_answer.replace('$', '').replace(',', '')
        is_correct = float(clean_model) == float(clean_true)
    except ValueError:
        is_correct = model_answer.strip() == true_answer.strip()

    if is_correct:
        correct_count += 1

    print(f"Test Problem {idx}/5")
    print(f"Question: {question}")
    print(f"\nTrue Answer: {true_answer}")
    print(f"\nModel Response:")
    print(response)
    print(f"\nExtracted Answer: {model_answer}")
    print(f"{'Correct' if is_correct else 'Incorrect'}")
    print(f"{'='*80}")

print(f"Results: {correct_count}/5 correct ({correct_count/5*100:.0f}%)")

Test Problem 1/5
Question: A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?

True Answer: 3

Model Response:
<REASONING>
To find the total number of bolts, we need to add the number of bolts of blue fiber and the number of bolts of white fiber. The problem states that it takes twice as much white fiber as blue fiber.

Let's denote the number of bolts of blue fiber as B and the number of bolts of white fiber as W. We know that W = 2B (since it takes half as much white fiber).

We can now write an equation: B + W = total bolts

Substituting W with 2B, we get:
B + 2B = total bolts
Combine like terms:
3B = total bolts

Since we don't have a specific value for B, we can express the total number of bolts in terms of B:
total bolts = 3B

However, since the question asks for the total number of bolts and not just "three times" the number of blue bolts, we can provide a more direct answer by stating that the total number of bolts is three

### 23.5 Reasoning Models Limitations <a name='23.5-reasoning-models-limitations'></a>

Despite the advantages, reasoning LLMs have several important limitations to consider.

- Overthinking: reasoning models “overthink,” i.e., since they can generate many more tokens than standard models to reach the same answer, in some settings they loop through unnecessarily long CoT instead of using external tools or stopping early. This makes them less efficient and sometimes less reliable.
- Limits of longer reasoning: more compute at test time does not always help. In some cases, longer CoT sequences actually hurt accuracy, where the model amplifies its own mistakes instead of fixing them. Some studies showed that reasoning models can fail to generalize on planning or low-complexity tasks where standard non-reasoning models do better.
- Degradation in non-reasoning domains: although finetuning a model for reasoning can improve math or code abilities, sometimes this can reduce performance on general instruction-following or other simpler tasks.
- Increased cost and latency: reasoning requires extra tokens, which increases compute cost, latency, and uses up the context window. For some tasks this extra cost is justified, while for others it is wasteful. These trade-offs are important when choosing or deploying models.

Reasoning models are most useful when a task truly requires multi-step thinking, such as solving math and logic problems, writing complex code, planning, or handling tasks where the answer must be derived rather than recalled. In these cases, the model benefits from exploring intermediate steps and evaluating different reasoning paths. However, reasoning is not always necessary or desirable. For simpler tasks like retrieving factual information,
summarization, or translation, standard LLMs are usually faster, cheaper, and often more accurate because they avoid unnecessary chains of thought. For instance, if the user asks "What is the capital of Italy?" there is no need to use a reasoning model to produce the answer. A good practical rule is to use the right tool for the task, and rely on reasoning models when genuine thinking is required, and use standard models when the task is straightforward.

## References <a name='references'></a>

1. A Visual Guide to Reasoning LLMs, by Maarten Grootendorst, available at [https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-reasoning-llms](https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-reasoning-llms).
2. Build A Reasoning Model (From Scratch), by Sebastian Raschka, Manning books, 2025, source code available at [https://github.com/rasbt/reasoning-from-scratch](https://github.com/rasbt/reasoning-from-scratch).
3. Demystifying Reasoning Models, by Cameron R. Wolfe, available at [https://cameronrwolfe.substack.com/p/demystifying-reasoning-models](https://cameronrwolfe.substack.com/p/demystifying-reasoning-models).
4. Advanced GRPO Fine-tuning for Mathematical Reasoning with Multi-Reward Training, by Behrooz Azarkhalili, available at [https://huggingface.co/learn/cookbook/en/trl_grpo_reasoning_advanced_reward](https://huggingface.co/learn/cookbook/en/trl_grpo_reasoning_advanced_reward).
5. Unsloth Notebooks: Llama3_2_(3B)_GRPO_LoRA.ipynb, available at [https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Advanced_Llama3_2_(3B)_GRPO_LoRA.ipynb](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Advanced_Llama3_2_(3B)_GRPO_LoRA.ipynb).
6. IBM: What is a reasoning model?, by Dave Bergmann, available at [https://www.ibm.com/think/topics/reasoning-model](https://www.ibm.com/think/topics/reasoning-model).

[BACK TO TOP](#top)