# Lecture 23 - Reasoning Models

[![View notebook on Github](https://img.shields.io/static/v1.svg?logo=github&label=Repo&message=View%20On%20Github&color=lightgrey)](https://github.com/avakanski/Fall-2025-Applied-Data-Science-with-Python/blob/main/docs/Lectures/Theme_3-Model_Engineering/Lecture_23-Reasoning_Models/Lecture_23-Reasoning_Models.ipynb)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/avakanski/Fall-2025-Applied-Data-Science-with-Python/blob/main/docs/Lectures/Theme_3-Model_Engineering/Lecture_23-Reasoning_Models/Lecture_23-Reasoning_Models.ipynb)

<a id='top'></a>

- [23.1 Introduction to Reasoning Models](#23.1-introduction-to-reasoning-models)
  - [23.1.1 Test-time Compute](#23.1.1-test-time-compute)
  - [23.1.2 Main Categories of Reasoning Models](#23.1.2-main-categories-of-reasoning-models)
- [23.2 Inference-based Reasoning Models](#23.2-inference-based-reasoning-models)
  - [23.2.1 Chain-of-Thought Prompting](#23.2.1.1-chain-of-thought-prompting)
  - [23.2.2 Self-Consistency](#23.2.2-self-consistency)
- [23.3 Supervised Finetuning Reasoning Models](#23.3-supervised-finetuning-reasoning-models)
- [23.4 Reinforcement Learning-based Reasoning Models](#23.4-reinforcement-learning-based-reasoning-models)
- [23.5 Reasoning Models Limitations](#23.5-reasoning-models-limitations)
- [References](#references)

Under construction

Under construction

## 23.1 Introduction to Reasoning Models <a name='23.1-introduction-to-reasoning-models'></a>

Compared to regular LLMs that immediately produce the answer to a given question, **reasoning LLMs** break down a problem into smaller steps called **reasoning steps** or **thought processes** before answering.

<img src="images/regular_vs_reasoning_llms.png" width="400">

*Figure: Regular vs Reasoning LLMs.* Source [1].


Instead of directly outputting the final answer, a reasoning LLM first produces a series of intermediate steps that explain how the model arrived at the conclusion. These steps make the internal processing more transparent and easier to follow. Reasoning steps are often referred to as **chain-of-thought (CoT)**.

<img src="images/reasoning_steps.png" width="500">

*Figure: Reasoning steps to produce the answer.* Source [1].

LLM developers describe the reasoning process as the model spending more time “thinking” through the problem before responding. However, it is important to note that reasoning LLMs do not actually think or reason in the human sense. LLMs generate responses autoregressively, one token at a time, based on statistical patterns learned from training data. The same applies to reasoning steps: they are also generated token-by-token from learned patterns. Consequently, there is no guarantee that LLM-generated reasoning steps are logical or correct.

In general, producing intermediate steps improves LLM performance, especially on more challenging problems. This caused an important shift among LLM developers to focus on reasoning models. Today, most premier LLMs are equipped with reasoning abilities, and when answering users' questions they commonly display that they are “thinking” through the problem.

### 23.1.1 Test-time Compute <a name='23.1.1-test-time-compute'></a>


Until 2024, developers typically improved LLM performance during pre-training adn fine-tuning steps by collecting larger datasets (increasing the number of tokens), designing larger models (increasing the number of parameters), and using parallel computing across many GPUs (increasing the number of FLOPs). This approach is referred to as **train-time compute**, since increases across the three dimensions—dataset size, model size, and computing power—occur during model training.

In this context, the term *"compute"* refers to the computational resources required to train or run a model. Compute is typically expressed in floating-point operations (FLOPs), which measure the number of performed mathematical operations (e.g., multiplications, additions, etc).  

The relationship between compute scale and model performance is described by **scaling laws**. Scaling laws diagrams are usually shown on a log-log scale to illustrate how model quality improves with increased compute. An example is presented in the next figure.

Well-known scaling laws include the Kaplan and Chinchilla laws, which imply that model performance increases with more data tokens, parameters, and compute FLOPs. The laws suggest that all three factors must be scaled simultaneously for optimal performance.

In 2024, OpenAI introduced a **test-time compute** scaling law and demonstrated that increasing computation during inference can boost model performance similarly to increasing train-time compute. Test-time compute is also referred to as **inference-time compute**.

<img src="images/scaling_laws.jpg" width="600">

*Figure: LLM scaling laws.* Source [1].


Test-time compute scaling caused a paradigm shift in LLM development. Instead of focusing primarily on train-time scaling through pre-training and finetuning, recent LLMs use more compute during inference to achieve improved reasoning and enhanced performance.

The figure below illustrates the difference in test-time compute between a non-reasoning LLM (which uses one token during inference) and reasoning models that use 6 and 15 tokens, respectively, to generate answers to the same question. By using more compute during inference, a reasoning LLM derives the answer through step-by-step "thinking".

<img src="images/tokens_compute.png.jpg" width="800">

*Figure: Comparison of used tokens for generating an answer*. Source [1].






### 23.1.2 Main Categories of Reasoning Models <a name='23.1.2-main-categories-of-reasoning-models'></a>



Numerous approaches have recently been introduced for enhancing LLM reasoning abilities. These methods can be classified into three main categories:

1. **Inference-based reasoning methods**

These methods improve reasoning at inference time without retraining the model. They focus on prompting and decoding strategies. Examples include:

- *Chain-of-thought prompting* - prompting the model to produce step-by-step explanations.
- *Self-consistency* - sampling multiple reasoning paths and selecting the most frequent answer.
- *Tree-of-thought* - exploring multiple reasoning branches using a search algorithm.
- *Reflection decoding* - prompting the model to reflect on mistakes and iteratively revise its solution.

In all of these methods, model weights are fixed, and no architectural changes are made. Reasoning emerges entirely from prompting and sampling strategies.

2. **Supervised finetuning reasoning models**

In these methods, the model is finetuned on datasets containing reasoning steps. Unlike inference-based methods that don't change model weights, this category of methods update model weights during finetuning.

Reasoning datasets typically contain questions and corresponding chain-of-thought reasoning steps and final answers. The datasets are often created based on solved problems from math or logic domains. The model learns reasoning patterns directly from supervised data.

An examplary approach is *distillation-based reasoning*, where reasoning abilities are transferred from a large, powerful LLM to a smaller and more efficient LLM.

3. **Reinforcement learning-based reasoning models**

These models employ Reinforcement Learning (RL) algorithms that encourage the model to select reasoning steps that maximize reward signals. Rewards can differ accross tasks, e.g., they can be defined based on producing correct answers to math problems, quality of intermediate reasoning steps, etc.

This approach allows the model to learn reasoning strategies through trial and error, based on whether the selected steps increase the reward value. Like supervised finetuning category of reasoning models, RL-based methods also apply additional finetuning and upate the model weights.

4. **Modular or hybrid reasoning models**

These models enhance reasoning by using external modules or tools. For instance, LLMs can use calculators or search engines to answer specific questions, or rely on retriever-augmented reasoning techniques, employ mixture-of-experts architectures with reasoning-specialized experts, or use other domain-specific tools.

## 23.2 Inference-based Reasoning Models <a name='23.2-inference-based-reasoning-models'></a>

Prompt-Driven Reasoning Models

Reasoning induced purely via prompting & decoding (CoT, ToT, SoT, etc.)


""Chain-of-thought prompting asks the model to write out the intermediate steps that
lead to a final answer. This helps in two practical ways.
First, walking through the steps gives the model more opportunities to correct itself.
Second, step-by-step reasoning matches how many training examples are written.
For instance, large math and logic datasets often contain detailed solutions, so
asking for a chain of thought aligns the model with patterns it has already learned.
At the same time, chains-of-thought are not a guarantee for correctness. It can still
produce wrong reasoning, and for very simple problems it may even introduce
unnecessary steps that lead to more mistakes. In other words, chains-of-thought can
improve accuracy on many reasoning tasks, but it is not universally beneficial.
Overall, chain-of-thought answering does not provide the model with new
knowledge, but it changes how the model uses its existing knowledge. Often, this
shift can lead to more reliable answers. This is especially true for math, code, logic
problems, and other sorts of multi-step problems.""

#### Load Required Packages

In [None]:
# Install packages
!pip install -qq transformers datasets trl peft

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/465.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/465.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m465.5/465.5 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Import libraries and modules
import torch
# Causal Language Model and Tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer
# Parameter-efficient fine-tuning
from peft import LoraConfig, get_peft_model, TaskType
# Dataset handling
from datasets import load_dataset
# GRPO training components
from trl import GRPOConfig, GRPOTrainer
# Regex patterns for reward functions
import re
# Counter for counting number of occurences
from collections import Counter

#### Load Model and Tokenizer

In [None]:
# Non-reasoning model
non_reasoning_model_name = "Qwen/Qwen3-0.6B-Base"

# Load the non-reasoning model
model = AutoModelForCausalLM.from_pretrained(
    non_reasoning_model_name,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)

# Load corresponding tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    non_reasoning_model_name,
    trust_remote_code=True
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model.eval()

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 1024)
    (layers): ModuleList(
      (0-27): 28 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (up_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (down_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)
        (post_attention_layer

In [None]:
prompt = (r"Half the value of $3x-9$ is $x+37$. What is the value of $x$?")

In [None]:
prompt_fmt = ("You are a helpful math assistant.\n"
        "Answer the question and write the final result on a new line as:\n"
        "\\boxed{ANSWER}\n\n"
        f"Question:\n{prompt}\n\nAnswer:")

In [None]:
print(prompt_fmt)

You are a helpful math assistant.
Answer the question and write the final result on a new line as:
\boxed{ANSWER}

Question:
Half the value of $3x-9$ is $x+37$. What is the value of $x$?

Answer:


In [None]:
input_tokens = tokenizer(prompt_fmt, return_tensors="pt").to(model.device)

output_tokens = model.generate(
        **input_tokens,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

output_text = tokenizer.decode(output_tokens.squeeze(0).tolist())
print(output_text)

You are a helpful math assistant.
Answer the question and write the final result on a new line as:
\boxed{ANSWER}

Question:
Half the value of $3x-9$ is $x+37$. What is the value of $x$?

Answer: \boxed{2}<|endoftext|>


### 23.2.1 Chain-of-Thought Prompting <a name='23.2.1.1-chain-of-thought-prompting'></a>

In [None]:
prompt_chainofthought = prompt + " \n\nExplain your response step by step."

In [None]:
input_tokens = tokenizer(prompt_chainofthought, return_tensors="pt").to(model.device)

output_tokens = model.generate(
        **input_tokens,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

output_text = tokenizer.decode(output_tokens.squeeze(0).tolist())
print(output_text)

Half the value of $3x-9$ is $x+37$. What is the value of $x$? 

Explain your response step by step. Let's solve the problem step by step.

**Given:**
- Half the value of \( 3x - 9 \) is \( x + 37 \).

**Step 1: Translate the statement into an equation.**

"Half the value" means \( \frac{1}{2} \) of the expression \( 3x - 9 \) equals \( x + 37 \).

\[
\frac{1}{2} \times (3x - 9) = x + 37
\]

**Step 2: Eliminate the fraction by multiplying both sides by 2.**

\[
2 \times \frac{1}{2} \times (3x - 9) = 2 \times (x + 37)
\]

\[
3x - 9 = 2x + 74
\]

**Step 3: Subtract \( 2x \) from both sides to get the \( x \)-terms on one side.**

\[
3x - 2x - 9 = 74
\]

\[
x - 9 = 74
\]

**Step 4: Add 9 to both sides to solve for \( x \).**

\[
x = 74 + 9
\]

\[
x = 83
\]

**Final Answer:**

\[
\boxed{83}
\]<|endoftext|>


""In the previous section, we saw that the model returned the final answer in an answer
box (written as r"\boxed{\dfrac{14}{3}}" in raw text), even though we hadn't
specifically asked for this format.
The reason the model answered in this specific format is likely because the model has
seen examples from benchmark datasets (including MATH-500) that were similarly
formatted during pretraining. As a general rule, it is fair to assume that any information
available on the internet when a model was trained has been part of the training data.



Although it was not necessary here, when we evaluate the model in the MATH-500
dataset later on, we will add a specific prompt that instructs the model to return answers in
this boxed form, as it is a common convention that makes the evaluation more consistent
across different models and makes data extraction easier.

MATH-500 is a curated collection of 500
problems that is widely used as a reasoning model benchmark dataset, which we will use
later in this chapter.)""

### 23.2.2 Self-Consistency <a name='23.2.2-self-consistency'></a>





<img src="images/self_consistency.png" width="600">

*Figure: Self-consistency.* Source [1].

In [None]:
# Generate multiple reasoning paths
num_paths = 5

# List to save all answers
all_answers = []

for path_id in range(num_paths):
    input_tokens = tokenizer(prompt_chainofthought, return_tensors="pt").to(model.device)
    output_tokens = model.generate(
        **input_tokens,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
    output_text = tokenizer.decode(output_tokens[0][input_tokens['input_ids'].shape[1]:], skip_special_tokens=True)
    # output_text = tokenizer.decode(output_tokens.squeeze(0).tolist())

    # Extract numeric answer for this path
    numbers = re.findall(r'\d+\.?\d*', output_text)
    extracted_answer = numbers[-1] if numbers else None

    # Append the answer for the current path
    all_answers.append(extracted_answer)

    print(f"\nPath {path_id + 1}:")
    print(f"Extracted Answer: {extracted_answer}")


Path 1:
Extracted Answer: 83

Path 2:
Extracted Answer: 2

Path 3:
Extracted Answer: 83

Path 4:
Extracted Answer: 83

Path 5:
Extracted Answer: 83


In [None]:
# Find the most common answer
answer_counts = Counter(all_answers)
most_common_answer, _ = answer_counts.most_common(1)[0]

print(f"\n Self-Consistency Answer: {most_common_answer}")


 Self-Consistency Answer: 83


In [None]:
# Delete non-reasoning model and tokenizer from GPU memory
del model
del tokenizer
torch.cuda.empty_cache()

## 23.3 Supervised Finetuning Reasoning Models <a name='23.3-supervised-finetuning-reasoning-models'></a>

## 23.4 Reinforcement Learning-based Reasoning Models <a name='23.4-reinforcement-learning-based-reasoning-models'></a>

Under construction

This section demonstrates the application of GRPO (Group Relative Policy Optimization) RL algorithm for finetuning an LLM for mathematical reasoning. Toward this goal, we will use the GSM8K dataset that contains approximately 8,000 solved math problems. The model will learn to generate structured mathematical solutions and provide the intermediate reasoning steps for solving the problems.


Outcome-Rewarded RL Reasoning Models

Optimized by RL for final accuracy only
(e.g., "reward = correct answer").

Process-Rewarded RL Reasoning Models

Optimized by RL for intermediate steps correctness
(e.g., "reward = correctness of reasoning path or code execution").


""In the context of developing reasoning models, it is important to distinguish the RL approach here from reinforcement learning with human feedback (RLHF), which is used during preference tuning when developing a conventional LLM as illustrated previously in figure 1.4.
Both settings use the same underlying process (RL) but they differ primarily in how the reward is obtained and validated (human judgments for RLHF versus automated
verifiers or environments for reasoning RL).
RLHF incorporates explicit human evaluations or rankings of model outputs as reward
signals, directly guiding the model toward human-preferred behaviors. In contrast,
RL in the context of reasoning models typically relies on automated or environmentbased
reward signals, which can be more objective but potentially less aligned with
human preferences. For instance, RL in a reasoning model development pipeline
might train a model to excel at mathematical proofs by providing explicit rewards for
correctness. In contrast, RLHF would involve human evaluators ranking various
responses to encourage outputs that align closely with human standards and
subjective preferences.""

#### Load Dataset

This examples uses the GSM8K mathematical reasoning dataset, which provides step-bystep solutions to mathematical problems.

The next cell loads separately the train and test splits of the dataset.

In [None]:
# Load train split of GSM8K
train_dataset = load_dataset("openai/gsm8k", "main", split="train")

# Load test split of GSM8K
test_dataset = load_dataset("openai/gsm8k", "main", split="test")

README.md: 0.00B [00:00, ?B/s]

main/train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

main/test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

In [None]:
train_dataset

Dataset({
    features: ['question', 'answer'],
    num_rows: 7473
})

In [None]:
train_dataset[0]

{'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
 'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'}

In [None]:
print("Question:", train_dataset[0]['question'])
print("Answer:", train_dataset[0]['answer'])

Question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Answer: Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72


In [None]:
# Dataset processing utilities
def extract_hash_answer(text):
    """Extract numerical answer from GSM8K format (#### marker)"""
    if "####" not in text:
        return None
    # GSM8K uses format: "Answer... #### 72"
    return text.split("####")[1].strip()

In [None]:
# Define structured output format for mathematical reasoning
reasoning_start = "<REASONING>"
reasoning_end = "</REASONING>"
solution_start = "<SOLUTION>"
solution_end = "</SOLUTION>"

# System prompt that defines the desired reasoning structure
system_prompt = f"""You are a mathematical reasoning assistant.
When given a math problem:
1. Show your step-by-step work between {reasoning_start} and {reasoning_end}
2. Provide ONLY your final numerical answer between {solution_start} and {solution_end}
   - Example: <SOLUTION> 18 </SOLUTION>
3. Be precise and show all calculation steps clearly in the reasoning section."""

In [None]:
def process_dataset_example(example):
    """Convert GSM8K example to conversation format for GRPO training"""
    question = example["question"]
    answer = extract_hash_answer(example["answer"])

    # Create conversation with system prompt for structured reasoning
    prompt = [{"role": "system", "content": system_prompt},
        {"role": "user", "content": question},]

    return {"prompt": prompt, "answer": answer}

In [None]:
# Apply conversation formatting to all examples
train_dataset = train_dataset.map(process_dataset_example)

# Use a smaller subset for faster demo training
train_dataset = train_dataset.select(range(1000))

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

In [None]:
train_dataset

Dataset({
    features: ['question', 'answer', 'prompt'],
    num_rows: 1000
})

In [None]:
train_dataset[0]

{'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
 'answer': '72',
 'prompt': [{'content': 'You are a mathematical reasoning assistant.\nWhen given a math problem:\n1. Show your step-by-step work between <REASONING> and </REASONING>\n2. Provide ONLY your final numerical answer between <SOLUTION> and </SOLUTION>\n   - Example: <SOLUTION> 18 </SOLUTION>\n3. Be precise and show all calculation steps clearly in the reasoning section.',
   'role': 'system'},
  {'content': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
   'role': 'user'}]}

#### Model Loading

In [None]:
# Select model
model_name = "unsloth/Llama-3.2-3B-Instruct"
# Token limit for mathematical problems
max_seq_length = 2048

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    # Auto-distribute across available GPUs/CPU
    device_map="auto",
    # Allow custom model code execution
    trust_remote_code=True,
    # Use FP16 to reduce memory
    torch_dtype=torch.float16,
    # Optimize memory usage during loading
    low_cpu_mem_usage=True,
)

# Load corresponding tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True
)

# Ensure tokenizer has proper padding token for batch processing
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

config.json:   0%|          | 0.00/890 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Apply LoRA (Low-Rank Adaptation) to train only approximately 0.3% of the model parameters.

In [None]:
# Configure LoRA for mathematical reasoning adaptation
lora_config = LoraConfig(
    # Rank: adaptation capacity
    r=16,
    # Scaling factor (typically 2x rank)
    lora_alpha=32,
    # Focus on attention for reasoning
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    # Regularization to prevent overfitting
    lora_dropout=0.1,
    # Skip bias adaptation for simplicity
    bias="none",
    # Causal language modeling task
    task_type=TaskType.CAUSAL_LM,
)

# Apply LoRA configuration to create trainable adapter
model = get_peft_model(model, lora_config)

# Enable gradient checkpointing to reduce memory usage
model.gradient_checkpointing_enable()

# Display ptrainable vs total parameters
model.print_trainable_parameters()

trainable params: 9,175,040 || all params: 3,221,924,864 || trainable%: 0.2848


#### RL Reward Definition

This cell defines the reward function for the reasoning model. The reward uses a combination of 4 functions that evaluate different aspects of the generated responses:

- Exact Format Matching: Assigns high reward if the format of the response matches the required response pattern.
- Approximate Matching: Assigns partial reward is the response matches individual components of the response pattern, even when the matching is nor perfect.
- Answer Correctness: Assigns rewards for mathematical accuracy with graduated scoring (e.g., 3.0 for exact match, 1.5 for answer within 10%, etc.).
- Number Extraction: Assign rewards if the model can parse the output and extract numerical results.

In [None]:
# Compiled regex patterns for efficient reward computation
match_format = re.compile(
    rf"^[\s]{{0,}}"                      # Optional whitespace at start
    rf"{reasoning_start}.+?{reasoning_end}.*?"  # Reasoning section (non-greedy)
    rf"{solution_start}(.+?){solution_end}"     # Solution section with capture group
    rf"[\s]{{0,}}$",                     # Optional whitespace at end
    flags=re.MULTILINE | re.DOTALL       # Multi-line matching with . matching newlines
)

match_numbers = re.compile(
    rf"{solution_start}.*?([\d\.]{{1,}})", # Extract numbers from solution section
    flags=re.MULTILINE | re.DOTALL        # Flexible pattern matching
)

# Matches from solution_start until the closing tag OR the end of the string
match_lenient = re.compile(
    rf"{solution_start}\s*(.*?)(?:{solution_end}|$)",
    flags=re.MULTILINE | re.DOTALL
)

# Reward Function 1: Exact Format Compliance
def match_format_exactly(completions, **kwargs):
    """
    High reward (3.0) for perfect format adherence
    Ensures model learns the complete structured output pattern
    """
    scores = []
    for completion in completions:
        response = completion[0]["content"]
        # Check if response matches complete format pattern
        score = 3.0 if match_format.search(response) is not None else 0.0
        scores.append(score)
    return scores

# Reward Function 2: Partial Format Credit
def match_format_approximately(completions, **kwargs):
    """
    Graduated scoring for format elements
    Encourages learning individual components even if not perfect
    """
    scores = []
    for completion in completions:
        response = completion[0]["content"]
        score = 0

        # Award +0.5 for correct token count, -0.5 for wrong count
        score += 0.5 if response.count(reasoning_start) == 1 else -0.5
        score += 0.5 if response.count(reasoning_end) == 1 else -0.5
        score += 0.5 if response.count(solution_start) == 1 else -0.5
        score += 0.5 if response.count(solution_end) == 1 else -0.5

        scores.append(score)
    return scores

# Reward Function 3: Mathematical Accuracy
def check_answer_correctness(prompts, completions, answer, **kwargs):
    """
    Graduated scoring for mathematical accuracy:
    - 3.0: Exact match
    - 1.5: Within 10% (close answer)
    - 0.5: Within 20% (reasonable attempt)
    - -0.5: Wrong answer (penalty for incorrect math)
    """
    responses = [completion[0]["content"] for completion in completions]

    # Extract answers using format pattern
    extracted_responses = [
        guess.group(1) if (guess := match_format.search(r)) is not None else None
        for r in responses
    ]

    scores = []
    for guess, true_answer in zip(extracted_responses, answer):
        if guess is None:  # No extractable answer
            scores.append(0)
            continue

        # Exact string match gets full points
        if guess.strip() == true_answer.strip():
            scores.append(3.0)
        else:
            # Try numerical comparison for partial credit
            try:
                ratio = float(guess) / float(true_answer)
                if 0.9 <= ratio <= 1.1:      # Within 10%
                    scores.append(1.5)
                elif 0.8 <= ratio <= 1.2:    # Within 20%
                    scores.append(0.5)
                else:                         # Wrong answer
                    scores.append(-0.5)
            except (ValueError, ZeroDivisionError):
                scores.append(-0.5)           # Invalid numerical format

    return scores

# Reward Function 4: Number Extraction Ability
def check_numbers_extraction(prompts, completions, answer, **kwargs):
    """
    Tests the model's ability to extract numerical values from solution sections
    Complementary to exact format matching - focuses on parsing capability
    """
    responses = [completion[0]["content"] for completion in completions]

    # Extract numbers from solution sections using number pattern
    extracted_responses = [
        guess.group(1) if (guess := match_numbers.search(r)) is not None else None
        for r in responses
    ]

    scores = []
    for guess, true_answer in zip(extracted_responses, answer):
        if guess is None:  # No extractable number
            scores.append(0)
            continue

        try:
            # Simple numerical equality check
            true_val = float(true_answer.strip())
            guess_val = float(guess.strip())
            # Binary scoring: correct (1.5) or incorrect (0)
            scores.append(1.5 if guess_val == true_val else 0.0)
        except (ValueError, TypeError):
            scores.append(0)  # Invalid number format

    return scores

In [None]:
# Configure GRPO training parameters
training_args = GRPOConfig(
    # Learning rate
    learning_rate=5e-6,
    # Batch configuration
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    # Sequence length limits for mathematical problems
    max_prompt_length=1024,
    max_completion_length=1024,
    # Training duration
    # num_train_epochs=1,
    max_steps = 100,
    # Log every 5 steps
    logging_steps=10,
    # Enable FP16 training for memory efficiency
    fp16=True,
    bf16=False,
    # Output configuration
    output_dir="./trl_grpo_outputs",
    # Gradient clipping for stable training
    max_grad_norm=0.1,
    # Disable external logging
    report_to=[],
    # Generation parameters for better reward signal
    num_generations=4,
    # Higher temp for more diverse outputs
    temperature=0.8,
    # KL divergence penalty for GRPO
    beta=0.01,
)

In [None]:
# Custom Trainer to log rewards
from transformers import TrainerCallback

class RewardLoggingCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs and any('reward' in k for k in logs.keys()):
            reward_logs = {k: v for k, v in logs.items() if 'reward' in k}
            print(f"Step {state.global_step}: {reward_logs}")

trainer = GRPOTrainer(
    model=model,
    # Four complementary reward functions
    reward_funcs=[
        match_format_exactly,         # Structure compliance
        match_format_approximately,   # Partial format credit
        check_answer_correctness,     # Mathematical accuracy
        check_numbers_extraction,     # Number parsing ability
    ],
    # Training configuration
    args=training_args,
    # Processed GSM8K dataset
    train_dataset=train_dataset,
    # Tokenizer (processing_class in TRL)
    processing_class=tokenizer,
    # Log RL rewards
    callbacks=[RewardLoggingCallback()]
)

The model is already on multiple devices. Skipping the move to device specified in `args`.


Note that GRPO algorithm uses a loss function that is different than the cross-entropy loss that is commonly used in classification ANNs and for pre-training and supervised finetuning of LLMs. GRPO loss function can have negative values, and in fact, negative values mean that the model is producing high-reward outputs. However, interpretation of the GRPO loss function is more difficult than in supervised learning, because it is calculated as a sum of several terms. This creates creates compound behavior, where one term may be improving while another having negative impact on the model. Hense, the final loss number can go up or down while the model is actually improving.

More relevant metric to monitor in GRPO is the reward, where increasing rewards indicate improving behavior. As you recall, we defined the reward function as a sum of four reward criteria. In the output of the training cell below, the combined reward is displayed under the, well, "reward" variable. During the training, the reward increased from 3.72 at step 10 to 8.20 at step 100.


In [None]:
trainer.train()

`generation_config` default values have been modified to match model-specific defaults: {'max_length': 131072, 'top_p': 0.9}. If this is not desired, please set these values explicitly.


Step,Training Loss
10,0.0381
20,-0.0025
30,-0.0089
40,0.0044
50,0.0184
60,0.0069
70,0.0073
80,0.0034
90,0.0268
100,0.0152


Step 10: {'rewards/match_format_exactly/mean': 1.275, 'rewards/match_format_exactly/std': 1.4583582043647767, 'rewards/match_format_approximately/mean': 0.73125, 'rewards/match_format_approximately/std': 1.263677716255188, 'rewards/check_answer_correctness/mean': 1.05, 'rewards/check_answer_correctness/std': 1.4218261599540711, 'rewards/check_numbers_extraction/mean': 0.665625, 'rewards/check_numbers_extraction/std': 0.7159868717193604, 'reward': 3.721875, 'reward_std': 3.1834747076034544, 'frac_reward_zero_std': 0.075}
Step 20: {'rewards/match_format_exactly/mean': 2.45625, 'rewards/match_format_exactly/std': 1.1661575436592102, 'rewards/match_format_approximately/mean': 1.675, 'rewards/match_format_approximately/std': 0.7206614732742309, 'rewards/check_answer_correctness/mean': 2.121875, 'rewards/check_answer_correctness/std': 1.3605764508247375, 'rewards/check_numbers_extraction/mean': 1.18125, 'rewards/check_numbers_extraction/std': 0.5642989039421081, 'reward': 7.434375, 'reward_s

TrainOutput(global_step=100, training_loss=0.010910292826592923, metrics={'train_runtime': 4973.4869, 'train_samples_per_second': 0.322, 'train_steps_per_second': 0.02, 'total_flos': 0.0, 'train_loss': 0.010910292826592923})

#### Evaluate on Test Set Problems

The model is next evaluated on a small set of 5 problems from the test dataset. The code in the cell first formats the prompt for the model, next the text is tokenized and a response is generated, and finaly the numerical answer is extracted from the model's response.

In this case, the model correctly answered 4 of the 5 questions. Premier LLMs have achieved 97% accuracy on the GMS8K dataset. Smaller models similar to the one we finetuned typically achieve around 50-60% accuracy.  

In [None]:
# Select 5 problems from test set
test_indices = [1, 3, 5, 7, 9]

model.eval()
correct_count = 0

# Generate responses
for idx, test_idx in enumerate(test_indices, 1):
    example = test_dataset[test_idx]
    question = example["question"]
    true_answer = extract_hash_answer(example["answer"])

    test_messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": question}
    ]

    test_input = tokenizer.apply_chat_template(
        test_messages,
        tokenize=False,
        add_generation_prompt=True
    )

    with torch.no_grad():
        inputs = tokenizer(test_input, return_tensors="pt").to(model.device)
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            # Lower temp for more focused answers
            temperature=0.7,
            do_sample=True,
            repetition_penalty=1.1,
            length_penalty=1.0,
            early_stopping=True,
            pad_token_id=tokenizer.pad_token_id
        )
        response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)

    # Extract the model's answer
    solution_text = None  # Initialize here
    format_match = match_format.search(response)

    if format_match:
        solution_text = format_match.group(1).strip()
    # If strict failed, just grab what is inside/after the solution tag
    else:
        lenient_match = match_lenient.search(response)
        if lenient_match:
            solution_text = lenient_match.group(1).strip()

    # Extract the final number
    if solution_text:
        numbers = re.findall(r'-?[\d,]+\.?\d*', solution_text)
        model_answer = numbers[-1] if numbers else "NO ANSWER FOUND"
    else:
        model_answer = "NO ANSWER FOUND"

    # Check if correct
    try:
        # Remove symbols like $ and , for numerical comparison
        clean_model = model_answer.replace('$', '').replace(',', '')
        clean_true = true_answer.replace('$', '').replace(',', '')
        is_correct = float(clean_model) == float(clean_true)
    except ValueError:
        is_correct = model_answer.strip() == true_answer.strip()

    if is_correct:
        correct_count += 1

    print(f"Test Problem {idx}/5")
    print(f"Question: {question}")
    print(f"\nTrue Answer: {true_answer}")
    print(f"\nModel Response:")
    print(response)
    print(f"\nExtracted Answer: {model_answer}")
    print(f"{'Correct' if is_correct else 'Incorrect'}")
    print(f"{'='*80}")

print(f"Resutls: {correct_count}/5 correct ({correct_count/5*100:.0f}%)")

Test Problem 1/5
Question: A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?

True Answer: 3

Model Response:
<REASONING>
To find the total number of bolts, we need to add the number of bolts of blue fiber and the number of bolts of white fiber. The problem states that it takes twice as much white fiber as blue fiber.

Let's denote the number of bolts of blue fiber as B and the number of bolts of white fiber as W. We know that W = 2B (since it takes half as much white fiber).

We can now write an equation: B + W = total bolts

Substituting W with 2B, we get:
B + 2B = total bolts
Combine like terms:
3B = total bolts

Since we don't have a specific value for B, we can express the total number of bolts in terms of B:
total bolts = 3B

However, since the question asks for the total number of bolts and not just "three times" the number of blue bolts, we can provide a more direct answer by stating that the total number of bolts is three

### 23.5 Reasoning Models Limitations <a name='23.5-reasoning-models-limitations'></a>

""Also noteworthy is the mention of knowing "when to think for a long time or not." This
hints at an important design consideration: reasoning is not always necessary or desirable.


For instance, reasoning models are designed to be good at complex tasks such as
solving puzzles, advanced math problems, and challenging coding tasks. However, they are
not necessary for simpler tasks like summarization, translation, or knowledge-based
question answering. In fact, using reasoning models for everything can be inefficient and
expensive. For instance, reasoning models are typically more expensive to use, more
verbose, and sometimes more prone to errors due to "overthinking." Also, here, the simple
rule applies: use the right tool (or type of LLM) for the task.""

## References <a name='references'></a>

1. A Visual Guide to Reasoning LLMs, by Maarten Grootendorst, available at [https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-reasoning-llms](https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-reasoning-llms).

[BACK TO TOP](#top)