# nanoAhaMoment: Single File "RL for LLM" Library
Single GPU · No TRL or Verl · Efficient · 3B Base Model · Full Parameter Tuning Implementation of R1-zero training.

Inspired by [TinyZero](https://github.com/Jiayi-Pan/TinyZero) and [Mini-R1](https://www.philschmid.de/mini-deepseek-r1), but designed to be **simpler**, **cleaner**, and **faster**, with every line of code visible and understandable.

R1-Zero is arguably the more interesting contribution from the DeepSeek R1 paper. The core idea: take a freshly pre-trained LLM (straight out of the unsupervised pretraining oven) and continue its training using reinforcement learning *without* any human feedback or supervision. The result? A model that starts showing emergent behaviors like self-reflection, verification, backtracking that researchers have tried to bake into LLMs using handcrafted tricks and inductive biases, at least since O1.

In this notebook, we’ll build an R1-Zero-style training loop **from scratch**. The goal is to create a crystal-clear, hackable foundation for RL-style LLM training; one that gives you a bird’s-eye view of every moving part and how they fit together. Perfect for playing around, extending, or hacking.

---

### Why another R1-Zero implementation?

There are already great implementations like [TinyZero](https://github.com/Jiayi-Pan/TinyZero) and [Mini-R1](https://www.philschmid.de/mini-deepseek-r1). But they rely on full-fledged RL libraries (like `trl` or `verl`) to handle training.

These libraries exist for good reason; efficient RL training for LLMs sits at the crossroads of scalable training and fast inference. Making that work takes a lot of engineering. But that also means the internals are often abstracted away, hard to read, and even harder to tweak.

This notebook is different: **no abstractions, no hiding**. You’ll see everything, top to bottom. A lightweight, readable codebase that still follows best practices and runs efficiently on a single GPU.

### What is this notebook, exactly?

We'll train a base LLM using RL to solve a reasoning-heavy algorithmic task. The setup:

- **Model**: Qwen2.5 3B-Base  
- **Dataset**: Countdown-Tasks-3to4  
- **Algorithm**: GRPO (a variant of policy gradient)

Yes, the task is a bit toy-ish—but it captures the essence of R1-Zero: emergent behaviors like self-reflection, verification, backtracking, even language-switching. This setup is ideal for rapid prototyping and experimentation.

### Who is this notebook for?

- Anyone interested in RL training for LLMs  
- Researchers, especially the ones in academia, exploring reasoning in language models

### What should I know before jumping in?

- A working knowledge of the HuggingFace Transformers library  
- Some experience fine-tuning LLMs  
- Familiarity with policy gradient methods (helpful but not required)

## R1-Zero Recipe

The goal is to train a base LLM to **reason** in a way that allows it to **reevaluate** its own outputs and **improve** them, all without human supervision. The DeepSeek R1 paper proposes a surprisingly simple recipe to achieve this, and that's exactly what we'll implement in this notebook.

### The Recipe

Here's the high-level procedure:

1. **Start** with a base LLM and a dataset containing problem prompts paired only with their *final answers* (no intermediate reasoning steps).  
2. For each iteration $i = 0$ to `NUM_ITERATIONS`:
   - Sample a batch of prompts $\{x_i\}_{i=1}^N$ from the dataset.
   - For each prompt, sample $G$ responses from the model:  
     $ y_1, y_2, \cdots, y_G \sim \pi_\theta(y|x) $

     These $G$ responses form what is called a *group* in GRPO.
   - Compute a reward $R_i$ for each response and normalize them tocalculate the GRPO advantage within each group.
   - Create a list of $N \times G$ episodes, i.e., pairs of $(x_i, y_i)$ along with their corresponding advantages.
   - Estimate the policy gradient $\vec{g}_{pg}$ from these episodes.
   - Update the model parameters:  
     $\theta \leftarrow \theta + \eta \vec{g}_{pg}$

### Code Structure Overview

The code you will see is structured directly following this recipe. It boils down to three main components:

1. **Episode Generation**  
   - Generate $ (x, y) $ pairs along with their advantages for each RL iteration.
   
2. **Reward Calculation**  
   - Compute rewards for each generated response.
   
3. **Policy Gradient Estimation**  
   - Use the generated episodes to estimate the policy gradient and perform the model update.

In the end, these three components come together in a simple loop that trains the model, step by step, to develop reasoning capabilities through reinforcement learning.


## Checkpoint Playground

In the `notebooks/checkpoint_playground.ipynb`, you can load the model we already trained with this notebook and interactively test the model's reasoning capabilities. This notebook allows you to input custom prompts and observe the model's responses.

## Prerequisites

### Installing Dependencies

Before we begin, let's install the necessary Python packages. We'll be using:

- PyTorch  
- Hugging Face Transformers  
- Hugging Face Datasets  
- DeepSpeed  
- vLLM

For a detailed, step-by-step installation guide, refer to the [README](https://github.com/McGill-NLP/tiny-aha-moment.git) of this project.

In [1]:
import os
from pathlib import Path

# Set the environment variables for HuggingFace
# This is done to ensure that the cache directory for HuggingFace is set to a specific location,
# preventing the storage from being overwhelmed with model files and other data.
SCRATCH = Path.home() / "scratch"
os.environ["HF_HOME"] = str(SCRATCH / "hf_home")
os.environ["VLLM_USE_V1"] = "0"

### Import the required libraries

In [2]:
import gc
import re
import time
from typing import Any, Dict, List, Tuple, Union

import deepspeed
import numpy as np
import torch
from datasets import load_dataset
from deepspeed import DeepSpeedEngine
from tqdm import trange
from transformers import AutoModelForCausalLM, AutoTokenizer, PreTrainedModel
from vllm import LLM, SamplingParams

import wandb
from utils import (compute_token_log_probs, dump_episodes, evaluate_on_test_set, find_free_port, find_last_checkpoint, prepare_model_inputs,
                   load_model_into_vllm)

# Needed to stop DeepSpeed from complaining
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = str(find_free_port())
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"

[2025-08-05 10:40:10,468] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status


**We do have a few helper functions in `utils.py` that are used to keep the code clean.**

## Hyperparameters

Let's define the hyperparameters for the training. These are mostly taken from [Mini-R1](https://www.philschmid.de/mini-deepseek-r1) implementation.

In [3]:
# Model configuration
MODEL_NAME = "Qwen/Qwen2.5-0.5B"
MODEL_CHAT_NAME = MODEL_NAME + "-Instruct"

# Dataset configuration
DATASET_NAME = "Jiayi-Pan/Countdown-Tasks-3to4"

# Total number of training iterations
NUM_ITERATIONS = 1000
# Number of episodes to collect per iteration for training
EPISODES_PER_ITERATION = 64
# Number of responses to generate for each input prompt (i.e. group size in GRPO)
GENERATIONS_PER_SAMPLE = 4
# Controls how much the policy can deviate from the reference model
KL_COEFFICIENT = 0.001

# Training hyperparameters
# Batch size for each GPU device during training
PER_DEVICE_BATCH_SIZE = 4
# Learning rate for model updates
LEARNING_RATE = 1e-6

# Sampling parameters
# Maximum number of tokens to generate in each response
MAX_RESPONSE_TOKENS = 1024
# Controls randomness in generation (higher = more random)
TEMPERATURE = 1.0
# Nucleus sampling parameter (1.0 = disabled)
TOP_P = 1.0
# Top-k sampling parameter (-1 = disabled)
TOP_K = -1  # no top k

# DeepSpeed configuration
# DeepSpeed config for the policy model
deepspeed_config = {
    "bf16": {
        "enabled": True
    },
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": False
    },
    "train_batch_size": EPISODES_PER_ITERATION,
    "train_micro_batch_size_per_gpu": PER_DEVICE_BATCH_SIZE,
    "gradient_accumulation_steps": EPISODES_PER_ITERATION // PER_DEVICE_BATCH_SIZE,
    "gradient_clipping": 1.0,
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": LEARNING_RATE,
            "betas": (0.9, 0.999),
            "eps": 1e-8,
            "weight_decay": 0.0,
            "torch_adam": True,
        },
    },
}
# DeepSpeed config for the reference model
ref_deepspeed_config = {
    "bf16": {
        "enabled": True
    },
    # Note that we don't train the reference model
    # These are just for compatibility with DeepSpeed.
    "train_batch_size": EPISODES_PER_ITERATION,
    "train_micro_batch_size_per_gpu": PER_DEVICE_BATCH_SIZE,
    "gradient_accumulation_steps": EPISODES_PER_ITERATION // PER_DEVICE_BATCH_SIZE,
}

RUN_NAME = "r1-zero"
EXP_DIR = SCRATCH / "deepseek_r1z_hackathon" / RUN_NAME
EXP_DIR.mkdir(parents=True, exist_ok=True)
print(f"Logs and Checkpoints will be saved to: {EXP_DIR}")

Logs and Checkpoints will be saved to: /home/quang/scratch/deepseek_r1z_hackathon/r1-zero


## Generating the training prompts

For training, we'll use the [Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) dataset, which provides problem statements paired with their final answers (but no reasoning steps).

### The Countdown Task

The Countdown game is a numerical puzzle where the player must reach a target number using a set of randomly chosen numbers and basic arithmetic operations: addition, subtraction, multiplication, and division. Each number must be used exactly once.

Example:

```yaml
Target: 622
Available Numbers: [25, 3, 6, 100]

# Not provided in the dataset
Solution: (100 × 6) + (25 − 3) = 622
```

This task is ideal for training LLMs to practice reasoning, searching, and self-verification.


Since we are using the base version of the model, which has only been pretrained on raw internet data, it has no prior understanding of system prompts or chat formatting. However, we will still use the chat format to make the resulting model compatible with downstream tools and frameworks that expect it.

In [4]:
SYSTEM_MESSAGE = ("You are a helpful assistant. You first think about the reasoning process in the mind "
                  "and then provide the user with the answer.")
PROMPT_TEMPLATE = ("Using the numbers {numbers}, create an equation that equals {target}. "
                   "You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. "
                   "Show your work in <think> </think> tags. And return the final equation and answer in "
                   "<answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.")

Now that we have the system message and prompt template, we can generate the training prompts.

In [5]:
# Load and process dataset
def preprocess_example(example: Dict[str, Any]):
    numbers: List[int] = example["nums"]
    target: int = example["target"]

    prefix = [
        {
            "role": "system",
            "content": SYSTEM_MESSAGE
        },
        {
            "role": "user",
            "content": PROMPT_TEMPLATE.format(numbers=numbers, target=target)
        },
        {
            "role": "assistant",
            "content": "Let me solve this step by step.\n<think>"
        },
    ]
    input_ids = tokenizer.apply_chat_template(prefix, tokenize=True, continue_final_message=True)
    prompt = tokenizer.decode(input_ids, skip_special_tokens=False, clean_up_tokenization_spaces=False)
    return {"prompt": prompt, "input_ids": input_ids}


# Note that the base model and "instruct" model have different eos token.
# Here we make sure to use the correct one.
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHAT_NAME)
EOS_TOKEN_ID = AutoTokenizer.from_pretrained(MODEL_NAME).eos_token_id
EOS_TOKEN = tokenizer.convert_ids_to_tokens(EOS_TOKEN_ID)

dataset = load_dataset(DATASET_NAME, split="train")
dataset = dataset.map(preprocess_example, num_proc=6)

# Split dataset
train_test_split = dataset.train_test_split(test_size=500, train_size=2000, seed=42)
train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]

len(train_dataset), len(test_dataset)

(2000, 500)

Let's look at some examples from the dataset.

In [6]:
print("Target: ", train_dataset[0]["target"])
print("Available Numbers: ", train_dataset[0]["nums"])

Target:  43
Available Numbers:  [4, 27, 12]


Using the system message and prompt template, we generate the following prompt for this example:

In [7]:
print(train_dataset[0]["prompt"])

<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [4, 27, 12], create an equation that equals 43. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>


As you noticed, we also prepend the `<assistant>` tag along with the phrase *"Let me solve this step by step."* to each prompt. This helps guide the model into **answering mode**. Without this, the base model might simply continue the prompt rather than attempting to solve the task, since it has no inherent understanding of instruction-following.

Additionally, we tokenize each prompt and store the result as `input_ids`, which will be used later during training.

In [8]:
print(train_dataset[0]["input_ids"])

[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 1446, 1156, 1744, 911, 279, 32711, 1882, 304, 279, 3971, 323, 1221, 3410, 279, 1196, 448, 279, 4226, 13, 151645, 198, 151644, 872, 198, 16429, 279, 5109, 508, 19, 11, 220, 17, 22, 11, 220, 16, 17, 1125, 1855, 458, 23606, 429, 16819, 220, 19, 18, 13, 1446, 646, 990, 6770, 34784, 7525, 17973, 11, 85922, 11777, 608, 8, 323, 1817, 1372, 646, 1172, 387, 1483, 3055, 13, 6928, 697, 975, 304, 366, 26865, 29, 690, 26865, 29, 9492, 13, 1597, 470, 279, 1590, 23606, 323, 4226, 304, 366, 9217, 29, 690, 9217, 29, 9492, 11, 369, 3110, 366, 9217, 2235, 16, 488, 220, 17, 8, 608, 320, 18, 353, 220, 20, 12533, 9217, 14276, 151645, 198, 151644, 77091, 198, 10061, 752, 11625, 419, 3019, 553, 3019, 624, 13708, 766, 29]


## Reward Function


The DeepSeek R1 paper introduced **rule-based rewards** to evaluate whether the model-generated solutions were correct. We'll adopt a similar approach by defining two custom reward functions:

- **Format Reward**: Checks if the output follows the required format:  
  `<think> [thinking] </think><answer> [answer] </answer>`

- **Equation Reward**: Extracts the equation from within the `<answer>` tag, verifies that it evaluates to the target result, and ensures that all available numbers are used exactly once.

The purpose of enforcing the format is mainly to make answer extraction easier. It isn't strictly necessary for the correctness of the answer itself but simplifies parsing during training.

The final reward assigned to an episode/trajectory (prompt+response) is simply the sum of these two components. Importantly, the reward is only computed at the **last token** of the output. From an RL perspective, this means that all intermediate actions receive zero reward. We also do not apply any discounting here (i.e., $\gamma = 1$).

In [9]:
def format_reward_func(completion: str) -> float:
    """
    Format: <think>...</think>\n</answer>...</answer>

    Also checks that the content within <answer>...</answer> conforms to a
    specified pattern (only digits, + - * / ( ) . and whitespace).

    Args:
        completion (str): Generated output

    Returns:
        float: Reward score
    """
    # Define the allowed pattern (only numbers, +, -, *, /, (, ), ., and whitespace)
    allowed_pattern = r"^[\d+\-*/().\s]+$"

    try:
        # add synthetic <think> as its already part of the prompt and prefilled
        # for the assistant to more easily match the regex
        completion = "<think>" + completion

        # Strip EOS token if present
        if completion.endswith(EOS_TOKEN):
            completion = completion[:-len(EOS_TOKEN)]

        # Check if the format is correct
        # Pattern means:
        # 1) <think>...contents not including other <think> tags...</think>
        # 2) \n
        # 3) <answer>...anything...</answer>
        regex = r"^<think>([^<]*(?:<(?!/?think>)[^<]*)*)<\/think>\n<answer>([\s\S]*?)<\/answer>$"
        match = re.search(regex, completion, re.DOTALL)

        if match is None or len(match.groups()) != 2:
            # Format is incorrect
            return 0.0
        else:
            # Extract the content inside <answer>...</answer>
            answer_content = match.group(2).strip()

            # Check if answer content matches the allowed pattern
            if not re.match(allowed_pattern, answer_content):
                # If it doesn't match, reward is 0.5
                return 0.5
            else:
                # If both format and pattern are correct, reward is 1
                return 1.0
    except Exception:
        # Any error leads to 0 reward
        return 0.0


def equation_reward_func(completion: str, nums: List[int], target: int) -> float:
    """
    Evaluates completion based on mathematical correctness of the answer

    Args:
        completion (str): Generated output
        target (str): Expected answer
        nums (list): Available numbers to use in the equation

    Returns:
        float: Reward score
    """
    try:
        # Check if the format is correct
        match = re.search(r"<answer>(.*?)<\/answer>", completion)
        if match is None:
            return 0.0
        # Extract the "answer" part from the completion
        equation = match.group(1).strip()
        # Extract all numbers from the equation
        used_numbers = [int(n) for n in re.findall(r"\d+", equation)]

        # Check if all numbers are used exactly once
        if sorted(used_numbers) != sorted(nums):
            return 0.0
        # Define a regex pattern that only allows numbers, operators, parentheses, and whitespace
        allowed_pattern = r"^[\d+\-*/().\s]+$"
        if not re.match(allowed_pattern, equation):
            return 0.0

        # Evaluate the equation with restricted globals and locals
        result = eval(equation, {"__builtins__": None}, {})
        # Check if the equation is correct and matches the ground truth
        if abs(float(result) - float(target)) < 1e-5:
            return 1.0
        else:
            return 0.0
    except Exception:
        # If evaluation fails, reward is 0
        return 0.0


def compute_reward(completion: str, sample: Dict[str, Any]) -> Tuple[float, Dict[str, float]]:
    nums = sample["nums"]
    target = sample["target"]

    format_reward = format_reward_func(completion)
    equation_reward = equation_reward_func(completion=completion, nums=nums, target=target)

    reward = format_reward + equation_reward

    metrics = {
        "format_reward": format_reward,
        "equation_reward": equation_reward,
    }

    return reward, metrics

In [10]:
# <think> is prefilled in the prompt. So, repeating it in the completion would be incorret.
format_reward_func("<think>I think the answer is </think>\n<answer>1+2</answer>")

0.0

In [11]:
format_reward_func("I think the answer is </think>\n<answer>1+2</answer>")

1.0

In [12]:
format_reward_func("<think>I think the<think>and even more</think> answer is </think>\n<answer>1+2</answer>")

0.0

In [13]:
equation_reward_func("I think the answer is </think>\n<answer>1+2+2</answer>", [1, 2], 3)

0.0

## Episode Generation

The goal of episode generation is to create a collection of query-response pairs that will be used for policy training. From the reinforcement learning (RL) perspective, the **query** serves as the initial state, and the generated tokens in the **response** represent the actions taken by the policy.

The `create_training_episodes` function takes a list of prompts (initial states) and their corresponding completions which we generate using the model.  In GRPO, we always generate multiple responses per prompt—specifically, `GENERATIONS_PER_SAMPLE` > 1. This means that, after episode generation, we end up with `batch_size × GENERATIONS_PER_SAMPLE` episodes in every RL iteration.

### Advantage Computation

In addition to generating episodes, `create_training_episodes` is also responsible for computing the **advantage** for every response token. 

In RL terms, the advantage of a token represents how much better or worse that token's action is compared to the average generate token at that specific state (prompt + prefix). Ideally, we would compute an advantage for every token individually to capture how each step contributes to the overall reward.

However, in GRPO, there's no per-token advantage computation. Instead, we compute a single advantage value per response. This value reflects how good the entire response is relative to other responses generated for the same prompt. We then assign this single advantage value uniformly to all tokens within that response.

GRPO uses a simple formula for this:

1. For each prompt $x$ with a group of generated responses $y_1, y_2, \ldots, y_G \sim \pi(\cdot|x)$, compute their rewards $R_1, R_2, \ldots, R_G$.
2. Compute the group's mean and standard deviation:  
   $ \mu = \text{mean}(R_1, R_2, \ldots, R_G) $  
   $ \sigma = \text{std}(R_1, R_2, \ldots, R_G) $
3. Compute a **relative score** for each response:  
   $ R^*_i = \frac{R_i - \mu}{\sigma} $
4. Assign this relative score $R^*_i$ as the advantage to all tokens of the $i$-th response:  
   $ A_t^{(i)} = R^*_i $

This **per-group normalization** encourages responses that are better than average and penalizes those that are worse.

### Example: Advantage in Action

Consider a binary reward scenario where each response is either correct (1) or incorrect (0):

```python
>>> rewards = np.array([1, 1, 0, 0, 0])
>>> (rewards - rewards.mean()) / (rewards.std())
array([ 1.22474487,  1.22474487, -0.81649658, -0.81649658, -0.81649658])
```

Here, the correct responses receive higher advantage scores, promoting them in future updates.


If only one response is correct:

```python
>>> rewards = np.array([1, 0, 0, 0, 0])
>>> (rewards - rewards.mean()) / (rewards.std())
array([ 2. , -0.5, -0.5, -0.5, -0.5])
```

This resembles the case where the question in the prompt is too hard and the model is not able to generate a correct response on average.
However, if one of the responses is correct, it will be assigned a higher advantage score, and all incorrect responses will be assigned a negative relative score.

If all responses are incorrect:

```python
>>> rewards = np.array([0, 0, 0, 0, 0])
>>> (rewards - rewards.mean()) / (rewards.std() + 1e-6)
array([0., 0., 0., 0., 0.])
```

Since there is no one is better than the average, the model receives no learning signal.

If all responses are correct:

```python
>>> rewards = np.array([1, 1, 1, 1, 1])
>>> (rewards - rewards.mean()) / (rewards.std() + 1e-6)
array([0., 0., 0., 0., 0.])
```

Again, no learning signal is provided because there is nothing to improve upon.

In a more mixed case:

```python
>>> rewards = np.array([1, 1, 1, 1, 0])
>>> (rewards - rewards.mean()) / (rewards.std() + 1e-6)
array([0.5, 0.5, 0.5, 0.5, -2.])
```

This represents an easier question for the model. Most responses are correct, but occasional incorrect ones are heavily penalized.

In [14]:
def create_training_episodes(
    samples: List[Dict[str, Any]],
    all_generations: List[List[int]],
    all_finish_reasons: List[str],
) -> Tuple[Dict[str, Any], Dict[str, Any]]:
    """
    Process model generations and calculate rewards for training episodes.

    This function processes generated responses and calculates rewards for training episodes by:
    1. Grouping generations by sample (GENERATIONS_PER_SAMPLE responses per input)
    2. Computing rewards and advantages for each response
    3. Processing response tokens

    Args:
        samples: List of input samples, each containing:
            - input_ids: List[int], tokenized input prompt
            - nums: List[int], numbers to use in equation
            - target: int, target value for equation
        all_generations: List of token ID sequences for each generated response
        all_finish_reasons: List of finish reasons for each generation ("stop" or other)

    Returns:
        Tuple containing:
        1. Dictionary with processed data for training:
            - all_query_token_ids: List[List[int]], input token IDs repeated for each generation
            - all_response_token_ids: List[List[int]], response token IDs with EOS tokens added
            - all_advantages: List[List[float]], advantage values repeated for each token
        2. Dictionary with generation statistics:
            - response_lengths: List[int], lengths of generated responses
            - rewards: List[float], raw reward values
            - non_stop_rate: List[bool], whether each generation ended naturally
            - reward_metrics/*: Various reward component metrics

    Example:
        >>> samples = [{"input_ids": [1,2,3], "nums": [1,2,3], "target": 6}]
        >>> generations = [[4,5, EOS_TOKEN_ID], [6,7], [8,9, EOS_TOKEN_ID]]  # 3 generations per sample
        >>> finish_reasons = ["stop", "length", "stop"]
        >>> episodes, stats = create_training_episodes(samples, generations, finish_reasons)
        >>> episodes
        {
            'all_query_token_ids': [[1,2,3], [1,2,3], [1,2,3]],
            'all_response_token_ids': [[4,5,EOS_TOKEN_ID], [6,7], [8,9,EOS_TOKEN_ID]],
            'all_advantages': [[0.5,0.5,0.5], [-1.0,-1.0], [0.5,0.5,0.5]]
        }
    """
    assert len(all_generations) == len(all_finish_reasons)
    assert len(all_generations) == len(samples) * GENERATIONS_PER_SAMPLE

    # Process responses and calculate rewards
    groups = [list(range(i, i + GENERATIONS_PER_SAMPLE)) for i in range(0, len(all_generations), GENERATIONS_PER_SAMPLE)
             ]  # example: [[0, 1, 2], [3, 4, 5], [6, 7, 8]]

    all_query_token_ids, all_responses_token_ids, all_advantages = [], [], []

    stats = {
        "response_lengths": [],
        "rewards": [],
        "non_stop_rate": [],
    }

    for sample, group_indices in zip(samples, groups):
        finish_reasons = [all_finish_reasons[i] for i in group_indices]
        response_token_ids = [all_generations[i] for i in group_indices]
        responses = tokenizer.batch_decode(response_token_ids, skip_special_tokens=False)

        rewards_and_metrics = [compute_reward(resp, sample) for resp in responses]
        rewards, reward_metrics = zip(*rewards_and_metrics)

        rewards = np.array(rewards)  # [group_size]
        response_advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)

        advantages = [[resp_adv] * len(resp) for resp_adv, resp in zip(response_advantages, response_token_ids)]

        all_query_token_ids.extend([sample["input_ids"]] * GENERATIONS_PER_SAMPLE)
        all_responses_token_ids.extend(response_token_ids)
        all_advantages.extend(advantages)

        stats["rewards"].extend(rewards)
        stats["non_stop_rate"].extend([fr != "stop" for fr in finish_reasons])
        stats["response_lengths"].extend([len(ids) for ids in response_token_ids])
        for rm in reward_metrics:
            for k, v in rm.items():
                stats.setdefault(f"reward_metrics/{k}", []).append(v)

    episodes = {
        "all_query_token_ids": all_query_token_ids,
        "all_response_token_ids": all_responses_token_ids,
        "all_advantages": all_advantages,
    }

    return episodes, stats

In [15]:
case_0 = {
    "sample": {
        "input_ids": [1, 2, 3],
        "nums": [1, 2, 3],
        "target": 6
    },
    "generations": [[4, 5, 22, 33], [6, 7], [8, 9, 11], [10, 11]],
    "finish_reasons": ["stop", "length", "stop", "stop"]
}

case = case_0
episodes, stats = create_training_episodes([case["sample"]], case["generations"], case["finish_reasons"])
episodes

{'all_query_token_ids': [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]],
 'all_response_token_ids': [[4, 5, 22, 33], [6, 7], [8, 9, 11], [10, 11]],
 'all_advantages': [[0.0, 0.0, 0.0, 0.0],
  [0.0, 0.0],
  [0.0, 0.0, 0.0],
  [0.0, 0.0]]}

In [16]:
case_1 = {
    "sample": {
        "input_ids": [33, 44],
        "nums": [11, 7, 8],
        "target": 26
    },
    "generations": [[1, 2], [3, 4], [5, 6], [7, 8]],
    "finish_reasons": ["stop", "stop", "length", "stop"]
}
case = case_1
episodes, stats = create_training_episodes([case["sample"]], case["generations"], case["finish_reasons"])
episodes

{'all_query_token_ids': [[33, 44], [33, 44], [33, 44], [33, 44]],
 'all_response_token_ids': [[1, 2], [3, 4], [5, 6], [7, 8]],
 'all_advantages': [[0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0]]}

In [17]:
case_2 = {
    "sample": {
        "input_ids": [9, 8, 7, 6, 5, 4],
        "nums": [1, 2, 3, 4],
        "target": 10
    },
    "generations": [[9, 10], [11, 12], [13, 14], [15, 16]],
    "finish_reasons": ["length", "length", "stop", "stop"]
}
case = case_2
episodes, stats = create_training_episodes([case["sample"]], case["generations"], case["finish_reasons"])
episodes

{'all_query_token_ids': [[9, 8, 7, 6, 5, 4],
  [9, 8, 7, 6, 5, 4],
  [9, 8, 7, 6, 5, 4],
  [9, 8, 7, 6, 5, 4]],
 'all_response_token_ids': [[9, 10], [11, 12], [13, 14], [15, 16]],
 'all_advantages': [[0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0]]}

As you can see, the `input_ids` of this single exmaple is repeated in all of generated episodes

## Policy Gradient


Now that we have a batch of episodes with corresponding advantages, we can compute the **policy gradient loss** to update the model.

GRPO uses the same loss formulation as PPO, but the key difference lies in how advantages are computed. To understand the implementation in `compute_pg_loss`, let’s first recall the original PPO objective:

$$
\mathcal{L}_{\text{PPO}} = \mathbb{E}\left[\min\left( 
\frac{\pi_\theta(y_t \mid y_{<t}, x)}{\pi_{\theta_{\text{old}}}(y_t \mid y_{<t}, x)} A_t, \;
\text{clip}\left(
\frac{\pi_\theta(y_t \mid y_{<t}, x)}{\pi_{\theta_{\text{old}}}(y_t \mid y_{<t}, x)}, \;
1 - \epsilon, \; 1 + \epsilon
\right) A_t \right)\right]
$$

where:
- $ \pi_{\theta} $ is the current policy,
- $ \pi_{\theta_{\text{old}}} $ is the policy from the previous iteration (the policy we sampled episodes from),
- $ A_t $ is the advantage.

This objective tries to increase or decrease the probability of tokens based on the advantage $A_t$ only when the ratio between the new and old policy probabilities stays within a small range, controlled by the clipping threshold $\epsilon$. This clipping mechanism prevents large, destabilizing updates during training.

### Fully Online Setting: Simplifying the Objective

In general PPO, multiple gradient steps might be taken using the same batch of episodes. However, in our case, we apply only **one gradient step per iteration** using freshly sampled episodes. That means:

- $ \pi_{\theta} = \pi_{\theta_{\text{old}}} $
- Consequently,  
  $$
  \frac{\pi_\theta(y_t \mid y_{<t}, x)}{\pi_{\theta_{\text{old}}}(y_t \mid y_{<t}, x)} = 1
  $$
  
Since the ratio is exactly 1:
- The clipping function becomes inactive.
- The $\min(\cdot,\cdot)$ operator simply returns the unclipped term.

So, the objective simplifies **to**:

$$
\mathcal{L}_{\text{PPO}} = \mathbb{E}\left[ \frac{\pi_\theta(y_t \mid y_{<t}, x)}{\pi_{\theta_{\text{old}}}(y_t \mid y_{<t}, x)} A_t \right]
$$


Taking the gradient of this loss with respect to $\theta$, we get:

$$
\vec{g}_{\text{PPO}} = \nabla_\theta \mathcal{L}_{\text{PPO}} = 2 \underbrace{\mathbb{E}\left[ \nabla_\theta \log \pi_\theta(y_t \mid y_{<t}, x) \cdot A_t \right]}_{\text{vanilla policy gradient with advantage}}
$$

This is the **standard policy gradient** formula, where the log-probabilities are weighted by the advantage. In effect, we recover vanilla REINFORCE-style learning.

> Note: The a constant multiplier (like 2) does not affect the direction of the gradient and can be safely ignored.

In fact, this behavior is not unique to GRPO. In all methods such as PPO, TRPO the very first gradient step after collecting new data will always reduce to this same form. Only after the optimization step the clipping or trust region constraint start to take effect.

### KL Penalty

The final loss also has a **KL penalty** term to ensure the new policy doesn't drift too far from a reference policy:

$$
\mathcal{L} = \mathcal{L}_{\text{PPO}} - \beta \cdot \text{KL}(\pi_\theta \parallel \pi_{\theta_{\text{ref}}})
$$

We estimate the KL divergence using the **k3 estimator** from [this blog post by Schulman](http://joschu.net/blog/kl-approx.html):

$$
\text{KL}(\pi_\theta \parallel \pi_{\theta_{\text{ref}}}) = \mathbb{E}\left[\frac{\pi_{\theta_{\text{ref}}}(y_t \mid y_{<t}, x)}{\pi_\theta(y_t \mid y_{<t}, x)} - \log\left(\frac{\pi_{\theta_{\text{ref}}}(y_t \mid y_{<t}, x)}{\pi_\theta(y_t \mid y_{<t}, x)}\right) - 1\right]
$$

This regularization term softly constrains the updated model to remain close to the reference.


### GRPO vs PPO/VinePPO: Key Difference

The main difference between **GRPO** and methods like **PPO/VinePPO** lies in **how the advantage is computed and applied**:

- In **PPO/VinePPO**, each token/step's advantage is computed individually. This allows for fine-grained credit assignment across the sequence.
- In **GRPO**, a **single scalar advantage** is computed for the entire response and is applied **uniformly to all tokens** in that response.

This distinction is illustrated below:

#### A successful response in GRPO:
<img src="https://github.com/McGill-NLP/nano-aha-moment/blob/main/assets/grpo_successful.png?raw=true" alt="GRPO vs PPO/VinePPO: successful response" width="500">

#### A failed response in GRPO:
<img src="https://github.com/McGill-NLP/nano-aha-moment/blob/main/assets/grpo_unsuccessful.png?raw=true" alt="GRPO vs PPO/VinePPO: failed response" width="500">

In GRPO, all tokens in a response are updated with the same magnitude. In contrast, PPO/VinePPO updates each token/step with a different advantage value:

<img src="https://github.com/McGill-NLP/nano-aha-moment/blob/main/assets/ppo_and_vineppo.png?raw=true" alt="GRPO vs PPO/VinePPO: PPO and VinePPO" width="500">


In [None]:
def compute_pg_loss(
    policy_model: Union[DeepSpeedEngine, PreTrainedModel],
    reference_model: Union[DeepSpeedEngine, PreTrainedModel],
    batch: Dict[str, torch.Tensor],
    total_response_len: int,
) -> Tuple[torch.Tensor, Dict[str, float]]:
    """
    Compute the policy gradient loss with KL penalty between policy and reference models.

    This function:
    1. Computes log probabilities for both policy and reference models
    2. Calculates KL divergence penalty between the models
    3. Computes policy gradient loss using advantages
    4. Combines the losses with KL coefficient

    Args:
        policy_model: The model being trained
        reference_model: The reference model for KL penalty calculation
        batch: Dictionary containing:
            - input_ids: Tensor of shape [batch_size, seq_len]
            - attention_mask: Tensor of shape [batch_size, seq_len]
            - labels: Tensor of shape [batch_size, seq_len] with -100 for ignored positions
            - advantages: Tensor of shape [batch_size, seq_len]

    Returns:
        Tuple containing:
            - loss: Combined policy gradient and KL penalty loss (scalar tensor)
            - metrics: Dictionary with detailed loss components:
                - policy_loss: Pure policy gradient loss
                - kl_penalty: KL divergence penalty
                - entropy: Policy entropy
    """
    input_ids = batch["input_ids"]  # [batch_size, seq_len]
    attention_mask = batch["attention_mask"]  # [batch_size, seq_len]
    labels = batch["labels"]  # [batch_size, seq_len]
    advantages = batch["advantages"]  # [batch_size, seq_len]

    labels_mask = (labels[..., :] != -100).float()  # [batch_size, seq_len]

    model_inputs = {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels,
        "labels_mask": labels_mask,
    }

    with torch.no_grad():
        ref_logps = compute_token_log_probs(reference_model, model_inputs, TEMPERATURE)  # [batch_size, seq_len-1]

    logps = compute_token_log_probs(policy_model, model_inputs, TEMPERATURE)  # [batch_size, seq_len-1]

    shifted_labels_mask = labels_mask[..., 1:]  # [batch_size, seq_len-1]

    kl_penalty = torch.exp(ref_logps - logps) - (ref_logps - logps) - 1  # [batch_size, seq_len-1]
    kl_penalty = kl_penalty * shifted_labels_mask  # [batch_size, seq_len-1]

    entropy = -logps.sum() / shifted_labels_mask.sum()  # scalar

    policy_loss = -logps * advantages[..., 1:]  # [batch_size, seq_len-1]
    policy_loss = policy_loss * shifted_labels_mask  # [batch_size, seq_len-1]

    loss = (policy_loss + KL_COEFFICIENT * kl_penalty).sum() / total_response_len  # scalar

    metrics = {
        "policy_loss": policy_loss.sum().item() / total_response_len,
        "kl_penalty": kl_penalty.sum().item() / total_response_len,
        "entropy": entropy.item() / total_response_len,
    }

    return loss, metrics

## Training

Before starting the RL loop, we need to set up all necessary components:

- **Policy Model**: The main model that will be trained using policy gradients.
- **Reference Model**: A frozen copy of the base model used for KL regularization.
- **DeepSpeed**: Both models are initialized with DeepSpeed.
- **vLLM Inference Engine**: Used for fast, batched inference during episode generation.
- **WandB Logging**: We initialize WandB to track training metrics, hyperparameters, and checkpoints.

Finally, if an existing checkpoint is detected, we automatically resume training from where it left off. 

Couple of remarks:
- We move the reference to CPU and only take back to GPU during policy gradient computation. Because of the relatievely small size of the model, this moving back and forth from GPU to CPU is super fast.
- Despite the entire training being run on a single GPU, we still use DeepSeed Zero stage 2. This is because the stage 2 comes with some optimization that avoid memory fragmentations, allowing to fully utilize GPU memory.
- Flash Attention is required in our setup as it reduces the memory requirement of transformers from $\mathcal{O}(n^2)$ to $\mathcal{O}(n)$ where $n$ the sequence length.

In [19]:
# Initialize main and reference models
policy_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map=0,
)
reference_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map=0,
)
policy_model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})

# Initialize DeepSpeed engines
policy_model, *_ = deepspeed.initialize(
    model=policy_model,
    config=deepspeed_config,
    model_parameters=policy_model.parameters(),
)
reference_model, *_ = deepspeed.initialize(
    model=reference_model,
    config=ref_deepspeed_config,
)

reference_model.module.cpu()

############################################
# Initialize vLLM (Inference) engine
############################################

inference_engine = LLM(
    model=MODEL_NAME,
    skip_tokenizer_init=False,
    # gpu_memory_utilization=0.2, 0.2 of 80GB = 16GB
    gpu_memory_utilization=0.4,  # 0.4 of 25GB = 10GB
    enable_prefix_caching=True,
    swap_space=2,  # 2GB
    scheduling_policy="fcfs",
    dtype=torch.bfloat16,
    max_model_len=2048,
    enable_sleep_mode=True,
)

# Wandb for logging
wandb.init(
    project="r1-aha-moment",
    name=RUN_NAME,
    config={
        "model_name": MODEL_NAME,
        "learning_rate": LEARNING_RATE,
        "num_iterations": NUM_ITERATIONS,
        "episodes_per_iteration": EPISODES_PER_ITERATION,
        "rollouts_per_episode": GENERATIONS_PER_SAMPLE,
        "kl_coefficient": KL_COEFFICIENT,
        "temperature": TEMPERATURE,
    },
)

# Load checkpoint if it exists
begin_iter = 0
ckpt_path, ckpt_iter = find_last_checkpoint(EXP_DIR)
if ckpt_path is not None:
    print(f"Resuming from checkpoint {ckpt_path} at iteration {ckpt_iter}")
    out = policy_model.load_checkpoint(ckpt_path / "deepspeed")
    if out is None:
        raise RuntimeError(f"Failed to load checkpoint {ckpt_path}")
    begin_iter = ckpt_iter + 1
    load_model_into_vllm(policy_model, inference_engine)

[2025-08-05 10:40:19,379] [INFO] [logging.py:128:log_dist] [Rank -1] DeepSpeed info: version=0.16.4, git-hash=unknown, git-branch=unknown
[2025-08-05 10:40:19,380] [INFO] [comm.py:658:init_distributed] cdb=None
[2025-08-05 10:40:19,380] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-08-05 10:40:19,382] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 1
[2025-08-05 10:40:19,503] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2025-08-05 10:40:19,504] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2025-08-05 10:40:19,505] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-08-05 10:40:19,511] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2025-08-05 10:40:19,511] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support f

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 08-05 10:40:29 model_runner.py:1115] Loading model weights took 0.9277 GB
INFO 08-05 10:40:29 worker.py:267] Memory profiling takes 0.39 seconds
INFO 08-05 10:40:29 worker.py:267] the current vLLM instance can use total_gpu_memory (23.66GiB) x gpu_memory_utilization (0.40) = 9.47GiB
INFO 08-05 10:40:29 worker.py:267] model weights take 0.93GiB; non_torch_memory takes 0.01GiB; PyTorch activation peak memory takes 1.39GiB; the rest of the memory reserved for KV Cache is 7.14GiB.
INFO 08-05 10:40:30 executor_base.py:111] # cuda blocks: 38993, # CPU blocks: 10922
INFO 08-05 10:40:30 executor_base.py:116] Maximum concurrency for 2048 tokens per request: 304.63x
INFO 08-05 10:40:30 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_ut

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:13<00:00,  2.65it/s]

INFO 08-05 10:40:44 model_runner.py:1562] Graph capturing finished in 13 secs, took 0.01 GiB
INFO 08-05 10:40:44 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 14.76 seconds



[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mquangnv[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


### Training loop

With everything set up, we are ready to start the main training loop. Each iteration of the loop performs the following steps:

1. **Evaluation** (optional): 
Every few iterations, the model is evaluated on a test set to monitor progress.
2. **Episode Generation**
A batch of prompts is sampled, and multiple responses are generated for each prompt using the inference engine. Then we put the inference engine to sleep.
3. **Reward Computation**
Rewards and advantages for each generated episode are computed.
4. **Policy Gradient Training**
Using the computed advantages, we calculate the policy gradient loss and update the model parameters. The training is done using gradient accumulation to handle large batches. Note that we apply single gradient update per iteration.
5. **Inference Engine Update**
The inference engine is woken up and updated with the latest model weights.
6. **Logging**
Training and evaluation metrics are logged using WandB.
7. **Checkpointing**
Every 50 iterations, the model and optimizer states are saved.

This loop continues until the specified number of iterations is completed.

**Sleeping of vLLM**
Before training begins, we put vLLM into sleep mode to free up its KV cache and model weights, ensuring enough GPU memory is available for policy training. After the training step is complete, vLLM is woken up, reinitializing its KV cache and preparing for the next round of sampling using the updated model parameters.

In [None]:
for iteration in trange(NUM_ITERATIONS):
    print(f"Iteration {iteration}/{NUM_ITERATIONS}")

    metrics = {}

    #########################################################
    # Evaluation
    #########################################################

    eval_stats = None
    if iteration % 25 == 0:
        print("Evaluating on eval set...")
        eval_episodes, eval_stats = evaluate_on_test_set(
            inference_engine=inference_engine,
            test_dataset=test_dataset,
            tokenizer=tokenizer,
            eos_token=EOS_TOKEN,
            eval_sampling_params=SamplingParams(
                temperature=0.3,
                max_tokens=1024,
                n=1,
                detokenize=False,
                stop_token_ids=[EOS_TOKEN_ID],
            ),
            reward_func=lambda completion, sample: compute_reward(completion, sample),
        )
        eval_episode_table = dump_episodes(
            episodes=eval_episodes,
            episodes_stats=eval_stats,
            exp_dir=EXP_DIR,
            tokenizer=tokenizer,
            iteration=iteration,
            is_eval=True,
        )
        wandb.log({"eval/episodes": eval_episode_table, "iteration": iteration})

    #########################################################
    # Generate Episodes
    #########################################################

    # Sample training batch
    num_samples = EPISODES_PER_ITERATION // GENERATIONS_PER_SAMPLE
    indices = np.random.choice(len(train_dataset), size=num_samples, replace=False)
    samples = train_dataset.select(indices)

    # Sample responses
    outputs = inference_engine.generate(prompt_token_ids=samples["input_ids"],
                                        sampling_params=SamplingParams(
                                            n=GENERATIONS_PER_SAMPLE,
                                            temperature=TEMPERATURE,
                                            top_p=TOP_P,
                                            top_k=TOP_K,
                                            max_tokens=MAX_RESPONSE_TOKENS,
                                            detokenize=False,
                                            stop_token_ids=[EOS_TOKEN_ID],
                                        ))
    all_generations = [list(g.token_ids) for out in outputs for g in out.outputs]
    all_finish_reasons = [g.finish_reason for out in outputs for g in out.outputs]
    inference_engine.sleep(1)

    print(f"Generated {len(all_generations)} responses")
    gc.collect()
    torch.cuda.empty_cache()
    time.sleep(1)

    # Process responses and calculate rewards
    episodes, episodes_stats = create_training_episodes(
        samples,
        all_generations,
        all_finish_reasons,
    )
    for k, v in episodes_stats.items():
        metrics.setdefault(k, []).extend(v)

    episode_table = dump_episodes(
        episodes=episodes,
        episodes_stats=episodes_stats,
        exp_dir=EXP_DIR,
        tokenizer=tokenizer,
        iteration=iteration,
    )

    #########################################################
    # Training
    #########################################################

    # Prepare training batch
    model_inputs = prepare_model_inputs(query_token_ids=episodes["all_query_token_ids"],
                                        response_token_ids=episodes["all_response_token_ids"],
                                        advantages=episodes["all_advantages"],
                                        device="cuda")

    # Calculate losses and update model
    policy_model.train()
    reference_model.module.cuda()
    reference_model.eval()

    total_response_len = (model_inputs["labels"] != -100).sum().item()

    for i in trange(0, EPISODES_PER_ITERATION, PER_DEVICE_BATCH_SIZE, desc="Gradient Accumulation"):
        batch = {k: v[i:i + PER_DEVICE_BATCH_SIZE] for k, v in model_inputs.items()}

        # Compute policy gradient loss
        loss, loss_metrics = compute_pg_loss(
            policy_model=policy_model,
            reference_model=reference_model,
            batch=batch,
            total_response_len=total_response_len,
        )

        # Track metrics
        metrics.setdefault("loss", []).append(loss.item())
        grad_norm = policy_model.get_global_grad_norm()
        if grad_norm is not None:
            grad_norm = grad_norm.item()
        metrics.setdefault("grad_norm", []).append(grad_norm)
        for k, v in loss_metrics.items():
            metrics.setdefault(k, []).append(v.item() if isinstance(v, torch.Tensor) else v)

        # Backpropagation and optimization step
        policy_model.backward(loss, scale_wrt_gas=False)

        # Free memory
        del loss, loss_metrics
        if policy_model.is_gradient_accumulation_boundary():
            reference_model.module.cpu()

        policy_model.step()

    #########################################################
    # Update inference engine weights
    #########################################################

    gc.collect()
    torch.cuda.empty_cache()
    time.sleep(1)

    inference_engine.wake_up()
    load_model_into_vllm(policy_model, inference_engine)

    gc.collect()
    torch.cuda.empty_cache()
    time.sleep(1)

    #########################################################
    # Log metrics
    #########################################################

    train_metrics = {k: np.mean(v) for k, v in metrics.items() if None not in v}
    train_metrics["learning_rate"] = policy_model.get_lr()[0]
    logs = {
        "iteration": iteration,
        f"episodes/iter_{iteration:06d}": episode_table,
        **{
            f"train/{k}": v for k, v in train_metrics.items()
        },
    }
    if eval_stats is not None:
        eval_metrics = {k: np.mean(v) for k, v in eval_stats.items() if None not in v}
        logs.update({f"eval/{k}": v for k, v in eval_metrics.items()})
    wandb.log(logs)

    selected_keys = [
        "train/kl_penalty",
        "train/rewards",
        "train/reward_metrics/format_reward",
        "train/reward_metrics/equation_reward",
        "eval/rewards",
        "eval/reward_metrics/format_reward",
        "eval/reward_metrics/equation_reward",
    ]
    selected_metrics = {k: logs[k] for k in selected_keys if k in logs}
    print(f"KEY METRICS: {selected_metrics}")

    if iteration % 50 == 0 and iteration != 0:
        policy_model.module.save_pretrained(str(EXP_DIR / "checkpoints" / f"ckpt_{iteration:06d}" / "hf_model"))
        policy_model.save_checkpoint(str(EXP_DIR / "checkpoints" / f"ckpt_{iteration:06d}" / "deepspeed"))

  eval_episodes, eval_stats = evaluate_on_test_set(


Iteration 0/1000
Evaluating on eval set...


Processed prompts: 100%|██████████| 500/500 [00:28<00:00, 17.79it/s, est. speed input: 2535.40 toks/s, output: 8779.25 toks/s]
Processed prompts:  25%|██▌       | 16/64 [00:05<00:15,  3.08it/s, est. speed input: 439.63 toks/s, output: 4848.41 toks/s]

INFO 08-05 10:41:21 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:41:21 prefix_caching_block.py:479] Successfully reset prefix cache





INFO 08-05 10:41:22 worker.py:133] Sleep mode freed 8.14 GiB memory, 4.93 GiB memory is still in use.
INFO 08-05 10:41:22 executor_base.py:208] It took 0.505415 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 162)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [43, 43, 38], create an equation that equals 39. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` 43 + 43 = 86
</think>
Let's see what happens when we divide 38 by 5.
<think> 38 / 5 =
</think>
We have got our answer now. It's 7.12 or 7.1.


Gradient Accumulation: 100%|██████████| 16/16 [00:13<00:00,  1.21it/s]


INFO 08-05 10:41:38 executor_base.py:219] It took 0.086147 seconds to wake up.


  0%|          | 1/1000 [00:54<15:03:46, 54.28s/it]

KEY METRICS: {'train/kl_penalty': 0.0, 'train/rewards': 0.0, 'train/reward_metrics/format_reward': 0.0, 'train/reward_metrics/equation_reward': 0.0, 'eval/rewards': 0.02, 'eval/reward_metrics/format_reward': 0.02, 'eval/reward_metrics/equation_reward': 0.0}
Iteration 1/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:14,  3.28it/s, est. speed input: 465.94 toks/s, output: 4171.75 toks/s]

INFO 08-05 10:41:45 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:41:45 prefix_caching_block.py:479] Successfully reset prefix cache





INFO 08-05 10:41:45 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.04 GiB memory is still in use.
INFO 08-05 10:41:45 executor_base.py:208] It took 0.123116 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 216)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [88, 48, 23], create an equation that equals 63. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` The first number can be rewritten as 88 = 2*(48) + 0.
         The second number can be rewritten as 48 = 17*(1 + 2).
         The third numb

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.68it/s]


INFO 08-05 10:41:58 executor_base.py:219] It took 0.087061 seconds to wake up.


  0%|          | 2/1000 [01:13<9:21:44, 33.77s/it] 

KEY METRICS: {'train/kl_penalty': 0.0, 'train/rewards': 0.0, 'train/reward_metrics/format_reward': 0.0, 'train/reward_metrics/equation_reward': 0.0}
Iteration 2/1000


Processed prompts:  25%|██▌       | 16/64 [00:05<00:17,  2.73it/s, est. speed input: 390.03 toks/s, output: 4971.73 toks/s]

INFO 08-05 10:42:05 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:42:05 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:42:05 worker.py:133] Sleep mode freed 8.14 GiB memory, 9.05 GiB memory is still in use.
INFO 08-05 10:42:05 executor_base.py:208] It took 0.123777 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 1024)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [86, 74, 80], create an equation that equals 80. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`1 + 2 = 3</think>
<think>3 / (3 * 5) = 0</think>
Therefore, 80 equals 0.
InnerHTMLAdd->'80'
用户的回答返 回
 DAMAGES ZERO, reflected the newapeake bay deal that was signed on sunday.. Columbia not winning the ses bass creek. With six rivals signed, it could | Featured by PIA
 DAMAGES ZERO, reflected the newapeake bay deal that was 

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.67it/s]


INFO 08-05 10:42:18 executor_base.py:219] It took 0.094234 seconds to wake up.


  0%|          | 3/1000 [01:34<7:41:31, 27.77s/it]

KEY METRICS: {'train/kl_penalty': 0.0, 'train/rewards': 0.0, 'train/reward_metrics/format_reward': 0.0, 'train/reward_metrics/equation_reward': 0.0}
Iteration 3/1000


Processed prompts:  25%|██▌       | 16/64 [00:05<00:15,  3.16it/s, est. speed input: 448.75 toks/s, output: 4360.60 toks/s]

INFO 08-05 10:42:25 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:42:25 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:42:25 worker.py:133] Sleep mode freed 8.14 GiB memory, 9.02 GiB memory is still in use.
INFO 08-05 10:42:25 executor_base.py:208] It took 0.124114 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 39)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [43, 26, 54, 47], create an equation that equals 32. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`  43 ÷ (2 + 6)
 <answer>(1 + 2) ÷ (3 * 5) ず
 Think first, then speak thoroughly.<|endoftext|>`


########## Example 2 (Reward: 0.0, Response Length: 454)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>


Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.67it/s]


INFO 08-05 10:42:38 executor_base.py:219] It took 0.088184 seconds to wake up.


  0%|          | 4/1000 [01:54<6:49:15, 24.65s/it]

KEY METRICS: {'train/kl_penalty': 0.0, 'train/rewards': 0.0, 'train/reward_metrics/format_reward': 0.0, 'train/reward_metrics/equation_reward': 0.0}
Iteration 4/1000


Processed prompts:  25%|██▌       | 16/64 [00:05<00:15,  3.10it/s, est. speed input: 442.81 toks/s, output: 4572.20 toks/s]

INFO 08-05 10:42:45 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:42:45 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:42:45 worker.py:133] Sleep mode freed 8.14 GiB memory, 9.02 GiB memory is still in use.
INFO 08-05 10:42:45 executor_base.py:208] It took 0.123734 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 273)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [2, 66, 21], create an equation that equals 90. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` What is the answer?</think>
Upon my research, I found that the answer is 67. 
Here's how this works:
The problem only has two numbers and several operations. 
To find the answer, I must find a way to balance two numbers. 
I will begin by using + and - operations to balance the +2-66 part. 
I can do this by adding 66 and getti

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.66it/s]


INFO 08-05 10:42:58 executor_base.py:219] It took 0.094747 seconds to wake up.


  0%|          | 5/1000 [02:14<6:20:44, 22.96s/it]

KEY METRICS: {'train/kl_penalty': 0.0, 'train/rewards': 0.03125, 'train/reward_metrics/format_reward': 0.03125, 'train/reward_metrics/equation_reward': 0.0}
Iteration 5/1000


Processed prompts:  25%|██▌       | 16/64 [00:05<00:16,  2.99it/s, est. speed input: 426.89 toks/s, output: 4731.64 toks/s]

INFO 08-05 10:43:05 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:43:05 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:43:05 worker.py:133] Sleep mode freed 8.14 GiB memory, 9.02 GiB memory is still in use.
INFO 08-05 10:43:05 executor_base.py:208] It took 0.124092 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 46)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [74, 33, 23, 75], create an equation that equals 10. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`10 = 74 + 33 - 23 + 75</think>
回答：10 = 74 + 33 - 23 + 75
/></answer><|endoftext|>`


########## Example 2 (Reward: 0.0, Response Length: 119)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.65it/s]


INFO 08-05 10:43:18 executor_base.py:219] It took 0.089570 seconds to wake up.


  1%|          | 6/1000 [02:34<6:04:50, 22.02s/it]

KEY METRICS: {'train/kl_penalty': 6.505636844726927e-05, 'train/rewards': 0.0, 'train/reward_metrics/format_reward': 0.0, 'train/reward_metrics/equation_reward': 0.0}
Iteration 6/1000


Processed prompts:  25%|██▌       | 16/64 [00:05<00:15,  3.12it/s, est. speed input: 445.87 toks/s, output: 4758.17 toks/s]

INFO 08-05 10:43:25 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:43:25 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:43:25 worker.py:133] Sleep mode freed 8.14 GiB memory, 9.02 GiB memory is still in use.
INFO 08-05 10:43:25 executor_base.py:208] It took 0.124399 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 195)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [50, 48, 51, 22], create an equation that equals 31. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` 31 = 50 + ( 48 - 22 ) </think>
// Get 50 and subtract 22 from 48
think> 50 - ( 22 + 10 ) = 50 - 32 = 18 </think>
think> 50 + 18 = 68 </think>
// Now we have 68 and divide by 51 + 48
think> 68 divided by ( 51 + 48 ) = 68 divided by 109 = 0.627 </think>
Think > The final equation is:
31 = 50 + ( 48 - 22 ) = 50 + 26 = 76 
T

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.64it/s]


INFO 08-05 10:43:38 executor_base.py:219] It took 0.087623 seconds to wake up.


  1%|          | 7/1000 [02:54<5:53:45, 21.37s/it]

KEY METRICS: {'train/kl_penalty': 6.31632954254487e-05, 'train/rewards': 0.0078125, 'train/reward_metrics/format_reward': 0.0078125, 'train/reward_metrics/equation_reward': 0.0}
Iteration 7/1000


Processed prompts:  25%|██▌       | 16/64 [00:05<00:16,  2.97it/s, est. speed input: 422.72 toks/s, output: 4820.46 toks/s]

INFO 08-05 10:43:45 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:43:45 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:43:46 worker.py:133] Sleep mode freed 8.14 GiB memory, 9.01 GiB memory is still in use.
INFO 08-05 10:43:46 executor_base.py:208] It took 0.124359 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 207)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [54, 62, 2, 19], create an equation that equals 29. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`Step 1 : First, I will consider 1st party: 2. I know that result of above equation will be 2.</think> [/think]
<think>Step 2 : Then, I will come back to same equation 19. I know that result of above equation will be 2.</think> [/think]
<think>Step 3 : So, lets come back to eq, and try to use * (multiplication):</think> [/t

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.65it/s]


INFO 08-05 10:43:59 executor_base.py:219] It took 0.089034 seconds to wake up.


  1%|          | 8/1000 [03:14<5:47:31, 21.02s/it]

KEY METRICS: {'train/kl_penalty': 6.330744035502718e-05, 'train/rewards': 0.0, 'train/reward_metrics/format_reward': 0.0, 'train/reward_metrics/equation_reward': 0.0}
Iteration 8/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:14,  3.28it/s, est. speed input: 468.69 toks/s, output: 4339.83 toks/s]

INFO 08-05 10:44:05 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:44:05 prefix_caching_block.py:479] Successfully reset prefix cache





INFO 08-05 10:44:05 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.02 GiB memory is still in use.
INFO 08-05 10:44:05 executor_base.py:208] It took 0.122881 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 468)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [21, 65, 84, 71], create an equation that equals 35. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` First of all, we need to find a suitable operation to combine all the numbers. We have different numbers ranging from 21 to 71, so we can

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.64it/s]


INFO 08-05 10:44:18 executor_base.py:219] It took 0.088038 seconds to wake up.


  1%|          | 9/1000 [03:34<5:41:00, 20.65s/it]

KEY METRICS: {'train/kl_penalty': 6.182079703691405e-05, 'train/rewards': 0.0, 'train/reward_metrics/format_reward': 0.0, 'train/reward_metrics/equation_reward': 0.0}
Iteration 9/1000


Processed prompts:  25%|██▌       | 16/64 [00:05<00:15,  3.18it/s, est. speed input: 452.12 toks/s, output: 4278.03 toks/s]

INFO 08-05 10:44:25 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:44:25 prefix_caching_block.py:479] Successfully reset prefix cache





INFO 08-05 10:44:25 worker.py:133] Sleep mode freed 8.14 GiB memory, 9.02 GiB memory is still in use.
INFO 08-05 10:44:25 executor_base.py:208] It took 0.124125 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 66)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [92, 6, 54], create an equation that equals 44. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` WAMake the equation calculation: </think>
= 92 + 6         - (54 - 6 = 48)
=<Answer>     or </Answer> = 92 + 6 = 108<Answer><Answer>
lya, the f

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.68it/s]


INFO 08-05 10:44:38 executor_base.py:219] It took 0.088911 seconds to wake up.


  1%|          | 10/1000 [03:54<5:36:20, 20.38s/it]

KEY METRICS: {'train/kl_penalty': 6.626318170817743e-05, 'train/rewards': 0.0078125, 'train/reward_metrics/format_reward': 0.0078125, 'train/reward_metrics/equation_reward': 0.0}
Iteration 10/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:14,  3.34it/s, est. speed input: 477.43 toks/s, output: 4216.36 toks/s]

INFO 08-05 10:44:45 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:44:45 prefix_caching_block.py:479] Successfully reset prefix cache





INFO 08-05 10:44:45 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.03 GiB memory is still in use.
INFO 08-05 10:44:45 executor_base.py:208] It took 0.129810 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 879)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [18, 59, 15, 56], create an equation that equals 69. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` We need to find a combination that equals to 69.</think>
<!-- think-->Could you please give me the numbers [18, 59, 15, 56]? <think> <ans

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.66it/s]


INFO 08-05 10:44:58 executor_base.py:219] It took 0.085688 seconds to wake up.


  1%|          | 11/1000 [04:13<5:32:01, 20.14s/it]

KEY METRICS: {'train/kl_penalty': 6.519375542838929e-05, 'train/rewards': 0.0, 'train/reward_metrics/format_reward': 0.0, 'train/reward_metrics/equation_reward': 0.0}
Iteration 11/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:14,  3.23it/s, est. speed input: 461.27 toks/s, output: 4324.97 toks/s]

INFO 08-05 10:45:04 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:45:04 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:45:05 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.03 GiB memory is still in use.
INFO 08-05 10:45:05 executor_base.py:208] It took 0.124151 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 220)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [35, 18, 80], create an equation that equals 97. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`4 * 5</think>
<think>   / 2</think>
Here's what happens next
<think (try to do it first)</think>
The answer is 10
@testable

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Quiz Answer</title>
</head>
<body>
  <form action="">
 

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.68it/s]


INFO 08-05 10:45:18 executor_base.py:219] It took 0.087263 seconds to wake up.


  1%|          | 12/1000 [04:33<5:29:26, 20.01s/it]

KEY METRICS: {'train/kl_penalty': 6.565287981419367e-05, 'train/rewards': 0.015625, 'train/reward_metrics/format_reward': 0.015625, 'train/reward_metrics/equation_reward': 0.0}
Iteration 12/1000


Processed prompts:  25%|██▌       | 16/64 [00:05<00:16,  2.93it/s, est. speed input: 418.21 toks/s, output: 4919.65 toks/s]

INFO 08-05 10:45:25 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:45:25 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:45:25 worker.py:133] Sleep mode freed 8.14 GiB memory, 9.02 GiB memory is still in use.
INFO 08-05 10:45:25 executor_base.py:208] It took 0.124237 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 263)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [73, 46, 13, 86], create an equation that equals 72. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`Thank you</think> ->
I read the problem and decided to follow these steps:
1. Use the numbers given to us [73a, 46b, 13c, 86d]. This will give us a total of 72.
2. We can think of this as using the numbers in multi step equations.
3. Let's start with 46 and add 13 to get 59.
4. Now subtract the new value from 72, so we ge

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.66it/s]


INFO 08-05 10:45:38 executor_base.py:219] It took 0.087456 seconds to wake up.


  1%|▏         | 13/1000 [04:53<5:30:44, 20.11s/it]

KEY METRICS: {'train/kl_penalty': 6.51492552133797e-05, 'train/rewards': 0.015625, 'train/reward_metrics/format_reward': 0.015625, 'train/reward_metrics/equation_reward': 0.0}
Iteration 13/1000


Processed prompts:  25%|██▌       | 16/64 [00:05<00:15,  3.17it/s, est. speed input: 453.69 toks/s, output: 4622.02 toks/s]

INFO 08-05 10:45:45 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:45:45 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:45:45 worker.py:133] Sleep mode freed 8.14 GiB memory, 9.03 GiB memory is still in use.
INFO 08-05 10:45:45 executor_base.py:208] It took 0.124429 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 58)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [48, 1, 51, 15], create an equation that equals 12. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` [48, 1, 51, 15] </think>
The equation we need to get the sum equals 12 is: 15 - (51 - 48) = (51 - 48)
Simple, wat?<|endoftext|>`


########## Example 2 (Reward: 0.0, Response Length: 522)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide th

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.65it/s]


INFO 08-05 10:45:58 executor_base.py:219] It took 0.088169 seconds to wake up.


  1%|▏         | 14/1000 [05:13<5:29:58, 20.08s/it]

KEY METRICS: {'train/kl_penalty': 7.615458595165604e-05, 'train/rewards': 0.0, 'train/reward_metrics/format_reward': 0.0, 'train/reward_metrics/equation_reward': 0.0}
Iteration 14/1000


Processed prompts:  25%|██▌       | 16/64 [00:05<00:16,  2.95it/s, est. speed input: 421.24 toks/s, output: 4885.96 toks/s]

INFO 08-05 10:46:05 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:46:05 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:46:05 worker.py:133] Sleep mode freed 8.14 GiB memory, 9.02 GiB memory is still in use.
INFO 08-05 10:46:05 executor_base.py:208] It took 0.124274 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 1024)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [52, 57, 31], create an equation that equals 26. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` Here is a simple equation that equals 26: </think> 52 * 3 + 57 / 3 = 26. To reach the answer, I combined the numbers 52 and 57 and then divided them by 3. (52 * 3) / (57 * 3) = 26. So, my answer is: </answer> 26




倪威是高校教师。
在图书馆工作。热爱自己教学研究的教学和科研。热爱基层\进取)\本科\研究生主要学科是"
倪威地址举报
倪威基础
上海交通大学
研究生谈话记
扬州大学
天马天马考研公开课

倪威红色经历
老人游戏
冬天

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.64it/s]


INFO 08-05 10:46:18 executor_base.py:219] It took 0.087046 seconds to wake up.


  2%|▏         | 15/1000 [05:34<5:31:25, 20.19s/it]

KEY METRICS: {'train/kl_penalty': 6.771625473670111e-05, 'train/rewards': 0.0078125, 'train/reward_metrics/format_reward': 0.0078125, 'train/reward_metrics/equation_reward': 0.0}
Iteration 15/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:14,  3.22it/s, est. speed input: 460.09 toks/s, output: 4190.80 toks/s]

INFO 08-05 10:46:25 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:46:25 prefix_caching_block.py:479] Successfully reset prefix cache





INFO 08-05 10:46:25 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.03 GiB memory is still in use.
INFO 08-05 10:46:25 executor_base.py:208] It took 0.123701 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 75)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [20, 22, 40], create an equation that equals 44. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` First, we have the equation 20 - 22 - 40 = ? </think>

<answer>20 - 22 - 40 = (20 - 22) - 40 

Here, we subtract 22 and 40 from 20, which give

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.67it/s]


INFO 08-05 10:46:38 executor_base.py:219] It took 0.088118 seconds to wake up.


  2%|▏         | 16/1000 [05:54<5:29:19, 20.08s/it]

KEY METRICS: {'train/kl_penalty': 7.346089498867463e-05, 'train/rewards': 0.0078125, 'train/reward_metrics/format_reward': 0.0078125, 'train/reward_metrics/equation_reward': 0.0}
Iteration 16/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:14,  3.35it/s, est. speed input: 473.64 toks/s, output: 3849.55 toks/s]

INFO 08-05 10:46:45 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:46:45 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:46:45 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.03 GiB memory is still in use.
INFO 08-05 10:46:45 executor_base.py:208] It took 0.124678 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 104)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [95, 44, 41], create an equation that equals 92. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`90 += 44。</think>
Now I have 90.
<think>(90 + 44) * 41。</think>
Now I have 900 + 1824 = 2724.
<think>92.</think>
Now I have 2724 - 92 = 2532. So my answer is 2532.
 bağreja3
ạ    
.topAnchoratoba<|endoftext|>`


########## Example 2 (Reward: 0.0, Response Length: 118)
#### Query:
`<|im_start|>system
You are a helpful assistan

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.67it/s]


INFO 08-05 10:46:58 executor_base.py:219] It took 0.095712 seconds to wake up.


  2%|▏         | 17/1000 [06:13<5:26:55, 19.95s/it]

KEY METRICS: {'train/kl_penalty': 8.187442775019083e-05, 'train/rewards': 0.015625, 'train/reward_metrics/format_reward': 0.015625, 'train/reward_metrics/equation_reward': 0.0}
Iteration 17/1000


Processed prompts:  25%|██▌       | 16/64 [00:05<00:15,  3.09it/s, est. speed input: 439.75 toks/s, output: 4393.03 toks/s]

INFO 08-05 10:47:05 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:47:05 prefix_caching_block.py:479] Successfully reset prefix cache





INFO 08-05 10:47:05 worker.py:133] Sleep mode freed 8.14 GiB memory, 9.03 GiB memory is still in use.
INFO 08-05 10:47:05 executor_base.py:208] It took 0.125276 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 195)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [66, 33, 80], create an equation that equals 78. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` 66 </think>
<think> 33 </think>
<think> 80 </think> OR
1. first i thought how many numbers 1 + 2 equals to 2. instead i found that two can be

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.66it/s]


INFO 08-05 10:47:18 executor_base.py:219] It took 0.087758 seconds to wake up.


  2%|▏         | 18/1000 [06:34<5:28:09, 20.05s/it]

KEY METRICS: {'train/kl_penalty': 7.899294832287168e-05, 'train/rewards': 0.03125, 'train/reward_metrics/format_reward': 0.03125, 'train/reward_metrics/equation_reward': 0.0}
Iteration 18/1000


Processed prompts:  25%|██▌       | 16/64 [00:05<00:15,  3.09it/s, est. speed input: 438.66 toks/s, output: 4464.11 toks/s]

INFO 08-05 10:47:25 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:47:25 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:47:25 worker.py:133] Sleep mode freed 8.14 GiB memory, 9.02 GiB memory is still in use.
INFO 08-05 10:47:25 executor_base.py:208] It took 0.125059 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 69)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [28, 69, 33, 46], create an equation that equals 18. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` 28 </think> + <think> 69 </think> - <think> 33 </think> - <think> 46 </think>
<think>  + </think> <think> 99 </think> 
</think> = <think> 99 </think>
answered<|endoftext|>`


########## Example 2 (Reward: 0.0, Response Length: 533)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reas

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.67it/s]


INFO 08-05 10:47:38 executor_base.py:219] It took 0.087294 seconds to wake up.


  2%|▏         | 19/1000 [06:54<5:28:13, 20.07s/it]

KEY METRICS: {'train/kl_penalty': 8.079325055078776e-05, 'train/rewards': 0.015625, 'train/reward_metrics/format_reward': 0.015625, 'train/reward_metrics/equation_reward': 0.0}
Iteration 19/1000


Processed prompts:  25%|██▌       | 16/64 [00:05<00:17,  2.81it/s, est. speed input: 399.83 toks/s, output: 4968.47 toks/s]

INFO 08-05 10:47:46 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:47:46 prefix_caching_block.py:479] Successfully reset prefix cache





INFO 08-05 10:47:46 worker.py:133] Sleep mode freed 8.14 GiB memory, 9.01 GiB memory is still in use.
INFO 08-05 10:47:46 executor_base.py:208] It took 0.124326 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 784)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [47, 33, 23], create an equation that equals 57. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`Let's call our first number as "x" </think>
<div><x equivalent to 47</div>
<div><x equivalent to 33</div>
<div><x equivalent to 23</div>

<thi

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.65it/s]


INFO 08-05 10:47:59 executor_base.py:219] It took 0.087479 seconds to wake up.


  2%|▏         | 20/1000 [07:15<5:31:34, 20.30s/it]

KEY METRICS: {'train/kl_penalty': 9.976576923294696e-05, 'train/rewards': 0.0078125, 'train/reward_metrics/format_reward': 0.0078125, 'train/reward_metrics/equation_reward': 0.0}
Iteration 20/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:13,  3.52it/s, est. speed input: 500.95 toks/s, output: 3689.92 toks/s]

INFO 08-05 10:48:05 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:48:05 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:48:05 worker.py:133] Sleep mode freed 8.14 GiB memory, 9.02 GiB memory is still in use.
INFO 08-05 10:48:05 executor_base.py:208] It took 0.123037 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 322)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [33, 5, 68, 29], create an equation that equals 67. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` They are three numbers.</think>
answer: 67 = 33 + 5 - 29.
usterity: 33 + 5 - 29 = 67.
usterity: This equation solves the expression 67.

usterity: 33, 5, 68, 29 are written regularly and not mixed incorrectly.
usterity: They are three numbers.
usterity: But they can only be used once.
usterity: Let me check them.
usterity

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.68it/s]


INFO 08-05 10:48:18 executor_base.py:219] It took 0.085558 seconds to wake up.


  2%|▏         | 21/1000 [07:34<5:26:52, 20.03s/it]

KEY METRICS: {'train/kl_penalty': 0.0001238063915110476, 'train/rewards': 0.0, 'train/reward_metrics/format_reward': 0.0, 'train/reward_metrics/equation_reward': 0.0}
Iteration 21/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:13,  3.51it/s, est. speed input: 501.80 toks/s, output: 3813.80 toks/s]

INFO 08-05 10:48:25 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:48:25 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:48:25 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.98 GiB memory is still in use.
INFO 08-05 10:48:25 executor_base.py:208] It took 0.124111 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 228)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [30, 82, 13, 73], create an equation that equals 26. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`26 = 30 + 4</think>

Let me explain what I did:
1. I started with three numbers that I want to combine
2. I chose the numbers 30 (20 + 10), 82 (42), and 13 (31)
3. I added the first two numbers: 30 and 42 = 72
4. I chose the third number, 13, and added it to the result (72): 72 + 13 = 85
5. I got very close. The final ans

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.67it/s]


INFO 08-05 10:48:38 executor_base.py:219] It took 0.087345 seconds to wake up.


  2%|▏         | 22/1000 [07:53<5:23:49, 19.87s/it]

KEY METRICS: {'train/kl_penalty': 0.00013892448984179447, 'train/rewards': 0.015625, 'train/reward_metrics/format_reward': 0.015625, 'train/reward_metrics/equation_reward': 0.0}
Iteration 22/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:13,  3.57it/s, est. speed input: 505.82 toks/s, output: 3529.82 toks/s]

INFO 08-05 10:48:44 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:48:44 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:48:44 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.98 GiB memory is still in use.
INFO 08-05 10:48:44 executor_base.py:208] It took 0.124060 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 202)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [4, 37, 21], create an equation that equals 64. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`For any equation with addition, you should start by adding the largest numbers first. In this example, we want 37 + 21. So, we get: 37 + 21 = 58.
<think>Now, let's move on to multiplication. Again, we can look for the largest numbers that make sense. In this example, we want 4 * 21. So, we get: 4 * 21 = 84.
<think>Lastly, we n

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.67it/s]


INFO 08-05 10:48:57 executor_base.py:219] It took 0.087071 seconds to wake up.


  2%|▏         | 23/1000 [08:13<5:21:43, 19.76s/it]

KEY METRICS: {'train/kl_penalty': 0.00018948866042641714, 'train/rewards': 0.0234375, 'train/reward_metrics/format_reward': 0.0234375, 'train/reward_metrics/equation_reward': 0.0}
Iteration 23/1000


Processed prompts:  25%|██▌       | 16/64 [00:05<00:15,  3.18it/s, est. speed input: 449.85 toks/s, output: 4320.04 toks/s]

INFO 08-05 10:49:04 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:49:04 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:49:04 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.99 GiB memory is still in use.
INFO 08-05 10:49:04 executor_base.py:208] It took 0.123854 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 164)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [17, 73, 2], create an equation that equals 39. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`39 = 39</think> 
<answer>
= 39
</answer> 
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer. рассматривание 17, 2 и 73 по математике. 
<think> 17 *   (73/2)
</think>
<answer>
=   (1 *   (73/2))
</answer>
<think>
15.15            
</think> 
<answer>
= 

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.67it/s]


INFO 08-05 10:49:17 executor_base.py:219] It took 0.087240 seconds to wake up.


  2%|▏         | 24/1000 [08:33<5:22:51, 19.85s/it]

KEY METRICS: {'train/kl_penalty': 0.0002184136127029774, 'train/rewards': 0.015625, 'train/reward_metrics/format_reward': 0.015625, 'train/reward_metrics/equation_reward': 0.0}
Iteration 24/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:14,  3.36it/s, est. speed input: 480.15 toks/s, output: 3855.69 toks/s]

INFO 08-05 10:49:24 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:49:24 prefix_caching_block.py:479] Successfully reset prefix cache





INFO 08-05 10:49:24 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.98 GiB memory is still in use.
INFO 08-05 10:49:24 executor_base.py:208] It took 0.123815 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 339)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [23, 40, 45], create an equation that equals 28. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` Please provide more information until I can understand what you're trying to do. I need to know what values do you have and how they relate t

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.68it/s]


INFO 08-05 10:49:37 executor_base.py:219] It took 0.087912 seconds to wake up.


  eval_episodes, eval_stats = evaluate_on_test_set(


KEY METRICS: {'train/kl_penalty': 0.0001723363827448059, 'train/rewards': 0.0546875, 'train/reward_metrics/format_reward': 0.0546875, 'train/reward_metrics/equation_reward': 0.0}
Iteration 25/1000
Evaluating on eval set...


Processed prompts: 100%|██████████| 500/500 [00:16<00:00, 30.78it/s, est. speed input: 4386.48 toks/s, output: 8283.47 toks/s]
Processed prompts:  25%|██▌       | 16/64 [00:04<00:14,  3.21it/s, est. speed input: 455.36 toks/s, output: 4079.81 toks/s]

INFO 08-05 10:50:02 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:50:02 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:50:02 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.97 GiB memory is still in use.
INFO 08-05 10:50:02 executor_base.py:208] It took 0.123976 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 584)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [3, 6, 55], create an equation that equals 58. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`Ok, the numbers I have are 3, 6, and 55. </think>
Let's start by simplifying 55. We know that 55 can be divided by 5, so:
-think>(55 / 5) = 11
Now, let's simplify the original equation:
11 / (3 * 6) = 11 / (18) = 0.6111...
Repeat this step until we get to a whole number. Easy, now we have:
1 + 2 / (3 * 5) = 0.6 + 0.2 = 0.8
Repe

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.67it/s]


INFO 08-05 10:50:15 executor_base.py:219] It took 0.087421 seconds to wake up.


  3%|▎         | 26/1000 [09:30<6:48:17, 25.15s/it]

KEY METRICS: {'train/kl_penalty': 0.0001950999425477769, 'train/rewards': 0.0390625, 'train/reward_metrics/format_reward': 0.0390625, 'train/reward_metrics/equation_reward': 0.0, 'eval/rewards': 0.169, 'eval/reward_metrics/format_reward': 0.167, 'eval/reward_metrics/equation_reward': 0.002}
Iteration 26/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:12,  3.79it/s, est. speed input: 541.45 toks/s, output: 2902.35 toks/s]

INFO 08-05 10:50:21 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:50:21 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:50:21 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.98 GiB memory is still in use.
INFO 08-05 10:50:21 executor_base.py:208] It took 0.123776 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 27)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [3, 6, 36, 60], create an equation that equals 43. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` 43 </think> = <answer> (65) </answer>
americanmathforall.com/1267<|endoftext|>`


########## Example 2 (Reward: 0.0, Response Length: 162)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.69it/s]


INFO 08-05 10:50:34 executor_base.py:219] It took 0.088272 seconds to wake up.


  3%|▎         | 27/1000 [09:49<6:17:50, 23.30s/it]

KEY METRICS: {'train/kl_penalty': 0.00029107730859714246, 'train/rewards': 0.03125, 'train/reward_metrics/format_reward': 0.03125, 'train/reward_metrics/equation_reward': 0.0}
Iteration 27/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:13,  3.65it/s, est. speed input: 518.42 toks/s, output: 3531.62 toks/s]

INFO 08-05 10:50:40 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:50:40 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:50:40 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.98 GiB memory is still in use.
INFO 08-05 10:50:40 executor_base.py:208] It took 0.124196 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 94)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [2, 89, 58, 55], create an equation that equals 41. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` 1+2 has multiple meaning, so we choose to solve in the first way 'add.' </think>
<think> (1+2)/(3*5) has multiple meaning also, so we choose to solve in the second way 'divide.' </think>
答案：</答案>  ((1+2)/3)*(5-8) = 1/3*(-3) = -1 = 41 (<answer>-1</answer>)<|endoftext|>`


########## Example 2 (Reward: 0.0, Response Length: 

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.66it/s]


INFO 08-05 10:50:53 executor_base.py:219] It took 0.087559 seconds to wake up.


  3%|▎         | 28/1000 [10:09<5:58:33, 22.13s/it]

KEY METRICS: {'train/kl_penalty': 0.0002735937969702138, 'train/rewards': 0.0390625, 'train/reward_metrics/format_reward': 0.0390625, 'train/reward_metrics/equation_reward': 0.0}
Iteration 28/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:13,  3.57it/s, est. speed input: 506.78 toks/s, output: 3353.55 toks/s]

INFO 08-05 10:50:59 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:50:59 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:51:00 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.98 GiB memory is still in use.
INFO 08-05 10:51:00 executor_base.py:208] It took 0.127982 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 276)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [62, 15, 4, 12], create an equation that equals 31. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`Programming a
-answer> 31</answer>
To create an equation that equals 31, we can use the following steps:

Step 1: Add the two numbers
31 = (62 + 15) + 4 + 12

Step 2: Multiply them by 1/2
31 = ((62 + 15) + 4) / (1 * 2)
<answer>(1 + 2) / (3 * 5)</answer>
Step 3: Divide them by 2
31 = (31 / 2)
Thus, the final equation that e

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.69it/s]


INFO 08-05 10:51:13 executor_base.py:219] It took 0.089157 seconds to wake up.


  3%|▎         | 29/1000 [10:28<5:44:30, 21.29s/it]

KEY METRICS: {'train/kl_penalty': 0.00032388593330408706, 'train/rewards': 0.0390625, 'train/reward_metrics/format_reward': 0.0390625, 'train/reward_metrics/equation_reward': 0.0}
Iteration 29/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:13,  3.51it/s, est. speed input: 503.51 toks/s, output: 3533.54 toks/s]

INFO 08-05 10:51:19 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:51:19 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:51:19 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.98 GiB memory is still in use.
INFO 08-05 10:51:19 executor_base.py:208] It took 0.123792 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 418)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [9, 71, 96, 55], create an equation that equals 21. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` Using plus and negative operators, we can form +1 (plus one) and -2 (minus two).
<think> Now, we have -2 + (minus 2 multiplied by -1) = -2 - (-2) = -2 + 2 = 0.
<answer>(0 * 9) / (9 * 5) = 0 / 45 = 0
</think> </answer> If it's not 0, throw away the equation because it's not working out.

conde_de_siren_el_pesado_fondita
To

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.67it/s]


INFO 08-05 10:51:32 executor_base.py:219] It took 0.087108 seconds to wake up.


  3%|▎         | 30/1000 [10:48<5:35:38, 20.76s/it]

KEY METRICS: {'train/kl_penalty': 0.0004816910231852108, 'train/rewards': 0.046875, 'train/reward_metrics/format_reward': 0.046875, 'train/reward_metrics/equation_reward': 0.0}
Iteration 30/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:12,  3.70it/s, est. speed input: 526.44 toks/s, output: 3054.39 toks/s]

INFO 08-05 10:51:38 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:51:38 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:51:38 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.98 GiB memory is still in use.
INFO 08-05 10:51:38 executor_base.py:208] It took 0.125371 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 261)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [60, 65, 56], create an equation that equals 51. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` For this equation to equal 51, we need to find numbers that when added, multiplied, and divided together, result in 51. I will try to come up with a solution using all 60, 65, and 56. </think> 
   Let's try: (56 - 65) / (60 + 56) = 1 
    56 - 65 = -9   60 + 56 = 116
    -9 / 116 = -9/116 51 has the same operator as the orig

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.70it/s]


INFO 08-05 10:51:51 executor_base.py:219] It took 0.087264 seconds to wake up.


  3%|▎         | 31/1000 [11:07<5:27:30, 20.28s/it]

KEY METRICS: {'train/kl_penalty': 0.00048622501833994183, 'train/rewards': 0.1171875, 'train/reward_metrics/format_reward': 0.1171875, 'train/reward_metrics/equation_reward': 0.0}
Iteration 31/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:13,  3.54it/s, est. speed input: 504.93 toks/s, output: 3284.16 toks/s]

INFO 08-05 10:51:57 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:51:57 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:51:58 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.98 GiB memory is still in use.
INFO 08-05 10:51:58 executor_base.py:208] It took 0.124492 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 343)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [51, 24, 53], create an equation that equals 80. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` Start by assigning an imbalance coefficient to the initial number 51, -24 (12), and 53 (36) such that the total imbalance is 80. </think>
<hint> You can choose any numbers for the coefficients' values to create different scenarios </hint>
<hint> Example: B = 2, C = 4, D = 6 </hint>
</think>
First, create an equation like: "5

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.69it/s]


INFO 08-05 10:52:11 executor_base.py:219] It took 0.087500 seconds to wake up.


  3%|▎         | 32/1000 [11:26<5:23:19, 20.04s/it]

KEY METRICS: {'train/kl_penalty': 0.000519334289641289, 'train/rewards': 0.0390625, 'train/reward_metrics/format_reward': 0.0390625, 'train/reward_metrics/equation_reward': 0.0}
Iteration 32/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:13,  3.58it/s, est. speed input: 507.64 toks/s, output: 3198.23 toks/s]

INFO 08-05 10:52:17 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:52:17 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:52:17 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.97 GiB memory is still in use.
INFO 08-05 10:52:17 executor_base.py:208] It took 0.123724 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 46)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [55, 15, 12], create an equation that equals 44. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` let's start with the numbers 55, 15, and 12 respectively. </think> 
<answer>(55 - 15) * (15 / 12)</answer><|endoftext|>`


########## Example 2 (Reward: 0.5, Response Length: 64)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with 

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.69it/s]


INFO 08-05 10:52:30 executor_base.py:219] It took 0.087106 seconds to wake up.


  3%|▎         | 33/1000 [11:46<5:19:39, 19.83s/it]

KEY METRICS: {'train/kl_penalty': 0.0005443486350438136, 'train/rewards': 0.1171875, 'train/reward_metrics/format_reward': 0.1171875, 'train/reward_metrics/equation_reward': 0.0}
Iteration 33/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:12,  3.70it/s, est. speed input: 527.04 toks/s, output: 3219.28 toks/s]

INFO 08-05 10:52:36 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:52:36 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:52:36 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.97 GiB memory is still in use.
INFO 08-05 10:52:36 executor_base.py:208] It took 0.124909 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 162)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [60, 65, 4, 76], create an equation that equals 77. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`ROS + 1 = 1 + 2</think>
<answer>4+1</answer>
<think>4 * 7 + 6 = 4 * (7 + 6) / 2 = 4 * 13 / 2</think>
<answer>4*13*2/2</answer>
 anniembre is a wonderful task. You can do this as follows:
1. 60 + 65 = 365
2. 365 - 4 = 361
3. 361 divided by 7 = 51 remainder 2
4. 51 + 76 = 127
5. 77 = 51 - 2<|endoftext|>`


########## Example

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.70it/s]


INFO 08-05 10:52:49 executor_base.py:219] It took 0.088090 seconds to wake up.


  3%|▎         | 34/1000 [12:05<5:15:57, 19.63s/it]

KEY METRICS: {'train/kl_penalty': 0.0007651087105475264, 'train/rewards': 0.140625, 'train/reward_metrics/format_reward': 0.140625, 'train/reward_metrics/equation_reward': 0.0}
Iteration 34/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:12,  3.71it/s, est. speed input: 527.78 toks/s, output: 2974.98 toks/s]

INFO 08-05 10:52:55 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:52:55 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:52:55 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.97 GiB memory is still in use.
INFO 08-05 10:52:55 executor_base.py:208] It took 0.125155 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 46)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [90, 44, 54, 48], create an equation that equals 32. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`Domain: 90, 44, 54, 48</think>
<ans>90 + (44 - 54) / 48 = 32</ans><|endoftext|>`


########## Example 2 (Reward: 0.0, Response Length: 397)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>us

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.69it/s]


INFO 08-05 10:53:08 executor_base.py:219] It took 0.087947 seconds to wake up.


  4%|▎         | 35/1000 [12:24<5:13:34, 19.50s/it]

KEY METRICS: {'train/kl_penalty': 0.0008473924198269518, 'train/rewards': 0.1875, 'train/reward_metrics/format_reward': 0.1875, 'train/reward_metrics/equation_reward': 0.0}
Iteration 35/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:12,  3.84it/s, est. speed input: 547.52 toks/s, output: 2654.03 toks/s]

INFO 08-05 10:53:14 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:53:14 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:53:14 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.97 GiB memory is still in use.
INFO 08-05 10:53:14 executor_base.py:208] It took 0.127980 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 415)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [10, 85, 4], create an equation that equals 79. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`First, we want to find a way to combine the addition and multiplication operations.</think>
<answer>
   [10] = [10] - [85]
&lt;answer> = 0</<answer>
   [90] = [4] * [5]
&lt;answer>
   [90] = [100] - [5]
   [90] = [90]
&lt;answer>
   [10 + 90] = [10 + 90] / [85 + 4] * [5]
   [10] / [85] = [10 + 90] / [85 + 4]
   [5] = [90]
&lt;

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.67it/s]


INFO 08-05 10:53:27 executor_base.py:219] It took 0.087409 seconds to wake up.


  4%|▎         | 36/1000 [12:43<5:11:40, 19.40s/it]

KEY METRICS: {'train/kl_penalty': 0.001274151543517325, 'train/rewards': 0.2421875, 'train/reward_metrics/format_reward': 0.2421875, 'train/reward_metrics/equation_reward': 0.0}
Iteration 36/1000


Processed prompts:  25%|██▌       | 16/64 [00:04<00:13,  3.58it/s, est. speed input: 506.93 toks/s, output: 3291.79 toks/s]

INFO 08-05 10:53:34 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:53:34 prefix_caching_block.py:479] Successfully reset prefix cache





INFO 08-05 10:53:34 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.07 GiB memory is still in use.
INFO 08-05 10:53:34 executor_base.py:208] It took 0.124209 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 40)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [78, 39, 32], create an equation that equals 85. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` 
    1
     +     
	keyard3
<jacket>{$2+$2} ÷ {4·2 ${1/2 * 8}}</jacket><|endoftext|>`


########## Example 2 (Reward: 0.5, Response Length: 17

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.68it/s]


INFO 08-05 10:53:47 executor_base.py:219] It took 0.085533 seconds to wake up.


  4%|▎         | 37/1000 [13:03<5:11:39, 19.42s/it]

KEY METRICS: {'train/kl_penalty': 0.0011091547987419719, 'train/rewards': 0.1796875, 'train/reward_metrics/format_reward': 0.1796875, 'train/reward_metrics/equation_reward': 0.0}
Iteration 37/1000


Processed prompts:  25%|██▌       | 16/64 [00:01<00:05,  8.35it/s, est. speed input: 1192.89 toks/s, output: 3249.90 toks/s]

INFO 08-05 10:53:51 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:53:51 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:53:51 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.16 GiB memory is still in use.
INFO 08-05 10:53:51 executor_base.py:208] It took 0.123721 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 247)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [64, 69, 33], create an equation that equals 38. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` Step 1: Create an equation </think>
<answer> x ÷ (x + x) + x - 64 </answer> </think>
.Step 2: First we add x and x together, which gives us 2x
<answer> 2x - 64 </answer> </think>
.Step 3: Next we can divide 2x minus 64 by x + x
<answer> (2 * 2 * 2 * -64) / (2 * 2) </answer> </think>
.Step 4: Simplify the equation
(64) / (4) 

Gradient Accumulation: 100%|██████████| 16/16 [00:05<00:00,  2.81it/s]


INFO 08-05 10:54:00 executor_base.py:219] It took 0.086752 seconds to wake up.


  4%|▍         | 38/1000 [13:16<4:40:17, 17.48s/it]

KEY METRICS: {'train/kl_penalty': 0.0022836767792625157, 'train/rewards': 0.2421875, 'train/reward_metrics/format_reward': 0.2421875, 'train/reward_metrics/equation_reward': 0.0}
Iteration 38/1000


Processed prompts:  25%|██▌       | 16/64 [00:03<00:11,  4.28it/s, est. speed input: 609.66 toks/s, output: 2234.70 toks/s]

INFO 08-05 10:54:05 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:54:05 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:54:06 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.98 GiB memory is still in use.
INFO 08-05 10:54:06 executor_base.py:208] It took 0.123907 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 111)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [27, 11, 69, 14], create an equation that equals 93. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` Let x be the unknown number we are trying to find. </think>
<answer> 93 = x + 11 + 14
69 = x + 69, since 69 = 69 + 0. So that's false
51 = x + 69, since we need more than 24 to make x = 51, but even if we add 24, x would still be less than 80
x = 80 </answer><|endoftext|>`


########## Example 2 (Reward: 0.5, Response Le

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.73it/s]


INFO 08-05 10:54:18 executor_base.py:219] It took 0.087215 seconds to wake up.


  4%|▍         | 39/1000 [13:34<4:44:33, 17.77s/it]

KEY METRICS: {'train/kl_penalty': 0.0020073903677706234, 'train/rewards': 0.25, 'train/reward_metrics/format_reward': 0.25, 'train/reward_metrics/equation_reward': 0.0}
Iteration 39/1000


Processed prompts:  25%|██▌       | 16/64 [00:02<00:07,  6.13it/s, est. speed input: 875.15 toks/s, output: 2257.65 toks/s]

INFO 08-05 10:54:23 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:54:23 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:54:23 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.98 GiB memory is still in use.
INFO 08-05 10:54:23 executor_base.py:208] It took 0.124232 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 60)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [59, 46, 75], create an equation that equals 30. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` We need an equation that equals 30. </think>
<answer>The equation is 59 - 46 - 75 = 30.</answer>

<answer>The answer: 59 - 46 - 75 = 30</answer><|endoftext|>`


########## Example 2 (Reward: 0.5, Response Length: 19)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in th

Gradient Accumulation: 100%|██████████| 16/16 [00:07<00:00,  2.15it/s]


INFO 08-05 10:54:34 executor_base.py:219] It took 0.088340 seconds to wake up.


  4%|▍         | 40/1000 [13:49<4:33:11, 17.07s/it]

KEY METRICS: {'train/kl_penalty': 0.003165677961842281, 'train/rewards': 0.2734375, 'train/reward_metrics/format_reward': 0.2734375, 'train/reward_metrics/equation_reward': 0.0}
Iteration 40/1000


Processed prompts:  25%|██▌       | 16/64 [00:03<00:10,  4.68it/s, est. speed input: 662.31 toks/s, output: 1741.69 toks/s]

INFO 08-05 10:54:39 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:54:39 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:54:39 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.98 GiB memory is still in use.
INFO 08-05 10:54:39 executor_base.py:208] It took 0.133384 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 45)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [56, 40, 26, 4], create an equation that equals 40. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` I need to make an equation that equals 40. The numbers can only be used once.</think>
<answer>(1 + 2) / (3 * 5) = 40</answer><|endoftext|>`


########## Example 2 (Reward: 0.5, Response Length: 41)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then 

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.70it/s]


INFO 08-05 10:54:52 executor_base.py:219] It took 0.088588 seconds to wake up.


  4%|▍         | 41/1000 [14:08<4:38:14, 17.41s/it]

KEY METRICS: {'train/kl_penalty': 0.002933054347229516, 'train/rewards': 0.4140625, 'train/reward_metrics/format_reward': 0.3984375, 'train/reward_metrics/equation_reward': 0.015625}
Iteration 41/1000


Processed prompts:  25%|██▌       | 16/64 [00:03<00:11,  4.35it/s, est. speed input: 622.21 toks/s, output: 1859.01 toks/s]

INFO 08-05 10:54:57 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:54:57 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:54:58 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.98 GiB memory is still in use.
INFO 08-05 10:54:58 executor_base.py:208] It took 0.123430 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 150)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [16, 18, 59, 89], create an equation that equals 64. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`First, we need to find a number that when increased by 16, and then increased by 18, the result is equal to 64. This number is 16.</think>

<think>Then, we have another number that, when increased by 59, and then increased by 89, the result is equal to 64. This number is 59.</think>

<answer>(16) + (16) + (59) = 64</answe

Gradient Accumulation: 100%|██████████| 16/16 [00:09<00:00,  1.69it/s]


INFO 08-05 10:55:11 executor_base.py:219] It took 0.091930 seconds to wake up.


  4%|▍         | 42/1000 [14:26<4:43:48, 17.78s/it]

KEY METRICS: {'train/kl_penalty': 0.003115372152979706, 'train/rewards': 0.375, 'train/reward_metrics/format_reward': 0.375, 'train/reward_metrics/equation_reward': 0.0}
Iteration 42/1000


Processed prompts:  25%|██▌       | 16/64 [00:01<00:05,  8.48it/s, est. speed input: 1218.27 toks/s, output: 2482.11 toks/s]

INFO 08-05 10:55:14 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:55:14 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:55:14 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.98 GiB memory is still in use.
INFO 08-05 10:55:14 executor_base.py:208] It took 0.124526 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 79)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [1, 86, 2], create an equation that equals 86. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` So, we have the equation 86 = <answer>(1 + 2) / (3 * 5)</answer></think>
<answer> (1 + 2) = 3 </answer>
<answer> 3 / (3 * 5) = 1/5 </answer>
<answer> 86 = 1/5 </answer><|endoftext|>`


########## Example 2 (Reward: 0.0, Response Length: 326)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the 

Gradient Accumulation: 100%|██████████| 16/16 [00:05<00:00,  2.77it/s]


INFO 08-05 10:55:24 executor_base.py:219] It took 0.087552 seconds to wake up.


  4%|▍         | 43/1000 [14:39<4:20:52, 16.36s/it]

KEY METRICS: {'train/kl_penalty': 0.004203571819026772, 'train/rewards': 0.4609375, 'train/reward_metrics/format_reward': 0.4609375, 'train/reward_metrics/equation_reward': 0.0}
Iteration 43/1000


Processed prompts:  25%|██▌       | 16/64 [00:02<00:08,  5.56it/s, est. speed input: 792.37 toks/s, output: 2112.27 toks/s]

INFO 08-05 10:55:28 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:55:28 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:55:28 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.98 GiB memory is still in use.
INFO 08-05 10:55:28 executor_base.py:208] It took 0.123153 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 52)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [60, 37, 50], create an equation that equals 47. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` (60 + 37) / (50 - 43) = 47 </think>
<answer> (60 + 37) / (50 - 43) = 47 </answer><|endoftext|>`


########## Example 2 (Reward: 0.0, Response Length: 38)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|i

Gradient Accumulation: 100%|██████████| 16/16 [00:07<00:00,  2.01it/s]


INFO 08-05 10:55:40 executor_base.py:219] It took 0.096228 seconds to wake up.


  4%|▍         | 44/1000 [14:56<4:20:19, 16.34s/it]

KEY METRICS: {'train/kl_penalty': 0.004329043407116959, 'train/rewards': 0.453125, 'train/reward_metrics/format_reward': 0.453125, 'train/reward_metrics/equation_reward': 0.0}
Iteration 44/1000


Processed prompts:  25%|██▌       | 16/64 [00:01<00:05,  8.39it/s, est. speed input: 1190.03 toks/s, output: 2051.57 toks/s]

INFO 08-05 10:55:44 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:55:44 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:55:44 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.08 GiB memory is still in use.
INFO 08-05 10:55:44 executor_base.py:208] It took 0.122982 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 56)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [6, 78, 23, 61], create an equation that equals 92. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` For now, let's just get as many answers as we can. We can use any combination of operations and numbers. </think>
<answer>(6 + 2) * (78 - 61 / (61 + 78))</answer><|endoftext|>`


########## Example 2 (Reward: 0.0, Response Length: 84)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the re

Gradient Accumulation: 100%|██████████| 16/16 [00:05<00:00,  2.67it/s]


INFO 08-05 10:55:53 executor_base.py:219] It took 0.088055 seconds to wake up.


  4%|▍         | 45/1000 [15:09<4:05:53, 15.45s/it]

KEY METRICS: {'train/kl_penalty': 0.007690198052569729, 'train/rewards': 0.46875, 'train/reward_metrics/format_reward': 0.46875, 'train/reward_metrics/equation_reward': 0.0}
Iteration 45/1000


Processed prompts:  25%|██▌       | 16/64 [00:00<00:02, 18.21it/s, est. speed input: 2587.95 toks/s, output: 3847.71 toks/s]

INFO 08-05 10:55:56 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:55:56 prefix_caching_block.py:479] Successfully reset prefix cache





INFO 08-05 10:55:56 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.21 GiB memory is still in use.
INFO 08-05 10:55:56 executor_base.py:208] It took 0.124094 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 0.0, Response Length: 44)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [58, 15, 17, 49], create an equation that equals 44. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`First, I will define the equation: </think>>
<answer>(58 + 15 + 17 + 49) / [49 - (48/9)]</answer><|endoftext|>`


########## Example 2 (Rew

Gradient Accumulation: 100%|██████████| 16/16 [00:04<00:00,  3.79it/s]


INFO 08-05 10:56:04 executor_base.py:219] It took 0.087085 seconds to wake up.


  5%|▍         | 46/1000 [15:19<3:41:57, 13.96s/it]

KEY METRICS: {'train/kl_penalty': 0.006786404277262199, 'train/rewards': 0.5703125, 'train/reward_metrics/format_reward': 0.5703125, 'train/reward_metrics/equation_reward': 0.0}
Iteration 46/1000


Processed prompts:  25%|██▌       | 16/64 [00:00<00:02, 16.62it/s, est. speed input: 2371.16 toks/s, output: 3395.50 toks/s]

INFO 08-05 10:56:07 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:56:07 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:56:07 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.00 GiB memory is still in use.
INFO 08-05 10:56:07 executor_base.py:208] It took 0.123010 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 45)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [47, 94, 12, 92], create an equation that equals 70. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`So we have four numbers to put in an equation.</think>
<answer>(1 + 2) / (3 * 5) = (3 + 4) / (3 * 5)</answer><|endoftext|>`


########## Example 2 (Reward: 0.5, Response Length: 97)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user

Gradient Accumulation: 100%|██████████| 16/16 [00:04<00:00,  3.86it/s]


INFO 08-05 10:56:14 executor_base.py:219] It took 0.087012 seconds to wake up.


  5%|▍         | 47/1000 [15:30<3:25:05, 12.91s/it]

KEY METRICS: {'train/kl_penalty': 0.0070692225583297, 'train/rewards': 0.5625, 'train/reward_metrics/format_reward': 0.5625, 'train/reward_metrics/equation_reward': 0.0}
Iteration 47/1000


Processed prompts:  25%|██▌       | 16/64 [00:00<00:02, 22.47it/s, est. speed input: 3202.77 toks/s, output: 4417.77 toks/s]

INFO 08-05 10:56:17 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:56:17 prefix_caching_block.py:479] Successfully reset prefix cache





INFO 08-05 10:56:17 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.99 GiB memory is still in use.
INFO 08-05 10:56:17 executor_base.py:208] It took 0.123327 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 40)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [12, 6, 39, 54], create an equation that equals 95. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`95 can be formed by multiplying 12 by 5 and subtracting 6 from that result.</think>
<answer>(12 * 5) - (6)</answer><|endoftext|>`


########

Gradient Accumulation: 100%|██████████| 16/16 [00:03<00:00,  4.03it/s]


INFO 08-05 10:56:24 executor_base.py:219] It took 0.086706 seconds to wake up.


  5%|▍         | 48/1000 [15:40<3:11:34, 12.07s/it]

KEY METRICS: {'train/kl_penalty': 0.0072123521841197785, 'train/rewards': 0.6328125, 'train/reward_metrics/format_reward': 0.6328125, 'train/reward_metrics/equation_reward': 0.0}
Iteration 48/1000


Processed prompts:  25%|██▌       | 16/64 [00:00<00:02, 20.14it/s, est. speed input: 2867.18 toks/s, output: 3975.18 toks/s]

INFO 08-05 10:56:27 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:56:27 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:56:27 worker.py:133] Sleep mode freed 8.15 GiB memory, 8.98 GiB memory is still in use.
INFO 08-05 10:56:27 executor_base.py:208] It took 0.122626 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 54)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [23, 70, 83, 92], create an equation that equals 82. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`Use the numbers 23, 70, 83, and 92 once. Let's first try to make 43.</think>
<answer>(23 - 70 + 83 + 92)</answer><|endoftext|>`


########## Example 2 (Reward: 1.0, Response Length: 40)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the 

Gradient Accumulation: 100%|██████████| 16/16 [00:04<00:00,  3.74it/s]


INFO 08-05 10:56:35 executor_base.py:219] It took 0.087510 seconds to wake up.


  5%|▍         | 49/1000 [15:51<3:03:47, 11.60s/it]

KEY METRICS: {'train/kl_penalty': 0.010056544339539774, 'train/rewards': 0.65625, 'train/reward_metrics/format_reward': 0.65625, 'train/reward_metrics/equation_reward': 0.0}
Iteration 49/1000


Processed prompts:  25%|██▌       | 16/64 [00:00<00:02, 16.08it/s, est. speed input: 2301.58 toks/s, output: 3128.41 toks/s]

INFO 08-05 10:56:38 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:56:38 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:56:38 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.09 GiB memory is still in use.
INFO 08-05 10:56:38 executor_base.py:208] It took 0.123912 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 31)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [9, 48, 12], create an equation that equals 69. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` To find the equation that equals 69 </think>
<answer>(2 * 48) + (1 / 6)</answer><|endoftext|>`


########## Example 2 (Reward: 0.0, Response Length: 38)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_

Gradient Accumulation: 100%|██████████| 16/16 [00:04<00:00,  3.57it/s]


INFO 08-05 10:56:46 executor_base.py:219] It took 0.086761 seconds to wake up.


  5%|▌         | 50/1000 [16:01<3:00:14, 11.38s/it]

KEY METRICS: {'train/kl_penalty': 0.008469931259032615, 'train/rewards': 0.7109375, 'train/reward_metrics/format_reward': 0.7109375, 'train/reward_metrics/equation_reward': 0.0}
Iteration 50/1000
Evaluating on eval set...


Processed prompts: 100%|██████████| 500/500 [00:03<00:00, 160.03it/s, est. speed input: 22808.83 toks/s, output: 7043.01 toks/s]
Processed prompts:  25%|██▌       | 16/64 [00:01<00:04, 10.57it/s, est. speed input: 1504.31 toks/s, output: 2156.72 toks/s]

INFO 08-05 10:56:53 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:56:53 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:56:53 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.01 GiB memory is still in use.
INFO 08-05 10:56:53 executor_base.py:208] It took 0.123122 seconds to fall asleep.





Generated 64 responses
########## Example 1 (Reward: 0.5, Response Length: 59)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [46, 78, 10], create an equation that equals 42. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` Subtract 46 from 10. </think>
<answer>(10 - 46) / (3 * 5) = 1 * 5 / (3 * 5) = 5 / (15) = 1/3 </answer><|endoftext|>`


########## Example 2 (Reward: 1.0, Response Length: 32)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the 

Gradient Accumulation: 100%|██████████| 16/16 [00:05<00:00,  2.91it/s]


INFO 08-05 10:57:02 executor_base.py:219] It took 0.085876 seconds to wake up.
KEY METRICS: {'train/kl_penalty': 0.0076583987244490965, 'train/rewards': 0.7578125, 'train/reward_metrics/format_reward': 0.7578125, 'train/reward_metrics/equation_reward': 0.0, 'eval/rewards': 0.986, 'eval/reward_metrics/format_reward': 0.984, 'eval/reward_metrics/equation_reward': 0.002}
[2025-08-05 10:57:05,282] [INFO] [logging.py:128:log_dist] [Rank 0] [Torch] Checkpoint global_step51 is about to be saved!
[2025-08-05 10:57:05,286] [INFO] [logging.py:128:log_dist] [Rank 0] Saving model checkpoint: /home/quang/scratch/deepseek_r1z_hackathon/r1-zero/checkpoints/ckpt_000050/deepspeed/global_step51/mp_rank_00_model_states.pt
[2025-08-05 10:57:05,287] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /home/quang/scratch/deepseek_r1z_hackathon/r1-zero/checkpoints/ckpt_000050/deepspeed/global_step51/mp_rank_00_model_states.pt...


[rank0]:[W805 10:57:05.314778866 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.


[2025-08-05 10:57:06,163] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /home/quang/scratch/deepseek_r1z_hackathon/r1-zero/checkpoints/ckpt_000050/deepspeed/global_step51/mp_rank_00_model_states.pt.
[2025-08-05 10:57:06,164] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /home/quang/scratch/deepseek_r1z_hackathon/r1-zero/checkpoints/ckpt_000050/deepspeed/global_step51/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2025-08-05 10:57:11,636] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /home/quang/scratch/deepseek_r1z_hackathon/r1-zero/checkpoints/ckpt_000050/deepspeed/global_step51/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2025-08-05 10:57:11,638] [INFO] [engine.py:3645:_save_zero_checkpoint] zero checkpoint saved /home/quang/scratch/deepseek_r1z_hackathon/r1-zero/checkpoints/ckpt_000050/deepspeed/global_step51/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2025-08-05 10:57:11,638] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Chec

  5%|▌         | 51/1000 [16:25<3:57:52, 15.04s/it]

Iteration 51/1000


Processed prompts:  25%|██▌       | 16/64 [00:00<00:01, 27.28it/s, est. speed input: 3879.79 toks/s, output: 4812.52 toks/s]

INFO 08-05 10:57:12 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:57:12 prefix_caching_block.py:479] Successfully reset prefix cache





INFO 08-05 10:57:12 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.13 GiB memory is still in use.
INFO 08-05 10:57:12 executor_base.py:208] It took 0.124180 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 88)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [18, 7, 93, 16], create an equation that equals 37. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` Set up the equation = x + (7 - 93 - 18) * ((x - 93) / (72 - 223)) + (93 * (72 / (223 - x))) / (93 - 93) * (93 - x)</think>
<answer>(1 + 2) 

Gradient Accumulation: 100%|██████████| 16/16 [00:04<00:00,  3.64it/s]


INFO 08-05 10:57:20 executor_base.py:219] It took 0.085774 seconds to wake up.


  5%|▌         | 52/1000 [16:35<3:35:43, 13.65s/it]

KEY METRICS: {'train/kl_penalty': 0.009260242433939987, 'train/rewards': 0.828125, 'train/reward_metrics/format_reward': 0.828125, 'train/reward_metrics/equation_reward': 0.0}
Iteration 52/1000


Processed prompts:  25%|██▌       | 16/64 [00:00<00:01, 30.95it/s, est. speed input: 4426.12 toks/s, output: 5151.41 toks/s]

INFO 08-05 10:57:22 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:57:22 prefix_caching_block.py:479] Successfully reset prefix cache





INFO 08-05 10:57:22 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.13 GiB memory is still in use.
INFO 08-05 10:57:22 executor_base.py:208] It took 0.123801 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 35)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [88, 80, 74], create an equation that equals 66. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` First we have to add up all the numbers to get 66. </think>
<answer>(88 + 80 + 74)</answer><|endoftext|>`


########## Example 2 (Reward: 1.0,

Gradient Accumulation: 100%|██████████| 16/16 [00:03<00:00,  4.14it/s]


INFO 08-05 10:57:30 executor_base.py:219] It took 0.086873 seconds to wake up.


  5%|▌         | 53/1000 [16:45<3:17:08, 12.49s/it]

KEY METRICS: {'train/kl_penalty': 0.014418514966517072, 'train/rewards': 0.9296875, 'train/reward_metrics/format_reward': 0.9296875, 'train/reward_metrics/equation_reward': 0.0}
Iteration 53/1000


Processed prompts:  25%|██▌       | 16/64 [00:00<00:01, 33.63it/s, est. speed input: 4798.07 toks/s, output: 5468.62 toks/s]

INFO 08-05 10:57:32 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:57:32 prefix_caching_block.py:479] Successfully reset prefix cache





INFO 08-05 10:57:32 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.11 GiB memory is still in use.
INFO 08-05 10:57:32 executor_base.py:208] It took 0.123986 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 33)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [25, 63, 63, 51], create an equation that equals 51. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
` We want to create an equation that equals 51</think>
<answer>(51 - 2)*(63 + 63)</answer><|endoftext|>`


########## Example 2 (Reward: 0.5

Gradient Accumulation: 100%|██████████| 16/16 [00:04<00:00,  3.93it/s]


INFO 08-05 10:57:40 executor_base.py:219] It took 0.086635 seconds to wake up.


  5%|▌         | 54/1000 [16:55<3:05:20, 11.75s/it]

KEY METRICS: {'train/kl_penalty': 0.017813185598702673, 'train/rewards': 0.8984375, 'train/reward_metrics/format_reward': 0.8984375, 'train/reward_metrics/equation_reward': 0.0}
Iteration 54/1000


Processed prompts:  25%|██▌       | 16/64 [00:00<00:01, 32.39it/s, est. speed input: 4610.89 toks/s, output: 5102.82 toks/s]

INFO 08-05 10:57:42 prefix_caching_block.py:479] Successfully reset prefix cache
INFO 08-05 10:57:42 prefix_caching_block.py:479] Successfully reset prefix cache





INFO 08-05 10:57:42 worker.py:133] Sleep mode freed 8.15 GiB memory, 9.12 GiB memory is still in use.
INFO 08-05 10:57:42 executor_base.py:208] It took 0.123006 seconds to fall asleep.
Generated 64 responses
########## Example 1 (Reward: 1.0, Response Length: 68)
#### Query:
`<|im_start|>system
You are a helpful assistant. You first think about the reasoning process in the mind and then provide the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [96, 2, 65, 79], create an equation that equals 45. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / (3 * 5)</answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>`
#### Response:
`We are given two variables, a and b, which we will use in our equation, and we have a total of four variables left to assign.</think>
<answe



## Citation

If you use this codebase in your research, please cite us using:

```bibtex
@misc{Kazemnejad2025:NanoAhaMoment,
  author       = {Amirhossein Kazemnejad and Milad Aghajohari and Alessandro Sordoni and Aaron Courville and Siva Reddy},
  title        = {Nano Aha! Moment: Lunch Break Reproduction of DeepSeek R1-Zero from Scratch},
  year         = {2025},
  howpublished = {\url{https://github.com/McGill-NLP/nano-aha-moment}},
  note         = {GitHub repository}
}
```