In [1]:
from exps.math_rlvr import RLVRMathDataset, to_datum
import rich

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
dataset = RLVRMathDataset("Qwen/Qwen3-8B")

##  Sample item

In [3]:
item = dataset[0]

In [4]:
item.keys()

dict_keys(['prompt_tokens', 'model_input', 'ground_truth', 'messages'])

In [5]:
rich.print(item["messages"])

In [6]:
rich.print(f"Ground truth: {item['ground_truth']}")

In [7]:
rich.print(f"Prompt tokens: {len(item['prompt_tokens'])}")

## Messages Format

This format for each question is typical of how you prompt models in this RLVR environment. You give examples in context and then have the model mimic the format of those examples. We'll compute reward as a composite of format reward and correctness.

In [10]:
rich.print(f"Ground truth: {item['ground_truth']}")
rich.print(f"Prompt tokens: {len(item['prompt_tokens'])}")

## Inference on Sample Question

Let's see what Qwen-8B can do on this question.

In [11]:
from exps.math_rlvr.utils import create_clients, get_sampling_params
from exps.math_rlvr.models import TrainingConfig, SamplingConfig

In [12]:
train_config = TrainingConfig()
sampling_config = SamplingConfig()

In [13]:
clients = create_clients(
    config=train_config
)

In [16]:
future = clients.sampling_client.sample(
    prompt=item["model_input"], 
    sampling_params=get_sampling_params(sampling_config), 
    num_samples=1
)

In [17]:
response = future.result()

In [19]:
for i, sequence in enumerate(response.sequences):
    decoded_text = dataset.tokenizer.decode(sequence.tokens)
    print(f"Sample {i+1}: {decoded_text}")

Sample 1: <think>
Okay, so I need to find the modulo 13 residue of the expression 247 + 5*39 + 7*143 + 4*15. Hmm, modulo 13 residue means I have to compute the entire expression and then find the remainder when divided by 13. But maybe there's a smarter way than calculating each term first and then adding them up. Let me think.

First, I remember that when dealing with modulo operations, I can take each term modulo 13 first and then add them up, right? Because modulo is distributive over addition and multiplication. So maybe I can simplify each term individually before adding. Let me try that.

Starting with the first term: 247. Let me divide 247 by 13 and find the remainder. Let me recall that 13*19 is 247? Wait, 13*20 is 260, so 260 - 13 = 247, so 13*19 = 247. Oh, so 247 divided by 13 is exactly 19 with no remainder. Therefore, 247 mod 13 is 0. Okay, that's straightforward.

Next term is 5*39. Let me compute 5*39 first. 5*39 is 195. Now, find 195 mod 13. Alternatively, maybe I can fi

## Notes

Yes, the model got the answer right and formatted it properly. But it shows that the task is within reach and reward won't be too sparse. The task needs to be achievable! In this case, Qwen8B has been sort of "warm started" with SFT to produce traces that contain the <think> tags.