Record timecost and add tqdm progress bar for rollout scheduler #378

mayinghan · 2025-12-16T22:03:27Z

record rollout and eval time cost separately.
add tqdm progress bar in the scheduler for better format

Note

Split execution timing into rollout vs eval durations and add rich tqdm progress bars with activity tracking to the priority scheduler; deprecate duration_seconds and update processors/tests/UI accordingly.

Models:
- Add execution_metadata.rollout_duration_seconds and eval_duration_seconds; deprecate duration_seconds.
Rollout Processors (tinker, mcp, remote, github_action, openenv, default_*):
- Record rollout_duration_seconds instead of deprecated field.
Scheduler & Utils:
- Priority scheduler: introduce async tqdm progress bars for rollouts and evals, active task tracking/postfix, and per-eval eval_duration_seconds capture.
- Progress helpers accept disable_tqdm; propagate to retry wrapper.
Tests:
- Update mocks to accept **kwargs; adjust worker scaling expectation; add concurrency assertions.
UI:
- Speed pivot defaults to execution_metadata.rollout_duration_seconds.

^{Written by Cursor Bugbot for commit e7d638d. This will update automatically on new commits. Configure here.}

eval_protocol/pytest/priority_scheduler.py

cursor · 2025-12-16T22:31:13Z

eval_protocol/pytest/priority_scheduler.py

+                                evaluation_test_kwargs=self.evaluation_test_kwargs,
+                                processed_row=rows_to_eval,
+                            )
+                eval_duration = time.perf_counter() - start_time


Bug: Eval duration includes semaphore wait time, inflating metrics

The start_time for measuring eval_duration_seconds is set at line 157, before the semaphore self.eval_sem is acquired at line 160. This means the evaluation duration will include any time spent waiting for the semaphore, not just the actual evaluation processing time. When there's contention for evaluations (multiple evals queued), the reported eval_duration_seconds will be significantly inflated, making the metric unreliable. The field description states "Processing duration in seconds for the evaluation" which implies actual processing time only. The start_time assignment should be moved inside the async with self.eval_sem: block to accurately measure only the evaluation processing time.

this is expected

eval_protocol/pytest/priority_scheduler.py

cursor · 2025-12-17T00:16:14Z

eval_protocol/pytest/priority_scheduler.py

+        # Calculate totals for progress bars
+        total_rollouts = len(dataset) * num_runs
+        # In pointwise mode: 1 eval per rollout; in groupwise mode: 1 eval per dataset row
+        total_evals = total_rollouts if self.mode == "pointwise" else len(dataset)


Bug: Progress bar total uses wrong variable for rollout count

The progress bar total calculation uses the num_runs parameter from run() (line 392), but all actual rollout scheduling logic uses self.rollout_n from the constructor (lines 109, 298, 308, 309). The self.num_runs assignment at line 389 is never used anywhere else. If a caller creates a PriorityRolloutScheduler with a rollout_n value different from the num_runs passed to run(), the progress bar will display incorrect totals, potentially showing 100% completion before all rollouts finish or not reaching 100% when all work is done.

dphuang2 · 2025-12-17T00:17:10Z

eval_protocol/models.py

    duration_seconds: Optional[float] = Field(
        default=None,
-        description="Processing duration in seconds for this evaluation row. Note that if it gets retried, this will be the duration of the last attempt.",
+        description="[Deprecated] Processing duration in seconds for this evaluation row. Note that if it gets retried, this will be the duration of the last attempt.",


See https://docs.pydantic.dev/latest/concepts/fields/#deprecated-fields

gotcha, thanks!

dphuang2 · 2025-12-17T00:18:43Z

eval_protocol/models.py

+        description="[Deprecated] Processing duration in seconds for this evaluation row. Note that if it gets retried, this will be the duration of the last attempt.",
+    )
+
+    rollout_duration_seconds: Optional[float] = Field(


Regarding retries, I think it would still be valuable to track total_duration_seconds so that people can get a sense of wall clock time for this row. This can be helpful in the UI as well

follow up PR work though

make sense, i think we should track number of retries as well. for duration we probably still only count the last successful run maybe. for failure i think failure reason matters more

dphuang2 · 2025-12-17T00:27:43Z

this presentation is super busy/confusing. I don't eally understand the r1:[0] at first glance. I know this isn't a because of your PR, but also the duplicated logs for every update is really jarring.

@xzrderek We should follow up here

eval_protocol/pytest/priority_scheduler.py

mayinghan · 2025-12-17T05:38:56Z

but also the duplicated logs for every update is really jarring.

this is because i have a print statement in my evaluator code which will break tqdm's progress bar i think

mayinghan added 4 commits December 15, 2025 17:07

record eval duration

25b5ec5

tqdm progress bar inside the scheduler

fb9c58c

tqdm for scheduler

9ceb154

revert

01c0c1e

cursor bot reviewed Dec 16, 2025

View reviewed changes

eval_protocol/pytest/priority_scheduler.py Outdated Show resolved Hide resolved

fix

70b2e17

cursor bot reviewed Dec 16, 2025

View reviewed changes

fix unittest

c7e3abb

cursor bot reviewed Dec 17, 2025

View reviewed changes

dphuang2 reviewed Dec 17, 2025

View reviewed changes

remove

2aecc31

cursor bot reviewed Dec 17, 2025

View reviewed changes

eval_protocol/pytest/priority_scheduler.py Outdated Show resolved Hide resolved

proper set to deprecated

e7d638d

mayinghan requested a review from dphuang2 December 17, 2025 05:27

dphuang2 approved these changes Dec 17, 2025

View reviewed changes

mayinghan merged commit 2e19cf7 into main Dec 17, 2025
17 checks passed

mayinghan deleted the record-timecost branch December 17, 2025 06:20

Record timecost and add tqdm progress bar for rollout scheduler #378

Record timecost and add tqdm progress bar for rollout scheduler #378

Uh oh!

Conversation

mayinghan commented Dec 16, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor bot Dec 16, 2025

Choose a reason for hiding this comment

Bug: Eval duration includes semaphore wait time, inflating metrics

Uh oh!

mayinghan Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Dec 17, 2025

Choose a reason for hiding this comment

Bug: Progress bar total uses wrong variable for rollout count

Uh oh!

dphuang2 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

mayinghan Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

dphuang2 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

dphuang2 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

mayinghan Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

dphuang2 commented Dec 17, 2025

Uh oh!

Uh oh!

mayinghan commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mayinghan commented Dec 16, 2025 •

edited by cursor bot

Loading