Skip to content

Conversation

@mayinghan
Copy link
Collaborator

@mayinghan mayinghan commented Dec 16, 2025

  1. record rollout and eval time cost separately.
  2. add tqdm progress bar in the scheduler for better format
Screenshot 2025-12-16 at 4 15 07 PM

Note

Split execution timing into rollout vs eval durations and add rich tqdm progress bars with activity tracking to the priority scheduler; deprecate duration_seconds and update processors/tests/UI accordingly.

  • Models:
    • Add execution_metadata.rollout_duration_seconds and eval_duration_seconds; deprecate duration_seconds.
  • Rollout Processors (tinker, mcp, remote, github_action, openenv, default_*):
    • Record rollout_duration_seconds instead of deprecated field.
  • Scheduler & Utils:
    • Priority scheduler: introduce async tqdm progress bars for rollouts and evals, active task tracking/postfix, and per-eval eval_duration_seconds capture.
    • Progress helpers accept disable_tqdm; propagate to retry wrapper.
  • Tests:
    • Update mocks to accept **kwargs; adjust worker scaling expectation; add concurrency assertions.
  • UI:
    • Speed pivot defaults to execution_metadata.rollout_duration_seconds.

Written by Cursor Bugbot for commit e7d638d. This will update automatically on new commits. Configure here.

evaluation_test_kwargs=self.evaluation_test_kwargs,
processed_row=rows_to_eval,
)
eval_duration = time.perf_counter() - start_time
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Eval duration includes semaphore wait time, inflating metrics

The start_time for measuring eval_duration_seconds is set at line 157, before the semaphore self.eval_sem is acquired at line 160. This means the evaluation duration will include any time spent waiting for the semaphore, not just the actual evaluation processing time. When there's contention for evaluations (multiple evals queued), the reported eval_duration_seconds will be significantly inflated, making the metric unreliable. The field description states "Processing duration in seconds for the evaluation" which implies actual processing time only. The start_time assignment should be moved inside the async with self.eval_sem: block to accurately measure only the evaluation processing time.

Fix in Cursor Fix in Web

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is expected

# Calculate totals for progress bars
total_rollouts = len(dataset) * num_runs
# In pointwise mode: 1 eval per rollout; in groupwise mode: 1 eval per dataset row
total_evals = total_rollouts if self.mode == "pointwise" else len(dataset)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Progress bar total uses wrong variable for rollout count

The progress bar total calculation uses the num_runs parameter from run() (line 392), but all actual rollout scheduling logic uses self.rollout_n from the constructor (lines 109, 298, 308, 309). The self.num_runs assignment at line 389 is never used anywhere else. If a caller creates a PriorityRolloutScheduler with a rollout_n value different from the num_runs passed to run(), the progress bar will display incorrect totals, potentially showing 100% completion before all rollouts finish or not reaching 100% when all work is done.

Fix in Cursor Fix in Web

duration_seconds: Optional[float] = Field(
default=None,
description="Processing duration in seconds for this evaluation row. Note that if it gets retried, this will be the duration of the last attempt.",
description="[Deprecated] Processing duration in seconds for this evaluation row. Note that if it gets retried, this will be the duration of the last attempt.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, thanks!

description="[Deprecated] Processing duration in seconds for this evaluation row. Note that if it gets retried, this will be the duration of the last attempt.",
)

rollout_duration_seconds: Optional[float] = Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding retries, I think it would still be valuable to track total_duration_seconds so that people can get a sense of wall clock time for this row. This can be helpful in the UI as well

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up PR work though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, i think we should track number of retries as well. for duration we probably still only count the last successful run maybe. for failure i think failure reason matters more

@dphuang2
Copy link
Collaborator

Screenshot 2025-12-16 at 4 15 07 PM

this presentation is super busy/confusing. I don't eally understand the r1:[0] at first glance. I know this isn't a because of your PR, but also the duplicated logs for every update is really jarring.

@xzrderek We should follow up here

@mayinghan mayinghan requested a review from dphuang2 December 17, 2025 05:27
@mayinghan
Copy link
Collaborator Author

but also the duplicated logs for every update is really jarring.

this is because i have a print statement in my evaluator code which will break tqdm's progress bar i think

@mayinghan mayinghan merged commit 2e19cf7 into main Dec 17, 2025
17 checks passed
@mayinghan mayinghan deleted the record-timecost branch December 17, 2025 06:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants