-
Notifications
You must be signed in to change notification settings - Fork 10
Record timecost and add tqdm progress bar for rollout scheduler #378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| evaluation_test_kwargs=self.evaluation_test_kwargs, | ||
| processed_row=rows_to_eval, | ||
| ) | ||
| eval_duration = time.perf_counter() - start_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Eval duration includes semaphore wait time, inflating metrics
The start_time for measuring eval_duration_seconds is set at line 157, before the semaphore self.eval_sem is acquired at line 160. This means the evaluation duration will include any time spent waiting for the semaphore, not just the actual evaluation processing time. When there's contention for evaluations (multiple evals queued), the reported eval_duration_seconds will be significantly inflated, making the metric unreliable. The field description states "Processing duration in seconds for the evaluation" which implies actual processing time only. The start_time assignment should be moved inside the async with self.eval_sem: block to accurately measure only the evaluation processing time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is expected
| # Calculate totals for progress bars | ||
| total_rollouts = len(dataset) * num_runs | ||
| # In pointwise mode: 1 eval per rollout; in groupwise mode: 1 eval per dataset row | ||
| total_evals = total_rollouts if self.mode == "pointwise" else len(dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Progress bar total uses wrong variable for rollout count
The progress bar total calculation uses the num_runs parameter from run() (line 392), but all actual rollout scheduling logic uses self.rollout_n from the constructor (lines 109, 298, 308, 309). The self.num_runs assignment at line 389 is never used anywhere else. If a caller creates a PriorityRolloutScheduler with a rollout_n value different from the num_runs passed to run(), the progress bar will display incorrect totals, potentially showing 100% completion before all rollouts finish or not reaching 100% when all work is done.
| duration_seconds: Optional[float] = Field( | ||
| default=None, | ||
| description="Processing duration in seconds for this evaluation row. Note that if it gets retried, this will be the duration of the last attempt.", | ||
| description="[Deprecated] Processing duration in seconds for this evaluation row. Note that if it gets retried, this will be the duration of the last attempt.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, thanks!
| description="[Deprecated] Processing duration in seconds for this evaluation row. Note that if it gets retried, this will be the duration of the last attempt.", | ||
| ) | ||
|
|
||
| rollout_duration_seconds: Optional[float] = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding retries, I think it would still be valuable to track total_duration_seconds so that people can get a sense of wall clock time for this row. This can be helpful in the UI as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow up PR work though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, i think we should track number of retries as well. for duration we probably still only count the last successful run maybe. for failure i think failure reason matters more
|
this presentation is super busy/confusing. I don't eally understand the @xzrderek We should follow up here |
this is because i have a print statement in my evaluator code which will break tqdm's progress bar i think |

Note
Split execution timing into rollout vs eval durations and add rich tqdm progress bars with activity tracking to the priority scheduler; deprecate duration_seconds and update processors/tests/UI accordingly.
execution_metadata.rollout_duration_secondsandeval_duration_seconds; deprecateduration_seconds.tinker,mcp,remote,github_action,openenv,default_*):rollout_duration_secondsinstead of deprecated field.tqdmprogress bars for rollouts and evals, active task tracking/postfix, and per-evaleval_duration_secondscapture.disable_tqdm; propagate to retry wrapper.**kwargs; adjust worker scaling expectation; add concurrency assertions.execution_metadata.rollout_duration_seconds.Written by Cursor Bugbot for commit e7d638d. This will update automatically on new commits. Configure here.