Skip to content

fix: backfill rollout status fields from logs when polling completes#451

Merged
SunnySoldier357 merged 2 commits intomainfrom
sandeep/ep-status-polling-fix
May 7, 2026
Merged

fix: backfill rollout status fields from logs when polling completes#451
SunnySoldier357 merged 2 commits intomainfrom
sandeep/ep-status-polling-fix

Conversation

@SunnySoldier357
Copy link
Copy Markdown
Collaborator

@SunnySoldier357 SunnySoldier357 commented May 6, 2026

Summary

Fireworks Tracing Tests is red on main and on every PR — see run 25409077579. The failure is a regression introduced by #446:

ERROR:root:❌ Rollout failed (non-retryable error encountered): InternalError()
    assert row.rollout_status.message == "test error"
E   AssertionError: assert '' == 'test error'

The lightweight /status endpoint on the tracing gateway is a point-read on the Status Spanner table, which only stores RolloutId, AccountId, StatusCode. Message, Details, and Extras still live exclusively on the Logs table. After #446 dropped the /logs backfill (commit 20b0f23), the SDK was constructing Status(code=..., message="", details=[]) on every completed rollout and EvalProtocolError(message="") on every failure — which is what the propagate-status integration test catches.

This PR restores the two-phase polling shape from the original design of #446:

  1. Poll /status for the status code (cheap point-read, runs every poll_interval).
  2. On a terminal (non-RUNNING) code, do exactly one async_search_logs call to backfill message / details / extras from the matching log row.

That's still ~1000× cheaper on the Logs table than the pre-#446 polling loop, because the search runs once per rollout completion instead of every poll interval.

The cleanest long-term fix is on the gateway side: extend either the Status Spanner schema (and spanner_reader.get_status) to also persist these fields, or have the /status handler do an internal Logs read on terminal codes and inline them into the response. Either change would let the SDK go back to a single read per completion. Filing a follow-up for that.

Test plan

  • Fireworks Tracing Tests workflow goes green on this branch.
  • test_remote_rollout_and_fetch_fireworks (happy path) still passes.
  • test_remote_rollout_and_fetch_fireworks_propagate_status sees Code.INTERNAL with message == "test error".

Made with Cursor


Note

Medium Risk
Changes terminal-status handling in RemoteRolloutProcessor by adding a one-time logs lookup and merging returned extras, which affects error propagation and could introduce edge cases if log entries are missing or mismatched.

Overview
Restores two-phase rollout polling in RemoteRolloutProcessor: continue polling /status for the status code, then on a terminal code perform a single async_search_logs query to backfill Status.message, Status.details, and execution extras.

When backfilling, it selects the log entry whose embedded status code matches the terminal /status code (avoiding intermediate RUNNING checkpoints) and filters out noisy extras keys before merging into row.execution_metadata.extra.

Reviewed by Cursor Bugbot for commit 3298857. Bugbot is set up for automated code reviews on this repo. Configure here.

@SunnySoldier357 SunnySoldier357 self-assigned this May 6, 2026
The lightweight `/status` endpoint on the tracing gateway only returns the
status code; `Message`, `Details`, and `Extras` still live on the Logs
table. After PR #446 stopped reading from `/logs` on terminal status, the
SDK was constructing `Status(code=..., message="", details=[])` for every
completed rollout and `EvalProtocolError(message="")` for failures, which
broke `tests/remote_server/test_remote_fireworks_propagate_status.py`
(`assert row.rollout_status.message == "test error"`).

Restore the two-phase polling shape from the original PR: poll `/status`
for the code, and on a terminal (non-RUNNING) code do one
`async_search_logs` call to backfill `message`/`details`/`extras` from
the matching log row. This is still ~1000x cheaper on the Logs table than
the pre-#446 polling loop because the search runs once per rollout
completion instead of every poll interval.

Made-with: Cursor
Co-authored-by: Cursor <cursoragent@cursor.com>
@SunnySoldier357 SunnySoldier357 force-pushed the sandeep/ep-status-polling-fix branch from 1a4b134 to 30f662f Compare May 6, 2026 00:13
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a4b13417d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread eval_protocol/pytest/remote_rollout_processor.py
Bugbot pointed out that the backfill loop could pick an earlier
RUNNING/partial status log instead of the terminal one when a rollout
emits multiple status-bearing logs. The reported `code` was always
correct (it came from /status), but `message`/`details`/`extras` could
be attached from the wrong row and the raised exception would carry
misleading text.

Match the log row's status code to the terminal code returned by
/status so the backfill is deterministic.

Made-with: Cursor
@SunnySoldier357 SunnySoldier357 requested a review from benjibc May 6, 2026 01:00
@SunnySoldier357 SunnySoldier357 merged commit 251ed86 into main May 7, 2026
17 checks passed
@SunnySoldier357 SunnySoldier357 deleted the sandeep/ep-status-polling-fix branch May 7, 2026 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants