fix: backfill rollout status fields from logs when polling completes#451
Merged
SunnySoldier357 merged 2 commits intomainfrom May 7, 2026
Merged
fix: backfill rollout status fields from logs when polling completes#451SunnySoldier357 merged 2 commits intomainfrom
SunnySoldier357 merged 2 commits intomainfrom
Conversation
The lightweight `/status` endpoint on the tracing gateway only returns the status code; `Message`, `Details`, and `Extras` still live on the Logs table. After PR #446 stopped reading from `/logs` on terminal status, the SDK was constructing `Status(code=..., message="", details=[])` for every completed rollout and `EvalProtocolError(message="")` for failures, which broke `tests/remote_server/test_remote_fireworks_propagate_status.py` (`assert row.rollout_status.message == "test error"`). Restore the two-phase polling shape from the original PR: poll `/status` for the code, and on a terminal (non-RUNNING) code do one `async_search_logs` call to backfill `message`/`details`/`extras` from the matching log row. This is still ~1000x cheaper on the Logs table than the pre-#446 polling loop because the search runs once per rollout completion instead of every poll interval. Made-with: Cursor Co-authored-by: Cursor <cursoragent@cursor.com>
1a4b134 to
30f662f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1a4b13417d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Bugbot pointed out that the backfill loop could pick an earlier RUNNING/partial status log instead of the terminal one when a rollout emits multiple status-bearing logs. The reported `code` was always correct (it came from /status), but `message`/`details`/`extras` could be attached from the wrong row and the raised exception would carry misleading text. Match the log row's status code to the terminal code returned by /status so the backfill is deterministic. Made-with: Cursor
benjibc
approved these changes
May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fireworks Tracing Testsis red onmainand on every PR — see run 25409077579. The failure is a regression introduced by #446:The lightweight
/statusendpoint on the tracing gateway is a point-read on theStatusSpanner table, which only storesRolloutId,AccountId,StatusCode.Message,Details, andExtrasstill live exclusively on theLogstable. After #446 dropped the/logsbackfill (commit20b0f23), the SDK was constructingStatus(code=..., message="", details=[])on every completed rollout andEvalProtocolError(message="")on every failure — which is what the propagate-status integration test catches.This PR restores the two-phase polling shape from the original design of #446:
/statusfor the status code (cheap point-read, runs everypoll_interval).RUNNING) code, do exactly oneasync_search_logscall to backfillmessage/details/extrasfrom the matching log row.That's still ~1000× cheaper on the
Logstable than the pre-#446 polling loop, because the search runs once per rollout completion instead of every poll interval.The cleanest long-term fix is on the gateway side: extend either the
StatusSpanner schema (andspanner_reader.get_status) to also persist these fields, or have the/statushandler do an internalLogsread on terminal codes and inline them into the response. Either change would let the SDK go back to a single read per completion. Filing a follow-up for that.Test plan
Fireworks Tracing Testsworkflow goes green on this branch.test_remote_rollout_and_fetch_fireworks(happy path) still passes.test_remote_rollout_and_fetch_fireworks_propagate_statusseesCode.INTERNALwithmessage == "test error".Made with Cursor
Note
Medium Risk
Changes terminal-status handling in
RemoteRolloutProcessorby adding a one-time logs lookup and merging returned extras, which affects error propagation and could introduce edge cases if log entries are missing or mismatched.Overview
Restores two-phase rollout polling in
RemoteRolloutProcessor: continue polling/statusfor the status code, then on a terminal code perform a singleasync_search_logsquery to backfillStatus.message,Status.details, and executionextras.When backfilling, it selects the log entry whose embedded status
codematches the terminal/statuscode (avoiding intermediateRUNNINGcheckpoints) and filters out noisy extras keys before merging intorow.execution_metadata.extra.Reviewed by Cursor Bugbot for commit 3298857. Bugbot is set up for automated code reviews on this repo. Configure here.