chore: sync main with staging#53
Merged
Merged
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…onse type Two issues surfaced in final code review: - run_agent's finally block would skip the complete event if create_message raised, leaving the SSE consumer hanging on queue.get(). - send_message was annotated -> Message but returns StreamingResponse, producing a wrong OpenAPI schema.
Adds INTERRUPTED alongside the existing MODEL_CALL_LIMIT addition so graceful-shutdown drain timeouts are distinguishable from real agent errors in queries and metrics. Both values land in the same not-yet-deployed migration rather than a new one. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Producer catches CancelledError, persists row as INTERRUPTED when no final_answer has been observed, and re-raises so the task ends cancelled. - Lifespan drain awaits cancelled tasks before continuing, so finally blocks and done callbacks run while loguru and the engine are still up. - _cleanup log messages reflect that cancellation persists a row. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two new tests in TestRunAgent: - cancellation before final_answer persists row as INTERRUPTED - cancellation after final_answer preserves SUCCESS status Uses asyncio.Event to deterministically synchronize on the "final answer processed" state instead of timing-based sleeps. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix: handle client disconnect
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Make agent runs survive client disconnects, and persist a meaningful row when a graceful shutdown forces cancellation.
Decouple agent execution from event streaming: The route handler now spawns the agent run as a background task and returns a
StreamingResponsethat forwards events from anasyncio.Queue. If the client disconnects, only the consumer is cancelled — the producer keeps running and persists its result.Track in-flight runs and drain on shutdown: The lifespan registers each producer task in
app.state.running_runsand, on SIGTERM, waits up toSHUTDOWN_DRAIN_TIMEOUT_SECONDSfor them to finish naturally, then cancels and awaits the rest so theirfinallyblocks and done-callbacks complete before loguru and the engine are torn down.Persist interrupted runs distinctly: On cancellation, the producer catches
CancelledError, sets the user-facing INTERRUPTED message and a newMessageStatus.INTERRUPTED(only if nofinal_answerhas been observed — otherwise the real status is preserved), and re-raises. A dedicated status keeps shutdown-driven cancellations separable from real agent errors in queries and metrics.Other fixes and improvements along the way: Persistence failure now surfaces via
complete.error_details; exception strings no longer leak into client payloads;run_idis consistently typed asstracross boundaries; Createdmodel_call_limitevent type andMODEL_CALL_LIMITstatus to distinguish from a success final answer.Notable design choices
finallyreachesqueue.put(complete), the consumer hangs and the frontend eventually sees a timeout error. We deliberately don't synthesize acompletein_cleanupbecause it would mask the bug and hide real schema-drift errors.