Skip to content

chore: sync main with staging#53

Merged
vrtornisiello merged 32 commits into
mainfrom
staging
May 20, 2026
Merged

chore: sync main with staging#53
vrtornisiello merged 32 commits into
mainfrom
staging

Conversation

@vrtornisiello
Copy link
Copy Markdown
Collaborator

Summary

Make agent runs survive client disconnects, and persist a meaningful row when a graceful shutdown forces cancellation.

  • Decouple agent execution from event streaming: The route handler now spawns the agent run as a background task and returns a StreamingResponse that forwards events from an asyncio.Queue. If the client disconnects, only the consumer is cancelled — the producer keeps running and persists its result.

  • Track in-flight runs and drain on shutdown: The lifespan registers each producer task in app.state.running_runs and, on SIGTERM, waits up to SHUTDOWN_DRAIN_TIMEOUT_SECONDS for them to finish naturally, then cancels and awaits the rest so their finally blocks and done-callbacks complete before loguru and the engine are torn down.

  • Persist interrupted runs distinctly: On cancellation, the producer catches CancelledError, sets the user-facing INTERRUPTED message and a new MessageStatus.INTERRUPTED (only if no final_answer has been observed — otherwise the real status is preserved), and re-raises. A dedicated status keeps shutdown-driven cancellations separable from real agent errors in queries and metrics.

  • Other fixes and improvements along the way: Persistence failure now surfaces via complete.error_details; exception strings no longer leak into client payloads; run_id is consistently typed as str across boundaries; Created model_call_limit event type and MODEL_CALL_LIMIT status to distinguish from a success final answer.

Notable design choices

  • Producer-crash hang is by design. If the producer dies before its finally reaches queue.put(complete), the consumer hangs and the frontend eventually sees a timeout error. We deliberately don't synthesize a complete in _cleanup because it would mask the bug and hide real schema-drift errors.
  • Producer opens its own DB session rather than reusing the request-scoped one, since it outlives the HTTP request.

vrtornisiello and others added 30 commits May 14, 2026 15:31
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…onse type

Two issues surfaced in final code review:
- run_agent's finally block would skip the complete event if create_message
  raised, leaving the SSE consumer hanging on queue.get().
- send_message was annotated -> Message but returns StreamingResponse, producing
  a wrong OpenAPI schema.
Adds INTERRUPTED alongside the existing MODEL_CALL_LIMIT addition so
graceful-shutdown drain timeouts are distinguishable from real agent
errors in queries and metrics. Both values land in the same
not-yet-deployed migration rather than a new one.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Producer catches CancelledError, persists row as INTERRUPTED when no
  final_answer has been observed, and re-raises so the task ends cancelled.
- Lifespan drain awaits cancelled tasks before continuing, so finally
  blocks and done callbacks run while loguru and the engine are still up.
- _cleanup log messages reflect that cancellation persists a row.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two new tests in TestRunAgent:

- cancellation before final_answer persists row as INTERRUPTED
- cancellation after final_answer preserves SUCCESS status

Uses asyncio.Event to deterministically synchronize on the "final
answer processed" state instead of timing-based sleeps.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@vrtornisiello vrtornisiello merged commit 836fc3a into main May 20, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant