refactor(adk): replace TurnLoop with push-based API#835
Merged
shentongmartin merged 53 commits intoalpha/09from Mar 26, 2026
Merged
refactor(adk): replace TurnLoop with push-based API#835shentongmartin merged 53 commits intoalpha/09from
shentongmartin merged 53 commits intoalpha/09from
Conversation
167e5e8 to
5a7c43a
Compare
e986cc7 to
aec39b6
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## alpha/09 #835 +/- ##
===========================================
Coverage ? 81.21%
===========================================
Files ? 158
Lines ? 18838
Branches ? 0
===========================================
Hits ? 15300
Misses ? 2415
Partials ? 1123 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
shentongmartin
commented
Mar 5, 2026
shentongmartin
commented
Mar 5, 2026
shentongmartin
commented
Mar 5, 2026
shentongmartin
commented
Mar 5, 2026
shentongmartin
commented
Mar 5, 2026
5a7c43a to
11ef482
Compare
b48db19 to
3242e0d
Compare
shentongmartin
commented
Mar 5, 2026
shentongmartin
commented
Mar 5, 2026
shentongmartin
added a commit
that referenced
this pull request
Mar 6, 2026
- Simplify cancel state machine: consolidate markCompleted/markInterrupted/markError into single markDone, remove stateInterrupted and stateError - Remove ErrExecutionInterrupted and ErrExecutionFailed sentinels, use ErrExecutionCompleted for all terminal states - Add cancelMonitoredToolHandler to wrap streamable tool streams with cancel monitoring via cancelContextKey (ErrStreamCancelled on CancelImmediate) - Rename turnLoopCancelSig -> turnLoopStopSig, align Stop/cancel terminology - Remove unused ErrTurnLoopAlreadyStopped, keep Push returning bool - Add context cancellation monitoring in TurnLoop.run() to unblock Receive() - Document Stop() degradation when agent doesn't support WithCancel - Run agentCancelFunc in goroutine to avoid deadlock in stop path - Add comprehensive tests for cancelMonitoredToolHandler, cancelContextKey, TurnLoop context cancellation, default error propagation, and callback tests Change-Id: I4fbe23619378b584206790d99bf75f17150cd075
hi-pender
reviewed
Mar 19, 2026
shentongmartin
added a commit
that referenced
this pull request
Mar 19, 2026
- Simplify cancel state machine: consolidate markCompleted/markInterrupted/markError into single markDone, remove stateInterrupted and stateError - Remove ErrExecutionInterrupted and ErrExecutionFailed sentinels, use ErrExecutionCompleted for all terminal states - Add cancelMonitoredToolHandler to wrap streamable tool streams with cancel monitoring via cancelContextKey (ErrStreamCancelled on CancelImmediate) - Rename turnLoopCancelSig -> turnLoopStopSig, align Stop/cancel terminology - Remove unused ErrTurnLoopAlreadyStopped, keep Push returning bool - Add context cancellation monitoring in TurnLoop.run() to unblock Receive() - Document Stop() degradation when agent doesn't support WithCancel - Run agentCancelFunc in goroutine to avoid deadlock in stop path - Add comprehensive tests for cancelMonitoredToolHandler, cancelContextKey, TurnLoop context cancellation, default error propagation, and callback tests Change-Id: I4fbe23619378b584206790d99bf75f17150cd075
98c9018 to
af5147a
Compare
hi-pender
reviewed
Mar 19, 2026
hi-pender
reviewed
Mar 20, 2026
96fb350 to
b3cc8f2
Compare
shentongmartin
added a commit
that referenced
this pull request
Mar 23, 2026
- Simplify cancel state machine: consolidate markCompleted/markInterrupted/markError into single markDone, remove stateInterrupted and stateError - Remove ErrExecutionInterrupted and ErrExecutionFailed sentinels, use ErrExecutionCompleted for all terminal states - Add cancelMonitoredToolHandler to wrap streamable tool streams with cancel monitoring via cancelContextKey (ErrStreamCancelled on CancelImmediate) - Rename turnLoopCancelSig -> turnLoopStopSig, align Stop/cancel terminology - Remove unused ErrTurnLoopAlreadyStopped, keep Push returning bool - Add context cancellation monitoring in TurnLoop.run() to unblock Receive() - Document Stop() degradation when agent doesn't support WithCancel - Run agentCancelFunc in goroutine to avoid deadlock in stop path - Add comprehensive tests for cancelMonitoredToolHandler, cancelContextKey, TurnLoop context cancellation, default error propagation, and callback tests Change-Id: I4fbe23619378b584206790d99bf75f17150cd075
90b13c3 to
3ff710c
Compare
hi-pender
reviewed
Mar 23, 2026
f1cac4b to
e9c74a4
Compare
hi-pender
reviewed
Mar 23, 2026
hi-pender
reviewed
Mar 23, 2026
…drop bugs - Convert preemptSignal from boolean paused to holdCount counter to support multiple independent holders (run loop + Push callers) - Move currentTC/currentRunCtx onto preemptSignal struct under same mutex, replacing the separate RWMutex on TurnLoop - Add holdAndGetTurn() for atomic hold+snapshot in pushWithStrategy, eliminating the TOCTOU race where strategy could observe a stale turn - Run loop now brackets each turn with holdRunLoop()/endTurnAndUnhold() so the unconditional end-of-turn release no longer clobbers a Push caller's hold, fixing silent preempt signal drops - Fix pending ack channels not being closed when holdCount reaches 0, preventing goroutine leaks - Rename methods for clarity: pause->holdRunLoop, release->unholdRunLoop, signalWithAck->requestPreempt, waitIfPaused->waitForPreemptOrUnhold, check->receivePreempt, pauseAndGetTurn->holdAndGetTurn - Add comprehensive doc comments on preemptSignal lifecycle - Add 6 preemptSignal unit tests and 5 integration race-condition tests Change-Id: I049c56806e42b227b0fbe263d4af9d16f65f60e6
…vent deadlock Change-Id: If1c1d8ad60ac73225736478c07459f2639ea304c
… reset - In the done case of runAgentAndHandleEvents select, add non-blocking check on preemptDone to handle the select race where done wins over preemptDone — previously this would leak the CancelError and incorrectly save a checkpoint instead of treating it as a preempt. - Only save checkpoint when stopSig.isStopped(), not on arbitrary handleErr — generic errors (panics, LLM failures) are not resumable. - Apply same fix in cleanup(): remove runErr != nil from shouldSaveCheckpoint condition. - Extract resetLocked() helper to deduplicate the identical reset body across unholdRunLoop, endTurnAndUnhold, and drainAll. Change-Id: Ie93a0243b9ec9d46be2155a0654f5fdf38a501ec
… tests - Remove ExternalTurnState, PrepareResume, and per-turn CheckpointID generation - Add CheckpointID to TurnLoopConfig for declarative checkpoint-based resume - Auto-detect resume vs fresh start via tryLoadCheckpoint on Run() - Delete stale checkpoints on clean exit to prevent stale resumption - Add CheckPointDeleter optional interface in core package for explicit deletion - Add 15 new tests covering all checkpoint/resume edge cases: tryLoadCheckpoint paths, cleanup save/delete paths, GenResume error handling, ResumeWithParams, stale checkpoint deletion via context cancellation, etc. Change-Id: Ic0c7bded8da11229a2816c22fbef3ab42ce74a85
15db73d to
8908e8c
Compare
8908e8c to
7e4b2ca
Compare
7e4b2ca to
e13edf3
Compare
…sed signals - compose/graph_manager.go: receiveWithListening now returns immediately when cancel channel is closed, preventing goroutine hang on subsequent receives from a closed channel (previously fell through to unreachable break) - adk/turn_loop.go: watchStopSignal adds a dedicated case for l.stopSig.done to ensure stop signals are not missed on subsequent turns when the notify channel was already drained - adk/turn_loop_test.go: wrap iteration loops in t.Run subtests for better isolation and diagnostics; remove unused result assertions Change-Id: If6479185bb3984ab618b2414f711433bfb4a1c41
e13edf3 to
ab92a8e
Compare
…gents - Add cancel-at-transition checks in sequential, loop, and parallel workflows Transition boundaries are unconditionally safe — any cancel mode fires - Track child cancel contexts with activeChildren counter for grace period - Wrap graph interrupt with grace period when children are active - Defensive copy in wrapGraphInterruptWithGracePeriod to prevent slice aliasing - Add cancelAsync helper eliminating time.Sleep-based test synchronization - Add comprehensive tests for transition boundaries, resume, multi-level nesting, custom cancel-unaware agents, and grace period fallback Change-Id: I1da9fbb365cfc45a7c5db7a26c1c18af93cbfa2a
…shWithStrategy Move setTurn() before runner.Run()/ResumeWithParams() so that holdAndGetTurn() always sees a non-nil TurnContext when a turn is active. Previously, the agent's goroutine could signal agentStarted before the run loop set currentTC, causing PushStrategy callbacks to receive nil. Change-Id: I623ba5104a9ca9600cff3bce232e41cd1b1380ef
…double-fire of interrupt functions Change-Id: If20978ee5836450e9811d0c82de9ea43e9dc4f97
…th concurrent preempts Change-Id: Ibbcabb3349c42ee334191179cc076c0a5dcacc71
hi-pender
approved these changes
Mar 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Unified Cancel State Machine, Eager Receive, and Composite Agent Cancel Support
Problem
The base
feat/agent_turn_loopbranch introducedCancellableAgentwithRunWithCancel/ResumeWithCanceland TurnLoop withNewTurnLoop+Run. Three issues remained:Cancel was interface-coupled and only worked for ChatModelAgent.
CancellableAgentrequired each agent type to implementRunWithCancel/ResumeWithCancelindividually. Composite agents (sequential, loop, parallel, supervisor, planexecute) didn't implement it, so cancel was unavailable in complex workflows.Safe-point cancels were sentinel errors that broke checkpoint/resume.
CancelAfterChatModelandCancelAfterToolCallspropagated as regular errors. The compose layer couldn't distinguish them from real failures, so no checkpoint was saved — targeted resume after a safe-point cancel was impossible.TurnLoop event delivery was lazy and preemption had race windows. Events buffered internally before reaching the caller. The
cancelSig-based cancel wrapper incancel_wrapper.goused model/tool wrapping to intercept cancel, which was fragile and didn't compose well across agent boundaries.Additionally, in compose graphs, parent and sub-graph task managers share a cancel channel — if a sub-graph consumed the cancel value first, the parent lost
FromGraphInterrupt=true.TurnContext.Preempted/TurnContext.Stoppedwere closed after calling cancel, but that call's options might not have been included in theCancelErrorobserved byOnAgentEvents(race between cancel calls and cancel finalization).Solution
Replace
CancellableAgentinterface withWithCanceloptionCancel is now an
AgentRunOptioninstead of a separate interface:A
cancelContextstate machine tracks execution state:The outer
flowAgentowns the cancel lifecycle. Inner agents access thecancelContextvia Go context (getCancelContext(ctx)) —filterCancelOptionstrips the cancel option from nested calls to prevent double-ownership. This replacescancel_wrapper.goentirely.Extend cancel to all composite agents
Three bugs prevented cancel from working with composite agents:
flowAgentcalledmarkDone()immediately for theworkflowAgentpath, causingAgentCancelFuncto returnErrExecutionCompletedbefore the workflow finished.flowAgentwrapper calledmarkDone()on completion, prematurely closingdoneChanfor subsequent sub-agents.graphInterruptFunc, so cancel only interrupted the last branch.Fix: only the outermost
flowAgentownsmarkDone(). Inner agents accesscancelCtxvia context but never callmarkDone(). Parallel agents use a slice of interrupt functions so all branches get interrupted.Safe-point semantics: transition boundaries ARE safe for all cancel modes
A key behavior decision: workflow transition boundaries (between sub-agents / between loop iterations / transfer points / parallel pre-spawn) are safe for all cancel modes because no sub-agent work is in progress at a boundary. Any cancel mode fires unconditionally at transition boundaries — there is no reason to delay cancellation.
CancelAfterChatModelandCancelAfterToolCallsare honored at boundaries because no model or tool call is running.CancelImmediateadditionally has a grace period wrapper (1 s by default) that gives child agents time to checkpoint before abort.Cancel at Workflow Transition Boundaries
Sequential, loop, and parallel workflows now check for cancel at each transition boundary. Since no sub-agent work is in progress at a boundary, all cancel modes (including
CancelAfterChatModelandCancelAfterToolCalls) are honored — there's no reason to delay cancellation.CancelImmediateadditionally has a grace period wrapper (1 s by default) that gives child agents time to interrupt/checkpoint before force-abort.Safe-point cancels via
compose.InterruptSafe-point cancels now emit
compose.Interruptwith typedcancelSafePointInfoinstead of a sentinel error. This makes compose save checkpoint data automatically, enablingRunner.ResumeWithParamsafter a safe-point cancel. The runner detectsCancelErrorand populatesInterruptContextsfrom the interrupt signal.Eager receive pattern
Events from agent execution are consumed in real-time rather than buffered. The react graph has an explicit
CancelChecklambda node after ChatModel for deterministic safe-point handling. Tool streams are wrapped bycancelMonitoredToolHandler— onCancelImmediate, the stream terminates withErrStreamCancelled(a concrete*StreamCancelledErrorregistered with gob so it survives checkpoint serialization).New
UnboundedChanmethods (TrySend,TakeAll,PushFront) support this: preemption recovers unprocessed items viaPushFront, andTakeAllenables batch drain.Permissive TurnLoop API
The TurnLoop API is refined so all methods work before
Runis called:Pushbuffers items,Stopsets a flag soRunexits immediately,Waitblocks untilRuncompletes. Preemption (WithPreempt()) is gate-scoped to the active turn.Per-turn context override for tracing
GenInputResult.RunCtxallows each turn to override the execution context used byPrepareAgent, the agent run/resume, andOnAgentEvents. This enables a pushed item to attach trace metadata (e.g., trace/span IDs) to the exact agent execution that processes it. The override must be derived from the TurnLoop's run context to preserve cancellation semantics.FromGraphInterrupt propagation
resolveInterruptCompletedTasksnow propagatesFromGraphInterruptupward from sub-graph interrupt info, so the parent graph correctly identifies cancel-triggered interrupts.Preempt acknowledgment channel (new in this update)
Push()now returns an acknowledgment channel when used withWithPreempt(). This allows callers to wait for the preempt signal to be acknowledged before proceeding:The acknowledgment channel is closed when:
This eliminates race conditions between pushing urgent items and checking preempt status.
CancelHandle for async cancel operations (new in this update)
AgentCancelFuncreturns aCancelHandleand acontributedbool:This allows the cancel operation to be started asynchronously while still providing a way to wait for completion and check the outcome (
ErrCancelTimeout,ErrExecutionCompleted). Thecontributedbool is used by TurnLoop to provide strict semantics for turn-level signals.Strict contributed semantics for TurnContext.Preempted/Stopped (new in this update)
TurnLoop now closes
TurnContext.Preempted/TurnContext.Stoppedonly when the corresponding cancel call actually contributed to theCancelErrorfor the current turn. This prevents a race where cancellation was requested but the cancel error was already created/finalized by another path.Internally,
createCancelError()andmarkCancelHandled()are synchronized undercancelMuvia an atomic helper, so a concurrent cancel call deterministically reportscontributed=trueorfalse.TurnLoop.Resume — resume a stopped loop from checkpoint (new in this update)
TurnLoopnow has aResumemethod that restarts execution from a previously saved checkpoint rather than from scratch:When
Resumeis called, TurnLoop loads the checkpoint fromStore, reconstructs thependingResumepayload, and on the first iteration resumes the interrupted agent turn (viaRunner.ResumeWithParams) instead of running a new one.TurnLoopExitStatenow carriesCheckPointID,CanceledItems, andUnhandledItemsso callers have all the data needed for the nextResumewithout querying the store directly.ExternalTurnState mode — decouple checkpoint from TurnLoop internals (new in this update)
Setting
TurnLoopConfig.ExternalTurnState = trueshifts ownership of turn-level checkpoint data from TurnLoop's internalStoreto the caller:ExternalTurnState=false(default)StoreExternalTurnState=trueCheckPointID; leaves persistence to callerWithExternalResumeItemsIn
ExternalTurnStatemode,Resumemust be given the canceled and unhandled items explicitly:This enables callers who manage their own persistence layer to drive TurnLoop resume without a separate
Store.GenResume callback — reconstruct the input queue on resume (new in this update)
TurnLoopConfig.GenResumeis the resume-time counterpart toGenInput. It is called exactly once on the first iteration of a resumed loop to let callers reconcile items that were in flight when the loop stopped:GenResumeResult.Consumeditems are passed toPrepareAgentand used to resume the agent turn.GenResumeResult.Remainingitems are pushed back to the front of the queue. Items in neither list are dropped. This hook replacesGenInputentirely for the first resume iteration.Stop escalation via AgentCancelOptions (new in this update)
Stop()now acceptsAgentCancelOptionvarargs so callers can simultaneously signal loop exit and cancel the in-flight agent:turnLoopStopSigstores the cancel opts alongside a generation counter. ThewatchStopSignalgoroutine firesagentCancelFunceach time the generation increases, making multiple escalatingStopcalls work correctly without races:turnLoopStopSig refactor: single-use done → repeatable notify (new in this update)
Previously
turnLoopStopSigused a singledone chan struct{}that was closed on the firstStopcall. This prevented multipleStopcalls from delivering different cancel options to the running agent.The struct now separates two concerns:
doneis still closed exactly once (when TurnLoop's main loop exits), readable viaisStopped().notifychannel + generation counter allowssignal()to be called multiple times without blocking and without losing the at-least-one-signal guarantee.TurnLoop API simplification (new in this update)
Three changes simplify the TurnLoop public API by removing error returns that callers never meaningfully handle and splitting
Resumeinto a failable preparation step and a non-failable execution step:1.
NewTurnLoopreturns*TurnLoop[T]instead of(*TurnLoop[T], error)The only validation
NewTurnLoopperforms is checking thatGenInputandPrepareAgentare non-nil. These are programming errors (always known at compile/init time), not runtime failures, so a panic is more appropriate than an error return. This eliminates boilerplateif err != nilblocks at every construction site:2.
Run()returns nothing instead oferrorPreviously
Run()returned an error when called on an already-running or already-finished loop. SinceRunis a lifecycle method that starts the main loop asynchronously, duplicate calls are now silently treated as no-ops. Callers retrieve the final outcome viaWait(), which already returnsTurnLoopExitState:3.
Resume()split intoPrepareResume()+Run()Resumepreviously combined checkpoint loading/validation (which can fail) with starting the loop (which cannot meaningfully fail). These are now separate steps:PrepareResume(ctx, checkPointID, newItems, ...opts) error— loads the checkpoint fromStore, validates items, and stages the resume payload. Returns an error if the checkpoint cannot be loaded or validation fails.Run(ctx)— starts the loop (works for both fresh starts and prepared resumes).This separation lets callers handle checkpoint errors before committing to the loop lifecycle, and keeps
Runas a single uniform entry point for both fresh and resumed loops.Key Insight
By making cancel an option (
WithCancel) instead of an interface (CancellableAgent), the cancel lifecycle becomes orthogonal to agent implementation. A singleflowAgentdrives the state machine regardless of how many agents are nested — inner agents observe cancellation through context, and safe-point cancels become regular compose interrupts that the checkpoint system already knows how to persist. All cancel modes fire unconditionally at workflow transition boundaries (where no sub-agent work is in progress), andCancelImmediateadds a grace period so child agents can checkpoint before force-abort.The addition of preempt acknowledgment channels, async CancelHandle waiting, and strict contributed semantics removes race windows in preemption/stop signalling while preserving precise CancelError attribution.
Summary
CancellableAgentinterface per agent typeWithCancelasAgentRunOptionwithcancelContextstate machineflowAgentowns lifecycle; inner agents via context; parallel agents use interrupt func slicecompose.Interruptwith typedcancelSafePointInfo— checkpoint saved automaticallyCancelChecknode;cancelMonitoredToolHandlerfor streamsGenInputResult.RunCtxoverrides per-turn ctx forPrepareAgent/run/OnAgentEventsUnboundedChan.TrySend/TakeAll/PushFrontmethodsRunPushbuffers,Stopsets flag,Waitblocks untilRuncompletesFromGraphInterruptin shared cancel channelresolveInterruptCompletedTasks*StreamCancelledErrorgob-registered type;agentEventWrapper.GobEncodeconsumes unconsumed streamssendInterruptandsetGraphInterruptFuncPush(WithPreempt())CancelHandle.Wait()enables async cancel and explicit outcome checkingcontributedgating onPreempted/Stoppedchannel closingTurnLoop.Resume(...)loads checkpoint and resumes first turnStoreinstanceExternalTurnState=true+WithExternalResumeItemsdecouples storage from TurnLoopGenResumecallback +GenResumeResult.Consumed/RemainingStop()can only signal exit, not cancel the running agentStop(WithAgentCancelMode(...))+watchStopSignalper-generation firingturnLoopStopSigdone channel was single-use, prevented escalationnotifychannel + generation counter separate permanent-stop from per-signal cancelNewTurnLoopreturns error for programming mistakes (nil callbacks)NewTurnLoopreturns*TurnLoop[T]directly; panics on nilGenInput/PrepareAgentRun()returns error on duplicate calls that callers must handleRun()returns nothing; duplicate calls are no-op; outcome viaWait()Resume()mixes failable checkpoint loading with non-failable loop startPrepareResume()returns error for checkpoint/validation;Run()starts the loop统一取消状态机、Eager Receive 与复合 Agent 取消支持
问题
基础分支
feat/agent_turn_loop引入了CancellableAgent(含RunWithCancel/ResumeWithCancel)和 TurnLoop(NewTurnLoop+Run)。但仍存在三个问题:取消与接口耦合,仅对 ChatModelAgent 有效。
CancellableAgent要求每种 agent 类型单独实现RunWithCancel/ResumeWithCancel。复合 agent(sequential、loop、parallel、supervisor、planexecute)未实现该接口,因此复杂工作流中无法使用取消。安全点取消使用 sentinel error,破坏了 checkpoint/resume。
CancelAfterChatModel和CancelAfterToolCalls作为普通错误传播。compose 层无法将其与真实错误区分,因此不保存 checkpoint——安全点取消后无法定向恢复。TurnLoop 事件传递延迟,抢占存在竞态窗口。 事件先在内部缓冲后才到达调用方。
cancel_wrapper.go中基于cancelSig的取消包装通过 model/tool 拦截实现,脆弱且难以跨 agent 边界组合。此外,compose 图中 parent 和 sub-graph 的 task manager 共享 cancel channel——若 sub-graph 先消费了 cancel 值,parent 会丢失
FromGraphInterrupt=true。解决方案
用
WithCanceloption 替代CancellableAgent接口取消现在是
AgentRunOption而非独立接口:cancelContext状态机跟踪执行状态:外层
flowAgent拥有取消生命周期。内层 agent 通过 Go context 访问cancelContext(getCancelContext(ctx))——filterCancelOption从嵌套调用中剥离 cancel option 以防止双重持有。此方案完全替代了cancel_wrapper.go。将取消扩展到所有复合 agent
三个 bug 阻止了取消在复合 agent 中工作:
flowAgent对workflowAgent路径立即调用markDone(),导致AgentCancelFunc在工作流完成前就返回ErrExecutionCompleted。flowAgent包装在完成时调用markDone(),过早关闭后续子 agent 的doneChan。graphInterruptFunc,导致取消仅中断最后一个分支。修复:仅最外层
flowAgent拥有markDone()生命周期。内层 agent 通过 context 访问cancelCtx但不调用markDone()。并行 agent 使用 interrupt 函数切片确保所有分支均被中断。安全点语义:过渡边界对所有取消模式均安全
一个关键行为选择:工作流过渡边界(子 agent 之间 / loop iteration 之间 / transfer 点 / parallel pre-spawn)对所有取消模式都是安全的,因为在边界处没有子 agent 工作正在进行。任何取消模式在过渡边界都会无条件触发——没有理由延迟取消。
CancelAfterChatModel和CancelAfterToolCalls在边界处会被执行,因为没有模型或工具调用正在运行。CancelImmediate额外包含宽限期包装器(默认 1 秒),在强制中止前给子 agent 留出 checkpoint 时间。工作流过渡边界的取消检查
sequential、loop 和 parallel 工作流现在会在每个过渡边界检查取消。由于边界处没有子 agent 工作正在进行,所有取消模式(包括
CancelAfterChatModel和CancelAfterToolCalls)都会被执行——没有理由延迟取消。CancelImmediate额外包含宽限期包装器(默认 1 秒),在强制中止前给子 agent 留出中断/checkpoint 时间。安全点取消通过
compose.Interrupt安全点取消现在发出携带 typed
cancelSafePointInfo的compose.Interrupt,而非 sentinel error。compose 自动保存 checkpoint 数据,支持安全点取消后通过Runner.ResumeWithParams恢复。Runner 检测CancelError并从 interrupt signal 填充InterruptContexts。Eager Receive 模式
Agent 执行的事件现在被实时消费而非缓冲。react graph 在 ChatModel 之后添加了显式
CancelChecklambda 节点用于确定性安全点处理。工具流被cancelMonitoredToolHandler包装——CancelImmediate时流以ErrStreamCancelled(注册了 gob 的具体*StreamCancelledError类型)终止,确保序列化后仍可识别。新增
UnboundedChan方法(TrySend、TakeAll、PushFront)支持此模式:抢占通过PushFront恢复未处理项,TakeAll支持批量消费。宽松的 TurnLoop API
TurnLoop API 经过优化,所有方法在
Run调用前均有效:Push缓冲项、Stop设置标志使Run立即退出、Wait阻塞直到Run完成。抢占(WithPreempt())通过 gate 限定在当前 turn。单次 turn 的上下文覆盖(用于链路追踪)
GenInputResult.RunCtx允许每个 turn 覆盖执行上下文,作用范围包括PrepareAgent、agent 的 run/resume 以及OnAgentEvents。这让被 Push 的 item 可以把 trace/span 等元信息绑定到具体的 agent 执行上。该上下文必须从 TurnLoop 的 run ctx 派生,以保留取消语义。FromGraphInterrupt 传播
resolveInterruptCompletedTasks现在从子 graph interrupt info 向上传播FromGraphInterrupt,确保 parent graph 正确识别取消触发的 interrupt。抢占确认通道(本次更新新增)
Push()现在在使用WithPreempt()时返回确认通道。这允许调用者在继续前等待抢占信号被确认:确认通道在以下情况关闭:
这消除了推送紧急项目和检查抢占状态之间的竞态条件。
用于异步取消操作的 CancelHandle(本次更新新增)
AgentCancelFunc现在返回CancelHandle而非阻塞:这允许取消操作异步启动,同时仍提供等待完成和检查结果的方式(
ErrCancelTimeout、ErrExecutionCompleted)。TurnLoop.Resume — 从 checkpoint 恢复停止的循环(本次更新新入)
TurnLoop新增Resume方法,可从之前保存的 checkpoint 重新启动执行,而非从头开始:Resume调用时,TurnLoop 从Store加载 checkpoint,重建pendingResume负载,并在第一次迭代时通过Runner.ResumeWithParams恢复被中断的 agent turn,而非启动新的一次运行。TurnLoopExitState现在携带CheckPointID、CanceledItems和UnhandledItems,调用方无需直接查询 store 即可获得下次Resume所需的全部数据。ExternalTurnState 模式 — 将 checkpoint 与 TurnLoop 内部解耦(本次更新新入)
将
TurnLoopConfig.ExternalTurnState = true把 turn 级 checkpoint 数据的持有权从 TurnLoop 内部Store转移给调用方:ExternalTurnState=false(默认)Store保存/加载 agent checkpointExternalTurnState=trueCheckPointID;持久化由调用方负责WithExternalResumeItems提供项在
ExternalTurnState模式下,Resume必须显式提供已取消和未处理的项:这使得自行管理持久化层的调用方无需独立的
Store即可驱动 TurnLoop resume。GenResume 回调 — resume 时重建输入队列(本次更新新入)
TurnLoopConfig.GenResume是GenInput在 resume 时的对应回调。它在恢复循环的第一次迭代时恰好调用一次,让调用方调和循环停止时正在处理中的项:GenResumeResult.Consumed项传入PrepareAgent并用于恢复 agent turn。GenResumeResult.Remaining项重新入队头。不属于两者的项被丢弃。该回调在第一次 resume 迭代时完全替代GenInput。Stop 升级通过 AgentCancelOptions(本次更新新入)
Stop()现在接受AgentCancelOption可变参数,调用方可同时发出退出信号并取消运行中的 agent:turnLoopStopSig将 cancel opts 和世代计数存储在一起。watchStopSignalgoroutine 在每次世代增加时触发agentCancelFunc,使多次递进式Stop调用在没有竞争的情况下正确工作。turnLoopStopSig 重构:单次使用 done → 可重复 notify(本次更新新入)
此前
turnLoopStopSig使用单一done chan struct{},在第一次Stop调用时就被 close。这防止了多次Stop调用将不同的 cancel options 传递给运行中的 agent。结构现在将两个关注点分离:
done仍尺好只关闭一次(在 TurnLoop 主循环退出时),通过isStopped()可读取。notifychannel 加世代计数将signal()能够多次调用而不阐塞,同时保证 at-least-one-signal 语义。TurnLoop API 简化(本次更新新增)
三项变更简化了 TurnLoop 公共 API,移除了调用方无法有效处理的 error 返回值,并将
Resume拆分为可失败的准备步骤和不可失败的执行步骤:1.
NewTurnLoop返回*TurnLoop[T]而非(*TurnLoop[T], error)NewTurnLoop唯一的校验是检查GenInput和PrepareAgent是否为 nil。这属于编程错误(在编译/初始化时即可确定),而非运行时故障,因此 panic 比返回 error 更为恰当。这消除了每个构造点的if err != nil样板代码:2.
Run()不再返回值(原先返回error)此前
Run()在对已运行或已完成的循环重复调用时返回 error。由于Run是异步启动主循环的生命周期方法,重复调用现在被静默视为空操作。调用方通过Wait()获取最终结果,该方法已返回TurnLoopExitState:3.
Resume()拆分为PrepareResume()+Run()Resume此前将 checkpoint 加载/校验(可能失败)与启动循环(不会有意义地失败)合为一体。现在拆分为两个步骤:PrepareResume(ctx, checkPointID, newItems, ...opts) error— 从Store加载 checkpoint、校验数据并暂存 resume 负载。若 checkpoint 无法加载或校验失败则返回 error。Run(ctx)— 启动循环(适用于全新启动和已准备好的 resume)。这种分离让调用方可以在进入循环生命周期之前处理 checkpoint 错误,同时让
Run作为全新启动和 resume 启动的统一入口。关键洞察
将取消从接口(
CancellableAgent)改为 option(WithCancel),使取消生命周期与 agent 实现正交。单一flowAgent驱动状态机,无论嵌套多少层——内层 agent 通过 context 观察取消,安全点取消成为 compose 层已知如何持久化的普通 interrupt。所有取消模式在工作流过渡边界(没有子 agent 工作正在进行)无条件触发,CancelImmediate额外添加宽限期以便子 agent 在强制中止前完成 checkpoint。抢占确认通道和 CancelHandle 的添加提供了对异步操作的更好控制,并消除了抢占场景中的竞态条件。
总结
CancellableAgent接口WithCancel作为AgentRunOption+cancelContext状态机flowAgent持有生命周期;内层 agent 通过 context 访问;并行 agent 使用 interrupt 函数切片compose.Interrupt携带 typedcancelSafePointInfo——自动保存 checkpointCancelCheck节点 +cancelMonitoredToolHandler处理流GenInputResult.RunCtx覆盖单次 turn 上下文用于PrepareAgent/run/OnAgentEventsUnboundedChan.TrySend/TakeAll/PushFront方法Run前无效Push缓冲、Stop设标志、Wait阻塞直到Run完成FromGraphInterruptresolveInterruptCompletedTasks中向上传播标志*StreamCancelledErrorgob 注册类型;agentEventWrapper.GobEncode消费未消费的 streamsendInterrupt和setGraphInterruptFunc中持有 mutex 跨迭代Push(WithPreempt())返回的抢占确认通道CancelHandle接口及Wait()方法用于异步取消TurnLoop.Resume(...)加载 checkpoint 并恢复第一个 turnStoreExternalTurnState=true+WithExternalResumeItems将存储与 TurnLoop 解耦GenResume回调 +GenResumeResult.Consumed/RemainingStop()只能发出退出信号,无法取消运行中的 agentStop(WithAgentCancelMode(...))+watchStopSignal按世代触发turnLoopStopSigdone channel 单次使用,阻止升级notifychannel + 世代计数将永久停止与每次信号取消分离NewTurnLoop对编程错误(nil 回调)返回 errorNewTurnLoop直接返回*TurnLoop[T];GenInput/PrepareAgent为 nil 时 panicRun()重复调用返回 error,调用方必须处理Run()不返回值;重复调用为空操作;结果通过Wait()获取Resume()将可失败的 checkpoint 加载与不可失败的循环启动混为一体PrepareResume()返回 error 用于 checkpoint/校验;Run()启动循环