Description
The fix in #4489 (d84d647) awaits the AutoSkill trace extraction JoinHandle at shutdown to prevent silent task cancellation. However, the await has no outer timeout, which can stall agent exit under adverse conditions.
Reproduction Steps
- Start an agent session with
trace_extraction_enabled = true and a slow/unavailable LLM provider.
- Exit the session (EOF or Ctrl-C).
- Observe: agent hangs in
core.agent.shutdown span without progressing to summary/digest steps.
Expected Behavior
Agent shutdown completes within a bounded wall time even if trace extraction is slow or hangs.
Actual Behavior
crates/zeph-core/src/agent/mod.rs:756:
if let Some(h) = self.services.learning_engine.trace_extraction_handle.take() {
let _ = h.await; // no timeout
}
run_extraction makes multiple LLM calls via TraceExtractor. While each individual LLM call has llm_timeout = 1 min (trace_extractor.rs:169), embed_existing iterates all existing skills sequentially — N skills × 1 min = N minutes of potential stall. The outer await in shutdown has no guard at all.
This violates the Await Discipline rule in .claude/rules/rust-code.md:
Every external .await MUST have a timeout.
Fix Suggestion
Wrap with tokio::time::timeout:
if let Some(h) = self.services.learning_engine.trace_extraction_handle.take() {
let deadline = Duration::from_secs(120); // 2 × llm_timeout
if tokio::time::timeout(deadline, h).await.is_err() {
tracing::warn!("trace_extraction: timed out at shutdown ({}s), aborting", deadline.as_secs());
}
}
Alternatively, abort the handle unconditionally (fire-and-forget remains fire-and-forget — the fix goal was logging awareness, not guaranteed completion).
Environment
- Version: d84d647
- Features: full (skills.learning.trace_extraction_enabled = true)
Logs / Evidence
Span core.agent.shutdown will stall between learning_tasks.abort_all() and yield_now() with no tracing output.
Description
The fix in #4489 (d84d647) awaits the AutoSkill trace extraction JoinHandle at shutdown to prevent silent task cancellation. However, the await has no outer timeout, which can stall agent exit under adverse conditions.
Reproduction Steps
trace_extraction_enabled = trueand a slow/unavailable LLM provider.core.agent.shutdownspan without progressing to summary/digest steps.Expected Behavior
Agent shutdown completes within a bounded wall time even if trace extraction is slow or hangs.
Actual Behavior
crates/zeph-core/src/agent/mod.rs:756:run_extractionmakes multiple LLM calls viaTraceExtractor. While each individual LLM call hasllm_timeout = 1 min(trace_extractor.rs:169),embed_existingiterates all existing skills sequentially — N skills × 1 min = N minutes of potential stall. The outer await in shutdown has no guard at all.This violates the Await Discipline rule in
.claude/rules/rust-code.md:Fix Suggestion
Wrap with
tokio::time::timeout:Alternatively, abort the handle unconditionally (fire-and-forget remains fire-and-forget — the fix goal was logging awareness, not guaranteed completion).
Environment
Logs / Evidence
Span
core.agent.shutdownwill stall betweenlearning_tasks.abort_all()andyield_now()with no tracing output.