Skip to content

fix(core): trace_extraction_handle.await at shutdown has no outer timeout — can stall agent exit #4500

@bug-ops

Description

@bug-ops

Description

The fix in #4489 (d84d647) awaits the AutoSkill trace extraction JoinHandle at shutdown to prevent silent task cancellation. However, the await has no outer timeout, which can stall agent exit under adverse conditions.

Reproduction Steps

  1. Start an agent session with trace_extraction_enabled = true and a slow/unavailable LLM provider.
  2. Exit the session (EOF or Ctrl-C).
  3. Observe: agent hangs in core.agent.shutdown span without progressing to summary/digest steps.

Expected Behavior

Agent shutdown completes within a bounded wall time even if trace extraction is slow or hangs.

Actual Behavior

crates/zeph-core/src/agent/mod.rs:756:

if let Some(h) = self.services.learning_engine.trace_extraction_handle.take() {
    let _ = h.await;  // no timeout
}

run_extraction makes multiple LLM calls via TraceExtractor. While each individual LLM call has llm_timeout = 1 min (trace_extractor.rs:169), embed_existing iterates all existing skills sequentially — N skills × 1 min = N minutes of potential stall. The outer await in shutdown has no guard at all.

This violates the Await Discipline rule in .claude/rules/rust-code.md:

Every external .await MUST have a timeout.

Fix Suggestion

Wrap with tokio::time::timeout:

if let Some(h) = self.services.learning_engine.trace_extraction_handle.take() {
    let deadline = Duration::from_secs(120); // 2 × llm_timeout
    if tokio::time::timeout(deadline, h).await.is_err() {
        tracing::warn!("trace_extraction: timed out at shutdown ({}s), aborting", deadline.as_secs());
    }
}

Alternatively, abort the handle unconditionally (fire-and-forget remains fire-and-forget — the fix goal was logging awareness, not guaranteed completion).

Environment

  • Version: d84d647
  • Features: full (skills.learning.trace_extraction_enabled = true)

Logs / Evidence

Span core.agent.shutdown will stall between learning_tasks.abort_all() and yield_now() with no tracing output.

Metadata

Metadata

Assignees

Labels

P3Research — medium-high complexitybugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions