Skip to content

fix(core): add context-prep timeout and NoProviders backoff to agent loop#3373

Merged
bug-ops merged 1 commit intomainfrom
3357-agent-tight-loop
Apr 24, 2026
Merged

fix(core): add context-prep timeout and NoProviders backoff to agent loop#3373
bug-ops merged 1 commit intomainfrom
3357-agent-tight-loop

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Apr 24, 2026

Summary

  • Add advance_context_lifecycle_guarded: wraps context preparation with tokio::time::timeout (default 30 s, configurable via [timeouts] context_prep_timeout_secs) to prevent 14+ second stalls when embed backends are rate-limited or unavailable
  • Add NoProviders backoff: after a NoProviders error, record the failure timestamp, sleep no_providers_backoff_secs (default 2 s), and skip context prep on the next turn while still within the backoff window
  • Add AgentError::is_no_providers() predicate used by the backoff guard
  • Remove TaskSupervisor BlockingSpawner attachment from CodeIndexer in agent_setup.rs to prevent flooding the async worker pool with 971+ concurrent chunk tasks during active agent turns

Root Cause (from investigation)

The "150+/s agent.turn" and "128-call prepare_context burst" reported in #3357 were async tracing artifacts (#[tracing::instrument] emits B/E events at every tokio poll boundary). The real issue was a single turn stalling for 14 seconds in advance_context_lifecycle due to 1006 embed calls against rate-limited/unavailable providers, compounded by 971 concurrent background indexer tasks saturating the tokio worker pool.

Test plan

  • 8372 unit tests pass (cargo nextest run --config-file .github/nextest.toml --workspace --lib --bins)
  • 7 new tests added: TimeoutConfig defaults/deserialization, LifecycleState backoff gate logic, is_no_providers() predicate
  • cargo +nightly fmt --check clean
  • cargo clippy --workspace -- -D warnings clean

Closes #3357

@github-actions github-actions Bot added documentation Improvements or additions to documentation rust Rust code changes core zeph-core crate bug Something isn't working size/L Large PR (201-500 lines) labels Apr 24, 2026
@bug-ops bug-ops force-pushed the 3357-agent-tight-loop branch from 4433a25 to 6ec2aff Compare April 24, 2026 20:36
@bug-ops bug-ops enabled auto-merge (squash) April 24, 2026 20:37
…loop

When all LLM providers fail, the agent was stalling inside
advance_context_lifecycle for 14+ seconds (1006 embed calls against
rate-limited backends) and then immediately retrying the same expensive
path on every subsequent turn.

- Wrap advance_context_lifecycle with tokio::time::timeout via the new
  advance_context_lifecycle_guarded helper; configurable via
  [timeouts] context_prep_timeout_secs (default 30 s)
- After a NoProviders error, record the failure timestamp and sleep
  no_providers_backoff_secs (default 2 s); skip context prep on the
  next turn while still within the backoff window
- Add AgentError::is_no_providers() predicate used by the backoff guard
- Remove TaskSupervisor BlockingSpawner from CodeIndexer in agent_setup
  to prevent 971 concurrent chunk tasks from flooding the async worker
  pool during active agent turns

Closes #3357
@bug-ops bug-ops force-pushed the 3357-agent-tight-loop branch from 6ec2aff to fa4f301 Compare April 24, 2026 20:43
@bug-ops bug-ops merged commit acdf7d6 into main Apr 24, 2026
32 checks passed
@bug-ops bug-ops deleted the 3357-agent-tight-loop branch April 24, 2026 20:51
bug-ops added a commit that referenced this pull request Apr 24, 2026
Add llm_request_timeout_secs (600 s), context_prep_timeout_secs (30 s),
and no_providers_backoff_secs (2 s) to the [timeouts] section with
descriptive comments. These fields were added in #3373 but omitted from
the reference config, making them invisible to migrate-config --diff and
to users reading the config file.

Closes #3377
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working core zeph-core crate documentation Improvements or additions to documentation rust Rust code changes size/L Large PR (201-500 lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: agent.turn tight loop (150+/s) and 128-call prepare_context burst when all LLM providers fail

1 participant