Skip to content

obs(orchestration): track planner/aggregator LLM calls in session metrics (#1899)#1925

Merged
bug-ops merged 5 commits intomainfrom
obs-orchestration-planner-aggr
Mar 16, 2026
Merged

obs(orchestration): track planner/aggregator LLM calls in session metrics (#1899)#1925
bug-ops merged 5 commits intomainfrom
obs-orchestration-planner-aggr

Conversation

@bug-ops
Copy link
Owner

@bug-ops bug-ops commented Mar 16, 2026

Summary

  • LlmPlanner::plan() and LlmAggregator::aggregate() now return Option<(u64, u64)> (prompt_tokens, completion_tokens) alongside their existing result
  • Call sites in agent/mod.rs use the returned usage to increment api_calls, prompt_tokens, completion_tokens, total_tokens, cost, and cache stats in the shared MetricsCollector
  • tasks_skipped now incremented in both GraphStatus::Completed and GraphStatus::Failed arms of finalize_plan_execution
  • /status shows an Orchestration: block (plans, tasks completed/failed/skipped) when orchestration.enabled = true and plans_total > 0
  • Integration test updated for new aggregate() return type
  • 4 new unit tests added (6057 total, was 6053)

Root cause

api_calls was only incremented in native.rs and legacy.rs (the main agent loop LLM call sites). LlmPlanner::plan() and LlmAggregator::aggregate() called chat_typed() directly, bypassing those paths. Token usage and cost for orchestration sessions were completely untracked.

Design note

The architect's initial approach (reading last_usage() from self.provider after planner/aggregator return) was rejected: AnyProvider::clone() creates a new Mutex<Option<(u64,u64)>> initialized to None — the clone used by planner/aggregator writes to its own independent mutex. Fix: return usage data directly from plan()/aggregate().

Follow-up

  • IC-01: chat_typed() retry loops overwrite last_usage — only the final attempt's tokens are captured. Pre-existing limitation of the last_usage() API, not introduced here.

Closes #1899

bug-ops added 4 commits March 16, 2026 19:11
…n metrics

LlmPlanner::plan() and LlmAggregator::aggregate() now return token usage
alongside their result, captured immediately after each provider.chat() call.
Call sites in agent/mod.rs update api_calls, prompt_tokens, completion_tokens,
total_tokens, record_cost, and record_cache_usage for both operations.

tasks_skipped is now incremented in the Completed graph path alongside
tasks_completed. handle_status_command() shows an Orchestration section
(plans, tasks completed/total, failed, skipped) when plans_total > 0.

Fixes #1899.
IC-02: GraphStatus::Failed arm now mirrors Completed arm by also
incrementing tasks_completed and tasks_skipped counters, so partial
failures are reflected accurately in orchestration metrics.

Integration test: destructure aggregate() return value to (String,
Option<(u64, u64)>) to match updated LlmAggregator::aggregate signature.
@bug-ops bug-ops enabled auto-merge (squash) March 16, 2026 18:33
@github-actions github-actions bot added documentation Improvements or additions to documentation rust Rust code changes core zeph-core crate size/L Large PR (201-500 lines) tests Test-related changes labels Mar 16, 2026
@bug-ops bug-ops merged commit d3ef922 into main Mar 16, 2026
20 checks passed
@bug-ops bug-ops deleted the obs-orchestration-planner-aggr branch March 16, 2026 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core zeph-core crate documentation Improvements or additions to documentation rust Rust code changes size/L Large PR (201-500 lines) tests Test-related changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

obs(orchestration): planner/aggregator LLM calls not tracked in session metrics or /status

1 participant