obs(orchestration): track planner/aggregator LLM calls in session metrics (#1899) by bug-ops · Pull Request #1925 · bug-ops/zeph

bug-ops · 2026-03-16T18:33:45Z

Summary

LlmPlanner::plan() and LlmAggregator::aggregate() now return Option<(u64, u64)> (prompt_tokens, completion_tokens) alongside their existing result
Call sites in agent/mod.rs use the returned usage to increment api_calls, prompt_tokens, completion_tokens, total_tokens, cost, and cache stats in the shared MetricsCollector
tasks_skipped now incremented in both GraphStatus::Completed and GraphStatus::Failed arms of finalize_plan_execution
/status shows an Orchestration: block (plans, tasks completed/failed/skipped) when orchestration.enabled = true and plans_total > 0
Integration test updated for new aggregate() return type
4 new unit tests added (6057 total, was 6053)

Root cause

api_calls was only incremented in native.rs and legacy.rs (the main agent loop LLM call sites). LlmPlanner::plan() and LlmAggregator::aggregate() called chat_typed() directly, bypassing those paths. Token usage and cost for orchestration sessions were completely untracked.

Design note

The architect's initial approach (reading last_usage() from self.provider after planner/aggregator return) was rejected: AnyProvider::clone() creates a new Mutex<Option<(u64,u64)>> initialized to None — the clone used by planner/aggregator writes to its own independent mutex. Fix: return usage data directly from plan()/aggregate().

Follow-up

IC-01: chat_typed() retry loops overwrite last_usage — only the final attempt's tokens are captured. Pre-existing limitation of the last_usage() API, not introduced here.

Closes #1899

…n metrics LlmPlanner::plan() and LlmAggregator::aggregate() now return token usage alongside their result, captured immediately after each provider.chat() call. Call sites in agent/mod.rs update api_calls, prompt_tokens, completion_tokens, total_tokens, record_cost, and record_cache_usage for both operations. tasks_skipped is now incremented in the Completed graph path alongside tasks_completed. handle_status_command() shows an Orchestration section (plans, tasks completed/total, failed, skipped) when plans_total > 0. Fixes #1899.

IC-02: GraphStatus::Failed arm now mirrors Completed arm by also incrementing tasks_completed and tasks_skipped counters, so partial failures are reflected accurately in orchestration metrics. Integration test: destructure aggregate() return value to (String, Option<(u64, u64)>) to match updated LlmAggregator::aggregate signature.

…te paths (#1899)

bug-ops added 4 commits March 16, 2026 19:11

style: fix nightly fmt alignment in tests.rs

1f0e0e2

docs(changelog): add [Unreleased] entry for #1899 orchestration metrics

db346fe

bug-ops enabled auto-merge (squash) March 16, 2026 18:33

github-actions bot added documentation Improvements or additions to documentation rust Rust code changes core zeph-core crate size/L Large PR (201-500 lines) tests Test-related changes labels Mar 16, 2026

test(orchestration): add end-to-end metrics coverage for plan/aggrega…

9ae3bb5

…te paths (#1899)

bug-ops merged commit d3ef922 into main Mar 16, 2026
20 checks passed

bug-ops deleted the obs-orchestration-planner-aggr branch March 16, 2026 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

obs(orchestration): track planner/aggregator LLM calls in session metrics (#1899)#1925

obs(orchestration): track planner/aggregator LLM calls in session metrics (#1899)#1925
bug-ops merged 5 commits intomainfrom
obs-orchestration-planner-aggr

bug-ops commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bug-ops commented Mar 16, 2026

Summary

Root cause

Design note

Follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant