fix(pydantic): retype wrapper spans as task to stop cost double-counting by colinbennettbrain · Pull Request #312 · braintrustdata/braintrust-sdk-python

colinbennettbrain · 2026-04-16T18:57:37Z

Agent wrapper spans (agent_run, agent_run_sync, agent_run_stream*, agent_to_cli_sync, model_request*) were tagged type=llm and logged the same usage metrics as their nested leaf chat <model> span. Experiment aggregations that sum metrics across type=llm spans therefore counted a single provider call twice (wrapper + leaf), inflating reported tokens and cost ~2x for a single-turn agent and more for multi-turn runs.

Retag every wrapper span as SpanTypeAttribute.TASK; only the leaf chat <model> emitted by _wrap_concrete_model_class stays LLM. _DirectStreamWrapper is used both as wrapper (direct.model_request_stream) and as leaf (Model.request_stream), so it gains a span_type parameter defaulting to LLM; wrapper callers pass TASK explicitly.

Test coverage: flip existing wrapper-type assertions to TASK (leaf chat_span assertions stay LLM) and extend the cassette-backed test_agent_run_async with a regression check that exactly one type=llm span exists and that prompt/completion tokens summed across llm spans equal the leaf's values.

Agent wrapper spans (agent_run, agent_run_sync, agent_run_stream*, agent_to_cli_sync, model_request*) were tagged type=llm and logged the same usage metrics as their nested leaf `chat <model>` span. Experiment aggregations that sum metrics across type=llm spans therefore counted a single provider call twice (wrapper + leaf), inflating reported tokens and cost ~2x for a single-turn agent and more for multi-turn runs. Retag every wrapper span as SpanTypeAttribute.TASK; only the leaf `chat <model>` emitted by _wrap_concrete_model_class stays LLM. _DirectStreamWrapper is used both as wrapper (direct.model_request_stream) and as leaf (Model.request_stream), so it gains a span_type parameter defaulting to LLM; wrapper callers pass TASK explicitly. Test coverage: flip existing wrapper-type assertions to TASK (leaf chat_span assertions stay LLM) and extend the cassette-backed test_agent_run_async with a regression check that exactly one type=llm span exists and that prompt/completion tokens summed across llm spans equal the leaf's values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#312 retyped wrapper spans (agent_run, model_request, streaming wrappers) from LLM to TASK, but they kept logging the same prompt_tokens/completion_tokens/tokens as their nested leaf `chat <model>` span. The server derives estimated_cost per-span from tokens+metadata.model (brainstore estimated_cost.rs), rolls up trace totals with `coalesce_add` over every non-scorer span regardless of type (summary.rs accumulate_metrics), and sums experiment-level token/cost over all non-scorer spans without filtering on span_type='llm' (summary.ts experimentScanSpanSummary). Retyping to TASK therefore did not stop double-counting on any of those three axes. Route every wrapper log site through a new `_wrapper_span_metrics` helper that emits only {start, end, duration, optional time_to_first_token}. Leaf `chat <model>` spans (from _wrap_concrete_model_class and _DirectStreamWrapper when span_type=LLM) keep full _extract_response_metrics. `_DirectStreamWrapper` now branches on span_type since it serves as both leaf and wrapper. Delete now-dead `_extract_usage_metrics` and `_extract_stream_usage_metrics`. Flip existing cassette-backed assertions (test_agent_run_async, test_agent_run_sync, test_agent_run_stream_async, test_agent_with_tools, test_agent_run_stream_sync) to assert prompt_tokens / completion_tokens / tokens / prompt_cached_tokens are absent from wrapper spans and present only on the leaf. No cassette re-recording needed -- the change is purely in post-processing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#312 retyped wrapper spans (agent_run, model_request, streaming wrappers) from LLM to TASK, but they kept logging the same prompt_tokens/completion_tokens/tokens as their nested leaf `chat <model>` span. The server derives estimated_cost per-span from tokens+metadata.model (brainstore estimated_cost.rs), rolls up trace totals with `coalesce_add` over every non-scorer span regardless of type (summary.rs accumulate_metrics), and sums experiment-level token/cost over all non-scorer spans without filtering on span_type='llm' (summary.ts experimentScanSpanSummary). Retyping to TASK therefore did not stop double-counting on any of those three axes. Route every wrapper log site through a new `_wrapper_span_metrics` helper that emits only {start, end, duration, optional time_to_first_token}. Leaf `chat <model>` spans (from _wrap_concrete_model_class and _DirectStreamWrapper when span_type=LLM) keep full _extract_response_metrics. `_DirectStreamWrapper` now branches on span_type since it serves as both leaf and wrapper. Delete now-dead `_extract_usage_metrics` and `_extract_stream_usage_metrics`. Flip existing cassette-backed assertions (test_agent_run_async, test_agent_run_sync, test_agent_run_stream_async, test_agent_with_tools, test_agent_run_stream_sync) to assert prompt_tokens / completion_tokens / tokens / prompt_cached_tokens are absent from wrapper spans and present only on the leaf. No cassette re-recording needed -- the change is purely in post-processing.

Abhijeet Prasad (AbhiPrasad) · 2026-04-17T15:58:52Z

released with https://github.com/braintrustdata/braintrust-sdk-python/releases/tag/py-sdk-v0.16.0

colinbennettbrain requested a review from Abhijeet Prasad (AbhiPrasad) April 16, 2026 19:04

Abhijeet Prasad (AbhiPrasad) approved these changes Apr 16, 2026

View reviewed changes

Abhijeet Prasad (AbhiPrasad) changed the title ~~retype wrapper spans as task to stop cost double-counting~~ fix(pydantic): retype wrapper spans as task to stop cost double-counting Apr 16, 2026

Abhijeet Prasad (AbhiPrasad) merged commit 31d7c07 into main Apr 16, 2026
84 checks passed

Abhijeet Prasad (AbhiPrasad) deleted the colin-pydantic_ai branch April 16, 2026 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pydantic): retype wrapper spans as task to stop cost double-counting#312

fix(pydantic): retype wrapper spans as task to stop cost double-counting#312
Abhijeet Prasad (AbhiPrasad) merged 1 commit intomainfrom
colin-pydantic_ai

colinbennettbrain commented Apr 16, 2026

Uh oh!

Uh oh!

Abhijeet Prasad (AbhiPrasad) commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

colinbennettbrain commented Apr 16, 2026

Uh oh!

Uh oh!

Abhijeet Prasad (AbhiPrasad) commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants