feat: add token usage metrics with OpenTelemetry integration by ajbozarth · Pull Request #563 · generative-computing/mellea

ajbozarth · 2026-02-26T22:45:25Z

Misc PR

Type of PR

Bug Fix
New Feature
Documentation
Other

Description

Link to Issue: Closes Implement counters to track token usage across all LLM backends with model and backend labels #462

Summary

Adds token usage metrics tracking across all Mellea backends using OpenTelemetry metrics counters, following Gen-AI Semantic Conventions for standardized observability.

Changes

Core Implementation

Added record_token_usage_metrics() function in mellea/telemetry/metrics.py
Implemented lazy initialization of token counters (mellea.llm.tokens.input, mellea.llm.tokens.output)
Integrated token tracking into all backends: OpenAI, Ollama, WatsonX, LiteLLM, and HuggingFace
Added console exporter support for debugging (MELLEA_METRICS_CONSOLE=true)

Configuration

New environment variable: MELLEA_METRICS_ENABLED (default: false)
New environment variable: MELLEA_METRICS_CONSOLE (default: false)
Metrics export via existing OTEL_EXPORTER_OTLP_ENDPOINT

Metrics Attributes

All token metrics include Gen-AI semantic convention attributes:

gen_ai.system - Backend system name (e.g., openai, ollama)
gen_ai.request.model - Model identifier
mellea.backend - Backend class name

Testing

Added comprehensive unit tests for metrics configuration and recording
Added integration tests for all backends (Ollama, OpenAI, WatsonX, LiteLLM, HuggingFace)
Tests verify proper token counting and attribute tagging

Documentation

Updated docs/dev/telemetry.md with complete metrics documentation
Added usage examples and configuration guide
Documented backend support matrix

Backend Support

Backend	Support	Token Source
OpenAI	✅ Full	`usage.prompt_tokens`, `usage.completion_tokens`
Ollama	✅ Full	`prompt_eval_count`, `eval_count`
WatsonX	✅ Full	`input_token_count`, `generated_token_count`
LiteLLM	✅ Full	`usage.prompt_tokens`, `usage.completion_tokens`
HuggingFace	✅ Full	Calculated from input_ids and output sequences

Breaking Changes

None - metrics are disabled by default and require explicit opt-in via MELLEA_METRICS_ENABLED=true.

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code as added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Add mellea.llm.tokens.input/output counters following Gen-AI semantic conventions with zero overhead when disabled Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

…LM backends Add record_token_usage_metrics() calls to all backend post_processing methods to track input/output tokens. Add get_value() helper in backends/utils.py to handle dict/object attribute extraction. Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

Calculate token counts from input_ids and output sequences. Records to both tracing spans and metrics using helper function. Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

- Add integration tests for Ollama, OpenAI, LiteLLM, HuggingFace, WatsonX - Tests revealed metrics were coupled with tracing (architectural issue) - Fixed: Metrics now record independently of tracing spans - WatsonX: Store full response to preserve usage information - HuggingFace: Add zero-overhead guard, optimize test model Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

…ation Use MonkeyPatch for cleanup and update Watsonx to granite-4-h-small. Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

- Add Token Usage Metrics section to docs/dev/telemetry.md with metric definitions, backend support table, and configuration examples - Create metrics_example.py demonstrating token tracking with tested console output - Update telemetry_example.py to reference new metrics example - Update examples/telemetry/README.md with metrics quick start guide Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

github-actions · 2026-02-26T22:45:36Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

mergify · 2026-02-26T22:46:01Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

ajbozarth · 2026-02-26T23:55:13Z

After opening this I had Bob and Claude to in depth reviews and they came back with a handful of things I want to address. I will work on fixing those tomorrow

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

ajbozarth · 2026-02-27T16:45:36Z

I've push a small update to test and doc streaming support, as suggested by AI review.

As of now this is ready for full review and merge

mellea/backends/utils.py

mellea/backends/litellm.py

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

Enable token metrics for streaming responses in OpenAI and LiteLLM backends. Parametrize backend tests for streaming/non-streaming coverage. Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

ajbozarth · 2026-03-03T23:32:35Z

@psschwei I've pushed fixes addressing your review if you want to take another look.

mellea/backends/litellm.py

- Replace llama3.2:1b with granite4:micro-h in telemetry tests - Replace deprecated granite-4.0-micro with granite-4.0-h-micro in HF tests - Use model constants instead of hardcoded strings - Remove redundant gh_run checks (rely on pytest markers) Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

ajbozarth · 2026-03-04T23:26:26Z

After reading a doc on model usage in our tests that @psschwei shared on slack I realized I had mis-configured the models for the integrations tests. I have rectified this in 9724db3 as well as updated the one other usage of the hugging face model to not use the deprecated model.

tl;dr
Now all the Ollama-based tests use the default granite4:micro-h, the WatsonX test uses it's default ibm/granite-4-h-small, and huggingface uses the default ibm-granite/granite-4.0-h-micro (including updating the other test instance) to match the rest of the test suite

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

ajbozarth · 2026-03-05T01:43:14Z

Based on having addressed previous review I have set this to auto-merge upon next approval (@psschwei)

ajbozarth · 2026-03-06T01:40:14Z

I'm disabling auto-merge based on my review of #582

Depending on how fast that gets merged I will want to reimplement this using hooks. Based on a conversation with Bob it would consolidate about half of the changes in each backend into a hooks-based plugin file. As such I'd like to hold off on merging this until we get a better picture of the timeline on that PR.

This will not block my continued work on #466 which is agnostic to whether this uses hooks or not and can be started on top of this branch until we know if #582 will be merged soon enough to reimplement this without delays

ajbozarth · 2026-03-09T15:15:36Z

Alternatively we could merge this "as is" and open the follow up immediately, then I can make the hooks-based refactor a follow up PR, we can discuss at Monday sync.

There was a general agreement in today's sync to following this path. So I will merge this once it's approved

jakelorocco · 2026-03-09T20:59:57Z

Commented in the wrong spot; moving to appropriate PR.

jakelorocco

With the "no overhead claims" are we worried about the otel part taking time or the token calculations? If it's not the token calculations, can you please open an issue to move the token counts by default as an accessible field in the model output thunk? And then metrics / logging can just grab those existing fields?

docs/dev/telemetry.md

mellea/backends/litellm.py

docs/dev/telemetry.md

ajbozarth · 2026-03-09T21:32:47Z

With the "no overhead claims" are we worried about the otel part taking time or the token calculations? If it's not the token calculations, can you please open an issue to move the token counts by default as an accessible field in the model output thunk? And then metrics / logging can just grab those existing fields?

Just in case I dug into this a bit with Bob to confirm. The telemetry overhead is talking about overhead of package imports not time, which is why telemetry is all noops when the mellea[telemetry] dependency is not installed. As for the tokens themselves, there is no overhead there as those values are already being returned from the LLMs.

As for moving the token values into mot fields, I can see the advance of that. If you want to open a follow up issue I may even add that while I'm in that code refactoring it to use hooks/plugins.

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

test/telemetry/test_metrics_backend.py

psschwei · 2026-03-10T11:56:48Z

going to try closing / opening to get rid of the deprecated 3.10 test...

psschwei · 2026-03-10T11:57:37Z

going to try closing / opening to get rid of the deprecated 3.10 test...

that did not work 😢

psschwei · 2026-03-10T12:03:45Z

going to try closing / opening to get rid of the deprecated 3.10 test...

that did not work 😢

now it did 😄

ajbozarth added 6 commits February 26, 2026 15:54

feat: add token usage counter metrics

50f0aa6

Add mellea.llm.tokens.input/output counters following Gen-AI semantic conventions with zero overhead when disabled Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

feat: add token metrics to HuggingFace backend

8dc02c1

Calculate token counts from input_ids and output sequences. Records to both tracing spans and metrics using helper function. Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

fix: use module-scoped fixture to prevent tracer provider reinitializ…

4c9788e

…ation Use MonkeyPatch for cleanup and update Watsonx to granite-4-h-small. Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

ajbozarth self-assigned this Feb 26, 2026

ajbozarth requested a review from a team as a code owner February 26, 2026 22:45

fix: lazy import is_metrics_enabled in backends

03e3663

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

test: add streaming token metrics test and document timing

5bb388c

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

psschwei reviewed Mar 2, 2026

View reviewed changes

mellea/backends/utils.py Show resolved Hide resolved

mellea/backends/litellm.py Show resolved Hide resolved

ajbozarth added 2 commits March 3, 2026 15:59

refactor: consolidate duplicate get_value function

9e5a8f7

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

feat: add streaming token usage metrics support

dde2dca

Enable token metrics for streaming responses in OpenAI and LiteLLM backends. Parametrize backend tests for streaming/non-streaming coverage. Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

psschwei reviewed Mar 4, 2026

View reviewed changes

mellea/backends/litellm.py Show resolved Hide resolved

ajbozarth added 2 commits March 4, 2026 16:42

Merge branch 'main' into feat/token-usage-metrics-v2

940e86d

ajbozarth added 2 commits March 4, 2026 17:27

style: apply ruff formatting to test signatures

b2df82c

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

test: skip HuggingFace test in CI (requires model download)

53ccefe

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

ajbozarth requested a review from psschwei March 5, 2026 00:59

Merge branch 'main' into feat/token-usage-metrics-v2

8e0c567

ajbozarth enabled auto-merge March 5, 2026 01:41

ajbozarth disabled auto-merge March 6, 2026 01:34

ajbozarth enabled auto-merge March 9, 2026 15:58

ajbozarth requested a review from jakelorocco March 9, 2026 20:18

Merge branch 'main' into feat/token-usage-metrics-v2

efa1e4e

jakelorocco reviewed Mar 9, 2026

View reviewed changes

docs/dev/telemetry.md Show resolved Hide resolved

mellea/backends/litellm.py Show resolved Hide resolved

docs/dev/telemetry.md Show resolved Hide resolved

doc: addressed review comments

29f2c15

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

psschwei reviewed Mar 9, 2026

View reviewed changes

test/telemetry/test_metrics_backend.py Show resolved Hide resolved

This was referenced Mar 9, 2026

Add token metrics as a field on ModelOutputThunk #607

Closed

Refactor: update token metrics telemetry to use hooks/plugins system #608

Closed

psschwei closed this Mar 10, 2026

auto-merge was automatically disabled March 10, 2026 11:56
Pull request was closed

psschwei reopened this Mar 10, 2026

psschwei closed this Mar 10, 2026

psschwei reopened this Mar 10, 2026

ajbozarth enabled auto-merge March 10, 2026 12:59

jakelorocco approved these changes Mar 10, 2026

View reviewed changes

ajbozarth added this pull request to the merge queue Mar 10, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 10, 2026

ajbozarth added this pull request to the merge queue Mar 10, 2026

Merged via the queue into generative-computing:main with commit 0e71558 Mar 10, 2026
7 of 10 checks passed

ajbozarth deleted the feat/token-usage-metrics-v2 branch March 10, 2026 13:38

planetf1 mentioned this pull request Mar 11, 2026

fix: add missing requires_gpu and requires_heavy_ram markers to Huggi… #621

Closed

8 tasks

This was referenced Mar 12, 2026

creating multiple LocalHFBackend in pytest caused a memory leak #630

Open

refactor: migrate token metrics to hooks/plugins system #653

Merged

Conversation

ajbozarth commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Misc PR

Type of PR

Description

Summary

Changes

Core Implementation

Configuration

Metrics Attributes

Testing

Documentation

Backend Support

Breaking Changes

Testing

Uh oh!

github-actions bot commented Feb 26, 2026

Uh oh!

mergify bot commented Feb 26, 2026

Merge Protections

🟢 Enforce conventional commit

Uh oh!

ajbozarth commented Feb 26, 2026

Uh oh!

ajbozarth commented Feb 27, 2026

Uh oh!

Uh oh!

Uh oh!

ajbozarth commented Mar 3, 2026

Uh oh!

Uh oh!

ajbozarth commented Mar 4, 2026

Uh oh!

ajbozarth commented Mar 5, 2026

Uh oh!

ajbozarth commented Mar 6, 2026

Uh oh!

ajbozarth commented Mar 9, 2026

Uh oh!

jakelorocco commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajbozarth commented Mar 9, 2026

Uh oh!

Uh oh!

psschwei commented Mar 10, 2026

Uh oh!

psschwei commented Mar 10, 2026

Uh oh!

psschwei commented Mar 10, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ajbozarth commented Feb 26, 2026 •

edited

Loading

jakelorocco commented Mar 9, 2026 •

edited

Loading