Skip to content

feat: add token usage metrics with OpenTelemetry integration#563

Merged
ajbozarth merged 19 commits intogenerative-computing:mainfrom
ajbozarth:feat/token-usage-metrics-v2
Mar 10, 2026
Merged

feat: add token usage metrics with OpenTelemetry integration#563
ajbozarth merged 19 commits intogenerative-computing:mainfrom
ajbozarth:feat/token-usage-metrics-v2

Conversation

@ajbozarth
Copy link
Contributor

@ajbozarth ajbozarth commented Feb 26, 2026

Misc PR

Type of PR

  • Bug Fix
  • New Feature
  • Documentation
  • Other

Description

Summary

Adds token usage metrics tracking across all Mellea backends using OpenTelemetry metrics counters, following Gen-AI Semantic Conventions for standardized observability.

Changes

Core Implementation

  • Added record_token_usage_metrics() function in mellea/telemetry/metrics.py
  • Implemented lazy initialization of token counters (mellea.llm.tokens.input, mellea.llm.tokens.output)
  • Integrated token tracking into all backends: OpenAI, Ollama, WatsonX, LiteLLM, and HuggingFace
  • Added console exporter support for debugging (MELLEA_METRICS_CONSOLE=true)

Configuration

  • New environment variable: MELLEA_METRICS_ENABLED (default: false)
  • New environment variable: MELLEA_METRICS_CONSOLE (default: false)
  • Metrics export via existing OTEL_EXPORTER_OTLP_ENDPOINT

Metrics Attributes

All token metrics include Gen-AI semantic convention attributes:

  • gen_ai.system - Backend system name (e.g., openai, ollama)
  • gen_ai.request.model - Model identifier
  • mellea.backend - Backend class name

Testing

  • Added comprehensive unit tests for metrics configuration and recording
  • Added integration tests for all backends (Ollama, OpenAI, WatsonX, LiteLLM, HuggingFace)
  • Tests verify proper token counting and attribute tagging

Documentation

  • Updated docs/dev/telemetry.md with complete metrics documentation
  • Added usage examples and configuration guide
  • Documented backend support matrix

Backend Support

Backend Support Token Source
OpenAI ✅ Full usage.prompt_tokens, usage.completion_tokens
Ollama ✅ Full prompt_eval_count, eval_count
WatsonX ✅ Full input_token_count, generated_token_count
LiteLLM ✅ Full usage.prompt_tokens, usage.completion_tokens
HuggingFace ✅ Full Calculated from input_ids and output sequences

Breaking Changes

None - metrics are disabled by default and require explicit opt-in via MELLEA_METRICS_ENABLED=true.

Testing

  • Tests added to the respective file if code was changed
  • New code has 100% coverage if code as added
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Add mellea.llm.tokens.input/output counters following Gen-AI semantic conventions with zero overhead when disabled

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
…LM backends

Add record_token_usage_metrics() calls to all backend post_processing methods to track input/output tokens. Add get_value() helper in backends/utils.py to handle dict/object attribute extraction.

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
Calculate token counts from input_ids and output sequences. Records to both tracing spans and metrics using helper function.

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
- Add integration tests for Ollama, OpenAI, LiteLLM, HuggingFace, WatsonX
- Tests revealed metrics were coupled with tracing (architectural issue)
- Fixed: Metrics now record independently of tracing spans
- WatsonX: Store full response to preserve usage information
- HuggingFace: Add zero-overhead guard, optimize test model

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
…ation

Use MonkeyPatch for cleanup and update Watsonx to granite-4-h-small.

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
- Add Token Usage Metrics section to docs/dev/telemetry.md with metric
  definitions, backend support table, and configuration examples
- Create metrics_example.py demonstrating token tracking with tested
  console output
- Update telemetry_example.py to reference new metrics example
- Update examples/telemetry/README.md with metrics quick start guide

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@ajbozarth ajbozarth self-assigned this Feb 26, 2026
@ajbozarth ajbozarth requested a review from a team as a code owner February 26, 2026 22:45
@github-actions
Copy link
Contributor

The PR description has been updated. Please fill out the template for your PR to be reviewed.

@mergify
Copy link

mergify bot commented Feb 26, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@ajbozarth
Copy link
Contributor Author

After opening this I had Bob and Claude to in depth reviews and they came back with a handful of things I want to address. I will work on fixing those tomorrow

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@ajbozarth
Copy link
Contributor Author

I've push a small update to test and doc streaming support, as suggested by AI review.

As of now this is ready for full review and merge

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
Enable token metrics for streaming responses in OpenAI and LiteLLM backends.
Parametrize backend tests for streaming/non-streaming coverage.

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@ajbozarth
Copy link
Contributor Author

@psschwei I've pushed fixes addressing your review if you want to take another look.

- Replace llama3.2:1b with granite4:micro-h in telemetry tests
- Replace deprecated granite-4.0-micro with granite-4.0-h-micro in HF tests
- Use model constants instead of hardcoded strings
- Remove redundant gh_run checks (rely on pytest markers)

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@ajbozarth
Copy link
Contributor Author

After reading a doc on model usage in our tests that @psschwei shared on slack I realized I had mis-configured the models for the integrations tests. I have rectified this in 9724db3 as well as updated the one other usage of the hugging face model to not use the deprecated model.

tl;dr
Now all the Ollama-based tests use the default granite4:micro-h, the WatsonX test uses it's default ibm/granite-4-h-small, and huggingface uses the default ibm-granite/granite-4.0-h-micro (including updating the other test instance) to match the rest of the test suite

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@ajbozarth ajbozarth requested a review from psschwei March 5, 2026 00:59
@ajbozarth ajbozarth enabled auto-merge March 5, 2026 01:41
@ajbozarth
Copy link
Contributor Author

Based on having addressed previous review I have set this to auto-merge upon next approval (@psschwei)

@ajbozarth ajbozarth disabled auto-merge March 6, 2026 01:34
@ajbozarth
Copy link
Contributor Author

I'm disabling auto-merge based on my review of #582

Depending on how fast that gets merged I will want to reimplement this using hooks. Based on a conversation with Bob it would consolidate about half of the changes in each backend into a hooks-based plugin file. As such I'd like to hold off on merging this until we get a better picture of the timeline on that PR.

This will not block my continued work on #466 which is agnostic to whether this uses hooks or not and can be started on top of this branch until we know if #582 will be merged soon enough to reimplement this without delays

@ajbozarth
Copy link
Contributor Author

Alternatively we could merge this "as is" and open the follow up immediately, then I can make the hooks-based refactor a follow up PR, we can discuss at Monday sync.

There was a general agreement in today's sync to following this path. So I will merge this once it's approved

@ajbozarth ajbozarth enabled auto-merge March 9, 2026 15:58
@ajbozarth ajbozarth requested a review from jakelorocco March 9, 2026 20:18
@jakelorocco
Copy link
Contributor

jakelorocco commented Mar 9, 2026

Commented in the wrong spot; moving to appropriate PR.

Copy link
Contributor

@jakelorocco jakelorocco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the "no overhead claims" are we worried about the otel part taking time or the token calculations? If it's not the token calculations, can you please open an issue to move the token counts by default as an accessible field in the model output thunk? And then metrics / logging can just grab those existing fields?

@ajbozarth
Copy link
Contributor Author

With the "no overhead claims" are we worried about the otel part taking time or the token calculations? If it's not the token calculations, can you please open an issue to move the token counts by default as an accessible field in the model output thunk? And then metrics / logging can just grab those existing fields?

Just in case I dug into this a bit with Bob to confirm. The telemetry overhead is talking about overhead of package imports not time, which is why telemetry is all noops when the mellea[telemetry] dependency is not installed. As for the tokens themselves, there is no overhead there as those values are already being returned from the LLMs.

As for moving the token values into mot fields, I can see the advance of that. If you want to open a follow up issue I may even add that while I'm in that code refactoring it to use hooks/plugins.

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@psschwei
Copy link
Member

going to try closing / opening to get rid of the deprecated 3.10 test...

@psschwei psschwei closed this Mar 10, 2026
auto-merge was automatically disabled March 10, 2026 11:56

Pull request was closed

@psschwei psschwei reopened this Mar 10, 2026
@psschwei
Copy link
Member

going to try closing / opening to get rid of the deprecated 3.10 test...

that did not work 😢

@psschwei psschwei closed this Mar 10, 2026
@psschwei psschwei reopened this Mar 10, 2026
@psschwei
Copy link
Member

going to try closing / opening to get rid of the deprecated 3.10 test...

that did not work 😢

now it did 😄

@ajbozarth ajbozarth enabled auto-merge March 10, 2026 12:59
@ajbozarth ajbozarth added this pull request to the merge queue Mar 10, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 10, 2026
@ajbozarth ajbozarth added this pull request to the merge queue Mar 10, 2026
Merged via the queue into generative-computing:main with commit 0e71558 Mar 10, 2026
7 of 10 checks passed
@ajbozarth ajbozarth deleted the feat/token-usage-metrics-v2 branch March 10, 2026 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement counters to track token usage across all LLM backends with model and backend labels

4 participants