feat(observability): align traces and metrics with OTel GenAI semantic conventions (#125)#142
Merged
cchinchilla-dev merged 9 commits intomainfrom May 2, 2026
Merged
Conversation
…s; drop legacy hook
… gen_ai.client.* names
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR modernizes AgentLoom's observability layer to use OpenTelemetry GenAI semantic conventions, centralizing telemetry names and expanding trace/metric coverage so external OTel backends can consume AgentLoom data with less custom relabeling.
Changes:
- Adds a centralized observability schema module for span names, span attributes, metric names, and provider-name translation.
- Refactors observer, gateway, engine, and LLM step code to emit GenAI-aligned spans/attributes/metrics, including provider-attempt spans and prompt metadata.
- Updates tests, docs, changelog, and Grafana queries to reflect the new telemetry surface and breaking hook/signature changes.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
tests/providers/test_gateway.py |
Adds fallback-span and step_id handling tests for gateway completions. |
tests/observability/test_schema.py |
Adds regression tests for centralized schema constants and observer behavior. |
tests/observability/test_observer.py |
Updates observer tests for renamed semantic-convention attributes/hooks. |
tests/observability/test_noop.py |
Updates noop observer coverage for new hook surface. |
tests/observability/test_metrics.py |
Updates metric tests for histogram-based GenAI telemetry. |
tests/core/test_engine_integration.py |
Adjusts engine/observer lifecycle assertions for run_id and hook changes. |
src/agentloom/steps/llm_call.py |
Adds prompt metadata generation, prompt-capture events, and forwards step_id to gateway. |
src/agentloom/steps/base.py |
Extends StepContext with capture_prompts. |
src/agentloom/providers/gateway.py |
Adds per-attempt provider observer hooks for complete/stream paths and consumes step_id. |
src/agentloom/observability/schema.py |
Introduces centralized span/attribute/metric constants and provider-name mapping. |
src/agentloom/observability/observer.py |
Refactors tracing/metrics emission to the new schema and adds provider start/end spans. |
src/agentloom/observability/noop.py |
Mirrors the expanded observer API with no-op implementations. |
src/agentloom/observability/metrics.py |
Renames GenAI metrics, switches token counting to histograms, and rewrites emitted attributes. |
src/agentloom/core/results.py |
Adds PromptMetadata and threads it through StepResult. |
src/agentloom/core/protocols.py |
Removes old observer hook surface from the protocol. |
src/agentloom/core/models.py |
Adds workflow-level capture_prompts config. |
src/agentloom/core/engine.py |
Passes run_id, prompt metadata, and new token fields into observer hooks. |
docs/workflow-yaml.md |
Documents the new capture_prompts workflow config. |
docs/observability.md |
Documents the new span hierarchy, attributes, and metric names. |
deploy/grafana/dashboards/agentloom.json |
Rewrites dashboard queries to use renamed metrics and labels. |
CHANGELOG.md |
Documents the breaking observability schema changes and new telemetry behavior. |
Comments suppressed due to low confidence (1)
src/agentloom/core/protocols.py:118
ObserverProtocolno longer describes the hooks that the gateway/steps actually use after this refactor (on_provider_call_start,on_provider_call_end,attach_step_event). That makes the protocol's own guarantee here inaccurate: an observer can satisfyObserverProtocoland still silently miss the new observability callbacks.
# Provider-level hooks called by engine + gateway. Listed here so an
# ``isinstance(obs, ObserverProtocol)`` check fails for observers that
# would crash mid-run on a missing method.
def on_tokens(
self,
provider: str,
model: str,
prompt_tokens: int,
completion_tokens: int,
**kwargs: Any,
) -> None: ...
def on_stream_response(
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Aligns the AgentLoom telemetry surface with the OpenTelemetry GenAI semantic conventions (May 2026 registry). Spans, attributes, and metrics now use canonical
gen_ai.*names, so any OTel-aware backend (Grafana GenAI dashboards, Jaeger plugins, third-party collectors) auto-correlates AgentLoom traces without per-site relabeling.observability/schema.pyas single source of truth for every span name, attribute, and metric. No raw telemetry literals anywhere else in the codebase — guarded by a drift-detection regression test.{operation_name} {model}(e.g.chat gpt-4o-mini); tool-call spans toexecute_tool {tool_name}. Provider attempts emit child spans under each step span, including failed fallback attempts.google→gcp.gemini, etc.); custom valuesollamaandmockdocumented as local extensions.gen_ai.client.token.usage) with agen_ai.token.typedimension; latency and TTFT migrated togen_ai.client.operation.durationandgen_ai.client.operation.time_to_first_chunk. Grafana dashboard queries rewritten to match.WorkflowConfig.capture_prompts, emitted as a span event to avoid attribute-size limits.workflow.run_idpropagated through the span tree;error.typeset alongsidestep.erroron failed inference spans.Why
The previous schema used ad-hoc names (
gen_ai.system,step.tokens,gen_ai.server.time_to_first_token) that drift from the OTel registry, blocking downstream consumers from using off-the-shelf dashboards. Centralizing the names also unblocks the plannedagentloom-contractspackage extraction — that work becomes agit mvinstead of a grep-and-replace.Closes #125
Testing
uv run pytest— 1095 passeduv run ruff check src/ tests/cleanuv run mypy src/cleanNotes
Zero backwards compatibility retained. The legacy
on_provider_callobserver hook and thetokenspositional argument onon_step_endare removed; the previousagentloom_tokens_total/agentloom_provider_latency_seconds/agentloom_time_to_first_token_secondsmetrics no longer exist. AgentLoom-specific metrics (workflow lifecycle, cost, circuit breaker, webhooks, approvals, recordings) keep theagentloom_*prefix as application namespace per OTel naming guidance.