refactor(telemetry)!: move backend tracing onto plugin/hook pattern#1181
refactor(telemetry)!: move backend tracing onto plugin/hook pattern#1181ajbozarth wants to merge 2 commits into
Conversation
Phase 1 of the tracing epic: migrates backend span emission from inline calls scattered across five backends onto a BackendTracingPlugin that subscribes to the existing generation_* hooks for chat and new generation_batch_* hooks for raw. Spans stay live on the OTel context across the API call so nested instrumentation parents under them. - Renames MELLEA_TRACE_* env vars to plural MELLEA_TRACES_* and introduces MELLEA_TRACES_ENABLED umbrella flag, opt-in MELLEA_TRACES_OTLP, and signal-specific OTEL_EXPORTER_OTLP_TRACES_ENDPOINT. No deprecation shim. - Drops deprecated gen_ai.system attribute in favor of gen_ai.provider.name. - Prefixes application-span attributes with mellea. for consistency with backend spans and the existing logging/metrics pillars. - Eagerly initialises the tracer provider and registers the BackendTracingPlugin at module import when MELLEA_TRACES_ENABLED is truthy. Tests reset module state and call _setup_tracing() to re-init after env-var changes, removing the need for importlib.reload. - Adds generation_batch_pre_call/post_call/error hook types so raw-path emits one span per generate_from_raw call (matching OTel GenAI semconv) rather than one per MOT. - Deletes mellea/telemetry/backend_instrumentation.py; backends no longer import from mellea.telemetry (except .context). - Removes the _telemetry_span round-trip on mot._meta and the tracing-specific error block in core/base.py; error-path span closure lives in the plugin's generation_error hook. Closes generative-computing#1045, generative-computing#1046, generative-computing#1047 (Phase 1 of generative-computing#444). BREAKING CHANGE: Public telemetry API surface narrowed: - MELLEA_TRACE_* env vars renamed to plural MELLEA_TRACES_* with no deprecation shim. - The deprecated gen_ai.system span attribute is removed; consumers should read gen_ai.provider.name instead. - The split helpers is_application_tracing_enabled() and is_backend_tracing_enabled() are replaced by a single is_tracing_enabled(). - The application-tracing helpers add_span_event, start_backend_span, end_backend_span, and trace_backend are removed from mellea.telemetry. Backend spans are now emitted by the BackendTracingPlugin automatically; application spans use trace_application (unchanged). - The mellea.telemetry.backend_instrumentation module is deleted along with its exports (start_generate_span, instrument_generate_from_raw, record_token_usage, record_response_metadata, finalize_backend_span). Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
|
Follow-up work spawned by this PR (not part of the tracing epic):
|
|
I also opened #1180 which fixes a bug in metrics that also applies here. (ie that needs to merge first) |
jakelorocco
left a comment
There was a problem hiding this comment.
looks good! I just have a few questions and a few places I think extra comments would be helpful.
| finish_backend_span_success( | ||
| payload.generation_id, operation="chat", usage=gen.usage, mot=mot, gen=gen | ||
| ) |
There was a problem hiding this comment.
What happens if someone only registers the start_backend_span function and these finish span functions never run?
There was a problem hiding this comment.
As written in this PR, this can't happen, start_backend_span is only called from BackendTracingPlugin.on_pre_call / on_batch_pre_call, and each is paired with on_post_call + on_error (or the batch equivalents) on the same plugin. The astream / generate_from_raw paths guarantee that exactly one of post_call or error fires for every pre_call, so the span always closes.
The only way to leak a span is if a user writes their own code that calls start_backend_span directly without calling finish_backend_* to close it. In that case the entry sits in _in_flight_spans until process exit and the OTel context token never detaches.
- HF backend: detect stop-sequence termination as finish_reason="stop" (previously dropped silently when stopping on a configured stop_string). - Add unit tests for HF post_processing finish_reasons derivation. - Clarify generation_id docstrings/comments to distinguish the Mellea-side hook correlation ID from GenerationMetadata.response_id. - Document why BackendTracingPlugin.on_pre_call runs SEQUENTIAL while the other tracing hooks run FIRE_AND_FORGET. Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
|
@jakelorocco I believe 00831a4 addresses all your actionable comments |
Pull Request
Issue
Fixes #1045, fixes #1046, fixes #1047. Phase 1 of #444. Also closes out item 2 of #909.
Description
Phase 1 of the tracing epic. Migrates backend span emission from inline calls scattered across five backends onto a
BackendTracingPluginthat subscribes to the existinggeneration_*hooks (chat path) and newgeneration_batch_*hooks (raw path). Spans stay live on the OTel context across the API call so nested instrumentation (httpx, langchain) parents under the backend span.Reviewer call-outs:
MELLEA_TRACE_*→MELLEA_TRACES_*(plural), aligned withOTEL_EXPORTER_OTLP_TRACES_ENDPOINTandMELLEA_METRICS_*. AddsMELLEA_TRACES_ENABLEDumbrella flag and opt-inMELLEA_TRACES_OTLP. No deprecation shim — tracing was still pre-1.0 surface.gen_ai.systemremoved in favor ofgen_ai.provider.name. Cleans up the dual emission introduced by feat(telemetry): close five OTel GenAI semantic convention emission gaps (#1035) #1036.generation_batch_pre_call/generation_batch_post_call/generation_batch_errorfor plugin authors. Raw path now emits one span pergenerate_from_rawcall (per OTel GenAI semconv), not one per MOT.ModelOutputThunk.cancel_generation(added upstream by feat(stdlib): add stream_with_chunking() with per-chunk validation (#901) #942 in9e8a9636) had its body rewritten to fireGENERATION_ERRORinstead of calling the removed_telemetry_spanAPI. Reviewers may notice as "extra surface", but it's rebase-driven, not refactor-driven.BREAKING CHANGE: Public telemetry API surface narrowed:
MELLEA_TRACE_*env vars renamed to pluralMELLEA_TRACES_*with no deprecation shim.gen_ai.systemspan attribute is removed; consumers should readgen_ai.provider.nameinstead.is_application_tracing_enabled()andis_backend_tracing_enabled()are replaced by a singleis_tracing_enabled().add_span_event,start_backend_span,end_backend_span, andtrace_backendare removed frommellea.telemetry. Backend spans are now emitted by theBackendTracingPluginautomatically; application spans usetrace_application(unchanged).mellea.telemetry.backend_instrumentationmodule is deleted along with its exports (start_generate_span,instrument_generate_from_raw,record_token_usage,record_response_metadata,finalize_backend_span).Testing
Attribution
Adding a new component, requirement, sampling strategy, or tool?
If your PR adds or modifies one of the types below, check the matching box. A checklist of type-specific review items will be posted as a comment.
NOTE: Please ensure you have an issue that has been acknowledged by a core contributor and routed you to open a pull request against this repository. Otherwise, please open an issue before continuing with this pull request.