Skip to content

Add distributed tracing via OpenTelemetry#217

Merged
valdis merged 2 commits into
mainfrom
open-telemetry
Apr 14, 2026
Merged

Add distributed tracing via OpenTelemetry#217
valdis merged 2 commits into
mainfrom
open-telemetry

Conversation

@valdis
Copy link
Copy Markdown
Contributor

@valdis valdis commented Apr 9, 2026

Every message hop across agents now shares a single trace_id, giving end-to-end visibility into multi-agent interactions. Tracing is opt-in and zero-cost when not configured.

Description

Adds distributed tracing support to the EggAI SDK using OpenTelemetry. A traceparent field (W3C Trace Context) is propagated through every message, linking producer and consumer spans across agents and transport backends into a single end-to-end trace. Every message hop across agents shares a single trace_id, giving end-to-end visibility into multi-agent interactions.

Tracing uses a swappable backend (_NoOpBackend / _OtelBackend) so setup_tracing() can be called at any point — handlers subscribed before it is called are immediately traced once it fires. When the otel extra is not installed, handlers are never wrapped and there is zero overhead.

Type of Change

  • New feature (non-breaking change which adds functionality)

Changes Made

  • Added sdk/eggai/tracing.py with setup_tracing(), swappable _NoOpBackend / _OtelBackend, make_tracing_wrapper() for consumer spans, and auto-activation when OTEL_EXPORTER_OTLP_ENDPOINT is set
  • Added traceparent: str | None field to BaseMessage for W3C Trace Context propagation
  • Updated Channel.publish() to inject/continue trace context into outgoing messages as a PRODUCER span via the active backend
  • Wrapped subscriber handlers in all three transports (InMemory, Kafka, Redis) with a CONSUMER span that restores the upstream trace context
  • Fixed Redis transport to avoid double-wrapping the handler on internal retry subscriptions
  • Added otel optional extra to pyproject.toml (pip install eggai[otel]) with OTLP gRPC, OTLP HTTP, and console exporters
  • Added sdk/tests/test_tracing.py with test coverage for no-op, producer spans, consumer spans, trace continuation, and multi-hop trace propagation

Testing

  • Existing tests pass locally (make test)
  • Added new tests for new functionality
  • Tested manually (describe below if applicable)

Manual Testing Steps

  1. Install with pip install eggai[otel]
  2. Set OTEL_EXPORTER_OTLP_ENDPOINT (or call setup_tracing(exporter="console") at startup)
  3. Publish a message from one agent and consume it in another — both spans appear under the same trace_id in your tracing backend (Jaeger, Tempo, etc.)
  4. Run without calling setup_tracing() — confirm no errors and no performance change

Changelog

  • I have updated sdk/CHANGELOG.md under [Unreleased] section (required for code changes)

Checklist

  • My code follows the project's code style (make lint passes)
  • My code is properly formatted (make format applied)
  • I have added/updated tests that prove my fix/feature works
  • I have added/updated documentation as needed
  • All tests pass locally
  • I have reviewed my own code
  • My changes generate no new warnings
  • My commit messages follow the Conventional Commits standard

Additional Notes

  • Zero overhead when OTel is not installedmake_tracing_wrapper detects OTel is absent at import time and returns handlers unwrapped; no closures allocated, no per-message cost
  • Order-independent activationsetup_tracing() swaps the backend in-place; handlers already subscribed begin producing real spans immediately, no re-subscription needed
  • No provider clobber — if a TracerProvider is already configured (e.g. by the application), setup_tracing() adopts it rather than overwriting it
  • Supports OTLP gRPC (default), OTLP HTTP, and console exporters; also accepts a pre-built exporter object for testing
  • Auto-activates via OTEL_EXPORTER_OTLP_ENDPOINT / OTEL_EXPORTER_OTLP_PROTOCOL env vars (silently skips if the otel extra is not installed)
  • Redis transport: handler is only wrapped on the initial subscribe call, not on internal retry-stream subscriptions, to prevent duplicate spans and consumer group name corruption

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 9, 2026

QualOps Code Quality Analysis

Status: ✅ PASSED - No issues found

Summary

  • Total Issues: 0
  • Critical: 0 🔴
  • High: 0 🟠
  • Medium: 0 🟡
  • Low: 0 🟢
  • Files Analyzed: 0

No issues found in the analyzed code.

📊 Full Report

View detailed report


Powered by QualOps

@valdis valdis force-pushed the open-telemetry branch 2 times, most recently from 7d82a42 to 5b19b4e Compare April 9, 2026 13:31
Every message hop across agents now shares a single trace_id, giving
end-to-end visibility into multi-agent interactions. Tracing is opt-in
and zero-cost when not configured.
@valdis valdis requested review from nherment and pontino April 10, 2026 06:49
@nherment nherment requested a review from rocky-jaiswal April 10, 2026 07:00
…() and provider clobber

- Replace _tracer sentinel with swappable _NoOpBackend / _OtelBackend so
  handlers subscribed before setup_tracing() are correctly traced once it
  is called (closure reads _backend at message time, not subscription time)
- setup_tracing() no longer overwrites a user-configured TracerProvider;
  it adopts the existing one when the current provider is not the default
  ProxyTracerProvider (fixes silent overwrite when env var is set)
- Replace `import eggai` for version lookup with importlib.metadata.version()
  to eliminate the fragile circular import
- All OTel imports are contained in _OtelBackend methods; channel.py and
  make_tracing_wrapper have no direct opentelemetry imports
- Skip wrapping sync handlers in make_tracing_wrapper (they cannot propagate
  OTel context across await points and have incompatible call conventions)
@valdis valdis merged commit c9fe762 into main Apr 14, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants