Emit OpenTelemetry spans from airflow CLI entry points#66789
Open
1fanwang wants to merge 2 commits into
Open
Conversation
Wrap airflow tasks test, airflow dags trigger, airflow dags test, and airflow backfill create with a span context that honours W3C TRACEPARENT / TRACESTATE environment variables. This lets an external caller (a CI step, parent workflow, or debug harness running a DAG locally) propagate trace context into Airflow. Without the wrapper a trace started by the caller terminates at the CLI binary and the downstream task / DagRun spans show up as a separate trace. The new `cli_span` helper lives next to the existing AIP-59 tracer setup in `airflow_shared.observability.traces` so the CLI surface and the scheduler / API server share one extraction path.
The CLI entry-point span tests drive real airflow CLI subcommands (`dags trigger`, `dags test`, `tasks test`, `backfill create`) which touch the metadata DB, so they must run in the DB-tests job, not the Non-DB job that sets _AIRFLOW_SKIP_DB_TESTS. Signed-off-by: 1fanwang <1fannnw@gmail.com>
373eefd to
1be2e7d
Compare
Closed
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Observability gap on our LinkedIn DI Airflow setup: scheduler and worker spans flow through our OTel pipeline, but operations issued via the
airflowCLI (manual triggers, backfills, ad-hoc clears) don't emit spans. So when someone reports an unexpected Dag state, we can't trace it back through OTel to find the CLI command that caused it. This PR emits spans at the CLI entry points.AIP-59 added OpenTelemetry traces inside the scheduler, dag processor, and task supervisor. The CLI entry points that drive those subsystems (
airflow tasks test,airflow dags trigger,airflow dags test,airflow backfill create) are not wrapped in spans, so when one of these commands is invoked from a wrapper that already has trace context — a CI pipeline, a parent workflow, a debug harness — the inbound trace dies at the CLI binary and the resulting task and DagRun spans show up as a separate trace. This PR wires those four entry points into the existing tracer so the caller's trace and Airflow's downstream spans stitch into a single distributed trace.Problem
airflow tasks test ...shelled out from a developer's terminal orairflow dags trigger ...issued by an external orchestrator both create a meaningful unit of work inside Airflow. Today there's no span to anchor that work to the caller's trace. The caller has to manually correlate trace IDs via logs, or accept a broken trace boundary at the CLI.Fix
Add a small
cli_spancontext-manager helper toairflow_shared.observability.traces(next to the existing AIP-59 tracer setup). It readsTRACEPARENT(and optionallyTRACESTATE) from the environment using the W3C TraceContext propagator, opens a span parented to that context, and yields. When the env vars are absent it produces a root span; when OTel is not configured it falls back to the global no-op tracer and stays inert.Apply the helper at four CLI entry points:
airflow.cli.commands.task_command.task_test→ spancli.tasks.testairflow.cli.commands.dag_command.dag_trigger→ spancli.dags.triggerairflow.cli.commands.dag_command.dag_test→ spancli.dags.testairflow.cli.commands.backfill_command.create_backfill→ spancli.backfill.createEach span carries
airflow.dag_idplus the most useful context for that command (task_id, run_id, logical_date, dry_run flag).Reproducer
Without this PR:
The console exporter prints task spans, but they're under a fresh root trace — the
0af76519…trace id fromTRACEPARENTis not used. With this PR, the same invocation emits acli.dags.testspan as a child ofb7ad6b7169203331, and the downstream task spans share the trace id from the caller.Tests
shared/observability/tests/observability/test_traces.py— unit tests for the helper: traceparent extraction, tracestate propagation, malformed-header tolerance, env-var fallback, no-op-tracer safety.airflow-core/tests/unit/cli/commands/test_cli_trace.py— integration tests for the four CLI entry points, asserting both the no-traceparent and with-traceparent paths via anInMemorySpanExporter.16 tests, all passing. The existing CLI tests for
task_test,dag_test,dag_trigger, andcreate_backfillstill pass unchanged.Notes
otel_onis false (the default), the global no-op tracer kicks in and the wrapper has effectively zero overhead.tasks test,dags trigger,dags test,backfill create) are wrapped. Read-only commands (dags list,tasks state, etc.) don't need a span — they're not the kind of operation a caller threads trace context through.action_clidecorator: that would emit a span for every CLI invocation includingairflow info,airflow db check, etc., which is more noise than value. The explicit per-entry-point wiring keeps the span set meaningful.Evidence
Captured against this branch by driving
airflow dags trigger example_bash_operator --run-id repro_runthrough the public CLI entry point with anInMemorySpanExporterinstalled,TRACEPARENT=00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01in the environment, and the metadata-DB-boundget_current_api_clientmocked so the run completes without DB initialization. For the before pass the same entry point is invoked withcli_spanpatched to a no-op context manager, simulating the pre-fix code path with everything else identical.Before (no
cli_spanwrapping the entry point):After (this PR):
{ "name": "cli.dags.trigger", "trace_id": "0af7651916cd43dd8448eb211c80319c", "span_id": "69f1f3e7418b8123", "parent_span_id": "b7ad6b7169203331", "attributes": { "airflow.dag_id": "example_bash_operator", "airflow.dag_run.run_id": "repro_run" } }The span's
trace_idmatches the inboundTRACEPARENTtrace id and itsparent_span_idmatches the inbound parent span id, confirming W3C context propagation from caller into Airflow. The same wiring coverscli.dags.test,cli.tasks.test, andcli.backfill.createvia the sharedcli_spanhelper, andairflow-core/tests/unit/cli/commands/test_cli_trace.pyasserts the same shape for each.Closes #66906.