Remove obfuscatory Trace, DebugTrace, and _TraceMeta classes#62154
Remove obfuscatory Trace, DebugTrace, and _TraceMeta classes#62154dstandish wants to merge 3 commits intoapache:mainfrom
Conversation
These classes did basically nothing but obfuscate. As far as I can tell, they only existed as confusing wrappers to call get_otel_tracer conditionally.
nickstenning
left a comment
There was a problem hiding this comment.
This looks like an improvement to me but I also agree with your assessment that in general this code is
deviat[ing] from the official API for not much good reason
Having a single tracer for the entire application is very much not how the OTel libraries are intended to be used and will make reasoning about where traces came from (and, for example, making sampling decisions) unnecessarily hard.
The intent of the design of the OTel packages is that instrumenting code doesn't need to know anything about whether tracing is configured or how. It should use the opentelemetry-api package, e.g.
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
...
def some_function():
# Add attributes on a preexisting span
span = trace.get_current_span()
span.set_attributes(
{
"agent.id": id_,
"agent.uid": str(uid),
}
)
# Start a new span
with tracer.start_as_current_span("service.agent_supervisor.run_once"):
do_work()Only the code responsible for configuring the tracer provider, span processors, exporter, etc., would usually have to touch the opentelemetry-sdk package.
Reading between the lines I think there might be two separate issues here:
- A desire to provide a
from airflow.*import that provides a shared abstraction over the tracing implementation rather than havingfrom opentelemetry.*imports throughout the codebase. - A desire to have traces and spans which are emitted conditionally, in order to increase or decrease the level of detail.
Personally, I'm not super sympathetic to (1) and would prefer we just instrumented things as above. I think if we really want to have our own interface to tracing, it should be a package that mirrors opentelemetry.trace to some extent.
from airflow.observability import trace
tracer = trace.get_tracer(__name__)On (2) I think there's a more fundamental issue, which is that if the desire is to provide configurable verbosity/detail levels for the traces, it doesn't make sense to try and achieve that at the level of a trace. It has to be a span-by-span decision, so that you can have optional "detailed" spans emitted within traces that are always emitted.
I think the most "native" way to implement this would be by writing a custom head-sampler and defining standard attributes which cause spans to be dropped if the configured detail level is too low, e.g.
from opentelemetry import trace
from airflow.observability.attributes import DETAIL_LEVEL
tracer = trace.get_tracer(__name__)
...
with tracer.start_as_current_span("myspan", attributes={DETAIL_LEVEL: 2}):
do_work()(Note that if you do this, it will need to be based on attributes provided at span creation time as head samplers do not have access to attributes created while the span is in flight.)
xBis7
left a comment
There was a problem hiding this comment.
@dstandish I'm in favor of removing the Trace metaclass and following a simpler approach. But we have to make sure that it's easy to extend in the future if needed to support more tracers. For example, Stats has 4 different implementation including the default no-op. We shouldn't assume that traces will always be either OTel or no-op.
I've added some comments in places that I've wanted to clean up for a while but I haven't had the time to do it.
Another topic for cleanup are all the debug traces. Anything that used to be called with the DebugTrace, is exposing internal operations. These might be useful to a developer and that's why I added a debug flag and didn't remove them entirely. But for sure, they are confusing to users who don't really care about individual steps in the scheduler's loop.
Here is the question, do you find it useful seeing dozen of spans, named after a method name, all with similar or different timings? I think we should either add more info to make them useful or remove them entirely. But that's out of scope for this PR.
| return result | ||
|
|
||
|
|
||
| def gen_link_from_traceparent(traceparent: str): |
There was a problem hiding this comment.
And these as well, aren't used and should be removed
They are doing what extract() and inject() are designed to do.
| ) | ||
|
|
||
|
|
||
| class AirflowOtelIdGenerator(IdGenerator): |
There was a problem hiding this comment.
This should also be removed entirely. Generating a span_id and a trace_id are internal operations of the SDK and we shouldn't mess with them.
| @@ -17,16 +17,44 @@ | |||
| from __future__ import annotations | |||
There was a problem hiding this comment.
There has been a plan to refactor this, add an initialize method and move it under shared, similarly to Stats.
Also related
#62127
|
If the integration tests here pass, then it means that the behavior has stayed the same. Also, you can test the behavior manually like so
|
|
Hi @nickstenning,
We can't use opentelemetry.* imports throughout the codebase and let me explain why. Airflow is a complicated project composed of a bunch of independent processes and subprocesses. You don't have a central in-memory constructor that initializes a bunch of objects and shares them across all processes. When I'm saying processes, I'm referring to a scheduler, a worker, a dag processor, etc. Each one of these, needs to configure the SDK and get its own instance of a tracer. That's why we need a common file with the SDK initialization logic that can be reused. Addiitionally, the current implementation with the metaclass makes sure that the SDK is initialized only once per process. Apart from the initialization, airflow is very hard to auto-instrument. Many operations are depending on db updates. E.g. once that is changed in the db, then take this action. For that reason, we have to handle the span context change manually and start-end spans manually.
The goal is for users to have continuous visibility under tasks. To be able to figure out, at which step of the execution the task is and then also understand how long each took. Apart from the dag_run traces that also allow the creation of task sub-spans, IMO everything else should be removed. The debug spans are providing details about internal operations and they are only useful to developers who have better tools at their disposal like an IDE debugger or a flamegraph profiler. @nickstenning If I understand correctly, you are also agreeing on this. |
It's certainly true that we need common logic for SDK initialization, but I don't think it's a given that that should happen through calls to the tracing API (I'm using these terms in the sense of It seems to me that it would be reasonable to have common SDK initialization logic which is called at bootstrap time for each process. Separately, there's the question of what the instrumentation API looks like for instrumented code. It certainly sounds like there are some more complicated use cases where trace context will have to be retrieved from an external system, spans started and ended explicitly, etc., but none of this is unfamiliar territory for I'm not arguing that we shouldn't have some helper packages to make instrumenting Airflow easier, but I think the current implementation is somewhat mixing up and confusing the instrumentation API with the SDK. I think that should be avoidable and I'd recommend avoiding it.
I'm not super familiar with what is instrumented with debug spans today. I suspect you are probably right. There may still be good arguments for a configurable level of detail in the traces emitted by Airflow. If we want to do that I'd still suggest going the head-sampling route. |
These classes did basically nothing but obfuscate.
As far as I can tell, they only existed as confusing wrappers to call get_otel_tracer conditionally.
I don't think I changed behavior meaningfully, but it's hard to really tell.
Main action is in
airflow-core/src/airflow/observability/trace.pyand the equivalent copy-pasted sdk one.Side note
Ultimately I think we should probably also remove the OtelTrace class. But this is a first step in trying to clean this up a bit. I do not think the OtelTrace class should exist. It just slightly tweaks some of the methods in the upstream API. It also is inconsistent with EmptyTrace and their shared (but not actually implemented) protocol ("Tracer"). But this slight tweaking is confusing for anyone who wants to do any work here since it deviates from the official API for not much good reason. And it's confusing for AI agents as well.
cc @xBis7 @ashb @jedcunningham @nickstenning