-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: [openlineage] add debug facet to all events #41217
Conversation
I like it. I think also it would be useful to get some way of temporarily enabling it (via API?). Also I am not sure we want to add it continuously - because this one will have basically - the same information in all events right? So maybe just emit it with some min frequency (i.e. max once per 10 minutes) ? So that we are not changing the behaviour of the system by observing it too much ? This could also be controllable via API / configuration I guess. |
Yes, it would contain the same information across all events when it comes to the list of installed packages. In the future, we might want to extend this facet with more state-specific information, but for now, it's simple.
I considered limiting it in some way (including it in only some of the events like START/COMPLETE, TASK/DAG), but my thinking was that this debug facet should only be used when something goes wrong. When that happens, we might only have a subset of events due to an error. If we limit the emittance in any way, we might end up not having it where we need it. However, I’m open to the idea—maybe there’s a better way of doing it that I haven’t thought of.
I was hoping we could avoid introducing an additional OL configuration flag here. It’s easy to add them, but it can quickly lead to having so many that it becomes hard to manage. My thinking was: if I’m debugging an error, I’m already using debug logs, so I shouldn’t need to set separate variables to fully debug my issue. Rarely do I want the debug facet to appear without the debug logs enabled. Not sure what you mean by the API here, do you have any examples in mind?
I’m not sure I understand this fully. Are you concerned that this debug facet might consume too many resources and thereby affect the system’s performance? If you could explain your concerns in more detail, I can address them more effectively. |
Yeah. No strong concerns here -> just wondered how much extra overhead it will add to retrieve that information - generally speaking the problem with such debugging information is that it might change the behaviour of the system (like adding or avoiding race conditions - introducing things call "heisenbugs" - the more you look at the problem, the less likely it is to occur. So generrally speaking - as low overhead as possible in this kind of debugging facility - the better. In this sense - connecting it to airflow "debug" log level is not necessarily a good idea - because Airflow in debug log generates a loooooot of debugging information and this might impact how airflow behaves. So my concern here is to: a) limit the overhead when it is enabled (this can be done as well by caching the information - so maybe that will be enough to cache the facet. JUST retrieving all installed package information is pretty heavy, and we should do it once per interpreter run ideally. b) I think there should be a way to enable this logging independently from Airflow logs, because it might well be that just enabling debug logs will have other side effects. |
Signed-off-by: Kacper Muda <mudakacper@gmail.com>
221dc1c
to
a43b2f7
Compare
I added caching for the package information and introduced a new flag called |
Signed-off-by: Kacper Muda <mudakacper@gmail.com>
Problem
Whenever we are trying to debug issues related to OpenLineage coming from users, we usually require information about versions of other providers used in the process, for reproducing purposes. It would be great to have that information in the events already to reduce the time and effort needed for debugging, but only when some flag is turned on explicitly on the user side (f.e. airflow log level is set to debug). This PR adds a DebugRun facet to Airflow, that contains information about all packages and their versions. In the future, we could possibly extend that facet with other useful information.
Questions:
Background:
Initially, I aimed to include package information in the producer field for specific facets but wanted to keep it hassle-free for both users and developers. I created some POC for each solution, but this debug facet seems the best for now.
We could automatically track where a facet is created and assign the producer, but the complexity and maintenance effort are not worth it for such a simple feature.
I also considered resetting the producer for facets after creation using a decorator or function in common.compat provider, but this would require additional code from developers when adding OL support for operators. With hook level lineage coming, it could become even more complex. This is something we might revisit in the future, as having the producer field identify the actual provider of the facet at all times is beneficial (regardless of OL configuration, logging level, etc.).
For now, the simple debug facet should suffice.
Please let me know your thoughts.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.