Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: [openlineage] add debug facet to all events #41217

Merged
merged 1 commit into from
Aug 12, 2024

Conversation

kacpermuda
Copy link
Contributor

Problem

Whenever we are trying to debug issues related to OpenLineage coming from users, we usually require information about versions of other providers used in the process, for reproducing purposes. It would be great to have that information in the events already to reduce the time and effort needed for debugging, but only when some flag is turned on explicitly on the user side (f.e. airflow log level is set to debug). This PR adds a DebugRun facet to Airflow, that contains information about all packages and their versions. In the future, we could possibly extend that facet with other useful information.

Questions:

  • I'm not sure, if we should add this facet to all events (current implementation), or only add it to some specific events (which?).
  • Can we rely on the Airflow logging level when deciding whether or not to attach this facet? I wanted to avoid creating another flag, that would basically had to be set together with DEBUG logging level. (Whenever debugging user issues, we are asking for debug logs anyway)

Background:

Initially, I aimed to include package information in the producer field for specific facets but wanted to keep it hassle-free for both users and developers. I created some POC for each solution, but this debug facet seems the best for now.

We could automatically track where a facet is created and assign the producer, but the complexity and maintenance effort are not worth it for such a simple feature.

I also considered resetting the producer for facets after creation using a decorator or function in common.compat provider, but this would require additional code from developers when adding OL support for operators. With hook level lineage coming, it could become even more complex. This is something we might revisit in the future, as having the producer field identify the actual provider of the facet at all times is beneficial (regardless of OL configuration, logging level, etc.).

For now, the simple debug facet should suffice.

Please let me know your thoughts.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk
Copy link
Member

potiuk commented Aug 5, 2024

I like it. I think also it would be useful to get some way of temporarily enabling it (via API?). Also I am not sure we want to add it continuously - because this one will have basically - the same information in all events right?

So maybe just emit it with some min frequency (i.e. max once per 10 minutes) ? So that we are not changing the behaviour of the system by observing it too much ?

This could also be controllable via API / configuration I guess.

@kacpermuda
Copy link
Contributor Author

Also I am not sure we want to add it continuously - because this one will have basically - the same information in all events right?

Yes, it would contain the same information across all events when it comes to the list of installed packages. In the future, we might want to extend this facet with more state-specific information, but for now, it's simple.

So maybe just emit it with some min frequency (i.e. max once per 10 minutes) ?

I considered limiting it in some way (including it in only some of the events like START/COMPLETE, TASK/DAG), but my thinking was that this debug facet should only be used when something goes wrong. When that happens, we might only have a subset of events due to an error. If we limit the emittance in any way, we might end up not having it where we need it. However, I’m open to the idea—maybe there’s a better way of doing it that I haven’t thought of.

I think also it would be useful to get some way of temporarily enabling it (via API?).

This could also be controllable via API / configuration I guess.

I was hoping we could avoid introducing an additional OL configuration flag here. It’s easy to add them, but it can quickly lead to having so many that it becomes hard to manage. My thinking was: if I’m debugging an error, I’m already using debug logs, so I shouldn’t need to set separate variables to fully debug my issue. Rarely do I want the debug facet to appear without the debug logs enabled. Not sure what you mean by the API here, do you have any examples in mind?

So that we are not changing the behaviour of the system by observing it too much ?

I’m not sure I understand this fully. Are you concerned that this debug facet might consume too many resources and thereby affect the system’s performance? If you could explain your concerns in more detail, I can address them more effectively.

@kacpermuda kacpermuda changed the title openlineage: add debug facet to all events feat: [openlineage] add debug facet to all events Aug 5, 2024
@potiuk
Copy link
Member

potiuk commented Aug 5, 2024

Yeah. No strong concerns here -> just wondered how much extra overhead it will add to retrieve that information - generally speaking the problem with such debugging information is that it might change the behaviour of the system (like adding or avoiding race conditions - introducing things call "heisenbugs" - the more you look at the problem, the less likely it is to occur. So generrally speaking - as low overhead as possible in this kind of debugging facility - the better.

In this sense - connecting it to airflow "debug" log level is not necessarily a good idea - because Airflow in debug log generates a loooooot of debugging information and this might impact how airflow behaves.

So my concern here is to:

a) limit the overhead when it is enabled (this can be done as well by caching the information - so maybe that will be enough to cache the facet. JUST retrieving all installed package information is pretty heavy, and we should do it once per interpreter run ideally.

b) I think there should be a way to enable this logging independently from Airflow logs, because it might well be that just enabling debug logs will have other side effects.

Signed-off-by: Kacper Muda <mudakacper@gmail.com>
@kacpermuda
Copy link
Contributor Author

a) limit the overhead when it is enabled (this can be done as well by caching the information - so maybe that will be enough to cache the facet. JUST retrieving all installed package information is pretty heavy, and we should do it once per interpreter run ideally.

b) I think there should be a way to enable this logging independently from Airflow logs, because it might well be that just enabling debug logs will have other side effects.

I added caching for the package information and introduced a new flag called debug_mode, which controls whether this facet is created and added. Now, it should run only once in the scheduler and once per task on the worker, but only when the debug_mode flag is explicitly enabled by the user. I also updated the documentation to inform users about the useful information they can provide to us when debugging.

@mobuchowski mobuchowski merged commit d12eb43 into apache:main Aug 12, 2024
53 checks passed
@kacpermuda kacpermuda deleted the ol-add-debug-facet branch August 12, 2024 09:52
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Aug 20, 2024
Signed-off-by: Kacper Muda <mudakacper@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants