Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AirflowTaskTimeout should inherit BaseException instead of Exception? #35644

Closed
1 of 2 tasks
potiuk opened this issue Nov 15, 2023 Discussed in #35643 · 14 comments · Fixed by #35653
Closed
1 of 2 tasks

AirflowTaskTimeout should inherit BaseException instead of Exception? #35644

potiuk opened this issue Nov 15, 2023 Discussed in #35643 · 14 comments · Fixed by #35653

Comments

@potiuk
Copy link
Member

potiuk commented Nov 15, 2023

Discussed in #35643

Originally posted by hterik November 14, 2023

Apache Airflow version

2.7.3

What happened

Assume following code runs inside airflow task:

errors = []
for item in items_to_process:
    try: 
        process(item)   # This can take several minutes/hours each
    except Exception as e:
        errors.append(e)
if errors:
    raise Exception(......)

If the Airflow task time out, what happens is that it injects an AirflowTaskTimeout exception where the code is currently running now.
If the code is designed to capture exceptions, it will capture the timeout and potentially continue running for several hours anyway.

What you think should happen instead

I think it would be better if AirflowTaskTimeout was treated similarly to KeyboardInterrupt, so it need explicit except block to capture.
Moving it outside of the Exception inheritance tree to inherit directly from BaseException would solve that.
See https://docs.python.org/3/library/exceptions.html#exception-hierarchy
Timeouts are generally in place to be last resort of aborting process by the higher level system if the lower level code has bugs in it, so it's safer to assume that code that doesn't deal with catching timeout should be aborted.

How to reproduce

See above

Operating System

NA

Versions of Apache Airflow Providers

apache-airflow==2.7.1

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@potiuk potiuk added kind:bug This is a clearly a bug area:core needs-triage label for new issues that we didn't triage yet kind:feature Feature Requests and removed kind:bug This is a clearly a bug labels Nov 15, 2023
@potiuk
Copy link
Member Author

potiuk commented Nov 15, 2023

Yes. good idea. Might be worth implementing it.
BTW. There is another problem with tasks running low-level C-code that are not interrupted by signal (and should be handled in yet another way) - see #35474 - but indeed when code is catching all exceptions, chaning timeout to BaseException should work well. Would you like to implement it @hterik ?

@Taragolis
Copy link
Contributor

That is exactly what a Option 1 propose 😄

@hterik
Copy link
Contributor

hterik commented Nov 15, 2023

Proposed change here #35653

@potiuk
Copy link
Member Author

potiuk commented Nov 15, 2023

That is exactly what a Option 1 propose 😄

Yeah. But it's not the same issue (or at least I believe it's not) - I believe the original issue was connected with SIGALRM not being properly handled by long-running low-level c-code. Catching and ignoring exceptions is a bad programming habit by anyone who run task in a loop from example described hy @hterik. You should generally ignore known exceptions not all of them if you want to add them to the loop in the fashon described here.

We should still make the change to AirflowTaskTimeout inheriting from BaseException to handle this "bad programming" case of course - (we do not want our users to shoot themselves in a foot by not knowing that they are doing something wrong) - but I think Kafka case from #35474 was somethign entirely different (only loosely related). I highly doubt somone in Kafka code handles all exceptions this way - my guess is that SIGALRM was not handled there :)

@Taragolis
Copy link
Contributor

Yeah. But it's not the same issue (or at least I believe it's not) - I believe the original issue was connected with SIGALRM not being properly handled by long-running low-level c-code.

To be honest we don't know what the reason in particular this cases. It could be both:

  1. Redefine SIGALRM handlers in upstream code
  2. To broad catch exceptions in libraries or user code

In second case when Airflow task SIGALARM works, we do not control in which place in Main thread code returns

In this case it will break Airflow timeout, and in addition it might breaks something else

try
    ...  # <- returns here
except Exception:
    pass

In case if handler return in this code block, then change inheritance wouldn't help at all

try
    ...  # <- returns here
except:  # or except BaseException:
    pass

hterik added a commit to hterik/airflow that referenced this issue Nov 15, 2023
Code that normally catches Exception should not implicitly ignore
interrupts from AirflowTaskTimout.

Fixes apache#35644
Partially addresses apache#35474
@potiuk
Copy link
Member Author

potiuk commented Nov 15, 2023

To be honest we don't know what the reason in particular this cases. It could be both:

Yep. There might be many reasons - that's why I would leave #35474 open (and likely someone taking a stab on it). For me managing timeout from the parent process that can perform signal escalation and eventually SIGKILL the task is tho only "ultimate" solution that works in all cases (though defining Timeout as BaseException first is nicer because it will allow for much more controlled and gentle timeout handling with closing all resources in case user does - indeed - catches and does not bubble up Exception.

@Taragolis
Copy link
Contributor

The problem a bit more deep, all current exceptions is a part of Public Interface of Airflow which I think it should not be, there are only couple exceptions which should be a part of public interface and IMO AirflowException and AirflowTaskTimeout should not be a part of public interface but we have what we have. Yet another case when we should decide why it not breaking changes.

About SIGKILL it would be trade off between work almost always and complete on_kill method, btw I work on that issue, however so many different behaviours so I think it should be configurable option to enable/disable "ultimate task killer"

@potiuk
Copy link
Member Author

potiuk commented Nov 17, 2023

The problem a bit more deep, all current exceptions is a part of Public Interface of Airflow which I think it should not be, there are only couple exceptions which should be a part of public interface and IMO AirflowException and AirflowTaskTimeout should not be a part of public interface but we have what we have. Yet another case when we should decide why it not breaking changes.

Yes. I think we should remove those "non -public" exceptions from the list. The list might contain things that should not be there and we are free to remove them if they found its way by mistake (like any other bugfix).. Might be a good idea to do so as a separate PR.

But even if not - I think changing the base for exception does not qualify as "Breaking change" IMHO. It's still an exception and it does not change the usage of it.

@potiuk
Copy link
Member Author

potiuk commented Nov 17, 2023

Just a comment on why we can modify the list - SemVer is not a set-in-stone thing. It's merely a statement of intention. Our intention is to make the "Public Interface" correct. But if - by mistake - we have things we did not intended to be there in the first place, we are perfectly fine to remove those "mistakes" from it.

@Taragolis
Copy link
Contributor

Taragolis commented Nov 17, 2023

I've found some grey area of Public interface, but let's focus on Exceptions and current case. Many of the things point for the future improvements or clarifications.

Maybe you could help me with this puzzle

Statement 1. AirflowTaskTimeout it is internal part of core and should only use in the core because it might change behaviour of different part of Airflow
Statement 2. Rule 1 should also apply to Community Providers
Statement 3. However AirflowTaskTimeout actively uses in providers, so that mean catch AirflowTaskTimeout is intentional behaviour and it should be part of Public interface which is conflicts with Statement 1, or Statement 2 only exists in my head

The same valid with AirflowException, in some circumstances use this exception may lead to catch some internal exceptions which should break something

@potiuk
Copy link
Member Author

potiuk commented Nov 21, 2023

I've found some grey area of Public interface,

There are probably plenty. It's open for contribution to clarify them as usual :).

SemVer is very clear about this, https://semver.org/ SemVer is a "communication" tool. It is about communication of our intentions about what our public API does, not about making sure all implementation details follow strict "I do not change this/ I change that" policy.

Statement 1. AirflowTaskTimeout it is internal part of core and should only use in the core because it might change behaviour of different part of Airflow

I'd modify it a bit. The existence of AirflowTaskTimeout (with intent of being an exception to be thrown when timeout occures is Public. The implementation and fact that it derives from BaseException or Exception is internal implementation details. If we change it, we should tell it but it does not change the intention. Public APIs are usually about intentions not about details. Someone COULD rely on the implementation detail of it being based on Exception but this was not the intention we had.

Hyrum's law is very clear about it (and I fully agree with it). Any change has potential of breaking someone's workflow. But Public APIS are intentions, not promises - and the change here which i see is that we clarify the intention we had with that exception.

Statement 2. Rule 1 should also apply to Community Providers

Yes. Applies well - intention of signifying timeout is Public, all the other details being internal implementation details.

Statement 3. If they are used according to the intention (finding out when the task had been timed-out and reacting to it somwhow) - all is cool. There is no conflict. The conflict is only if someone used it assuming that it will always be derived from BaseException (we never intended that). If you used it 👎

try:
   do something
except Timeout:
    /// react somehow

All is fine and well.

We will have cases that we break someone's workflow. TOUGH. We will have to bite the bullet. If we have not clarified our intentions well enough before - it might be even seen as our fault. Also tough. But with clarifying intentions we are moving closer and closer to the place that most of our Publiic APIs is listed and intentions explained.

This should be our goal.

@potiuk
Copy link
Member Author

potiuk commented Nov 21, 2023

The same valid with AirflowException, in some circumstances use this exception may lead to catch some internal exceptions which should break something

And yes. This is inevitable. It's our job as maintainers to take a calculated risk, whether it's worth risking it when we clarify our intention and leave "other ways" where things can be broken.

This is of course extreme case, but it very nicely shows that SemVer and our decisions are never 0/1. It's always continous spectrum of us impacting other workflows. And we should take a decision how far we want to go when we do - actually - clarify our intentions.

image

And comment on this one (the Timeout case):

  1. yes, there is a risk
  2. yes, it will break someone's workflows with using the exception
  3. but I think our intentions were just to signal timeout, and any other interpretation of this exception being implemented this way are not intentional
  4. so I am perfectly fine as maintainer to take the risk, change it AND clarify our intentions so that in the future they have less chance of being misinterpreted

Of course the only thing you can do is to "attempt" it. People will misinterprete things and will use things in unintended ways (as in the comic above). And you can do absolutely nothing about it. But you can at least make sure that your intentions are clearer and clearer with every iteration. And this should be IMHO our goal with the "Public Interface" page - continue updating it and explaining things every time we find any misinterpretation of it, and sometimes allowe changes even if we know some workflows will be broken, as long as we make sure our intentions are clarified.

@potiuk
Copy link
Member Author

potiuk commented Nov 29, 2023

So ? Any more concerns :) ?

@jscheffl
Copy link
Contributor

The bug report and discussion seems to be a bit stale. Just looking through the PR it looks good and reasonable for me. There might be side effects but we can handle this via a newsfragment I think.
To be on the "safer" side we might put this to a "functional" release, meaning 2.9.0 so that people just upgrading for fixes do catch less "surprises".

@jscheffl jscheffl removed the needs-triage label for new issues that we didn't triage yet label Dec 31, 2023
hterik added a commit to hterik/airflow that referenced this issue Feb 20, 2024
Code that normally catches Exception should not implicitly ignore
interrupts from AirflowTaskTimout.

Fixes apache#35644 apache#35474
hterik added a commit to hterik/airflow that referenced this issue Feb 20, 2024
Code that normally catches Exception should not implicitly ignore
interrupts from AirflowTaskTimout.

Fixes apache#35644 apache#35474
hterik added a commit to hterik/airflow that referenced this issue Feb 20, 2024
Code that normally catches Exception should not implicitly ignore
interrupts from AirflowTaskTimout.

Fixes apache#35644 apache#35474
hterik added a commit to hterik/airflow that referenced this issue Feb 20, 2024
Code that normally catches Exception should not implicitly ignore
interrupts from AirflowTaskTimout.

Fixes apache#35644 apache#35474
hterik added a commit to hterik/airflow that referenced this issue Feb 20, 2024
Code that normally catches Exception should not implicitly ignore
interrupts from AirflowTaskTimout.

Fixes apache#35644 apache#35474
hterik added a commit to hterik/airflow that referenced this issue Feb 20, 2024
Code that normally catches Exception should not implicitly ignore
interrupts from AirflowTaskTimout.

Fixes apache#35644 apache#35474
potiuk pushed a commit to hterik/airflow that referenced this issue Feb 21, 2024
Code that normally catches Exception should not implicitly ignore
interrupts from AirflowTaskTimout.

Fixes apache#35644 apache#35474
potiuk pushed a commit that referenced this issue Feb 21, 2024
Code that normally catches Exception should not implicitly ignore
interrupts from AirflowTaskTimout.

Fixes #35644 #35474
ephraimbuddy pushed a commit that referenced this issue Feb 21, 2024
Code that normally catches Exception should not implicitly ignore
interrupts from AirflowTaskTimout.

Fixes #35644 #35474

(cherry picked from commit 581e2e4)
ephraimbuddy pushed a commit that referenced this issue Feb 22, 2024
Code that normally catches Exception should not implicitly ignore
interrupts from AirflowTaskTimout.

Fixes #35644 #35474

(cherry picked from commit 581e2e4)
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this issue Mar 5, 2024
Code that normally catches Exception should not implicitly ignore
interrupts from AirflowTaskTimout.

Fixes apache#35644 apache#35474
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants