Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change AirflowTaskTimeout to inherit BaseException #35653

Merged
merged 1 commit into from Feb 21, 2024

Conversation

hterik
Copy link
Contributor

@hterik hterik commented Nov 15, 2023

Code that normally catches Exception should not implicitly ignore interrupts from AirflowTaskTimout.

Fixes: #35644

@potiuk
Copy link
Member

potiuk commented Nov 15, 2023

Code that normally catches Exception should not implicitly ignore interrupts from AirflowTaskTimout.

Fixes #35644 #35474

I do not think #35474 is fixed by it. We STILL need to handle the case where low-level c-program does not handle SIGALRM - cc: @Taragolis . This one only fixes the case when someone deliberately caught all exceptions and ignored them - which is I believe different issue that #35474 describes.

@hterik
Copy link
Contributor Author

hterik commented Nov 15, 2023

Hmm, SIGALARM wouldn't be injected in the middle of C-extension execution but it would stay pending until the extension returns to Python execution at least right? So it won't be totally ignored, just very much delayed? Same goes for subprocesses i believe.
I can update commit message to clarify it only addresses #35474 partially, that issue has a very big scope if it were to be implemented in full.

@potiuk
Copy link
Member

potiuk commented Nov 15, 2023

Hmm, SIGALARM wouldn't be injected in the middle of C-extension execution but it would stay pending until the extension returns to Python execution at least right? So it won't be totally ignored, just very much delayed? Same goes for subprocesses i believe.

Technically speaking it would be only until the C-code checks if the signal arrived and handle it. Properly run long running loop in C-code executed from Python should frequently check if a signal arrived (Python lets the c-code know that there is a signal recived but it's up to the C-code to see it and handle it).

And yes it will be very much delayed, which in case of timeout it will be very much not happening :).

And yes - I fully agree those two should be separate issues and solved separately - so removing it from commmit message makes sense.

@Taragolis
Copy link
Contributor

An in general need to check other Airflow codebase (core), that we do not miss to catch AirflowTaskTimeout where it required, because right now we could catch as AirflowException in some places

try:
    ...
except AirflowException:
    "Surprise! Surprise!"

@Taragolis
Copy link
Contributor

For example StandardTaskRunner

except Exception as exc:
return_code = 1
self.log.error(
"Failed to execute job %s for task %s (%s; %r)",
job_id,
self._task_instance.task_id,
exc,
os.getpid(),
)
except SystemExit as sys_ex:
# Someone called sys.exit() in the fork - mistakenly. You should not run sys.exit() in
# the fork because you can mistakenly execute atexit that were set by the parent process
# before fork happened
return_code = sys_ex.code
except BaseException:
# while we want to handle Also Base exceptions here - we do not want to log them (this
# is the default behaviour anyway. Setting the return code here to 2 to indicate that
# this had happened.
return_code = 2
finally:
try:
# Explicitly flush any pending exception to Sentry and logging if enabled
Sentry.flush()
logging.shutdown()
except BaseException:
# also make sure to silently ignore ALL POSSIBLE exceptions thrown in the flush/shutdown,
# otherwise os._exit() might never be called. We could have used `except:` but
# except BaseException is more explicit (and linters do not comply).
pass

@hterik
Copy link
Contributor Author

hterik commented Nov 15, 2023

An in general need to check other Airflow codebase (core), that we do not miss to catch AirflowTaskTimeout where it required, because right now we could catch as AirflowException in some places

Good point. I would need a lot more help from someone with deeper knowledge of Airflow internals to understand and review full scope of this.

I found a few such issues in taskinstance.py, pushed in latest patchset. Question there is if we should loosen constraint on those types to just accept BaseException instead of listing all types in a Union.

@potiuk
Copy link
Member

potiuk commented Nov 15, 2023

An in general need to check other Airflow codebase (core), that we do not miss to catch AirflowTaskTimeout where it required, because right now we could catch as AirflowException in some places

try:
    ...
except AirflowException:
    "Surprise! Surprise!"

Good point yes.

Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 31, 2023
@jscheffl
Copy link
Contributor

I was reading through the discussion and found this PR now being a bit stale.
I'd make a full review if nobody objects.

Before merge though I would kindly request a re-base and an addition of a newsfragment (see https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#step-4-prepare-pr --> newsfragment).

I would propose to make this into a functional release and not in a patch release, meaning probably 2.9.0.

@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 3, 2024
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 18, 2024
@hterik hterik force-pushed the timeout_baseexception branch 2 times, most recently from beb0898 to 2c24222 Compare February 20, 2024 09:39
@hterik hterik force-pushed the timeout_baseexception branch 4 times, most recently from d323064 to f70ff9a Compare February 20, 2024 12:02
@hterik
Copy link
Contributor Author

hterik commented Feb 20, 2024

Latest CI failures look unrelated to change, can someone please help rerun it.

(Airflow FS S3 protocol requires the s3fs library, but it is not installed as it requiresaiobotocore)

In the meantime, I'm really struggling to run the whole test suite myself locally. Both with and without breeze :(

@potiuk
Copy link
Member

potiuk commented Feb 20, 2024

Latest CI failures look unrelated to change, can someone please help rerun it.

The main is broken for that one, and it's being fixed in #37559

In the meantime, I'm really struggling to run the whole test suite myself locally. Both with and without breeze :(

It takes a lot of time, but its usually not expected to do so, usually you should run just related tests, and this is why we have CI to catch other issues, and in most cases you will just see the tests that failed. But generally it should work with Breeze (when you rebuild the image), I wonder what kind of struggles you have? It's not really possible to do anything with unknown struggles.

@jscheffl jscheffl removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 20, 2024
@jscheffl
Copy link
Contributor

@hterik before I do a full test, do you have any recommondation how to test? Did you use any example DAG and timeout?

Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this PR and really like it. My (manual) tests with LocalExecutor, SequentialExecutor, CeleryExecutor and KubernetesExecutor (in KinD) worked as expected with a manual timeout+sleep based in an example DAG.

I'd propose (as it is a "feature" change, to make this into 2.9.0 (not a patch to any 2.8.x line

@jscheffl jscheffl added area:core kind:feature Feature Requests labels Feb 20, 2024
@hterik
Copy link
Contributor Author

hterik commented Feb 21, 2024

Thanks @jscheffl , can you please retrigger the failing CI actions? I've run all the unit tests locally now and the CI failures are passing locally.

@potiuk
Copy link
Member

potiuk commented Feb 21, 2024

Thanks @jscheffl , can you please retrigger the failing CI actions? I've run all the unit tests locally now and the CI failures are passing locally.

No. You need to rebase for that - and you can do it yourself.

Code that normally catches Exception should not implicitly ignore
interrupts from AirflowTaskTimout.

Fixes apache#35644 apache#35474
@potiuk
Copy link
Member

potiuk commented Feb 21, 2024

I clicked "Rebase" button - but it's always faster for you to do rather than wait for someone (in the future) @hterik

@potiuk
Copy link
Member

potiuk commented Feb 21, 2024

(In this case it would have saved you 2 hours)

@jscheffl
Copy link
Contributor

So, GREEN! Now just another second reviewer needed here... then for me this is looking good.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@potiuk potiuk merged commit 581e2e4 into apache:main Feb 21, 2024
59 checks passed
ephraimbuddy pushed a commit that referenced this pull request Feb 21, 2024
Code that normally catches Exception should not implicitly ignore
interrupts from AirflowTaskTimout.

Fixes #35644 #35474

(cherry picked from commit 581e2e4)
ephraimbuddy pushed a commit that referenced this pull request Feb 22, 2024
Code that normally catches Exception should not implicitly ignore
interrupts from AirflowTaskTimout.

Fixes #35644 #35474

(cherry picked from commit 581e2e4)
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
Code that normally catches Exception should not implicitly ignore
interrupts from AirflowTaskTimout.

Fixes apache#35644 apache#35474
@ephraimbuddy ephraimbuddy added type:bug-fix Changelog: Bug Fixes and removed kind:feature Feature Requests labels Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AirflowTaskTimeout should inherit BaseException instead of Exception?
5 participants