-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition between triggerer and scheduler #21316
Fix race condition between triggerer and scheduler #21316
Conversation
22a12cc
to
95538a1
Compare
I do not know as much about the triggerer, but maybe @andrewgodwin might chime-in? |
@malthe do you think you could write a test for this? |
This seems to make sense to me - deferring definitely exposed some edge cases in the scheduler where things transitioned state "too fast", and since the "true fix" would be an entire rejiggling of the database schema and state machine, this looks like a sensible fix that won't take ages! |
Rebasing should fix the docker failure. |
cc: @malthe |
367011e
to
4b7d512
Compare
@dstandish tests added in 367011e339c5d8afeaeea98690e1bc29c19641ba and branch rebased. |
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggested a test modification but looks good to me
Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>
(cherry picked from commit 2a6792d)
(cherry picked from commit 2a6792d)
(cherry picked from commit 2a6792d)
(cherry picked from commit 2a6792d)
(cherry picked from commit 2a6792d)
(cherry picked from commit 2a6792d)
(cherry picked from commit 2a6792d)
(cherry picked from commit 2a6792d)
A very late "review".
|
@tanelk nice find – it seems like the Celery executor has its own retrying logic in But it seems like we should be able to rework this to share the same logic. |
This is a follow-up to #21316 which did not take into account that CeleryExecutor overrides trigger_tasks and thus would ignore if a task was already running. Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>
This fixes an issue that can occur when a task is deferred and the resulting trigger completes before the executor is notified that the task was completed, whereby the rescheduling of the task fails since the executor refuses to enqueue a task which is already running.
The executor's tracking of running tasks is always lagging behind the task instance state but this is typically not a problem because tasks are not rescheduled fast enough for this to be a problem. But with deferred tasks and triggering logic, this picture changes since a trigger condition can in some cases be met right away – for example, checking whether a certain external system has a given state (i.e., "sensoring").
An alternative solution is to for example sleep for a small amount of time in the triggerer before changing the task state back to
SCHEDULED
– perhaps in the form of a minimum triggering duration to allow the executor to register that the initial task has completed.