Fix race condition between triggerer and scheduler #21316

malthe · 2022-02-04T07:39:04Z

This fixes an issue that can occur when a task is deferred and the resulting trigger completes before the executor is notified that the task was completed, whereby the rescheduling of the task fails since the executor refuses to enqueue a task which is already running.

The executor's tracking of running tasks is always lagging behind the task instance state but this is typically not a problem because tasks are not rescheduled fast enough for this to be a problem. But with deferred tasks and triggering logic, this picture changes since a trigger condition can in some cases be met right away – for example, checking whether a certain external system has a given state (i.e., "sensoring").

An alternative solution is to for example sleep for a small amount of time in the triggerer before changing the task state back to SCHEDULED – perhaps in the form of a minimum triggering duration to allow the executor to register that the initial task has completed.

potiuk · 2022-02-14T15:02:14Z

I do not know as much about the triggerer, but maybe @andrewgodwin might chime-in?

dstandish · 2022-02-14T17:34:51Z

@malthe do you think you could write a test for this?

andrewgodwin · 2022-02-14T18:30:35Z

This seems to make sense to me - deferring definitely exposed some edge cases in the scheduler where things transitioned state "too fast", and since the "true fix" would be an entire rejiggling of the database schema and state machine, this looks like a sensible fix that won't take ages!

potiuk · 2022-02-14T20:37:37Z

Rebasing should fix the docker failure.

potiuk · 2022-02-14T20:37:58Z

cc: @malthe

malthe · 2022-02-14T21:30:44Z

@dstandish tests added in 367011e339c5d8afeaeea98690e1bc29c19641ba and branch rebased.

github-actions · 2022-02-14T21:49:18Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

dstandish

suggested a test modification but looks good to me

tests/executors/test_base_executor.py

Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>

(cherry picked from commit 2a6792d)

tanelk · 2022-04-11T07:02:07Z

A very late "review".

CeleryExecutor overwrites the trigger_tasks and this fix does not help when using celery.

malthe · 2022-04-11T07:22:28Z

@tanelk nice find – it seems like the Celery executor has its own retrying logic in task_publish_retries which combined with this pull request ends up still solving the problem.

But it seems like we should be able to rework this to share the same logic.

This is a follow-up to #21316 which did not take into account that CeleryExecutor overrides trigger_tasks and thus would ignore if a task was already running. Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>

malthe requested review from ashb, kaxil and XD-DENG as code owners February 4, 2022 07:39

boring-cyborg bot added the area:Scheduler Scheduler or dag parsing Issues label Feb 4, 2022

malthe force-pushed the retry-queuing-when-task-is-in-running-state branch from 22a12cc to 95538a1 Compare February 9, 2022 06:29

malthe mentioned this pull request Feb 9, 2022

Add additional information when queuing fails (running or queued) #21290

Closed

kaxil requested a review from dstandish February 14, 2022 16:12

kaxil added this to the Airflow 2.3.0 milestone Feb 14, 2022

malthe added 2 commits February 14, 2022 22:30

Fix race condition between triggerer and scheduler

dd29b8f

Add tests for 'trigger_tasks' method

4b7d512

malthe force-pushed the retry-queuing-when-task-is-in-running-state branch from 367011e to 4b7d512 Compare February 14, 2022 21:30

potiuk approved these changes Feb 14, 2022

View reviewed changes

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Feb 14, 2022

dstandish approved these changes Feb 15, 2022

View reviewed changes

tests/executors/test_base_executor.py Outdated Show resolved Hide resolved

Make test case stronger and improve variable naming

8bf5181

Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com>

potiuk merged commit 2a6792d into apache:main Feb 15, 2022

jedcunningham added the type:bug-fix Changelog: Bug Fixes label Feb 28, 2022

jedcunningham modified the milestones: Airflow 2.3.0, Airflow 2.2.5 Feb 28, 2022

ephraimbuddy pushed a commit that referenced this pull request Mar 16, 2022

Fix race condition between triggerer and scheduler (#21316)

94a75ae

(cherry picked from commit 2a6792d)

ephraimbuddy pushed a commit that referenced this pull request Mar 20, 2022

Fix race condition between triggerer and scheduler (#21316)

a865c66

(cherry picked from commit 2a6792d)

ephraimbuddy pushed a commit that referenced this pull request Mar 22, 2022

Fix race condition between triggerer and scheduler (#21316)

f44d950

(cherry picked from commit 2a6792d)

ephraimbuddy pushed a commit that referenced this pull request Mar 22, 2022

Fix race condition between triggerer and scheduler (#21316)

21bcb05

(cherry picked from commit 2a6792d)

ephraimbuddy pushed a commit that referenced this pull request Mar 22, 2022

Fix race condition between triggerer and scheduler (#21316)

8bb8d25

(cherry picked from commit 2a6792d)

ephraimbuddy pushed a commit that referenced this pull request Mar 22, 2022

Fix race condition between triggerer and scheduler (#21316)

a9a7f79

(cherry picked from commit 2a6792d)

ephraimbuddy pushed a commit that referenced this pull request Mar 24, 2022

Fix race condition between triggerer and scheduler (#21316)

46ef194

(cherry picked from commit 2a6792d)

ephraimbuddy pushed a commit that referenced this pull request Mar 26, 2022

Fix race condition between triggerer and scheduler (#21316)

d4f82a8

(cherry picked from commit 2a6792d)

ephraimbuddy mentioned this pull request Mar 27, 2022

Status of testing of Apache Airflow 2.2.5rc3 #22549

Closed

36 tasks

tothandor mentioned this pull request Apr 11, 2022

Task stuck in "scheduled" or "queued" state, pool has all slots queued, nothing is executing #13542

Closed

malthe mentioned this pull request Apr 14, 2022

Use inherited 'trigger_tasks' method #23016

Merged

This was referenced May 20, 2022

Race condition between Triggerer and Scheduler #23824

Closed

Do not fail requeued TIs #23846

Merged

This was referenced Jul 2, 2022

Status of testing of Apache Airflow 2.3.3rc1 #24806

Closed

Status of testing of Apache Airflow 2.3.3rc3 #24863

Closed

malthe mentioned this pull request Aug 17, 2022

Tasks marked as "UP_FOR_RESCHEDULE" get stuck in Executor.running and never reschedule #25728

Closed

2 tasks

dstandish mentioned this pull request Dec 25, 2022

Use time not tries for queued & running re-checks. #28586

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition between triggerer and scheduler #21316

Fix race condition between triggerer and scheduler #21316

malthe commented Feb 4, 2022

potiuk commented Feb 14, 2022

dstandish commented Feb 14, 2022

andrewgodwin commented Feb 14, 2022

potiuk commented Feb 14, 2022

potiuk commented Feb 14, 2022

malthe commented Feb 14, 2022

github-actions bot commented Feb 14, 2022

dstandish left a comment

tanelk commented Apr 11, 2022

malthe commented Apr 11, 2022

Fix race condition between triggerer and scheduler #21316

Fix race condition between triggerer and scheduler #21316

Conversation

malthe commented Feb 4, 2022

potiuk commented Feb 14, 2022

dstandish commented Feb 14, 2022

andrewgodwin commented Feb 14, 2022

potiuk commented Feb 14, 2022

potiuk commented Feb 14, 2022

malthe commented Feb 14, 2022

github-actions bot commented Feb 14, 2022

dstandish left a comment

Choose a reason for hiding this comment

tanelk commented Apr 11, 2022

malthe commented Apr 11, 2022