Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix future DagRun rarely triggered by race conditions when max_active_runs reached its upper limit. #31414

Merged
merged 8 commits into from Aug 8, 2023

Conversation

doiken
Copy link
Contributor

@doiken doiken commented May 19, 2023

closes: #31407


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added the area:Scheduler Scheduler or dag parsing Issues label May 19, 2023
@uranusjr
Copy link
Member

Static check failure seems to be unrelated (also failing on main).

@eladkal
Copy link
Contributor

eladkal commented May 24, 2023

can we have a unit test for this change to avoid regression?

@doiken
Copy link
Contributor Author

doiken commented May 25, 2023

In this PR I have changed the dag retrieval to be done with locking.
This unit test is difficult to implement and I would appreciate any ideas.

I believe the regression test is covered by the existing tests.
E.g. test_dag_file_processor_process_task_instances and test_dagrun_timeout_fails_run in test_scheduler_job.py

@doiken
Copy link
Contributor Author

doiken commented Jun 14, 2023

It has been a while since our last conversation,
is there anything I can do about this PR?

@nathadfield
Copy link
Collaborator

@doiken Just picking up the thread here, your PR still has some failed static tests.

@doiken
Copy link
Contributor Author

doiken commented Jul 12, 2023

@nathadfield
Thanks for your comments.
I have updated the branch and confirmed that all tests passed.

@eladkal eladkal added this to the Airflow 2.6.4 milestone Jul 13, 2023
@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label Jul 13, 2023
@eladkal eladkal modified the milestones: Airflow 2.6.4, Airflow 2.7.0 Aug 1, 2023
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we don't merge changes without test, but that one would be rather difficult to write test for - LGTM

@eladkal eladkal merged commit b53e2ae into apache:main Aug 8, 2023
42 checks passed
ephraimbuddy pushed a commit that referenced this pull request Aug 8, 2023
…_runs reached its upper limit. (#31414)

* feat: select dag_model with row lock

* fix: logging that scheduling was skipped

* fix: remove unused get_dagmodel

* fix: correct log message to more generic word

---------

Co-authored-by: doiken <doiken@users.noreply.github.com>
Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
Co-authored-by: eladkal <45845474+eladkal@users.noreply.github.com>
(cherry picked from commit b53e2ae)
@ZackUhlenhuth
Copy link

Hello all, I seem to be running into a new deadlock issue due to this change. I've changed my dag name to DAGNAME in output below:

[�[34m2023-12-03T12:43:48.487+0000�[0m] {{�[34mscheduler_job_runner.py:�[0m1426}} INFO�[0m - DAG DAGNAME scheduling was skipped, probably because the DAG record was locked�[0m
[2023-12-03T12:43:56.335+0000] {{base.py:73}} INFO - Using connection ID 'aws_default' for task execution.
[�[34m2023-12-03T12:43:58.509+0000�[0m] {{�[34mdagrun.py:�[0m632}} ERROR�[0m - Marking run <DagRun DAGNAME @ 2023-11-26 12:00:00+00:00: scheduled__2023-11-26T12:00:00+00:00, state:running, queued_at: 2023-12-03 12:00:00.743211+00:00. externally triggered: False> failed�[0m

It seems that the scheduler fails to acquire the Dag Record lock, and then fails the entire dag as a result.

Is someone able to explain the cases under which the scheduler would fail to acquire the Dag Record lock, and if there is some configuration I can use to mitigate? I was not seeing such failures before upgrading from 2.5.1 to 2.7.2

ephraimbuddy added a commit to astronomer/airflow that referenced this pull request Feb 21, 2024
…x_active_runs reached its upper limit. (apache#31414)"

This reverts commit b53e2ae.
ephraimbuddy added a commit to astronomer/airflow that referenced this pull request Feb 21, 2024
…x_active_runs reached its upper limit. (apache#31414)"

This reverts commit b53e2ae.
ephraimbuddy added a commit that referenced this pull request Feb 21, 2024
…x_active_runs reached its upper limit. (#31414)" (#37596)

This reverts commit b53e2ae.
ephraimbuddy added a commit that referenced this pull request Feb 21, 2024
…x_active_runs reached its upper limit. (#31414)" (#37596)

This reverts commit b53e2ae.

(cherry picked from commit b38d59b)
ephraimbuddy added a commit that referenced this pull request Feb 21, 2024
…x_active_runs reached its upper limit. (#31414)" (#37596)

This reverts commit b53e2ae.

(cherry picked from commit b38d59b)
ephraimbuddy added a commit that referenced this pull request Feb 22, 2024
…x_active_runs reached its upper limit. (#31414)" (#37596)

This reverts commit b53e2ae.

(cherry picked from commit b38d59b)
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
…x_active_runs reached its upper limit. (apache#31414)" (apache#37596)

This reverts commit b53e2ae.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler Scheduler or dag parsing Issues type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Future DagRun rarely triggered by Race Condition when max_active_runs has reached its upper limit
7 participants