-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-5102] Worker jobs should terminate themselves if they can't heartbeat #6284
Conversation
8bb0e96
to
958604b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIce!
…heartbeat If a LocalTaskJob fails to heartbeat for scheduler_zombie_task_threshold, it should shut itself down. However, at some point, a change was made to catch exceptions inside the heartbeat, so the LocalTaskJob thought it had managed to heartbeat successfully. This effectively means that zombie tasks don't shut themselves down. When the scheduler reschedules the job, this means we could have two instances of the task running concurrently.
958604b
to
295ee8b
Compare
Codecov Report
@@ Coverage Diff @@
## master #6284 +/- ##
==========================================
- Coverage 80.32% 80.31% -0.01%
==========================================
Files 612 612
Lines 35395 35396 +1
==========================================
- Hits 28432 28430 -2
- Misses 6963 6966 +3
Continue to review full report at Codecov.
|
@@ -111,8 +111,7 @@ def test_localtaskjob_heartbeat(self, mock_pid): | |||
session.merge(ti) | |||
session.commit() | |||
|
|||
ret = job1.heartbeat_callback() | |||
self.assertEqual(ret, None) | |||
job1.heartbeat_callback() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was cos pylint started complaining about it.
…heartbeat (apache#6284) If a LocalTaskJob fails to heartbeat for scheduler_zombie_task_threshold, it should shut itself down. However, at some point, a change was made to catch exceptions inside the heartbeat, so the LocalTaskJob thought it had managed to heartbeat successfully. This effectively means that zombie tasks don't shut themselves down. When the scheduler reschedules the job, this means we could have two instances of the task running concurrently. (cherry picked from commit 68b8ec5)
…heartbeat (apache#6284) If a LocalTaskJob fails to heartbeat for scheduler_zombie_task_threshold, it should shut itself down. However, at some point, a change was made to catch exceptions inside the heartbeat, so the LocalTaskJob thought it had managed to heartbeat successfully. This effectively means that zombie tasks don't shut themselves down. When the scheduler reschedules the job, this means we could have two instances of the task running concurrently.
Make sure you have checked all steps below.
Jira
Description
If a LocalTaskJob fails to heartbeat for scheduler_zombie_task_threshold, it should shut itself down.
However, at some point, a change was made to catch exceptions inside the heartbeat, so the LocalTaskJob thought it had managed to heartbeat successfully.
This effectively means that zombie tasks don't shut themselves down. When the scheduler reschedules the job, this means we could have two instances of the task running concurrently.
Tests
self.latest_heartbeat
is only updated when the DB is updated.Commits
Documentation