[AIRFLOW-5102] Worker jobs should terminate themselves if they can't heartbeat #6284

ashb · 2019-10-08T13:00:03Z

Make sure you have checked all steps below.

Jira

https://issues.apache.org/jira/browse/AIRFLOW-5102

Description

If a LocalTaskJob fails to heartbeat for scheduler_zombie_task_threshold, it should shut itself down.

However, at some point, a change was made to catch exceptions inside the heartbeat, so the LocalTaskJob thought it had managed to heartbeat successfully.

This effectively means that zombie tasks don't shut themselves down. When the scheduler reschedules the job, this means we could have two instances of the task running concurrently.

Tests

I have added tests to ensure that self.latest_heartbeat is only updated when the DB is updated.

Commits

My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

None

potiuk

NIce!

…heartbeat If a LocalTaskJob fails to heartbeat for scheduler_zombie_task_threshold, it should shut itself down. However, at some point, a change was made to catch exceptions inside the heartbeat, so the LocalTaskJob thought it had managed to heartbeat successfully. This effectively means that zombie tasks don't shut themselves down. When the scheduler reschedules the job, this means we could have two instances of the task running concurrently.

codecov-io · 2019-10-08T15:35:52Z

Codecov Report

Merging #6284 into master will decrease coverage by <.01%.
The diff coverage is 90.9%.

@@            Coverage Diff             @@
##           master    #6284      +/-   ##
==========================================
- Coverage   80.32%   80.31%   -0.01%     
==========================================
  Files         612      612              
  Lines       35395    35396       +1     
==========================================
- Hits        28432    28430       -2     
- Misses       6963     6966       +3

Impacted Files	Coverage Δ
airflow/jobs/local_task_job.py	`85% <100%> (-5%)`	⬇️
airflow/jobs/base_job.py	`88.73% <88.88%> (+2.2%)`	⬆️
airflow/utils/dag_processing.py	`56.55% <0%> (-0.35%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d71f33...295ee8b. Read the comment docs.

ashb · 2019-10-08T15:59:40Z

tests/jobs/test_local_task_job.py

@@ -111,8 +111,7 @@ def test_localtaskjob_heartbeat(self, mock_pid):
        session.merge(ti)
        session.commit()

-        ret = job1.heartbeat_callback()
-        self.assertEqual(ret, None)
+        job1.heartbeat_callback()


This change was cos pylint started complaining about it.

…heartbeat (apache#6284) If a LocalTaskJob fails to heartbeat for scheduler_zombie_task_threshold, it should shut itself down. However, at some point, a change was made to catch exceptions inside the heartbeat, so the LocalTaskJob thought it had managed to heartbeat successfully. This effectively means that zombie tasks don't shut themselves down. When the scheduler reschedules the job, this means we could have two instances of the task running concurrently. (cherry picked from commit 68b8ec5)

…heartbeat (apache#6284) If a LocalTaskJob fails to heartbeat for scheduler_zombie_task_threshold, it should shut itself down. However, at some point, a change was made to catch exceptions inside the heartbeat, so the LocalTaskJob thought it had managed to heartbeat successfully. This effectively means that zombie tasks don't shut themselves down. When the scheduler reschedules the job, this means we could have two instances of the task running concurrently.

ashb requested a review from potiuk October 8, 2019 13:00

ashb force-pushed the stop-after-failed-heartbeat branch from 8bb0e96 to 958604b Compare October 8, 2019 13:04

ashb mentioned this pull request Oct 8, 2019

[AIRFLOW-5102] Fix LocalTaskJob heartbeat logic #5713

Closed

6 tasks

potiuk approved these changes Oct 8, 2019

View reviewed changes

ashb force-pushed the stop-after-failed-heartbeat branch from 958604b to 295ee8b Compare October 8, 2019 14:25

kaxil approved these changes Oct 8, 2019

View reviewed changes

ashb commented Oct 8, 2019

View reviewed changes

ashb merged commit 68b8ec5 into apache:master Oct 8, 2019

ashb deleted the stop-after-failed-heartbeat branch October 25, 2019 08:46

MarkMuffin mentioned this pull request Jul 9, 2020

LocalTaskJob heartbeat log spamming #9735

Closed

trevorprater mentioned this pull request Jul 15, 2020

LocalTaskJob Heartbeat Spamming (yields 800MB log) and a DAG that runs for nearly an hour. #9837

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIRFLOW-5102] Worker jobs should terminate themselves if they can't heartbeat #6284

[AIRFLOW-5102] Worker jobs should terminate themselves if they can't heartbeat #6284

ashb commented Oct 8, 2019

potiuk left a comment

codecov-io commented Oct 8, 2019

ashb Oct 8, 2019

[AIRFLOW-5102] Worker jobs should terminate themselves if they can't heartbeat #6284

[AIRFLOW-5102] Worker jobs should terminate themselves if they can't heartbeat #6284

Conversation

ashb commented Oct 8, 2019

Jira

Description

Tests

Commits

Documentation

potiuk left a comment

Choose a reason for hiding this comment

codecov-io commented Oct 8, 2019

Codecov Report

ashb Oct 8, 2019

Choose a reason for hiding this comment