Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-5102] Worker jobs should terminate themselves if they can't heartbeat #6284

Merged
merged 1 commit into from
Oct 8, 2019

Conversation

ashb
Copy link
Member

@ashb ashb commented Oct 8, 2019

Make sure you have checked all steps below.

Jira

Description

  • If a LocalTaskJob fails to heartbeat for scheduler_zombie_task_threshold, it should shut itself down.

    However, at some point, a change was made to catch exceptions inside the heartbeat, so the LocalTaskJob thought it had managed to heartbeat successfully.

    This effectively means that zombie tasks don't shut themselves down. When the scheduler reschedules the job, this means we could have two instances of the task running concurrently.

Tests

  • I have added tests to ensure that self.latest_heartbeat is only updated when the DB is updated.

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • None

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIce!

…heartbeat

If a LocalTaskJob fails to heartbeat for
scheduler_zombie_task_threshold, it should shut itself down.

However, at some point, a change was made to catch exceptions inside the
heartbeat, so the LocalTaskJob thought it had managed to heartbeat
successfully.

This effectively means that zombie tasks don't shut themselves down.
When the scheduler reschedules the job, this means we could have two
instances of the task running concurrently.
@ashb ashb force-pushed the stop-after-failed-heartbeat branch from 958604b to 295ee8b Compare October 8, 2019 14:25
@codecov-io
Copy link

Codecov Report

Merging #6284 into master will decrease coverage by <.01%.
The diff coverage is 90.9%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6284      +/-   ##
==========================================
- Coverage   80.32%   80.31%   -0.01%     
==========================================
  Files         612      612              
  Lines       35395    35396       +1     
==========================================
- Hits        28432    28430       -2     
- Misses       6963     6966       +3
Impacted Files Coverage Δ
airflow/jobs/local_task_job.py 85% <100%> (-5%) ⬇️
airflow/jobs/base_job.py 88.73% <88.88%> (+2.2%) ⬆️
airflow/utils/dag_processing.py 56.55% <0%> (-0.35%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d71f33...295ee8b. Read the comment docs.

@@ -111,8 +111,7 @@ def test_localtaskjob_heartbeat(self, mock_pid):
session.merge(ti)
session.commit()

ret = job1.heartbeat_callback()
self.assertEqual(ret, None)
job1.heartbeat_callback()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was cos pylint started complaining about it.

@ashb ashb merged commit 68b8ec5 into apache:master Oct 8, 2019
ashb added a commit to ashb/airflow that referenced this pull request Oct 10, 2019
…heartbeat (apache#6284)

If a LocalTaskJob fails to heartbeat for
scheduler_zombie_task_threshold, it should shut itself down.

However, at some point, a change was made to catch exceptions inside the
heartbeat, so the LocalTaskJob thought it had managed to heartbeat
successfully.

This effectively means that zombie tasks don't shut themselves down.
When the scheduler reschedules the job, this means we could have two
instances of the task running concurrently.

(cherry picked from commit 68b8ec5)
@ashb ashb deleted the stop-after-failed-heartbeat branch October 25, 2019 08:46
arcadecoffee pushed a commit to SpareFoot/incubator-airflow that referenced this pull request Feb 5, 2020
…heartbeat (apache#6284)

If a LocalTaskJob fails to heartbeat for
scheduler_zombie_task_threshold, it should shut itself down.

However, at some point, a change was made to catch exceptions inside the
heartbeat, so the LocalTaskJob thought it had managed to heartbeat
successfully.

This effectively means that zombie tasks don't shut themselves down.
When the scheduler reschedules the job, this means we could have two
instances of the task running concurrently.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants