Properly reset/adopt taskinstances and fix find_zombies #17846

ephraimbuddy · 2021-08-26T11:25:19Z

Currently, we fail long running SchedulerJob instances(LocalTaskJob) with no recent heartbeat
in reset_and_adopt_orphaned_tasks method of the Scheduler. Sometimes, this leads to the _find_zombies
method of the DAG file process manager running TaskCallbacks. This is how it happens:

When the scheduler dies and we reset task instances or adopt, we do not reset the ti.job_id and the
new SchedulerJob will create instances(LocalTaskJob) to run the new reset tasks. When the old job
which the task instances still has the job_id, has not heartbeated for long, the adopt_or_reset_orphaned
tasks will kill those jobs and subsequently or the _find_zombies method will find the tasks with the old jobID
and send a TaskCallback which will now fail the tasks being run by new schedulerjob instances

This was fixed by first, setting ti.job_id to None
This helps to avoid clashing with the _find_zombies method which is now made to only act on jobs that are not
running and are also not heartbeating but the task is running.

In a situation where the _find_zombies method and adopt_or_reset_orphaned_task method run at the same time,
there would be conflict but now there won't be conflicts because _find_zombies only act on failed jobs with
running tasks and adopt_or_reset_orphaned_tasks kill jobs

Related: #16023

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

ephraimbuddy · 2021-08-26T12:44:50Z

Looking into adopt_or_reset_orphaned_tasks to check if it needs improvement ...

jedcunningham · 2021-08-26T13:48:52Z

This doesn't handle the case where a scheduler gets SIGKILL'd or hits a power-off event.

I'm not sure I understand why having the 'mark schedulers who haven't heartbeated recently enough as failed' action happen at the adopt_or_reset_orphaned_tasks interval is a problem? Can you expand on that?

ephraimbuddy · 2021-08-26T23:30:09Z

This doesn't handle the case where a scheduler gets SIGKILL'd or hits a power-off event.

I have fixed that now...what do you think?

I'm not sure I understand why having the 'mark schedulers who haven't heartbeated recently enough as failed' action happen at the adopt_or_reset_orphaned_tasks interval is a problem? Can you expand on that?

Sorry for not the late reply. Initially, I thought that to be the problem but it's not, however, I think keeping it separate is better and I have fixed the sigkill/sigterms. No extra tests were added for now.

The real problem as it turned out was that the processor manager kills LocalTaskJob when it detects a zombie but that was because, in the adoption code, the old LocalTaskJob is not marked as failed after resetting the task thereby making it a zombie.
I resolved that by failing the LocalTaskJob when the task is reset. It no longer happens but extra tests are needed and I will add them and also update the commit message

EDIT
Just found that I disabled _find_zombie and thought all is well :)

airflow/jobs/scheduler_job.py

ephraimbuddy · 2021-08-28T08:20:51Z

@jedcunningham This is currently wrong, not working the way I want it, I have changed the code on my pc still trying to figure out what’s happening. I’ll update it when I’m done

Currently, we fail long running SchedulerJob instances(LocalTaskJob) with no recent heartbeat in reset_and_adopt_orphaned_tasks method of the Scheduler. Sometimes, this leads to the _find_zombies method of the DAG file process manager running TaskCallbacks. This is how it happens: When the scheduler dies and we reset task instances or adopt, we do not reset the ti.job_id and the new SchedulerJob will create instances(LocalTaskJob) to run the new reset tasks. When the old job which the task instances still has the job_id, has not heartbeated for long, the adopt_or_reset_orphaned tasks will kill those jobs and subsequently or the _find_zombies method will find the tasks with the old jobID and send a TaskCallback which will now fail the tasks being run by new schedulerjob instances This was fixed by first, setting ti.job_id to None This helps to avoid clashing with the _find_zombies method which is now made to only act on jobs that are not running and are also not heartbeating but the task is running. In a situation where the _find_zombies method and adopt_or_reset_orphaned_task method run at the same time, there would be conflict but now there won't be conflicts because _find_zombies only act on failed jobs with running tasks and adopt_or_reset_orphaned_tasks kill jobs

ashb

I think you are working on two/three overlapping problems here --

When you exit a scheduler using LocalExecutor, the running tasks (i.e. the LocalTaskJob processes) aren't killed too, and the usually should be.
LocalExecutor can't addopt tasks (because it expects them to not be running if the scheduler isn't)

The issue might be that we blindly adopt/reset orphaned tasks even though they might still be running and heartbeating.

So this change isn't right I don't think.

Certainly setting ti.job_id = None is wrong -- if the ti.job_id has a value then we want to wait until that has been detected as a zombie before changing that value.

github-actions · 2021-10-15T00:06:13Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

ephraimbuddy requested review from XD-DENG, ashb and kaxil as code owners August 26, 2021 11:25

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Aug 26, 2021

ephraimbuddy marked this pull request as draft August 26, 2021 12:44

ephraimbuddy force-pushed the properly-fail-scheduler-job branch 3 times, most recently from 403751f to f157354 Compare August 27, 2021 11:00

jedcunningham reviewed Aug 27, 2021

View reviewed changes

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

ephraimbuddy force-pushed the properly-fail-scheduler-job branch 3 times, most recently from eceffd2 to 483d6d3 Compare August 29, 2021 20:35

ephraimbuddy force-pushed the properly-fail-scheduler-job branch from a6e2090 to d88b1b0 Compare August 29, 2021 20:46

ephraimbuddy marked this pull request as ready for review August 29, 2021 20:48

ephraimbuddy requested a review from jedcunningham August 29, 2021 20:48

ephraimbuddy changed the title ~~Properly fail SchedulerJob~~ Properly reset/adopt taskinstances and fix find_zombies Aug 29, 2021

ashb self-assigned this Aug 30, 2021

ashb requested changes Aug 30, 2021

View reviewed changes

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Oct 15, 2021

github-actions bot closed this Oct 20, 2021

ephraimbuddy deleted the properly-fail-scheduler-job branch February 14, 2022 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly reset/adopt taskinstances and fix find_zombies #17846

Properly reset/adopt taskinstances and fix find_zombies #17846

Uh oh!

ephraimbuddy commented Aug 26, 2021 •

edited

Loading

Uh oh!

ephraimbuddy commented Aug 26, 2021

Uh oh!

jedcunningham commented Aug 26, 2021

Uh oh!

ephraimbuddy commented Aug 26, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ephraimbuddy commented Aug 28, 2021

Uh oh!

ashb left a comment

Uh oh!

github-actions bot commented Oct 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Properly reset/adopt taskinstances and fix find_zombies #17846

Properly reset/adopt taskinstances and fix find_zombies #17846

Uh oh!

Conversation

ephraimbuddy commented Aug 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ephraimbuddy commented Aug 26, 2021

Uh oh!

jedcunningham commented Aug 26, 2021

Uh oh!

ephraimbuddy commented Aug 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ephraimbuddy commented Aug 28, 2021

Uh oh!

ashb left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ephraimbuddy commented Aug 26, 2021 •

edited

Loading

ephraimbuddy commented Aug 26, 2021 •

edited

Loading