Skip to content

Conversation

@ephraimbuddy
Copy link
Contributor

@ephraimbuddy ephraimbuddy commented Aug 26, 2021

Currently, we fail long running SchedulerJob instances(LocalTaskJob) with no recent heartbeat
in reset_and_adopt_orphaned_tasks method of the Scheduler. Sometimes, this leads to the _find_zombies
method of the DAG file process manager running TaskCallbacks. This is how it happens:

When the scheduler dies and we reset task instances or adopt, we do not reset the ti.job_id and the
new SchedulerJob will create instances(LocalTaskJob) to run the new reset tasks. When the old job
which the task instances still has the job_id, has not heartbeated for long, the adopt_or_reset_orphaned
tasks will kill those jobs and subsequently or the _find_zombies method will find the tasks with the old jobID
and send a TaskCallback which will now fail the tasks being run by new schedulerjob instances

This was fixed by first, setting ti.job_id to None
This helps to avoid clashing with the _find_zombies method which is now made to only act on jobs that are not
running and are also not heartbeating but the task is running.

In a situation where the _find_zombies method and adopt_or_reset_orphaned_task method run at the same time,
there would be conflict but now there won't be conflicts because _find_zombies only act on failed jobs with
running tasks and adopt_or_reset_orphaned_tasks kill jobs

Related: #16023


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Aug 26, 2021
@ephraimbuddy
Copy link
Contributor Author

Looking into adopt_or_reset_orphaned_tasks to check if it needs improvement ...

@ephraimbuddy ephraimbuddy marked this pull request as draft August 26, 2021 12:44
@jedcunningham
Copy link
Member

This doesn't handle the case where a scheduler gets SIGKILL'd or hits a power-off event.

I'm not sure I understand why having the 'mark schedulers who haven't heartbeated recently enough as failed' action happen at the adopt_or_reset_orphaned_tasks interval is a problem? Can you expand on that?

@ephraimbuddy
Copy link
Contributor Author

ephraimbuddy commented Aug 26, 2021

This doesn't handle the case where a scheduler gets SIGKILL'd or hits a power-off event.

I have fixed that now...what do you think?

I'm not sure I understand why having the 'mark schedulers who haven't heartbeated recently enough as failed' action happen at the adopt_or_reset_orphaned_tasks interval is a problem? Can you expand on that?

Sorry for not the late reply. Initially, I thought that to be the problem but it's not, however, I think keeping it separate is better and I have fixed the sigkill/sigterms. No extra tests were added for now.

The real problem as it turned out was that the processor manager kills LocalTaskJob when it detects a zombie but that was because, in the adoption code, the old LocalTaskJob is not marked as failed after resetting the task thereby making it a zombie.
I resolved that by failing the LocalTaskJob when the task is reset. It no longer happens but extra tests are needed and I will add them and also update the commit message

EDIT
Just found that I disabled _find_zombie and thought all is well :)

@ephraimbuddy ephraimbuddy force-pushed the properly-fail-scheduler-job branch 3 times, most recently from 403751f to f157354 Compare August 27, 2021 11:00
@ephraimbuddy
Copy link
Contributor Author

@jedcunningham This is currently wrong, not working the way I want it, I have changed the code on my pc still trying to figure out what’s happening. I’ll update it when I’m done

@ephraimbuddy ephraimbuddy force-pushed the properly-fail-scheduler-job branch 3 times, most recently from eceffd2 to 483d6d3 Compare August 29, 2021 20:35
Currently, we fail long running SchedulerJob instances(LocalTaskJob) with no recent heartbeat
in reset_and_adopt_orphaned_tasks method of the Scheduler. Sometimes, this leads to the _find_zombies
method of the DAG file process manager running TaskCallbacks. This is how it happens:

When the scheduler dies and we reset task instances or adopt, we do not reset the ti.job_id and the
new SchedulerJob will create instances(LocalTaskJob) to run the new reset tasks. When the old job
which the task instances still has the job_id, has not heartbeated for long, the adopt_or_reset_orphaned
tasks will kill those jobs and subsequently or the _find_zombies method will find the tasks with the old jobID
and send a TaskCallback which will now fail the tasks being run by new schedulerjob instances

This was fixed by first, setting  ti.job_id to None
This helps to avoid clashing with the _find_zombies method which is now made to only act on jobs that are not
running and are also not heartbeating but the task is running.

In a situation where the _find_zombies method and adopt_or_reset_orphaned_task method run at the same time,
there would be conflict but now there won't be conflicts because _find_zombies only act on failed jobs with
running tasks and adopt_or_reset_orphaned_tasks kill jobs
@ephraimbuddy ephraimbuddy force-pushed the properly-fail-scheduler-job branch from a6e2090 to d88b1b0 Compare August 29, 2021 20:46
@ephraimbuddy ephraimbuddy marked this pull request as ready for review August 29, 2021 20:48
@ephraimbuddy ephraimbuddy changed the title Properly fail SchedulerJob Properly reset/adopt taskinstances and fix find_zombies Aug 29, 2021
@ashb ashb self-assigned this Aug 30, 2021
Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are working on two/three overlapping problems here --

  1. When you exit a scheduler using LocalExecutor, the running tasks (i.e. the LocalTaskJob processes) aren't killed too, and the usually should be.
  2. LocalExecutor can't addopt tasks (because it expects them to not be running if the scheduler isn't)

The issue might be that we blindly adopt/reset orphaned tasks even though they might still be running and heartbeating.

So this change isn't right I don't think.

Certainly setting ti.job_id = None is wrong -- if the ti.job_id has a value then we want to wait until that has been detected as a zombie before changing that value.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Oct 15, 2021
@github-actions github-actions bot closed this Oct 20, 2021
@ephraimbuddy ephraimbuddy deleted the properly-fail-scheduler-job branch February 14, 2022 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants