-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Properly reset/adopt taskinstances and fix find_zombies #17846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Looking into |
|
This doesn't handle the case where a scheduler gets SIGKILL'd or hits a power-off event. I'm not sure I understand why having the 'mark schedulers who haven't heartbeated recently enough as failed' action happen at the |
I have fixed that now...what do you think?
Sorry for not the late reply. Initially, I thought that to be the problem but it's not, however, I think keeping it separate is better and I have fixed the sigkill/sigterms. No extra tests were added for now. The real problem as it turned out was that the processor manager kills LocalTaskJob when it detects a zombie but that was because, in the adoption code, the old LocalTaskJob is not marked as failed after resetting the task thereby making it a zombie. EDIT |
403751f to
f157354
Compare
|
@jedcunningham This is currently wrong, not working the way I want it, I have changed the code on my pc still trying to figure out what’s happening. I’ll update it when I’m done |
eceffd2 to
483d6d3
Compare
Currently, we fail long running SchedulerJob instances(LocalTaskJob) with no recent heartbeat in reset_and_adopt_orphaned_tasks method of the Scheduler. Sometimes, this leads to the _find_zombies method of the DAG file process manager running TaskCallbacks. This is how it happens: When the scheduler dies and we reset task instances or adopt, we do not reset the ti.job_id and the new SchedulerJob will create instances(LocalTaskJob) to run the new reset tasks. When the old job which the task instances still has the job_id, has not heartbeated for long, the adopt_or_reset_orphaned tasks will kill those jobs and subsequently or the _find_zombies method will find the tasks with the old jobID and send a TaskCallback which will now fail the tasks being run by new schedulerjob instances This was fixed by first, setting ti.job_id to None This helps to avoid clashing with the _find_zombies method which is now made to only act on jobs that are not running and are also not heartbeating but the task is running. In a situation where the _find_zombies method and adopt_or_reset_orphaned_task method run at the same time, there would be conflict but now there won't be conflicts because _find_zombies only act on failed jobs with running tasks and adopt_or_reset_orphaned_tasks kill jobs
a6e2090 to
d88b1b0
Compare
ashb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are working on two/three overlapping problems here --
- When you exit a scheduler using LocalExecutor, the running tasks (i.e. the LocalTaskJob processes) aren't killed too, and the usually should be.
- LocalExecutor can't addopt tasks (because it expects them to not be running if the scheduler isn't)
The issue might be that we blindly adopt/reset orphaned tasks even though they might still be running and heartbeating.
So this change isn't right I don't think.
Certainly setting ti.job_id = None is wrong -- if the ti.job_id has a value then we want to wait until that has been detected as a zombie before changing that value.
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
Currently, we fail long running SchedulerJob instances(LocalTaskJob) with no recent heartbeat
in reset_and_adopt_orphaned_tasks method of the Scheduler. Sometimes, this leads to the _find_zombies
method of the DAG file process manager running TaskCallbacks. This is how it happens:
When the scheduler dies and we reset task instances or adopt, we do not reset the ti.job_id and the
new SchedulerJob will create instances(LocalTaskJob) to run the new reset tasks. When the old job
which the task instances still has the job_id, has not heartbeated for long, the adopt_or_reset_orphaned
tasks will kill those jobs and subsequently or the _find_zombies method will find the tasks with the old jobID
and send a TaskCallback which will now fail the tasks being run by new schedulerjob instances
This was fixed by first, setting ti.job_id to None
This helps to avoid clashing with the _find_zombies method which is now made to only act on jobs that are not
running and are also not heartbeating but the task is running.
In a situation where the _find_zombies method and adopt_or_reset_orphaned_task method run at the same time,
there would be conflict but now there won't be conflicts because _find_zombies only act on failed jobs with
running tasks and adopt_or_reset_orphaned_tasks kill jobs
Related: #16023
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.