Skip to content
This repository has been archived by the owner on Sep 4, 2021. It is now read-only.

scheduler: Host checks don't detect graceful shutdown #2182

Closed
lmars opened this issue Nov 26, 2015 · 2 comments · Fixed by #2421
Closed

scheduler: Host checks don't detect graceful shutdown #2182

lmars opened this issue Nov 26, 2015 · 2 comments · Fixed by #2421

Comments

@lmars
Copy link
Contributor

lmars commented Nov 26, 2015

If a flynn-host daemon is gracefully shutting down, it will stop heartbeating before stopping it's jobs, and also before closing the HTTP listener, which means when the scheduler receives the down event, it will successfully get the status from the host (because it is still listening for HTTP requests), so marks it as healthy again. Then, whilst all the jobs are stopping, the scheduler tries to restart them all because it think the host is healthy.

I think having the host should stop responding to status requests when it is shutting down, specifically before it stops heartbeating.

@lmars
Copy link
Contributor Author

lmars commented Nov 26, 2015

FYI due to #1922 (which I have fixed in #2171), when the host is finally marked as down, the "crashed" jobs are still in-memory, so a rectify thinks we now have too many jobs (e.g. 4 discoverd jobs, 1 of them crashed, rather than 3), and then kills a running job, which breaks the cluster.

I am noting this to make sure it is considered when testing that this issue is fixed.

lmars added a commit that referenced this issue Feb 7, 2016
Fixes #2182.

Signed-off-by: Lewis Marshall <lewis@lmars.net>
@lmars lmars mentioned this issue Feb 7, 2016
@lmars
Copy link
Contributor Author

lmars commented Feb 7, 2016

I have made a potential fix in #2421 which sets shutdown=true in the host's service metadata, causing the scheduler to unfollow the host immediately on receipt of the down event, thus avoiding re-following and attempting to restart any jobs on that host.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant