You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 4, 2021. It is now read-only.
If a flynn-host daemon is gracefully shutting down, it will stop heartbeating before stopping it's jobs, and also before closing the HTTP listener, which means when the scheduler receives the down event, it will successfully get the status from the host (because it is still listening for HTTP requests), so marks it as healthy again. Then, whilst all the jobs are stopping, the scheduler tries to restart them all because it think the host is healthy.
I think having the host should stop responding to status requests when it is shutting down, specifically before it stops heartbeating.
The text was updated successfully, but these errors were encountered:
FYI due to #1922 (which I have fixed in #2171), when the host is finally marked as down, the "crashed" jobs are still in-memory, so a rectify thinks we now have too many jobs (e.g. 4 discoverd jobs, 1 of them crashed, rather than 3), and then kills a running job, which breaks the cluster.
I am noting this to make sure it is considered when testing that this issue is fixed.
I have made a potential fix in #2421 which sets shutdown=true in the host's service metadata, causing the scheduler to unfollow the host immediately on receipt of the down event, thus avoiding re-following and attempting to restart any jobs on that host.
If a flynn-host daemon is gracefully shutting down, it will stop heartbeating before stopping it's jobs, and also before closing the HTTP listener, which means when the scheduler receives the down event, it will successfully get the status from the host (because it is still listening for HTTP requests), so marks it as healthy again. Then, whilst all the jobs are stopping, the scheduler tries to restart them all because it think the host is healthy.
I think having the host should stop responding to status requests when it is shutting down, specifically before it stops heartbeating.
The text was updated successfully, but these errors were encountered: