You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 4, 2021. It is now read-only.
A single host in a cluster started hanging on AddJob calls, which caused the ratelimit to be exhausted indefinitely. This cluster was running a version before a69a455, so the scheduler tried to repeatedly place the job on the same host, resulting in perpetually pending jobs.
@lmars I'm not sure of the exact behavior with a69a455 when a host is misbehaving but not down. Do you think we need more code to handle this situation?
The scheduler has the concept of a host check which it will perform if a host drops from discoverd, and if it fails more than 10 times the host is unfollowed which means jobs are no longer scheduled on that host (it also treats that host as gone and re-schedules all its jobs onto healthy hosts, see here).
It seems we should have a similar mechanism for dealing with hosts in the state described.
@titanous do you know why the calls were hanging, and did the scheduler get a HTTP timeout error or did it too hang?
A single host in a cluster started hanging on
AddJob
calls, which caused the ratelimit to be exhausted indefinitely. This cluster was running a version before a69a455, so the scheduler tried to repeatedly place the job on the same host, resulting in perpetually pending jobs.@lmars I'm not sure of the exact behavior with a69a455 when a host is misbehaving but not down. Do you think we need more code to handle this situation?
/cc @quentez
The text was updated successfully, but these errors were encountered: