Skip to content
This repository has been archived by the owner on Sep 4, 2021. It is now read-only.

scheduler should ignore hosts that aren't accepting jobs #2981

Closed
titanous opened this issue Jun 23, 2016 · 4 comments
Closed

scheduler should ignore hosts that aren't accepting jobs #2981

titanous opened this issue Jun 23, 2016 · 4 comments

Comments

@titanous
Copy link
Contributor

A single host in a cluster started hanging on AddJob calls, which caused the ratelimit to be exhausted indefinitely. This cluster was running a version before a69a455, so the scheduler tried to repeatedly place the job on the same host, resulting in perpetually pending jobs.

@lmars I'm not sure of the exact behavior with a69a455 when a host is misbehaving but not down. Do you think we need more code to handle this situation?

/cc @quentez

@lmars
Copy link
Contributor

lmars commented Jun 24, 2016

The scheduler has the concept of a host check which it will perform if a host drops from discoverd, and if it fails more than 10 times the host is unfollowed which means jobs are no longer scheduled on that host (it also treats that host as gone and re-schedules all its jobs onto healthy hosts, see here).

It seems we should have a similar mechanism for dealing with hosts in the state described.

@titanous do you know why the calls were hanging, and did the scheduler get a HTTP timeout error or did it too hang?

@titanous
Copy link
Contributor Author

The scheduler was getting the ratelimit error:

lvl=eror msg="error adding job to the cluster" component=scheduler fn=StartJob app.id=0f5f6d25-2629-4892-8752-a049d4b8ad4e release.id=1990bb16-6cf7-4cd6-bb38-214d482d0f47 job.type=web err="ratelimited: maximum concurrent AddJob calls running, try again later"
lvl=info msg="failed to start job after 820 attempts, waiting 30s before trying again" component=scheduler fn=StartJob app.id=0f5f6d25-2629-4892-8752-a049d4b8ad4e release.id=1990bb16-6cf7-4cd6-bb38-214d482d0f47 job.type=web

Ping me if you want the full log.

@titanous
Copy link
Contributor Author

titanous commented Jun 24, 2016

Also, it's not clear that we should reschedule all jobs on the host in this case, only stop placing new ones.

@titanous
Copy link
Contributor Author

Flynn is unmaintained and our infrastructure will shut down on June 1, 2021. See the README for details.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants