scheduler should ignore hosts that aren't accepting jobs #2981

titanous · 2016-06-23T17:48:08Z

A single host in a cluster started hanging on AddJob calls, which caused the ratelimit to be exhausted indefinitely. This cluster was running a version before a69a455, so the scheduler tried to repeatedly place the job on the same host, resulting in perpetually pending jobs.

@lmars I'm not sure of the exact behavior with a69a455 when a host is misbehaving but not down. Do you think we need more code to handle this situation?

/cc @quentez

The text was updated successfully, but these errors were encountered:

lmars · 2016-06-24T17:14:37Z

The scheduler has the concept of a host check which it will perform if a host drops from discoverd, and if it fails more than 10 times the host is unfollowed which means jobs are no longer scheduled on that host (it also treats that host as gone and re-schedules all its jobs onto healthy hosts, see here).

It seems we should have a similar mechanism for dealing with hosts in the state described.

@titanous do you know why the calls were hanging, and did the scheduler get a HTTP timeout error or did it too hang?

titanous · 2016-06-24T17:15:52Z

The scheduler was getting the ratelimit error:

lvl=eror msg="error adding job to the cluster" component=scheduler fn=StartJob app.id=0f5f6d25-2629-4892-8752-a049d4b8ad4e release.id=1990bb16-6cf7-4cd6-bb38-214d482d0f47 job.type=web err="ratelimited: maximum concurrent AddJob calls running, try again later"
lvl=info msg="failed to start job after 820 attempts, waiting 30s before trying again" component=scheduler fn=StartJob app.id=0f5f6d25-2629-4892-8752-a049d4b8ad4e release.id=1990bb16-6cf7-4cd6-bb38-214d482d0f47 job.type=web

Ping me if you want the full log.

titanous · 2016-06-24T17:16:27Z

Also, it's not clear that we should reschedule all jobs on the host in this case, only stop placing new ones.

titanous · 2021-02-18T21:53:42Z

Flynn is unmaintained and our infrastructure will shut down on June 1, 2021. See the README for details.

titanous added kind/enhancement component/scheduler labels Jun 23, 2016

titanous closed this as completed Feb 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler should ignore hosts that aren't accepting jobs #2981

scheduler should ignore hosts that aren't accepting jobs #2981

titanous commented Jun 23, 2016

lmars commented Jun 24, 2016

titanous commented Jun 24, 2016

titanous commented Jun 24, 2016 •

edited

Loading

titanous commented Feb 18, 2021

scheduler should ignore hosts that aren't accepting jobs #2981

scheduler should ignore hosts that aren't accepting jobs #2981

Comments

titanous commented Jun 23, 2016

lmars commented Jun 24, 2016

titanous commented Jun 24, 2016

titanous commented Jun 24, 2016 • edited Loading

titanous commented Feb 18, 2021

titanous commented Jun 24, 2016 •

edited

Loading