Chainable Schedulers Discussion #1049

epipho · 2014-12-02T06:50:00Z

Chainable schedulers was an idea originally put forth in #922 as a way to compose a complex scheduler out of simple pieces. This issue is intended to discuss how chainable schedulers could work as well as how they could be practically implemented.

These schedulers would run after the existing metadata scheduler to further reduce eligible list of candidates. If at any point a scheduler returned zero candidates, the scheduling process would be halted and the unit would either be delayed for rescheduling later or failed completely. If the full chain of schedulers ends with multiple eligible machines, a default scheduler will pick from them and schedule the unit.

Requirements:

Composable: Multiple schedulers can be chained in a given order to produce complex scheduling behavior.
Easy to deploy, should not require a custom CoreOS image or a rebuild of the cluster to change.
Easy to distribute. The community should be able to create and publish scheduler building blocks for others to use.

Scheduler Agents

The scheduling agent is a single scheduler in the chain. It should register itself with fleet (via etcd?). The registration should include an endpoint to call as well as a priority order.

The master fleet scheduler will call each registered scheduler in priority order. Calls happen serially and only one unit is scheduled at a time. Fleet is responsible for sending all data on the current state of the cluster to the endpoint and the endpoint is responsible for returning a filtered list of eligible machines.

Scheduler Unit Files

Since there only needs to be one copy of each scheduler, schedulers could be submitted as a special type of fleet unit. The fleet master would be responsible to starting up these units on the current master machine to facilitate fast communication. If the fleet master changes, the schedulers are shut down and restarted on the new master.

Questions:

What is acceptable time limit for all schedulers to run?
How is the cluster state passed to each scheduler? A large cluster could result in a ton of data needing to be passed around. Could store all data in etcd and each scheduler would be responsible for fetching the required info. Alternatively fleet could expose an api for the scheduler to call back to ask for specific data. Both options add to the latency of the scheduler.

Edit 12/3/2014
Cluster state should be queried by the scheduler, not passed in the request for scalability reasons.

New Question: What do the schedulers call to get information on the cluster? A fleet api endpoint would provide a stable interface, possibly with batching, at the cost of an extra hop or two. Calling directly into a k/v store could be faster but relies on each scheduler having knowledge of how fleet stores its data as well as knowledge on how to discover and query each support k/v store.

End Edit

Reporting Agents

Reporting agents are global fleet units that are responsible to sending back per-machine metrics to be added to the global cluster state. This allows for schedulers based on artificial or product-based metrics that cannot or should not be read by the normal fleet daemon.

Reporting agents would be responsible for sending a set of key/value pairs to their local fleet daemon on a regular basis. The local fleet daemon would send these to the fleet master along with existing data such as memory and cpu usage.

Data published this way would be available to all scheduler agents.

Wrap Up

I realize there is a lot missing here, but I think this feature could be extremely powerful. Please read through and help tear this proposal apart and flesh it out properly.

Calling everyone from the previous thread, please weigh in!
/cc: @gucki @dbason @robhaswell @Dunedan

robhaswell · 2014-12-02T13:02:39Z

This is great, I'm excited by the prospect of one of the large players releasing a plugin system which fits our purposes. Here are some thoughts:

How do container constraints like MachineOf and Conflicts fit into this? Is this the function of the "metadata scheduler"? It would be useful to be able to see an overview of the capabilities of that scheduler, so I can work through some use-cases.

Also, have you given any thought to extending [X-Fleet] configuration? That extended configuration would need to be accessible by the custom schedulers.

Some responses to your questions:

Configurable, of course. Stateful failover schedulers frequently will want to wait for administrator input, which requires there to be no timeout.
I think you have to consider scalability above all else. If you are to provide all the state up-front, you must have a streamable protocol (chunked HTTP POST) and a streamable format (wait for it... XML). I don't think this is a good solution. Calling out to fleet or etcd is scalable at all levels. My advice would be to worry about performance later.

dbason · 2014-12-02T21:13:51Z

+1 for state being stored in etcd (or alternative kv store). The only things I think you want to be passing via the api is a list of machine keys that the scheduler will consider. That does lead on to what to do if the state information is missing. Do you simply discard that machine from the list or do you return an error for the scheduler. I think it would be good to standardize on that behaviour.

Also great proposal.

tg123 · 2014-12-03T02:25:47Z

In our environment we schedule machines based on their mem usage.
see my PR #1030

this patch also enable unit files to choose their Schedulers, however, now the Schedulers have to be defined in fleet code.

+1 for standardize this feature

epipho · 2014-12-03T05:44:27Z

Thank you all for the feedback so far.

@robhaswell In this context the metadata scheduler refers to everything the current scheduler does today (metadata, machine id, machineof etc) except the leveling of unit counts per machine.

Myself and others have talked about some scheduling additions to the X-Fleet configuration in #945 and #943. As for extending it for this purpose I think the easy answer is to pass the unit to be scheduled as part of the request data. Users could define their own unit file sections to keep X-Fleet clean and easy to validate by the fleet core.

@dbason Would it make sense to call back to fleet for cluster state information? This would allow fleet to store it in any k/v store it supports (i.e. if it was ported to be able to run on consul) at the cost of an extra hop. Using fleet as a proxy would also protect schedulers from internal data storage changes via a consistent, stable api. Fleet could also expose a nice batching api (get the following data for this list of machines) and parallelize the requests to the kv store.

Missing state for a machine should cause the machine to be removed from consideration. As long as there are more machines to consider I don't see a reason to error out on what could be a perfectly good cluster that lost a machine between scheduler calls.

@tg123 Very cool. It seems that your system could be implemented as a scheduler (for doing the sort) and a reporting agent to read and publish the system stats back to fleet from each node.

I am in agreement with others that machine state data should not be passed in with the request but queried for by the scheduler.

Fleet should call each scheduler in sequence with the following data

List of eligible machines from the previous scheduler
Representation of unit file being scheduled
Location to query for cluster state and machine data

Schedulers should return

Filtered list of eligible machines

exarkun · 2014-12-03T17:06:42Z

Calling directly into a k/v store could be faster but relies on each scheduler having knowledge of how fleet stores its data [...]

Another aspect of that "but" is that it makes it more difficult to change the schema of the k/v store (the more third-party schedulers directly access the k/v store, the more schedulers are broken/have to be updated if you ever want to re-arrange the data). It's usually a pretty safe bet that the schema won't be perfect on the first try.

By exposing an API in Fleet for retrieving this information, there's an opportunity to provide backwards compatibility when changing the underlying schema - only the Fleet implementation of the API needs to be fixed to support the same API operations on top of a new schema in the k/v store.

These compatibility layers don't need to live forever: it's very likely that it will make sense to eventually drop them - the decision of when that is depends a lot on how hard they ends up being to maintain and how much effort is available to do that maintenance - but at least early on this kind of compatibility is probably very valuable in encouraging people to develop to this interface (this increasing the amount of innovation and functionality in the space) and avoiding lots of user-facing breakage when a CoreOS or Fleet upgrade might otherwise have been incompatible with a scheduler favored by that user.

jonboulle · 2014-12-05T23:03:50Z

Cross-post - please see #922 (comment) (that comment is just as much a reply to this issue)

dbason · 2014-12-08T20:04:23Z

+1 for a separate api for storing and calling cluster information from Fleet? I still don't think you want to be passing that information as part of the scheduling chain but having a separate way of pulling it from Fleet would be fine. The API would also allow people to write their own plugins for storing metrics that they want to use in scheduling.

dbason · 2014-12-08T20:09:22Z

Ahh from #1055 :
no additions to the fleet API; extending the unit file with X-Fleet options should be sufficient

I guess the other option is that each scheduler has to have its own method of collecting and storing metrics it needs?

exarkun · 2014-12-08T22:22:21Z

I think it would be a good idea to make each scheduler deal with whatever persistence requirements it has on its own, at least initially. It may turn out that all the schedulers everyone wants to write have the same requirements in this area - but we won't know until people actually start writing schedulers. More likely, I suspect, is that lots of different people will have their own ideas about how this should work and enshrining one approach in Fleet will turn off more people than it helps.

epipho · 2014-12-08T22:37:48Z

@dbason I asked for clarification in the other thread. My interpretation was more along the lines of "No user-facing API changes" rather than "no changes at all".

@exarkun There is no reason a scheduler couldn't elect to handle its own storage and retrieval but I think there is value to having persistence available. Users could write schedulers that use already existing reporting agents without worrying about persistence. It simplifies setup and administration as everything is using a single store and allows for schedulers to be unaware of the backend storage if fleet is ever ported to a non-etcd store (#1004).

jonboulle · 2014-12-17T20:08:33Z

I think it would be a good idea to make each scheduler deal with whatever persistence requirements it has on its own, at least initially

Yep, this is definitely in line with what we were thinking

jonboulle mentioned this issue Dec 5, 2014

RFC: Requirements for an extensible scheduling system #1055

Open

bcwaldon added kind/question component/engine labels Dec 16, 2014

bcwaldon mentioned this issue Jul 9, 2015

It could be possible to not schedule a unit in a node without memory? #1217

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chainable Schedulers Discussion #1049

Chainable Schedulers Discussion #1049

epipho commented Dec 2, 2014

robhaswell commented Dec 2, 2014

dbason commented Dec 2, 2014

tg123 commented Dec 3, 2014

epipho commented Dec 3, 2014

exarkun commented Dec 3, 2014

jonboulle commented Dec 5, 2014

dbason commented Dec 8, 2014

dbason commented Dec 8, 2014

exarkun commented Dec 8, 2014

epipho commented Dec 8, 2014

jonboulle commented Dec 17, 2014

Chainable Schedulers Discussion #1049

Chainable Schedulers Discussion #1049

Comments

epipho commented Dec 2, 2014

Requirements:

Scheduler Agents

Scheduler Unit Files

Questions:

Reporting Agents

Wrap Up

robhaswell commented Dec 2, 2014

dbason commented Dec 2, 2014

tg123 commented Dec 3, 2014

epipho commented Dec 3, 2014

exarkun commented Dec 3, 2014

jonboulle commented Dec 5, 2014

dbason commented Dec 8, 2014

dbason commented Dec 8, 2014

exarkun commented Dec 8, 2014

epipho commented Dec 8, 2014

jonboulle commented Dec 17, 2014