-
Notifications
You must be signed in to change notification settings - Fork 302
Chainable Schedulers Discussion #1049
Comments
This is great, I'm excited by the prospect of one of the large players releasing a plugin system which fits our purposes. Here are some thoughts: How do container constraints like MachineOf and Conflicts fit into this? Is this the function of the "metadata scheduler"? It would be useful to be able to see an overview of the capabilities of that scheduler, so I can work through some use-cases. Also, have you given any thought to extending [X-Fleet] configuration? That extended configuration would need to be accessible by the custom schedulers. Some responses to your questions:
|
+1 for state being stored in etcd (or alternative kv store). The only things I think you want to be passing via the api is a list of machine keys that the scheduler will consider. That does lead on to what to do if the state information is missing. Do you simply discard that machine from the list or do you return an error for the scheduler. I think it would be good to standardize on that behaviour. Also great proposal. |
In our environment we schedule machines based on their mem usage. this patch also enable unit files to choose their +1 for standardize this feature |
Thank you all for the feedback so far. @robhaswell In this context the metadata scheduler refers to everything the current scheduler does today (metadata, machine id, machineof etc) except the leveling of unit counts per machine. Myself and others have talked about some scheduling additions to the X-Fleet configuration in #945 and #943. As for extending it for this purpose I think the easy answer is to pass the unit to be scheduled as part of the request data. Users could define their own unit file sections to keep X-Fleet clean and easy to validate by the fleet core. @dbason Would it make sense to call back to fleet for cluster state information? This would allow fleet to store it in any k/v store it supports (i.e. if it was ported to be able to run on consul) at the cost of an extra hop. Using fleet as a proxy would also protect schedulers from internal data storage changes via a consistent, stable api. Fleet could also expose a nice batching api (get the following data for this list of machines) and parallelize the requests to the kv store. Missing state for a machine should cause the machine to be removed from consideration. As long as there are more machines to consider I don't see a reason to error out on what could be a perfectly good cluster that lost a machine between scheduler calls. @tg123 Very cool. It seems that your system could be implemented as a scheduler (for doing the sort) and a reporting agent to read and publish the system stats back to fleet from each node. I am in agreement with others that machine state data should not be passed in with the request but queried for by the scheduler. Fleet should call each scheduler in sequence with the following data
Schedulers should return
|
Another aspect of that "but" is that it makes it more difficult to change the schema of the k/v store (the more third-party schedulers directly access the k/v store, the more schedulers are broken/have to be updated if you ever want to re-arrange the data). It's usually a pretty safe bet that the schema won't be perfect on the first try. By exposing an API in Fleet for retrieving this information, there's an opportunity to provide backwards compatibility when changing the underlying schema - only the Fleet implementation of the API needs to be fixed to support the same API operations on top of a new schema in the k/v store. These compatibility layers don't need to live forever: it's very likely that it will make sense to eventually drop them - the decision of when that is depends a lot on how hard they ends up being to maintain and how much effort is available to do that maintenance - but at least early on this kind of compatibility is probably very valuable in encouraging people to develop to this interface (this increasing the amount of innovation and functionality in the space) and avoiding lots of user-facing breakage when a CoreOS or Fleet upgrade might otherwise have been incompatible with a scheduler favored by that user. |
Cross-post - please see #922 (comment) (that comment is just as much a reply to this issue) |
+1 for a separate api for storing and calling cluster information from Fleet? I still don't think you want to be passing that information as part of the scheduling chain but having a separate way of pulling it from Fleet would be fine. The API would also allow people to write their own plugins for storing metrics that they want to use in scheduling. |
Ahh from #1055 : I guess the other option is that each scheduler has to have its own method of collecting and storing metrics it needs? |
I think it would be a good idea to make each scheduler deal with whatever persistence requirements it has on its own, at least initially. It may turn out that all the schedulers everyone wants to write have the same requirements in this area - but we won't know until people actually start writing schedulers. More likely, I suspect, is that lots of different people will have their own ideas about how this should work and enshrining one approach in Fleet will turn off more people than it helps. |
@dbason I asked for clarification in the other thread. My interpretation was more along the lines of "No user-facing API changes" rather than "no changes at all". @exarkun There is no reason a scheduler couldn't elect to handle its own storage and retrieval but I think there is value to having persistence available. Users could write schedulers that use already existing reporting agents without worrying about persistence. It simplifies setup and administration as everything is using a single store and allows for schedulers to be unaware of the backend storage if fleet is ever ported to a non-etcd store (#1004). |
Yep, this is definitely in line with what we were thinking |
Chainable schedulers was an idea originally put forth in #922 as a way to compose a complex scheduler out of simple pieces. This issue is intended to discuss how chainable schedulers could work as well as how they could be practically implemented.
These schedulers would run after the existing metadata scheduler to further reduce eligible list of candidates. If at any point a scheduler returned zero candidates, the scheduling process would be halted and the unit would either be delayed for rescheduling later or failed completely. If the full chain of schedulers ends with multiple eligible machines, a default scheduler will pick from them and schedule the unit.
Requirements:
Scheduler Agents
The scheduling agent is a single scheduler in the chain. It should register itself with fleet (via etcd?). The registration should include an endpoint to call as well as a priority order.
The master fleet scheduler will call each registered scheduler in priority order. Calls happen serially and only one unit is scheduled at a time. Fleet is responsible for sending all data on the current state of the cluster to the endpoint and the endpoint is responsible for returning a filtered list of eligible machines.
Scheduler Unit Files
Since there only needs to be one copy of each scheduler, schedulers could be submitted as a special type of fleet unit. The fleet master would be responsible to starting up these units on the current master machine to facilitate fast communication. If the fleet master changes, the schedulers are shut down and restarted on the new master.
Questions:
Edit 12/3/2014
Cluster state should be queried by the scheduler, not passed in the request for scalability reasons.
End Edit
Reporting Agents
Reporting agents are global fleet units that are responsible to sending back per-machine metrics to be added to the global cluster state. This allows for schedulers based on artificial or product-based metrics that cannot or should not be read by the normal fleet daemon.
Reporting agents would be responsible for sending a set of key/value pairs to their local fleet daemon on a regular basis. The local fleet daemon would send these to the fleet master along with existing data such as memory and cpu usage.
Data published this way would be available to all scheduler agents.
Wrap Up
I realize there is a lot missing here, but I think this feature could be extremely powerful. Please read through and help tear this proposal apart and flesh it out properly.
Calling everyone from the previous thread, please weigh in!
/cc: @gucki @dbason @robhaswell @Dunedan
The text was updated successfully, but these errors were encountered: