Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

RFC: Requirements for an extensible scheduling system #1055

Open
jonboulle opened this issue Dec 5, 2014 · 12 comments
Open

RFC: Requirements for an extensible scheduling system #1055

jonboulle opened this issue Dec 5, 2014 · 12 comments

Comments

@jonboulle
Copy link
Contributor

fleet’s current scheduling engine is rudimentary, and there have been various proposals (1 2 3 4) to either enhance the complexity of the scheduling within fleet, or provide a means for users to extend it without needing to run a custom branch of fleet.

This issue aims to capture the design requirements and restrictions for a solution to these requests, such that it can be implemented in a way a) keeping with fleet’s architecture and design goals, and b) without impacting existing users of fleet.

(Bear in mind that this is a work in progress proposal, not a final set of hard requirements; please provide feedback below).

The solution should be:

  • Pluggable
    • There should exist a simple request-response interface for scheduling plugins to connect with fleet. Some options previously proposed include an extremely simple HTTP endpoint or command-line interface.
    • It should not be necessary to spin a custom build of fleet to add new scheduling capabilities
    • There should be no new deployable fleet components (i.e. retain the situation of a single binary per host (fleetd) for fleet itself)
  • Backwards-compatible and behaviour preserving
    • fleet should continue to be usable “out of the box” for basic/common use cases (i.e. there should be a simple default scheduling mechanism similar to today, and users should not need to write a plugin to use fleet)
    • In the default configuration, the performance and efficiency of fleet should not be compromised from today’s situation
    • impact on (i.e. code changes to) existing fleet architecture and internals should be minimized
    • no additions to the fleet API; extending the unit file with X-Fleet options should be sufficient
  • Understandable
    • The introduction of pluggable schedulers will necessarily impact the performance and behaviour of fleet. The plug-in functionality should be structured such that it is easy to differentiate between issues in fleet itself and in plugins.

Since comparisons will inevitably arise, we will take this opportunity to draw a distinction between what we’re aiming for and the Mesos framework model. We do not anticipate the API between fleet and the pluggable schedulers becoming arbitrarily complex (ideally it should be limited to the single request-response described above), and we would still consider fleet to be the “entrypoint” for users deploying applications (c.f. Mesos, where the entrypoint is typically the scheduler). To put it another way, schedulers should plug in behind fleet rather than on top of fleet.

@xh3b4sd
Copy link

xh3b4sd commented Dec 6, 2014

Hey guys,

great to see that draft of requirements. I recently thought about different scheduling strategies and how to enable fleet to do so. Since we are running various clusters of coreos we are really interested to be able to schedule units based on different policies. Reading the requirements above leads to the following brain dump how such a plugin could be designed and work.

1. scheduler strategies

For different needs there should be different strategies to schedule units. This strategies could easily be defined in the [X-Fleet] section. At a later point this strategies maybe could be combined, but lets keep it simple in the first step. If no strategy is defined, the current implementation is used.

Here are possible strategies crossing my mind:

strategie description
lowestCPU schedules units on the host with lowest CPU usage
highestCPU schedules units on the host with highest CPU usage
lowestCores schedules units on the host with lowest CPU cores
highestCores schedules units on the host with highest CPU cores
lowestMem schedules units on the host with lowest memory usage
highestMem schedules units on the host with highest memory usage
lowestCount schedules units on the host with lowest unit count
highestCount schedules units on the host with highest unit count
roundRobin schedules units round robin style
default schedules units based on current implementation

This is how it could look like in a unit file:

[X-Fleet]
SchedulerStrategie=lowestCPU

2. configuring fleetd

The scheduler in fleetd needs to handle strategies. Maybe it would be nice to use a feature that already exists. Imagine fleetd could ask some service to which MachineID a unit could be scheduled to best. If fleetd detects a given SchedulerStrategy in a given unit file, it asks a resource-ruler for the machine ID, best fulfilling the given strategy requirements. That just would lead to an additional call to an external resource to get a machine ID during the actual scheduling. To let fleetd know where to ask for a machine ID, one could configure it on startup. Further it would be necessary to ensure each fleetd is configured using a resource-ruler or not. There might be different policies too, to wether judge on consistent usage of resource-ruler's to throw an error and refuse startup or not.

This here could be placed in a config file:

resource_ruler_endpoint=http://127.0.0.1:7007
resource_ruler_consistency=true

3. write a plugable resource ruler

The ruler could be just another service running parallel to fleetd and etcd. It could be configured to collect a set of different data necessary to schedule units in configurable intervals of time to store them in etcd. Think about a HTTP server providing different routes with plugged in rulers for different strategies.

  • each ruler would provide a collector, collecting metrics important for its ruler decisions
  • a ruler's collector would be called in a configured interval of seconds
  • a ruler's collector would just collect metrics of its own host
  • each ruler would provide a HTTP route just handling one simple algorithm to fulfil its own strategy
  • a ruler's HTTP route would just accept a strategy parameter described above
  • a ruler's HTTP route would just respond an "active" machine ID of the fleet/etcd cluster

Starting the resource-ruler could look like this:

resource-ruler --strategy-set=all --collector-interval=10 --host=127.0.0.1 --port=7007

As far as I can see, this proposal fulfils nearly every requirement listed above. Anyway there will be parts I have missed, because of the complexity if distributed systems. Further, often one disadvantage of additional functionality is increasing latency for the given action. In our case this approach would be as so called "good enough", since there is just one additional HTTP call. That HTTP call needs to wait for another HTTP call to etcd, to crunch little data in the speed of light (because we are using go 🚀).

So far so good, thanks for coreos <3

@dbason
Copy link

dbason commented Dec 7, 2014

@zyndiecate I don't think there is much benefit in doing metrics based on cores - when taken in isolation it's not particularly useful. If your cpushares are set correctly then it's the number of cores divided by the number of units on the machine that influences cpu time.

I think being able to chain a series of simple composable schedulers (as described in #1049) is really going to drive the most value - think a moog synthesizer for scheduling ;). This keeps the core scheduler stable, and allows for community extensions

@epipho
Copy link

epipho commented Dec 8, 2014

Thank you @jonboulle for the update.

I would like one clarification on the requirements. Under Behavior Preserving you call out "No changes to the fleet API". I believe this should be clarified to: "No changes to the user-facing & user APIs", i.e. the API used to schedule, retrieve, and view units should stay as it is today to limit the impact on existing users.

I think it makes sense to expand the API to allow better interaction between fleet and the external schedulers. This would be a separate API (separate port like etcd client vs server ports?) used only by the schedulers to register themselves and publish/retrieve scheduler data.

I like the idea of calling out which schedulers to run and in which order via a new parameter in X-Fleet, similar to @zyndiecate's example above with multiple values. This removes the need to specify any sort of priority when registering as well as allow different types of units to request specific types of scheduling.

I'd like to propose the following additions to the requirements:

  • Schedulers can register dynamically with fleet.
  • Schedulers should be able to store and access simple k/v pairs from fleet itself. @zyndiecate calls them rulers above, I called them reporting agents in Chainable Schedulers Discussion #1049
  • User-facing APIs should not change (as detailed in the first paragraph).

I would also like to (re-) propose that the default batteries included scheduler receive a few upgrades. A scheme similar to #945 would allow for more out of the box uses without modifying the default behavior as it stands today.

@jonboulle
Copy link
Contributor Author

I would like one clarification on the requirements. Under Behavior Preserving you call out "No changes to the fleet API". I believe this should be clarified to: "No changes to the user-facing & user APIs", i.e. the API used to schedule, retrieve, and view units should stay as it is today to limit the impact on existing users.

This is accurate, but it's your follow on point (a new and expanded scheduler-specific API) that concerns me a little. To quote myself from the end of the OP:

We do not anticipate the API between fleet and the pluggable schedulers becoming arbitrarily complex (ideally it should be limited to the single request-response described above), and we would still consider fleet to be the “entrypoint” for users deploying applications (c.f. Mesos, where the entrypoint is typically the scheduler). To put it another way, schedulers should plug in behind fleet rather than on top of fleet.

If I understand @zyndiecate's proposal correctly, it wouldn't involve such an increase in complexity as the interface would be limited to a single call.

@jonboulle
Copy link
Contributor Author

I would also like to (re-) propose that the default batteries included scheduler receive a few upgrades. A scheme similar to #945 would allow for more out of the box uses without modifying the default behavior as it stands today.

Could you explain a little more on why you feel #945 should be baked into the core rather than implemented as a chainable scheduler?

@epipho
Copy link

epipho commented Dec 22, 2014

I don't see the API becoming arbitrarily complex, however I still maintain that fleet itself should provide some rudimentary persistence. Persistence would consist of single-level key value pairs per machine, collected and sent to fleet from each machine. Having this persistence would simplify the setup and use of the schedulers and increase performance (scheduler could request data at registration time and appropriate data could be sent along on a scheduler request). I think this is worth the slight complexity that it would add to fleet itself.

The following two endpoints would be added to Fleet

  • Update KVs for machine. Updates an array of key-value pairs for a single machine. This endpoint can be called on any fleet instance and the data will be forwarded to the master. This data is stored in etcd so it remains available if the fleet leader changes.
  • Register Scheduler. Each scheduler registers an endpoint, a name, and a list of kv pairs that should be sent on a scheduler request.

Each scheduler would implement just one endpoint

  • Filter Machines. Called by fleet, takes a list of machines, each machine including a map of data requested by the registration call. Returns a filtered list of machines that are still valid. If a scheduler returns multiple machines and there are no more schedulers in the list specified by the unit, fleet uses the default scheduler to narrow down to 1.

The fleet unit itself specifies the order of the schedulers to run. The list of machines output from a scheduler is fed into the next in the chain, even if there is only 1 machine to ensure the machine meets all qualifications. If at any point a scheduler returns zero machines, the unit cannot be scheduled.

@epipho
Copy link

epipho commented Dec 22, 2014

@jonboulle As for #945 I think that the default scheduler as it exists today just isn't enough. While I agree that the reservation system could (and probably should) be moved to a separate chained scheduler, the job multiplier feels like an easy, low complexity, addition with a lot of benefit.

Allowing fleet to load balance asymmetric work loads would open it up for a lot more applications out of the box.

@camerondavison
Copy link

While I think that this RFC is pretty cool, it does not really look like its getting traction. Is there any way that we could have something simpler to deal with just over provisioning of memory?

Right now I have a problem where we have 10 machines running cores os, and about 12 services running multiple versions. 4 services can fit on 1 machine, they are all about the same size. This means that we have the capacity to run 40 service copies, much more than the 24 or so that we need. Even with this if a machine or two fails or restarts, then a cascading failure occurs as it piles too many services on one machine.

@epipho
Copy link

epipho commented Apr 24, 2015

I was hoping to work on this next, but am currently stalled out on #1077. I am hoping now that etcd2.0 has shipped in alpha that @jonboulle and @bcwaldon can give me a bit of direction on both issues.

@vassilevsky
Copy link

I see that there was an attempt to bring machine resources into play, but it was deleted in ab275c1 (although resource dir is still there). I wonder why you decided to do that.

@wuqixuan
Copy link
Contributor

@jonboulle , seems you forget to list the #943 requirement.

@jonboulle
Copy link
Contributor Author

/cc @htr

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants