Simple share based scheduling #945

epipho · 2014-10-04T02:57:56Z

With all the scheduling chatter I thought it was time to put in an idea that my colleagues and I have been thinking about for a while.

This proposal and is related to #922 and #747

The idea behind this scheduler proposal is to be an extension to the, currently very simple, "flatten the number of jobs" scheduler without attempting to solve any dynamics resource scheduling problems. I believe this method would meet a large number of users's scheduling needs and could be used as the "default" scheduler for fleet.

Machine Shares

Each machine receives a fixed number of shares (default 1024) as a parameter to fleet at startup either as a env variable/flag (i.e. FLEET_ETCD_SERVERS) or as a "well-known" meta-data entry. For cases other than Share reservation (see below) the exact values do not matter and are only used for relative weighting of the number of jobs on each node.

A node with 2048 shares would receive twice the number of jobs as a node with 1024 shares. Total share count should be modifiable via api either via the machine resource or the meta-data api in #555.

A node with zero shares is not eligible to receive any new (non-global) jobs. If the entire cluster is reporting zero shares no new jobs can be sheduled.

Use cases

asymmetrical hardware i.e. 2x 2 core servers, 1x 4 core server
asymmetrical region balancing i.e. expecting twice the users in dc1 compared to dc2
set shares to 0 to prevent new jobs from being schedule to clear off a node for decommissioning or maintenance.

Job multiplier

A new X-Fleet option (Multiplier or Weight) for weighing a jobs against each other. The multiplier is a floating point option, defaulting to 1.0. This value is then used when scheduling jobs to balance the heavier jobs with the lighter jobs by ensuring each machine unit has approximately the same "weight" of units.

Use cases

Jobs that are more expensive to process
Services that receive differing amounts of traffic
Idle monitoring or data containers (multiplier < 1)

Share reservation

A new X-Fleet option (ReserveShares) to consume a number of shares from the machine during the job's run. Once the job has exited the shares are returned to the pool. While the shares are tied up the machine appears to have less shares than normal to the regular scheduler. A machine with 1024 shares running a 256 share reserved task would appear as a 768 point machine to the scheduler.

A job that reserves shares cannot be started if no machines report an appropriate number of unconsumed shares, with the exception of global jobs (See Global Jobs).

Use cases

Batch and one-shot jobs that cannot be rebalanced
Latency sensitive or stateful jobs such as game servers.

Global Jobs

Global jobs present an edge case for this system, particularly where share reservation is concerned. To resolve the edge case the scheduler would have the following rules.

Global units can always be scheduled on every node unless the entire cluster is reporting 0 shares.
Global units using share reservation can cause the current share count to go below 0. This allows the machine to come back into balance as other jobs exit and return shares.

Wrap-up

Overall I think this would be a good addition to the default fleet scheduler. All options are opt-in and if none are set or used the behavior stays exactly how it is today as all machines would have the same number of shares, all jobs would be weighted at 1.0 and no shares would be reserved.

I also think this is inline with the goals of fleet. It is easy to use and provides significant power without trying to be everything for everyone.

Investigations on what it would take to implement all three features are currently underway but I wanted to engage with the CoreOS developers and the community before I went too far down a path.

I look forward to any and all feedback.

romesh-mccullough · 2014-10-04T14:37:13Z

+1

Full disclosure, I am one of those colleagues @epipho that mentioned.

This would give us all the controls we need to balance a cluster, but without the weight of Kubernetes or Mesos (which are overkill for our needs).

divideandconquer · 2014-10-06T13:54:10Z

+1

I am another of @epipho's colleagues so I may be some what bias but this solution would provide a good balance between ease of use and control that currently seems to be lacking in the space.

jonboulle · 2014-10-07T00:36:56Z

@epipho interesting proposal. I am a little unclear on how the job multiplier and shares reservation would interact (e.g. do reservations always take precedence, and multipliers are only considered if reservations are not present or equal?)

If you could come up with an illustrative example also that would be really great.

epipho · 2014-10-07T05:16:58Z

I see it working something like the following. I chose the numbers for the shares and job multipliers to illustrate the math, any number of jobs can have the same multiplier.

Consider the following cluster of two machines. One starts with 100 shares, one with 50, they have been running for some time and their current state looks like the table below.

ID	Shares	Current Job Multipliers	Current Load
1	100	1, 2, 5, 10	(1+2+5+10)/100 = 0.18
2	50	3, 6	(3+6)/50 = 0.18

Machine 1 has 4 jobs with multipliers 1, 2, 5, 10. Machine 2 has 3 jobs with multipliers 3, 6, 7.

A new job is submitted with a job multiplier of 4. The scheduler first calculates what the new load would be on each machine if the job were to be scheduled there. The load here is only relative, no amount of current load will prevent this job from being scheduled somewhere.

1: (1+2+5+10+4)/100 = 0.22
2: (3+6+4)/50 = 0.26

0.22 < 0.26 so machine 1 is chosen. Current state:

ID	Shares	Current Job Multipliers	Current Load
1	100	1, 2, 5, 10,4	(1+2+5+10+4)/100 = 0.22
2	50	3, 6	(3+6)/50 = 0.18

A new job is submitted, but this one uses 25 reserved shares with a multiplier of 1.
Loads are calculated as before, but the 25 shares are considered consumed. If a machine has less than 25 shares, it is automatically not eligible. If no machines have enough shares, the job cannot be scheduled.

1: (1+2+5+10+4+1)/(100-25) = 0.307
2. (3+6+1)/(50-25) = 0.4

Machine 1 still wins the job, but its current share count is reduce since our new job consumed them. If this job finishes or exits the shares will be returned. Current state:

ID	Shares	Current Job Multipliers	Current Load
1	75 (100)	1, 2, 5, 10,4,1+25shares	(1+2+5+10+4+1)/(100-25) = 0.307
2	50	3, 6	(3+6)/50 = 0.18

From now until the shares are returned, Machine 1 is treated as a machine with only 75 shares instead of 100.

Again a job comes in, this time with a multiplier of 7. Same pattern as the first one.

1: (1+2+5+10+4+1+7)/(100-25) = 0.4
2: (3+6+7)/50 = 0.32

The second machine wins the job, leaving our cluster in the final state of

ID	Shares	Current Job Multipliers	Current Load
1	75 (100)	1, 2, 5, 10,4,1+25shares	(1+2+5+10+4+1)/(100-25) = 0.307
2	50	3, 6,7	(3+6+7)/50 = 0.32

Hopefully this helps flesh out how jobs with reserved shares interact with the scheduler.

**edit: fixing busted up table markdown

epipho · 2014-10-14T23:00:11Z

ping. Looking for more feedback on this.

jonboulle · 2014-10-16T18:31:09Z

@epipho sorry for the delay, bcwaldon is away at the moment and I really want to sit down with him to chat about this a bit.

epipho · 2014-10-16T19:00:05Z

Not a problem, just wanted to make sure it wasn't lost.

I would also be happy to schedule a time to jump on IRC to chat in more detail once bcwaldon is back,

dbason · 2014-10-21T22:50:58Z

FWIW here is our fork that does something quite similar, but includes memory constraints as well:
cheribral@090b4a9

I think your comment on #922 is the way to go, that way this sort of thing could be just added in as part of the chain.

epipho · 2014-11-18T17:22:10Z

We had a bit of a shakeup over here but are ready to get started soon.

Talked to several team members, including @polvi at reInvent last week and they seemed enthusiastic about the concept.

jonboulle added the kind/question label Oct 7, 2014

jonboulle mentioned this issue Oct 7, 2014

A "soft" conflicts attribute #943

Open

bcwaldon added the component/engine label Oct 9, 2014

epipho mentioned this issue Dec 3, 2014

Chainable Schedulers Discussion #1049

Open

jonboulle mentioned this issue Dec 5, 2014

RFC: Requirements for an extensible scheduling system #1055

Open

This was referenced May 17, 2016

Implement a share-based scheduling heuristic #1583

Closed

Implement a share-based scheduling heuristic upfluence/fleet#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple share based scheduling #945

Simple share based scheduling #945

epipho commented Oct 4, 2014

romesh-mccullough commented Oct 4, 2014

divideandconquer commented Oct 6, 2014

jonboulle commented Oct 7, 2014

epipho commented Oct 7, 2014

epipho commented Oct 14, 2014

jonboulle commented Oct 16, 2014

epipho commented Oct 16, 2014

dbason commented Oct 21, 2014

epipho commented Nov 18, 2014

Simple share based scheduling #945

Simple share based scheduling #945

Comments

epipho commented Oct 4, 2014

Machine Shares

Use cases

Job multiplier

Use cases

Share reservation

Use cases

Global Jobs

Wrap-up

romesh-mccullough commented Oct 4, 2014

divideandconquer commented Oct 6, 2014

jonboulle commented Oct 7, 2014

epipho commented Oct 7, 2014

epipho commented Oct 14, 2014

jonboulle commented Oct 16, 2014

epipho commented Oct 16, 2014

dbason commented Oct 21, 2014

epipho commented Nov 18, 2014