Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

Simple share based scheduling #945

Open
epipho opened this issue Oct 4, 2014 · 9 comments
Open

Simple share based scheduling #945

epipho opened this issue Oct 4, 2014 · 9 comments

Comments

@epipho
Copy link

epipho commented Oct 4, 2014

With all the scheduling chatter I thought it was time to put in an idea that my colleagues and I have been thinking about for a while.

This proposal and is related to #922 and #747

The idea behind this scheduler proposal is to be an extension to the, currently very simple, "flatten the number of jobs" scheduler without attempting to solve any dynamics resource scheduling problems. I believe this method would meet a large number of users's scheduling needs and could be used as the "default" scheduler for fleet.

Machine Shares

Each machine receives a fixed number of shares (default 1024) as a parameter to fleet at startup either as a env variable/flag (i.e. FLEET_ETCD_SERVERS) or as a "well-known" meta-data entry. For cases other than Share reservation (see below) the exact values do not matter and are only used for relative weighting of the number of jobs on each node.

A node with 2048 shares would receive twice the number of jobs as a node with 1024 shares. Total share count should be modifiable via api either via the machine resource or the meta-data api in #555.

A node with zero shares is not eligible to receive any new (non-global) jobs. If the entire cluster is reporting zero shares no new jobs can be sheduled.

Use cases

  • asymmetrical hardware i.e. 2x 2 core servers, 1x 4 core server
  • asymmetrical region balancing i.e. expecting twice the users in dc1 compared to dc2
  • set shares to 0 to prevent new jobs from being schedule to clear off a node for decommissioning or maintenance.

Job multiplier

A new X-Fleet option (Multiplier or Weight) for weighing a jobs against each other. The multiplier is a floating point option, defaulting to 1.0. This value is then used when scheduling jobs to balance the heavier jobs with the lighter jobs by ensuring each machine unit has approximately the same "weight" of units.

Use cases

  • Jobs that are more expensive to process
  • Services that receive differing amounts of traffic
  • Idle monitoring or data containers (multiplier < 1)

Share reservation

A new X-Fleet option (ReserveShares) to consume a number of shares from the machine during the job's run. Once the job has exited the shares are returned to the pool. While the shares are tied up the machine appears to have less shares than normal to the regular scheduler. A machine with 1024 shares running a 256 share reserved task would appear as a 768 point machine to the scheduler.

A job that reserves shares cannot be started if no machines report an appropriate number of unconsumed shares, with the exception of global jobs (See Global Jobs).

Use cases

  • Batch and one-shot jobs that cannot be rebalanced
  • Latency sensitive or stateful jobs such as game servers.

Global Jobs

Global jobs present an edge case for this system, particularly where share reservation is concerned. To resolve the edge case the scheduler would have the following rules.

  • Global units can always be scheduled on every node unless the entire cluster is reporting 0 shares.
  • Global units using share reservation can cause the current share count to go below 0. This allows the machine to come back into balance as other jobs exit and return shares.

Wrap-up

Overall I think this would be a good addition to the default fleet scheduler. All options are opt-in and if none are set or used the behavior stays exactly how it is today as all machines would have the same number of shares, all jobs would be weighted at 1.0 and no shares would be reserved.

I also think this is inline with the goals of fleet. It is easy to use and provides significant power without trying to be everything for everyone.

Investigations on what it would take to implement all three features are currently underway but I wanted to engage with the CoreOS developers and the community before I went too far down a path.

I look forward to any and all feedback.

@romesh-mccullough
Copy link

+1

Full disclosure, I am one of those colleagues @epipho that mentioned.

This would give us all the controls we need to balance a cluster, but without the weight of Kubernetes or Mesos (which are overkill for our needs).

@divideandconquer
Copy link

+1

I am another of @epipho's colleagues so I may be some what bias but this solution would provide a good balance between ease of use and control that currently seems to be lacking in the space.

@jonboulle
Copy link
Contributor

@epipho interesting proposal. I am a little unclear on how the job multiplier and shares reservation would interact (e.g. do reservations always take precedence, and multipliers are only considered if reservations are not present or equal?)

If you could come up with an illustrative example also that would be really great.

@epipho
Copy link
Author

epipho commented Oct 7, 2014

I see it working something like the following. I chose the numbers for the shares and job multipliers to illustrate the math, any number of jobs can have the same multiplier.

Consider the following cluster of two machines. One starts with 100 shares, one with 50, they have been running for some time and their current state looks like the table below.

ID Shares Current Job Multipliers Current Load
1 100 1, 2, 5, 10 (1+2+5+10)/100 = 0.18
2 50 3, 6 (3+6)/50 = 0.18

Machine 1 has 4 jobs with multipliers 1, 2, 5, 10. Machine 2 has 3 jobs with multipliers 3, 6, 7.

A new job is submitted with a job multiplier of 4. The scheduler first calculates what the new load would be on each machine if the job were to be scheduled there. The load here is only relative, no amount of current load will prevent this job from being scheduled somewhere.

1: (1+2+5+10+4)/100 = 0.22
2: (3+6+4)/50 = 0.26

0.22 < 0.26 so machine 1 is chosen. Current state:

ID Shares Current Job Multipliers Current Load
1 100 1, 2, 5, 10,4 (1+2+5+10+4)/100 = 0.22
2 50 3, 6 (3+6)/50 = 0.18

A new job is submitted, but this one uses 25 reserved shares with a multiplier of 1.
Loads are calculated as before, but the 25 shares are considered consumed. If a machine has less than 25 shares, it is automatically not eligible. If no machines have enough shares, the job cannot be scheduled.

1: (1+2+5+10+4+1)/(100-25) = 0.307
2. (3+6+1)/(50-25) = 0.4

Machine 1 still wins the job, but its current share count is reduce since our new job consumed them. If this job finishes or exits the shares will be returned. Current state:

ID Shares Current Job Multipliers Current Load
1 75 (100) 1, 2, 5, 10,4,1+25shares (1+2+5+10+4+1)/(100-25) = 0.307
2 50 3, 6 (3+6)/50 = 0.18

From now until the shares are returned, Machine 1 is treated as a machine with only 75 shares instead of 100.

Again a job comes in, this time with a multiplier of 7. Same pattern as the first one.

1: (1+2+5+10+4+1+7)/(100-25) = 0.4
2: (3+6+7)/50 = 0.32

The second machine wins the job, leaving our cluster in the final state of

ID Shares Current Job Multipliers Current Load
1 75 (100) 1, 2, 5, 10,4,1+25shares (1+2+5+10+4+1)/(100-25) = 0.307
2 50 3, 6,7 (3+6+7)/50 = 0.32

Hopefully this helps flesh out how jobs with reserved shares interact with the scheduler.

**edit: fixing busted up table markdown

@epipho
Copy link
Author

epipho commented Oct 14, 2014

ping. Looking for more feedback on this.

@jonboulle
Copy link
Contributor

@epipho sorry for the delay, bcwaldon is away at the moment and I really want to sit down with him to chat about this a bit.

@epipho
Copy link
Author

epipho commented Oct 16, 2014

Not a problem, just wanted to make sure it wasn't lost.

I would also be happy to schedule a time to jump on IRC to chat in more detail once bcwaldon is back,

@dbason
Copy link

dbason commented Oct 21, 2014

FWIW here is our fork that does something quite similar, but includes memory constraints as well:
cheribral@090b4a9

I think your comment on #922 is the way to go, that way this sort of thing could be just added in as part of the chain.

@epipho
Copy link
Author

epipho commented Nov 18, 2014

We had a bit of a shakeup over here but are ready to get started soon.

Talked to several team members, including @polvi at reInvent last week and they seemed enthusiastic about the concept.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants