Skip to content

Commit

Permalink
Fair Share Replication Scheduler Implementation
Browse files Browse the repository at this point in the history
Fair share replication scheduler allows configuring job priorities
per-replicator db.

Previously jobs from all the replication dbs would be added to the
scheduler and run in a round-robin order.  This update makes it
possible to specify the relative priority of jobs from different
databases, so for example, there could be a lower, higher and default
priority _replicator dbs.

The main discussion and the design is in the [RFC](apache/couchdb-documentation#617).

The original algorithm comes from the [A Fair Share
Scheduler](https://proteusmaster.urcf.drexel.edu/urcfwiki/images/KayLauderFairShare.pdf
"Fair Share Scheduler") paper by Judy Kay and Piers Lauder.

The main idea is that _replicator dbs can have a configurable number
of "shares" assigned to them. Shares in this context is a relative
quantity from 1 to 1000. The default is 100. Jobs from a replicator
database with more shares get proportionally a higher chance to run
than those from databases with a lower number of shares.

Every scheduler cycle, running jobs are "charged" based on how much
time they spent running during that cycle. At the end of the cycle,
the charges accumulated for each db, the number of shares configured,
and the total number of jobs in the pending queue for that db are used
to calculate a priority value for each job. To match the algorithm
presented in the paper, jobs with lower priority values are the ones
at the front of the run queue and have a higher chance of running.

Here is how charges, shares, and number of sibling jobs affect the
priority value:

  1) Jobs from dbs with higher shares get assigned lower priority
  values (they run sooner) than those with higher shares.

  2) Jobs from dbs with many other jobs (many siblings) get assigned a
  higher priority value, so they get pushed further down the run queue
  and have a lower chance of running.

  3) Jobs which run longer accumulate more charges and get assigned a
  higher priority value and also get to wait longer to run.

In order to prevent job starvation, there is a competing "process"
which periodically moves all the jobs towards the front of the queue
and decreases the priority values. In the implementation this is called
the "priority decay". So, in effect, there are two competing processes
- one uniformly moves all jobs to the front, the other throws them
back in proportion to those factors mentioned above. The speed of this
uniform priority decay is controlled by the `[replicator]
priority_coeff = 0.98` parameter. The implementation has description
on how it was picked and what effect the default has.

In order to prevent jobs from low priority dbs from being deleted and
immediately re-added, thus resetting their priority to 0, charges are
accumulated using a historically decayed "usage" value. The speed of
the decay is controlled by the `[replicator] usage_coeff = 0.5`
parameter. The implementation has a discussion on how the value was
picked and what effect modifying it might have.

Modification to the main scheduler logic turned out to be
minimal. Besides the main share accounting bits (accumulating charges,
decaying priorities and calculating new priorities), the main changes
are:

  1) Running and stopping candidates are picked based on the priority
  first, then on their last_started timestamp.

  2) When jobs finish executing mid-cycle, their charges have to be
  accounted for. That holds for jobs which terminate normally, are
  removed by the user or simply crash.

Another interesting aspect is the interaction with the error back-off
mechanism and how one-shot replications are treated:

  1) The exponential error back-off mechanism is unaltered and takes
  precedence over the priority values. That means unhealthy jobs are
  rejected and "penalized" before the priority value is even looked
  at.

  2) One-shot replications, once started, are not stopped during each
  scheduling cycle unless the operator manually adjust the max_jobs
  parameter. That behavior is necessary to preserve the "snapshot"-ing
  semantics and is retained in this implementation.
  • Loading branch information
nickva committed Feb 16, 2021
1 parent 183f6fe commit 5a5ebe6
Show file tree
Hide file tree
Showing 5 changed files with 976 additions and 76 deletions.
27 changes: 27 additions & 0 deletions rel/overlay/etc/default.ini
Expand Up @@ -482,6 +482,33 @@ ssl_certificate_max_depth = 3
; or 403 response this setting is not needed.
;session_refresh_interval_sec = 550

; Usage coefficient decays historic fair share usage every scheduling
; cycle. The value must be between 0.0 and 1.0. Lower values will
; ensure historic usage decays quicker and higher values means it will
; be remembered longer.
;usage_coeff = 0.5

; Priority coefficient decays all the fair share job priorities such
; that they uniformly drift towards the front of the run queue. At the
; default value of 0.98 it will take about 430 scheduler cycle (7
; hours) for a single job which ran for 1 minute to drift towards the
; front of the queue (get assigned priority 0). 7 hours was picked as
; it is close the maximum error backoff interval of about 8 hours. The
; value must be between 0.0 and 1.0. A too low of a value, coupled
; with a lower max jobs or churn parameter could end up making the
; majority of job priority 0 too quickly and canceling the effect of
; the fair share algorithm.
;priority_coeff = 0.98


[replicator.shares]
; Fair share configuration section. More shares result in a higher
; chance that jobs from that db get to run. The default value is 100,
; minimum is 1 and maximum is 1000. The configuration may be set even
; if the database wasn't created yet.
;_replicator = 100


[log]
; Possible log levels:
; debug
Expand Down

0 comments on commit 5a5ebe6

Please sign in to comment.