Fair Share Replication Scheduler Implementation

Fair share replication scheduler allows configuring job priorities per-replicator db. Previously jobs from all the replication dbs would be added to the scheduler and run in a round-robin order. This update makes it possible to specify the relative priority of jobs from different databases, so for example, there could be a lower, higher and default priority _replicator dbs. The main discussion and the design is in the [RFC](apache/couchdb-documentation#617). The original algorithm comes from the [A Fair Share Scheduler](https://proteusmaster.urcf.drexel.edu/urcfwiki/images/KayLauderFairShare.pdf "Fair Share Scheduler") paper by Judy Kay and Piers Lauder. The main idea is that _replicator dbs can have a configurable number of "shares" assigned to them. Shares in this context is a relative quantity from 1 to 1000. The default is 100. Jobs from a replicator database with more shares get proportionally a higher chance to run than those from databases with a lower number of shares. Every scheduler cycle, running jobs are "charged" based on how much time they spent running during that cycle. At the end of the cycle, the charges accumulated for each db, the number of shares configured, and the total number of jobs in the pending queue for that db are used to calculate a priority value for each job. To match the algorithm presented in the paper, jobs with lower priority values are the ones at the front of the run queue and have a higher chance of running. Here is how charges, shares, and number of sibling jobs affect the priority value: 1) Jobs from dbs with higher shares get assigned lower priority values (they run sooner) than those with higher shares. 2) Jobs from dbs with many other jobs (many siblings) get assigned a higher priority value, so they get pushed further down the run queue and have a lower chance of running. 3) Jobs which run longer accumulate more charges and get assigned a higher priority value and also get to wait longer to run. In order to prevent job starvation, there is a competing "process" which periodically moves all the jobs towards the front of the queue and decreases the priority values. In the implementation this is called the "priority decay". So, in effect, there are two competing processes - one uniformly moves all jobs to the front, the other throws them back in proportion to those factors mentioned above. The speed of this uniform priority decay is controlled by the `[replicator] priority_coeff = 0.98` parameter. The implementation has description on how it was picked and what effect the default has. In order to prevent jobs from low priority dbs from being deleted and immediately re-added, thus resetting their priority to 0, charges are accumulated using a historically decayed "usage" value. The speed of the decay is controlled by the `[replicator] usage_coeff = 0.5` parameter. The implementation has a discussion on how the value was picked and what effect modifying it might have. Modification to the main scheduler logic turned out to be minimal. Besides the main share accounting bits (accumulating charges, decaying priorities and calculating new priorities), the main changes are: 1) Running and stopping candidates are picked based on the priority first, then on their last_started timestamp. 2) When jobs finish executing mid-cycle, their charges have to be accounted for. That holds for jobs which terminate normally, are removed by the user or simply crash. Another interesting aspect is the interaction with the error back-off mechanism and how one-shot replications are treated: 1) The exponential error back-off mechanism is unaltered and takes precedence over the priority values. That means unhealthy jobs are rejected and "penalized" before the priority value is even looked at. 2) One-shot replications, once started, are not stopped during each scheduling cycle unless the operator manually adjust the max_jobs parameter. That behavior is necessary to preserve the "snapshot"-ing semantics and is retained in this implementation.
apache · Feb 16, 2021 · 5a5ebe6 · 5a5ebe6
1 parent 183f6fe
commit 5a5ebe6
Show file tree

Hide file tree

Showing 5 changed files with 976 additions and 76 deletions.
diff --git a/rel/overlay/etc/default.ini b/rel/overlay/etc/default.ini
@@ -482,6 +482,33 @@ ssl_certificate_max_depth = 3
 ; or 403 response this setting is not needed.
 ;session_refresh_interval_sec = 550
 
+; Usage coefficient decays historic fair share usage every scheduling
+; cycle. The value must be between 0.0 and 1.0. Lower values will
+; ensure historic usage decays quicker and higher values means it will
+; be remembered longer.
+;usage_coeff = 0.5
+
+; Priority coefficient decays all the fair share job priorities such
+; that they uniformly drift towards the front of the run queue. At the
+; default value of 0.98 it will take about 430 scheduler cycle (7
+; hours) for a single job which ran for 1 minute to drift towards the
+; front of the queue (get assigned priority 0). 7 hours was picked as
+; it is close the maximum error backoff interval of about 8 hours. The
+; value must be between 0.0 and 1.0. A too low of a value, coupled
+; with a lower max jobs or churn parameter could end up making the
+; majority of job priority 0 too quickly and canceling the effect of
+; the fair share algorithm.
+;priority_coeff = 0.98
+
+
+[replicator.shares]
+; Fair share configuration section. More shares result in a higher
+; chance that jobs from that db get to run. The default value is 100,
+; minimum is 1 and maximum is 1000. The configuration may be set even
+; if the database wasn't created yet.
+;_replicator = 100
+
+
 [log]
 ; Possible log levels:
 ;  debug