Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fair Share Replication Scheduler Implementation (3.x) #3364

Merged
merged 4 commits into from Mar 11, 2021

Conversation

nickva
Copy link
Contributor

@nickva nickva commented Feb 8, 2021

Fair share replication scheduler allows configuring job priorities
per-replicator db.

Previously jobs from all the replication dbs would be added to the scheduler
and run in a round-robin order. This update makes it possible to specify the
relative priority of jobs from different databases. For example, there could be
low, high and default priority _replicator dbs.

The original algorithm comes from the A Fair Share
Scheduler
paper by Judy Kay and Piers Lauder. A summary of how
the algorithm works is included in the top level comment in the
couch_replicator_share module.

There is minimal modification to the main scheduler logic. Besides the
share accounting logic each cycle, the other changes are:

  • Running and stopping candidates are now picked based on the priority first,
    and then on their last_started timestamp.

  • When jobs finish executing mid-cycle, their charges are accounted for. That
    holds for jobs which terminate normally, are removed by the user, or crash.

Other interesting aspects are the interaction with the error back-off mechanism
and how one-shot replications are treated:

  • The exponential error back-off mechanism is unaltered and takes precedence
    over the priority values. That means unhealthy jobs are rejected and
    "penalized" before the priority value is even looked at.

  • One-shot replications, once started, are not stopped during each scheduling
    cycle unless the operator manually adjusts the max_jobs parameter. That
    behavior is necessary to preserve the "snapshot" semantics and is retained in
    this update.

@nickva nickva force-pushed the fair-share-scheduler-3.x branch 9 times, most recently from 43b60b4 to b68113b Compare February 15, 2021 22:46
@nickva nickva marked this pull request as ready for review February 15, 2021 22:51
@nickva nickva force-pushed the fair-share-scheduler-3.x branch 2 times, most recently from 5a5ebe6 to a9163ec Compare February 16, 2021 05:34
@nickva nickva changed the title Fair share scheduler 3.x Fair Share Replication Scheduler Implementation (3.x) Feb 16, 2021
@nickva nickva force-pushed the fair-share-scheduler-3.x branch 3 times, most recently from 339daf3 to 2ef7b50 Compare February 16, 2021 17:49
@nickva nickva requested a review from davisp February 16, 2021 17:56
@nickva nickva force-pushed the fair-share-scheduler-3.x branch 5 times, most recently from 5b0d79d to a4780e4 Compare February 19, 2021 22:57
@nickva
Copy link
Contributor Author

nickva commented Feb 19, 2021

To help test the PR I created a silly script which runs with a local ./dev/run -n1 cluster. It configures replication jobs and shares for three _replicator dbs and then continuously samples the proportion of running jobs in each db:

https://gist.github.com/nickva/7e86b3df19537a60372217e0f68b693a

So for the default parameters it might output something like:

 ** run stats num dbs: 3, total jobs: 100
rdb1/_replicator 66 66%
rdb2/_replicator 23 23%
rdb3/_replicator 11 11%

 ** run stats num dbs: 3, total jobs: 100
rdb1/_replicator 61 61%
rdb2/_replicator 31 31%
rdb3/_replicator 8 8%

With a configuration where the shares are all even (100) but the dbs have an uneven number of jobs, after 5-10 minutes there should be a roughly even share of running jobs from each db, even though one db has 500 jobs and others have less.

INTERVAL = 20000
CONFIG = {
    "replicator": {
        "max_jobs": "100",
        "max_churn": "20",
        "interval" : str(INTERVAL),
    },
    "replicator.shares": {
        RDB1 : "100",
        RDB2 : "100",
        RDB3 : "100",
    }
}
REPS = {
    RDB1 : 200,
    RDB2 : 300,
    RDB3 : 500,
}
 ** run stats num dbs: 3, total jobs: 100
rdb1/_replicator 30 30%
rdb2/_replicator 34 34%
rdb3/_replicator 36 36%

 ** run stats num dbs: 3, total jobs: 100
rdb1/_replicator 29 29%
rdb2/_replicator 33 33%
rdb3/_replicator 38 38%

Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial review of this PR didn't find anything major in terms of the implementation. Though I do think we're missing a high level overview comment in the couch_replicator_share module that provides a high level mental model for how this works. I know there's a paper I can go read for more detail but a quick two or three paragraphs that explain how shares and charges work and their relation I think would be a pretty good starting point. Along with maybe a couple scenarios that would describe how the priority calculations work out.

Other than that most of my comments were style nits.

rel/overlay/etc/default.ini Outdated Show resolved Hide resolved
src/couch_replicator/src/couch_replicator_scheduler.erl Outdated Show resolved Hide resolved
src/couch_replicator/src/couch_replicator_share.erl Outdated Show resolved Hide resolved
src/couch_replicator/src/couch_replicator_share.erl Outdated Show resolved Hide resolved
src/couch_replicator/src/couch_replicator_share.erl Outdated Show resolved Hide resolved
src/couch_replicator/src/couch_replicator_share.erl Outdated Show resolved Hide resolved
@nickva nickva force-pushed the fair-share-scheduler-3.x branch 4 times, most recently from d066d86 to 78ceeed Compare March 1, 2021 17:33
Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Awesome work as per usual.

This is needed to prepare for the Fair Share scheduler feature since
both the scheduler and the fair share module will end up referencing
the #job record.
Fair share replication scheduler allows configuring job priorities
per-replicator db.

Previously jobs from all the replication dbs would be added to the scheduler
and run in a round-robin order. This update makes it possible to specify the
relative priority of jobs from different databases. For example, there could be
low, high and default priority _replicator dbs.

The original algorithm comes from the [A Fair Share
Scheduler](https://proteusmaster.urcf.drexel.edu/urcfwiki/images/KayLauderFairShare.pdf
"Fair Share Scheduler") paper by Judy Kay and Piers Lauder. A summary of how
the algorithm works is included in the top level comment in the
couch_replicator_share module.

There is minimal modification to the main scheduler logic. Besides the
share accounting logic each cycle, the other changes are:

  * Running and stopping candidates are now picked based on the priority first,
    and then on their last_started timestamp.

  * When jobs finish executing mid-cycle, their charges are accounted for. That
    holds for jobs which terminate normally, are removed by the user, or crash.

Other interesting aspects are the interaction with the error back-off mechanism
and how one-shot replications are treated:

  * The exponential error back-off mechanism is unaltered and takes precedence
  over the priority values. That means unhealthy jobs are rejected and
  "penalized" before the priority value is even looked at.

  * One-shot replications, once started, are not stopped during each scheduling
  cycle unless the operator manually adjusts the `max_jobs` parameter. That
  behavior is necessary to preserve the "snapshot" semantics and is retained in
  this update.
@nickva
Copy link
Contributor Author

nickva commented Mar 11, 2021

Will merge the PR and noting that I still owe a documentation update for it.

@nickva nickva merged commit 1add534 into 3.x Mar 11, 2021
@nickva nickva deleted the fair-share-scheduler-3.x branch March 11, 2021 18:13
nickva added a commit to apache/couchdb-documentation that referenced this pull request Mar 15, 2021
A short description on how the algorithm works along with the
configuration sections.

Main PR: apache/couchdb#3364
nickva added a commit to apache/couchdb-documentation that referenced this pull request Mar 15, 2021
A short description on how the algorithm works along with the
configuration sections.

Main PR: apache/couchdb#3364
nickva added a commit to apache/couchdb-documentation that referenced this pull request Mar 15, 2021
A short description on how the algorithm works along with the
configuration sections.

Main PR: apache/couchdb#3364
nickva added a commit to apache/couchdb-documentation that referenced this pull request Mar 15, 2021
A short description on how the algorithm works along with the
configuration sections.

Main PR: apache/couchdb#3364
nickva added a commit to apache/couchdb-documentation that referenced this pull request Mar 15, 2021
A short description on how the algorithm works along with the
configuration sections.

Main PR: apache/couchdb#3364
nickva added a commit to apache/couchdb-documentation that referenced this pull request Mar 17, 2021
A short description on how the algorithm works along with the
configuration sections.

Main PR: apache/couchdb#3364
nickva added a commit to apache/couchdb-documentation that referenced this pull request Mar 17, 2021
A short description on how the algorithm works along with the
configuration sections.

Main PR: apache/couchdb#3364
nickva added a commit to nickva/couchdb that referenced this pull request Sep 7, 2022
A short description on how the algorithm works along with the
configuration sections.

Main PR: apache#3364
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants