New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain replication stats between job runs #1722

Merged
merged 1 commit into from Nov 9, 2018

Conversation

Projects
None yet
2 participants
@nickva
Contributor

nickva commented Nov 8, 2018

Previously stats counts between job runs were reset. So if a job was stopped
and restarted by the scheduler, its docs_written, docs_read, doc_write_failures,
etc., counts would go back to 0. For doc_write_failures this was especially bad
as it hid the fact that some documents were not replicated to the target
because either a VDU failed or one of the limits were hit.

This change preserves stats across job runs. Everytime active tasks is updated,
the stats object in rep record of each job in scheduler's ets table will be
updated asynchronously. On next job start the job will reinitialize from last
saved stats.

Relates somewhat to issue #1159

@nickva nickva force-pushed the fail-replication-on-doc-write branch from 3d0525b to 8c93fed Nov 8, 2018

@jaydoane

Other than the new warning, code looks great.

Eunit test passes:

$ make eunit skip_deps+=couch_epi apps=couch_replicator suites=couch_replicator_retain_stats_between_job_runs

======================== EUnit ========================
module 'couch_replicator_retain_stats_between_job_runs'
  couch_replicator_retain_stats_between_job_runs:49: t_stats_retained...[0.598 s] ok
=======================================================
  Test passed.

Also ran manual tests:

  • start a replication job, let it do some work, look at it stats in _active_tasks, remember the docs written, docs read etc numbers
$ curl -u adm:pass localhost:15984/_active_tasks | jq '.[0].docs_read'
3
  • go to remsh: config:set replicator max_jobs = 0
(node1@127.0.0.1)36> config:get("replicator", "max_jobs").
"500"
(node1@127.0.0.1)37> config:set("replicator","max_jobs","0", false).
ok
(node1@127.0.0.1)38> config:get("replicator", "max_jobs").
"0"
  • call couch_replicator_scheduler:reschedule(), which should stop the job
(node1@127.0.0.1)39> couch_replicator_scheduler:reschedule().
ok
  • set replicator max_jobs = 500 (default)
(node1@127.0.0.1)40> config:set("replicator","max_jobs","500", false).
ok
  • call couch_replicator_scheduler:reschedule() wait for job to start up again
(node1@127.0.0.1)41> couch_replicator_scheduler:reschedule().
ok
  • look at _active_tasks docs written and docs read should have the previous values instead of 0
$ curl -u adm:pass localhost:15984/_active_tasks | jq '.[0].docs_read'
3

👍

{checkpoint_interval, CheckpointInterval}
]),
] ++ rep_stats(State)),

This comment has been minimized.

@jaydoane

jaydoane Nov 9, 2018

Contributor

It seems like this refactor introduced a new warning:

db/src/couchdb/src/couch_replicator/src/couch_replicator_scheduler_job.erl:121: Warning: variable 'CommittedSeq' is unused
Retain replication stats between job runs
Previously stats counts between job runs were reset. So if a job was stopped
and restarted by the scheduler, its docs_written, docs_read, doc_write_failures,
etc., counts would go back to 0. For doc_write_failures this was especially bad
as it hid the fact that some documents were not replicated to the target
because either a VDU failed or one of the limits were hit.

This change preserves stats across job runs. Everytime active tasks is updated,
the stats object in rep record of each job in scheduler's ets table will be
updated asynchronously. On next job start the job will reinitialize from last
saved stats.

Relates somewhat to issue #1159

@nickva nickva force-pushed the fail-replication-on-doc-write branch from 8c93fed to cc8463c Nov 9, 2018

@nickva nickva merged commit 00b28c2 into master Nov 9, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@nickva nickva deleted the fail-replication-on-doc-write branch Nov 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment