[stats] replicator scheduler crashed counter not inrcementing #810

wohali · 2017-09-12T19:59:10Z

I set up a _replicator document with an incorrect URL in it, and let the replication scheduler attempt to access it a few times in a row.

Repeated fetches of /_node/couchdb@localhost/_stats showed that couch_replicator.jobs.crashed.value stayed at 0 and never incremented.

Steps to Reproduce (for bugs)

Create a _replicator document of this form:

{
    "_id": "my_rep",
    "source": "http://does.not.resolve.wohali.rules/foo",
    "target": "http://127.0.0.1:5984/bar",
    "create_target": true
}

and let it cycle through a few nxdomain errors (as evidenced by couch.log).

Context

Trying to put together an open source CouchDB monitoring solution, and failing to get useful stats out of the new scheduler.

Your Environment

Version used: 2.1.0
Browser Name and version: n/a
Operating System and version Debian 8 (jessie)

The text was updated successfully, but these errors were encountered:

Previously an individual failed request would be tried 10 times in a row with an exponential backoff starting at 0.25 seconds. So the intervals in seconds would be: `0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128` For a total of about 250 seconds (or about 4 minutes). This made sense before the scheduling replicator because if a replication job had crashed in the startup phase enough times it would not be retried anymore. With a scheduling replicator, it makes more sense to stop the whole task, and let the scheduling replicatgor retry later. `retries_per_request` then becomes something used mainly for short intermettent network issues. The new retry schedule is `0.25, 0.5, 1, 2, 4` Or about 8 seconds. An additional benefit when the job is stopped quicker, the user can find out about the problem sooner from the _scheduler/docs and _scheduler/jobs status endpoints and can rectify the problem. Otherwise a single request retrying for 4 minutes would be indicated there as the job is healthy and running. Fixes apache#810

nickva · 2017-09-27T15:02:07Z

Took a look at this one.

Saw the same behavior. There could be 3 reasons for not noticing the crashed guage being bumped.

There is a fairly high retries_per_request default value of 10 used to retry failed requests. Requests are tried 10 times with exponentially increasing sleep amounts in between, starting at 0.25 seconds. That means there could be up to 4 minutes of retrying the same failed request before the job fails and scheduling replicator reports a crashed status for it. I made a PR to reduce the default number of tries to 5 so there would be up to 8 seconds worth of retries instead. This makes more sense now that the scheduling replicator is used, as it can better handle reporting and backing off when errors occurs. This means that for quite a while (tens of minute or hours) the status for the replication job might be in the running state since it wasting time retrying that request.
Replications will uniformly pick one node in the cluster to run on which doesn't have to be the node which processed the document update request. To detect the crashed stats update would have to know which node to check for changes. I made this mistake so mentioning it here just in case. Perhaps there is a case there in general for aggregating stats for all nodes.
Crashed status is only reported after the replication job has crashed and is waiting to run next (possibly being penalized if it crashed too many times in a row). However as soon as it is given a chance to run again, it gets counted as running. While in that state it won't bump the crashed guage. This was done such that he total is always equal to running + pending + crashing. So the effect of this is that the crashing count will periodically go down for a bit when job is attempting to run, then when it fails it will be bumped back up. Before the PR above, this could take quite a while, but even with it might still take up to 15 seconds (8 seconds worth of retries + stats updates happen with a delay of 5 seconds).

Previously an individual failed request would be tried 10 times in a row with an exponential backoff starting at 0.25 seconds. So the intervals in seconds would be: `0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128` For a total of about 250 seconds (or about 4 minutes). This made sense before the scheduling replicator because if a replication job had crashed in the startup phase enough times it would not be retried anymore. With a scheduling replicator, it makes more sense to stop the whole task, and let the scheduling replicatgor retry later. `retries_per_request` then becomes something used mainly for short intermettent network issues. The new retry schedule is `0.25, 0.5, 1, 2, 4` Or about 8 seconds. An additional benefit when the job is stopped quicker, the user can find out about the problem sooner from the _scheduler/docs and _scheduler/jobs status endpoints and can rectify the problem. Otherwise a single request retrying for 4 minutes would be indicated there as the job is healthy and running. Fixes #810

wohali · 2017-09-27T16:56:02Z

@nickva Thanks, this all makes sense. I'd like to not lose track of point 2 above. Could you file a new enhancement issue for this concept so we can add it to the backlog?

Previously an individual failed request would be tried 10 times in a row with an exponential backoff starting at 0.25 seconds. So the intervals in seconds would be: `0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128` For a total of about 250 seconds (or about 4 minutes). This made sense before the scheduling replicator because if a replication job had crashed in the startup phase enough times it would not be retried anymore. With a scheduling replicator, it makes more sense to stop the whole task, and let the scheduling replicatgor retry later. `retries_per_request` then becomes something used mainly for short intermettent network issues. The new retry schedule is `0.25, 0.5, 1, 2, 4` Or about 8 seconds. An additional benefit when the job is stopped quicker, the user can find out about the problem sooner from the _scheduler/docs and _scheduler/jobs status endpoints and can rectify the problem. Otherwise a single request retrying for 4 minutes would be indicated there as the job is healthy and running. Fixes #810

Previously an individual failed request would be tried 10 times in a row with an exponential backoff starting at 0.25 seconds. So the intervals in seconds would be: `0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128` For a total of about 250 seconds (or about 4 minutes). This made sense before the scheduling replicator because if a replication job had crashed in the startup phase enough times it would not be retried anymore. With a scheduling replicator, it makes more sense to stop the whole task, and let the scheduling replicatgor retry later. `retries_per_request` then becomes something used mainly for short intermettent network issues. The new retry schedule is `0.25, 0.5, 1, 2, 4` Or about 8 seconds. An additional benefit when the job is stopped quicker, the user can find out about the problem sooner from the _scheduler/docs and _scheduler/jobs status endpoints and can rectify the problem. Otherwise a single request retrying for 4 minutes would be indicated there as the job is healthy and running. Fixes apache#810

The new value is 5 but used to be 10, which makes more sense with the new scheduling replicator behavior. Issue #810

The new value is 5 but used to be 10, which makes more sense with the new scheduling replicator behavior. Issue apache#810

wohali assigned nickva Sep 12, 2017

nickva mentioned this issue Sep 27, 2017

Reduce replicator.retries_per_request value from 10 to 5 #843

Merged

nickva closed this as completed in #843 Sep 27, 2017

wohali mentioned this issue Oct 18, 2017

Aggregate stats for all nodes #903

Closed

janl pushed a commit that referenced this issue Jan 5, 2020

Updated documentation for new replicator retries_per_request value

ec3efe7

The new value is 5 but used to be 10, which makes more sense with the new scheduling replicator behavior. Issue #810

nickva added a commit to nickva/couchdb that referenced this issue Sep 7, 2022

Updated documentation for new replicator retries_per_request value

f141fa1

The new value is 5 but used to be 10, which makes more sense with the new scheduling replicator behavior. Issue apache#810

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stats] replicator scheduler crashed counter not inrcementing #810

[stats] replicator scheduler crashed counter not inrcementing #810

wohali commented Sep 12, 2017

nickva commented Sep 27, 2017 •

edited

Loading

wohali commented Sep 27, 2017

[stats] replicator scheduler crashed counter not inrcementing #810

[stats] replicator scheduler crashed counter not inrcementing #810

Comments

wohali commented Sep 12, 2017

Steps to Reproduce (for bugs)

Context

Your Environment

nickva commented Sep 27, 2017 • edited Loading

wohali commented Sep 27, 2017

nickva commented Sep 27, 2017 •

edited

Loading