Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stats] replicator scheduler crashed counter not inrcementing #810

Closed
wohali opened this issue Sep 12, 2017 · 2 comments
Closed

[stats] replicator scheduler crashed counter not inrcementing #810

wohali opened this issue Sep 12, 2017 · 2 comments
Assignees

Comments

@wohali
Copy link
Member

wohali commented Sep 12, 2017

I set up a _replicator document with an incorrect URL in it, and let the replication scheduler attempt to access it a few times in a row.

Repeated fetches of /_node/couchdb@localhost/_stats showed that couch_replicator.jobs.crashed.value stayed at 0 and never incremented.

Steps to Reproduce (for bugs)

Create a _replicator document of this form:

{
    "_id": "my_rep",
    "source": "http://does.not.resolve.wohali.rules/foo",
    "target": "http://127.0.0.1:5984/bar",
    "create_target": true
}

and let it cycle through a few nxdomain errors (as evidenced by couch.log).

Context

Trying to put together an open source CouchDB monitoring solution, and failing to get useful stats out of the new scheduler.

Your Environment

  • Version used: 2.1.0
  • Browser Name and version: n/a
  • Operating System and version Debian 8 (jessie)
nickva added a commit to cloudant/couchdb that referenced this issue Sep 27, 2017
Previously an individual failed request would be tried 10 times in a row with
an exponential backoff starting at 0.25 seconds. So the intervals in seconds
would be:

   `0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128`

For a total of about 250 seconds (or about 4 minutes). This made sense before
the scheduling replicator because if a replication job had crashed in the
startup phase enough times it would not be retried anymore. With a scheduling
replicator, it makes more sense to stop the whole task, and let the scheduling
replicatgor retry later. `retries_per_request` then becomes something used
mainly for short intermettent network issues.

The new retry schedule is

   `0.25, 0.5, 1, 2, 4`

Or about 8 seconds.

An additional benefit when the job is stopped quicker, the user can find out
about the problem sooner from the _scheduler/docs and _scheduler/jobs status
endpoints and can rectify the problem. Otherwise a single request retrying for
4 minutes would be indicated there as the job is healthy and running.

Fixes apache#810
@nickva
Copy link
Contributor

nickva commented Sep 27, 2017

Took a look at this one.

Saw the same behavior. There could be 3 reasons for not noticing the crashed guage being bumped.

  1. There is a fairly high retries_per_request default value of 10 used to retry failed requests. Requests are tried 10 times with exponentially increasing sleep amounts in between, starting at 0.25 seconds. That means there could be up to 4 minutes of retrying the same failed request before the job fails and scheduling replicator reports a crashed status for it. I made a PR to reduce the default number of tries to 5 so there would be up to 8 seconds worth of retries instead. This makes more sense now that the scheduling replicator is used, as it can better handle reporting and backing off when errors occurs. This means that for quite a while (tens of minute or hours) the status for the replication job might be in the running state since it wasting time retrying that request.

  2. Replications will uniformly pick one node in the cluster to run on which doesn't have to be the node which processed the document update request. To detect the crashed stats update would have to know which node to check for changes. I made this mistake so mentioning it here just in case. Perhaps there is a case there in general for aggregating stats for all nodes.

  3. Crashed status is only reported after the replication job has crashed and is waiting to run next (possibly being penalized if it crashed too many times in a row). However as soon as it is given a chance to run again, it gets counted as running. While in that state it won't bump the crashed guage. This was done such that he total is always equal to running + pending + crashing. So the effect of this is that the crashing count will periodically go down for a bit when job is attempting to run, then when it fails it will be bumped back up. Before the PR above, this could take quite a while, but even with it might still take up to 15 seconds (8 seconds worth of retries + stats updates happen with a delay of 5 seconds).

nickva added a commit that referenced this issue Sep 27, 2017
Previously an individual failed request would be tried 10 times in a row with
an exponential backoff starting at 0.25 seconds. So the intervals in seconds
would be:

   `0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128`

For a total of about 250 seconds (or about 4 minutes). This made sense before
the scheduling replicator because if a replication job had crashed in the
startup phase enough times it would not be retried anymore. With a scheduling
replicator, it makes more sense to stop the whole task, and let the scheduling
replicatgor retry later. `retries_per_request` then becomes something used
mainly for short intermettent network issues.

The new retry schedule is

   `0.25, 0.5, 1, 2, 4`

Or about 8 seconds.

An additional benefit when the job is stopped quicker, the user can find out
about the problem sooner from the _scheduler/docs and _scheduler/jobs status
endpoints and can rectify the problem. Otherwise a single request retrying for
4 minutes would be indicated there as the job is healthy and running.

Fixes #810
@wohali
Copy link
Member Author

wohali commented Sep 27, 2017

@nickva Thanks, this all makes sense. I'd like to not lose track of point 2 above. Could you file a new enhancement issue for this concept so we can add it to the backlog?

wohali pushed a commit that referenced this issue Oct 19, 2017
Previously an individual failed request would be tried 10 times in a row with
an exponential backoff starting at 0.25 seconds. So the intervals in seconds
would be:

   `0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128`

For a total of about 250 seconds (or about 4 minutes). This made sense before
the scheduling replicator because if a replication job had crashed in the
startup phase enough times it would not be retried anymore. With a scheduling
replicator, it makes more sense to stop the whole task, and let the scheduling
replicatgor retry later. `retries_per_request` then becomes something used
mainly for short intermettent network issues.

The new retry schedule is

   `0.25, 0.5, 1, 2, 4`

Or about 8 seconds.

An additional benefit when the job is stopped quicker, the user can find out
about the problem sooner from the _scheduler/docs and _scheduler/jobs status
endpoints and can rectify the problem. Otherwise a single request retrying for
4 minutes would be indicated there as the job is healthy and running.

Fixes #810
willholley pushed a commit to willholley/couchdb that referenced this issue May 22, 2018
Previously an individual failed request would be tried 10 times in a row with
an exponential backoff starting at 0.25 seconds. So the intervals in seconds
would be:

   `0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128`

For a total of about 250 seconds (or about 4 minutes). This made sense before
the scheduling replicator because if a replication job had crashed in the
startup phase enough times it would not be retried anymore. With a scheduling
replicator, it makes more sense to stop the whole task, and let the scheduling
replicatgor retry later. `retries_per_request` then becomes something used
mainly for short intermettent network issues.

The new retry schedule is

   `0.25, 0.5, 1, 2, 4`

Or about 8 seconds.

An additional benefit when the job is stopped quicker, the user can find out
about the problem sooner from the _scheduler/docs and _scheduler/jobs status
endpoints and can rectify the problem. Otherwise a single request retrying for
4 minutes would be indicated there as the job is healthy and running.

Fixes apache#810
janl pushed a commit that referenced this issue Jan 5, 2020
The new value is 5 but used to be 10, which makes more sense with the new
scheduling replicator behavior.

Issue #810
nickva added a commit to nickva/couchdb that referenced this issue Sep 7, 2022
The new value is 5 but used to be 10, which makes more sense with the new
scheduling replicator behavior.

Issue apache#810
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants