New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stats] replicator scheduler crashed counter not inrcementing #810

Closed
wohali opened this Issue Sep 12, 2017 · 2 comments

Comments

Projects
None yet
2 participants
@wohali
Member

wohali commented Sep 12, 2017

I set up a _replicator document with an incorrect URL in it, and let the replication scheduler attempt to access it a few times in a row.

Repeated fetches of /_node/couchdb@localhost/_stats showed that couch_replicator.jobs.crashed.value stayed at 0 and never incremented.

Steps to Reproduce (for bugs)

Create a _replicator document of this form:

{
    "_id": "my_rep",
    "source": "http://does.not.resolve.wohali.rules/foo",
    "target": "http://127.0.0.1:5984/bar",
    "create_target": true
}

and let it cycle through a few nxdomain errors (as evidenced by couch.log).

Context

Trying to put together an open source CouchDB monitoring solution, and failing to get useful stats out of the new scheduler.

Your Environment

  • Version used: 2.1.0
  • Browser Name and version: n/a
  • Operating System and version Debian 8 (jessie)

nickva added a commit to cloudant/couchdb that referenced this issue Sep 27, 2017

Reduce replicator.retries_per_request value from 10 to 5
Previously an individual failed request would be tried 10 times in a row with
an exponential backoff starting at 0.25 seconds. So the intervals in seconds
would be:

   `0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128`

For a total of about 250 seconds (or about 4 minutes). This made sense before
the scheduling replicator because if a replication job had crashed in the
startup phase enough times it would not be retried anymore. With a scheduling
replicator, it makes more sense to stop the whole task, and let the scheduling
replicatgor retry later. `retries_per_request` then becomes something used
mainly for short intermettent network issues.

The new retry schedule is

   `0.25, 0.5, 1, 2, 4`

Or about 8 seconds.

An additional benefit when the job is stopped quicker, the user can find out
about the problem sooner from the _scheduler/docs and _scheduler/jobs status
endpoints and can rectify the problem. Otherwise a single request retrying for
4 minutes would be indicated there as the job is healthy and running.

Fixes #810
@nickva

This comment has been minimized.

Show comment
Hide comment
@nickva

nickva Sep 27, 2017

Contributor

Took a look at this one.

Saw the same behavior. There could be 3 reasons for not noticing the crashed guage being bumped.

  1. There is a fairly high retries_per_request default value of 10 used to retry failed requests. Requests are tried 10 times with exponentially increasing sleep amounts in between, starting at 0.25 seconds. That means there could be up to 4 minutes of retrying the same failed request before the job fails and scheduling replicator reports a crashed status for it. I made a PR to reduce the default number of tries to 5 so there would be up to 8 seconds worth of retries instead. This makes more sense now that the scheduling replicator is used, as it can better handle reporting and backing off when errors occurs. This means that for quite a while (tens of minute or hours) the status for the replication job might be in the running state since it wasting time retrying that request.

  2. Replications will uniformly pick one node in the cluster to run on which doesn't have to be the node which processed the document update request. To detect the crashed stats update would have to know which node to check for changes. I made this mistake so mentioning it here just in case. Perhaps there is a case there in general for aggregating stats for all nodes.

  3. Crashed status is only reported after the replication job has crashed and is waiting to run next (possibly being penalized if it crashed too many times in a row). However as soon as it is given a chance to run again, it gets counted as running. While in that state it won't bump the crashed guage. This was done such that he total is always equal to running + pending + crashing. So the effect of this is that the crashing count will periodically go down for a bit when job is attempting to run, then when it fails it will be bumped back up. Before the PR above, this could take quite a while, but even with it might still take up to 15 seconds (8 seconds worth of retries + stats updates happen with a delay of 5 seconds).

Contributor

nickva commented Sep 27, 2017

Took a look at this one.

Saw the same behavior. There could be 3 reasons for not noticing the crashed guage being bumped.

  1. There is a fairly high retries_per_request default value of 10 used to retry failed requests. Requests are tried 10 times with exponentially increasing sleep amounts in between, starting at 0.25 seconds. That means there could be up to 4 minutes of retrying the same failed request before the job fails and scheduling replicator reports a crashed status for it. I made a PR to reduce the default number of tries to 5 so there would be up to 8 seconds worth of retries instead. This makes more sense now that the scheduling replicator is used, as it can better handle reporting and backing off when errors occurs. This means that for quite a while (tens of minute or hours) the status for the replication job might be in the running state since it wasting time retrying that request.

  2. Replications will uniformly pick one node in the cluster to run on which doesn't have to be the node which processed the document update request. To detect the crashed stats update would have to know which node to check for changes. I made this mistake so mentioning it here just in case. Perhaps there is a case there in general for aggregating stats for all nodes.

  3. Crashed status is only reported after the replication job has crashed and is waiting to run next (possibly being penalized if it crashed too many times in a row). However as soon as it is given a chance to run again, it gets counted as running. While in that state it won't bump the crashed guage. This was done such that he total is always equal to running + pending + crashing. So the effect of this is that the crashing count will periodically go down for a bit when job is attempting to run, then when it fails it will be bumped back up. Before the PR above, this could take quite a while, but even with it might still take up to 15 seconds (8 seconds worth of retries + stats updates happen with a delay of 5 seconds).

@nickva nickva closed this in #843 Sep 27, 2017

nickva added a commit that referenced this issue Sep 27, 2017

Reduce replicator.retries_per_request value from 10 to 5
Previously an individual failed request would be tried 10 times in a row with
an exponential backoff starting at 0.25 seconds. So the intervals in seconds
would be:

   `0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128`

For a total of about 250 seconds (or about 4 minutes). This made sense before
the scheduling replicator because if a replication job had crashed in the
startup phase enough times it would not be retried anymore. With a scheduling
replicator, it makes more sense to stop the whole task, and let the scheduling
replicatgor retry later. `retries_per_request` then becomes something used
mainly for short intermettent network issues.

The new retry schedule is

   `0.25, 0.5, 1, 2, 4`

Or about 8 seconds.

An additional benefit when the job is stopped quicker, the user can find out
about the problem sooner from the _scheduler/docs and _scheduler/jobs status
endpoints and can rectify the problem. Otherwise a single request retrying for
4 minutes would be indicated there as the job is healthy and running.

Fixes #810
@wohali

This comment has been minimized.

Show comment
Hide comment
@wohali

wohali Sep 27, 2017

Member

@nickva Thanks, this all makes sense. I'd like to not lose track of point 2 above. Could you file a new enhancement issue for this concept so we can add it to the backlog?

Member

wohali commented Sep 27, 2017

@nickva Thanks, this all makes sense. I'd like to not lose track of point 2 above. Could you file a new enhancement issue for this concept so we can add it to the backlog?

wohali added a commit that referenced this issue Oct 19, 2017

Reduce replicator.retries_per_request value from 10 to 5
Previously an individual failed request would be tried 10 times in a row with
an exponential backoff starting at 0.25 seconds. So the intervals in seconds
would be:

   `0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128`

For a total of about 250 seconds (or about 4 minutes). This made sense before
the scheduling replicator because if a replication job had crashed in the
startup phase enough times it would not be retried anymore. With a scheduling
replicator, it makes more sense to stop the whole task, and let the scheduling
replicatgor retry later. `retries_per_request` then becomes something used
mainly for short intermettent network issues.

The new retry schedule is

   `0.25, 0.5, 1, 2, 4`

Or about 8 seconds.

An additional benefit when the job is stopped quicker, the user can find out
about the problem sooner from the _scheduler/docs and _scheduler/jobs status
endpoints and can rectify the problem. Otherwise a single request retrying for
4 minutes would be indicated there as the job is healthy and running.

Fixes #810

willholley added a commit to willholley/couchdb that referenced this issue May 22, 2018

Reduce replicator.retries_per_request value from 10 to 5
Previously an individual failed request would be tried 10 times in a row with
an exponential backoff starting at 0.25 seconds. So the intervals in seconds
would be:

   `0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128`

For a total of about 250 seconds (or about 4 minutes). This made sense before
the scheduling replicator because if a replication job had crashed in the
startup phase enough times it would not be retried anymore. With a scheduling
replicator, it makes more sense to stop the whole task, and let the scheduling
replicatgor retry later. `retries_per_request` then becomes something used
mainly for short intermettent network issues.

The new retry schedule is

   `0.25, 0.5, 1, 2, 4`

Or about 8 seconds.

An additional benefit when the job is stopped quicker, the user can find out
about the problem sooner from the _scheduler/docs and _scheduler/jobs status
endpoints and can rectify the problem. Otherwise a single request retrying for
4 minutes would be indicated there as the job is healthy and running.

Fixes #810
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment