New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable delay before retrying on missing_doc error #934

Merged
merged 1 commit into from Oct 31, 2017

Conversation

Projects
None yet
2 participants
@nickva
Contributor

nickva commented Oct 30, 2017

Implement a configurable delay before retrying a document fetch in replicator.

missing_doc exceptions usually happen when there is a continuous replication
set up and the source is updated. The change might appear in the changes feed,
but when worker tries to fetch the document's revisions it talks to a
node where internal replication hasn't caught up and so it throws an exception.

Previously the delay was hard-coded at 0 (that is retrying was immediate). The
replication would still make progress, but after crashing, retrying and
generating a lot of unnecessary log noise. Since updating a source while
continuous replication is running is a common scenario, it's worth optimizing
for it and avoiding wasting resources and spamming logs.

@nickva nickva changed the title from Configurable delay before retrying on missing_doc error in replicator to Configurable delay before retrying on missing_doc error Oct 30, 2017

Configurable delay before retrying on missing_doc error
Implement a configurable delay before retrying a document fetch in replicator.

missing_doc exceptions usually happen when there is a continuous replication
set up and the source is updated. The change might appear in the changes feed,
but when worker tries to fetch the document's revisions it talks to a
node where internal replication hasn't caught up and so it throws an exception.

Previously the delay was hard-coded at 0 (that is retrying was immediate). The
replication would still make progress, but after crashing, retrying and
generating a lot of unnecessary log noise. Since updating a source while
continuous replication is running is a common scenario, it's worth optimizing
for it and avoiding wasting resources and spamming logs.
@eiri

This comment has been minimized.

Show comment
Hide comment
@eiri

eiri Oct 31, 2017

Member

Default's 2 sec delay seems a bit arbitrary. Would it be reasonable to set it to 0 to preserve the existing behaviour and use new knob by situation?

Member

eiri commented Oct 31, 2017

Default's 2 sec delay seems a bit arbitrary. Would it be reasonable to set it to 0 to preserve the existing behaviour and use new knob by situation?

@nickva

This comment has been minimized.

Show comment
Hide comment
@nickva

nickva Oct 31, 2017

Contributor

The intent is to improve the situation in the default case as well. Keeping it at 0 would mean users would see more missing_doc exceptions, crashes and retries.

Another aspect to this is that transient jobs (those posted to _replicate endpoint) don't restart after a crash. So here we'd be safe trying to avoid crashing the job it will be effectively dropped from the scheduler.

Contributor

nickva commented Oct 31, 2017

The intent is to improve the situation in the default case as well. Keeping it at 0 would mean users would see more missing_doc exceptions, crashes and retries.

Another aspect to this is that transient jobs (those posted to _replicate endpoint) don't restart after a crash. So here we'd be safe trying to avoid crashing the job it will be effectively dropped from the scheduler.

@eiri

eiri approved these changes Oct 31, 2017

Fair point on default delay. lgtm

@nickva nickva merged commit 40b9f85 into apache:master Oct 31, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@nickva nickva deleted the cloudant:configurable-wait-on-missing-doc branch Oct 31, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment