Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unconditional retries in replicator's http client #1177

Merged
merged 1 commit into from Feb 22, 2018

Conversation

Projects
None yet
2 participants
@nickva
Copy link
Contributor

nickva commented Feb 21, 2018

In some cases the higher level code from couch_replicator_api_wrap needs to
handle retries explicitly and cannot cope with retries happening in the lower
level http client. In such cases it sets retries = 0.

For example:

https://github.com/apache/couchdb/blob/master/src/couch_replicator/src/couch_replicator_api_wrap.erl#L271-L275

The http client then should avoid unconditional retries and instead consult
retries value. If retries = 0, it shouldn't retry and instead bubble the
exception up to the caller.

This bug was discovered when attachments were replicated to a target cluster
and the target cluster's resources were constrainted. Since attachment PUT
requests were made from the context of an open_revs GET request, PUT
request timed out, and they would retry. However, because the retry didn't
bubble up to the open_revs code, the second PUT request would die with a
noproc error, since the old parser had exited by then. See issue #745 for
more.

Testing recommendations

See issue #745 comments on how to set up testing. The code was tested locally with a Vagrant VM running Debian 8, Erlang 17.5. Hardware resources were 1 CPU, throttled to about 30%, disk throughput also throttled to about 10Mb/s. stress running in the background as stress --timeout 900m --cpu 1 --io 4.

@davisp

davisp approved these changes Feb 21, 2018

Copy link
Member

davisp left a comment

LGTM. Am not 100% concerned about removing the clause vs not. Whatever you think is best there.

% ibrowse worker terminated because remote peer closed the socket
% -> not an error
throw({retry, HttpDb, Params});
maybe_retry({error, req_timedout}, Worker, HttpDb, Params);

This comment has been minimized.

Copy link
@davisp

davisp Feb 21, 2018

Member

Shouldn't we just remove this clause entirely and let it fall through to the catch-all at the bottom?

This comment has been minimized.

Copy link
@nickva

nickva Feb 22, 2018

Author Contributor

Oh totally, let's remove it. Good call.

Avoid unconditional retries in replicator's http client
In some cases the higher level code from `couch_replicator_api_wrap` needs to
handle retries explicitly and cannot cope with retries happening in the lower
level http client. In such cases it sets `retries = 0`.

For example:

https://github.com/apache/couchdb/blob/master/src/couch_replicator/src/couch_replicator_api_wrap.erl#L271-L275

The http client then should avoid unconditional retries and instead consult
`retries` value. If `retries = 0`, it shouldn't retry and instead bubble the
exception up to the caller.

This bug was discovered when attachments were replicated to a target cluster
and the target cluster's resources were constrainted. Since attachment `PUT`
requests were made from the context of an open_revs `GET` request, `PUT`
request timed out, and they would retry. However, because the retry didn't
bubble up to the `open_revs` code, the second `PUT` request would die with a
`noproc` error, since the old parser had exited by then. See issue #745 for
more.

@nickva nickva force-pushed the fix-noproc-in-replicator-put branch from 399c07b to 16699e5 Feb 22, 2018

@nickva nickva merged commit 6d959a7 into master Feb 22, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@nickva nickva deleted the fix-noproc-in-replicator-put branch Feb 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.