Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fabric_open_doc_revs #1862

Merged
merged 1 commit into from Jan 15, 2019

Conversation

Projects
None yet
3 participants
@davisp
Copy link
Member

davisp commented Jan 15, 2019

There was a subtle bug when opening specific revisions in
fabric_doc_open_revs due to a race condition between updates being
applied across a cluster.

The underlying cause here was due to the stemming after a document had
been updated more than revs_limit number of times along with concurrent
reads to a node that had not yet made the update. To illustrate lets
consider a document A which has a revision history from {N, RevN} to
{N+1000, RevN+1000} (assuming revs_limit is the default 1000). If we
consider a single node perspective when an update comes in we added the
new revision and stem the oldest revision. The docs the revisions on the
node would be {N+1, RevN+1} to {N+1001, RevN+1001}.

The bug exists when we attempt to open revisions on a different node
that has yet to apply the new update. In this case when
fabric_doc_open_revs could be called with {N+1000, RevN+1000}. This
results in a response from fabric_doc_open_revs that includes two
different {ok, Doc} results instead of the expected one instance. The
reason for this is that one document has revisions {N+1, RevN+1} to
{N+1000, RevN+1000} from the node that has applied the update, while
the node without the update responds with revisions {N, RevN} to
{N+1000, RevN+1000}`.

To rephrase that, a node that has applied an update can end up returning
a revision path that contains revs_limit - 1 revisions while a node
wihtout the update returns all revs_limit revisions. This slight
change in the path prevented the responses from being properly combined
into a single response.

This bug has existed for many years. However, read repair effectively
prevents it from being a significant issue by immediately fixing the
revision history discrepancy. This was discovered due to the recent bug
in read repair during a mixed cluster upgrade to a release including
clustered purge. In this situation we end up crashing the design
document cache which then leads to all of the design document requests
being direct reads which can end up causing cluster nodes to OOM and
die. The conditions require a significant number of design document
edits coupled with already significant load to those modified design
documents. The most direct example observed was a clustered that had a
significant number of filtered replications in and out of the cluster.

Testing recommendations

make check

Related Issues or Pull Requests

This was discovered due to issues caused by #1860

Checklist

  • Code is written and works correctly;
  • Changes are covered by tests;
  • Documentation reflects the changes;
@nickva

nickva approved these changes Jan 15, 2019

Fix fabric_open_doc_revs
There was a subtle bug when opening specific revisions in
fabric_doc_open_revs due to a race condition between updates being
applied across a cluster.

The underlying cause here was due to the stemming after a document had
been updated more than revs_limit number of times along with concurrent
reads to a node that had not yet made the update. To illustrate lets
consider a document A which has a revision history from `{N, RevN}` to
`{N+1000, RevN+1000}` (assuming revs_limit is the default 1000). If we
consider a single node perspective when an update comes in we added the
new revision and stem the oldest revision. The docs the revisions on the
node would be `{N+1, RevN+1}` to `{N+1001, RevN+1001}`.

The bug exists when we attempt to open revisions on a different node
that has yet to apply the new update. In this case when
fabric_doc_open_revs could be called with `{N+1000, RevN+1000}`. This
results in a response from fabric_doc_open_revs that includes two
different `{ok, Doc}` results instead of the expected one instance. The
reason for this is that one document has revisions `{N+1, RevN+1}` to
`{N+1000, RevN+1000}` from the node that has applied the update, while
the node without the update responds with revisions `{N, RevN}` to
{N+1000, RevN+1000}`.

To rephrase that, a node that has applied an update can end up returning
a revision path that contains `revs_limit - 1` revisions while a node
wihtout the update returns all `revs_limit` revisions. This slight
change in the path prevented the responses from being properly combined
into a single response.

This bug has existed for many years. However, read repair effectively
prevents it from being a significant issue by immediately fixing the
revision history discrepancy. This was discovered due to the recent bug
in read repair during a mixed cluster upgrade to a release including
clustered purge. In this situation we end up crashing the design
document cache which then leads to all of the design document requests
being direct reads which can end up causing cluster nodes to OOM and
die. The conditions require a significant number of design document
edits coupled with already significant load to those modified design
documents. The most direct example observed was a clustered that had a
significant number of filtered replications in and out of the cluster.

@davisp davisp force-pushed the fix-open-doc-revs branch from 0ecc373 to 4252c3d Jan 15, 2019

@eiri

eiri approved these changes Jan 15, 2019

@davisp davisp merged commit 5269d79 into master Jan 15, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@davisp davisp deleted the fix-open-doc-revs branch Jan 15, 2019

@wohali wohali referenced this pull request Feb 7, 2019

Merged

2.3.1 Release Proposal #1908

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.