Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a subtle bug when opening specific revisions in
fabric_doc_open_revs due to a race condition between updates being
applied across a cluster.
The underlying cause here was due to the stemming after a document had
been updated more than revs_limit number of times along with concurrent
reads to a node that had not yet made the update. To illustrate lets
consider a document A which has a revision history from
{N, RevN}
to{N+1000, RevN+1000}
(assuming revs_limit is the default 1000). If weconsider a single node perspective when an update comes in we added the
new revision and stem the oldest revision. The docs the revisions on the
node would be
{N+1, RevN+1}
to{N+1001, RevN+1001}
.The bug exists when we attempt to open revisions on a different node
that has yet to apply the new update. In this case when
fabric_doc_open_revs could be called with
{N+1000, RevN+1000}
. Thisresults in a response from fabric_doc_open_revs that includes two
different
{ok, Doc}
results instead of the expected one instance. Thereason for this is that one document has revisions
{N+1, RevN+1}
to{N+1000, RevN+1000}
from the node that has applied the update, whilethe node without the update responds with revisions
{N, RevN}
to{N+1000, RevN+1000}`.
To rephrase that, a node that has applied an update can end up returning
a revision path that contains
revs_limit - 1
revisions while a nodewihtout the update returns all
revs_limit
revisions. This slightchange in the path prevented the responses from being properly combined
into a single response.
This bug has existed for many years. However, read repair effectively
prevents it from being a significant issue by immediately fixing the
revision history discrepancy. This was discovered due to the recent bug
in read repair during a mixed cluster upgrade to a release including
clustered purge. In this situation we end up crashing the design
document cache which then leads to all of the design document requests
being direct reads which can end up causing cluster nodes to OOM and
die. The conditions require a significant number of design document
edits coupled with already significant load to those modified design
documents. The most direct example observed was a clustered that had a
significant number of filtered replications in and out of the cluster.
Testing recommendations
make check
Related Issues or Pull Requests
This was discovered due to issues caused by #1860
Checklist