Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Consistency problem even with majority and unanimous quorums #55
We are running BigCouch in a 5 server pre-production cluster behind haproxy.
We are deliberately running a case that makes it as hard as possible for BigCouch to maintain version consistency across the cluster, we have a requirement for sequential, integer identifiers, so that we can integrate with specific legacy systems. (We're trying real hard to keep a SQL solution off the table)
The test configuration has a database with n=5, q=128
We are testing consistency by having a single document, containing an integer value.
When we successfully make an update we log the the thread that succeeded and the incremented value.
We've run the test rig with r=3/w=3 and r=5/w=5 in an attempt to force the cluster to be sure about document versions before accepting the write. (To act as a monlithic couchdb instance)
If we run the test with 1, 2 or 3 threads the cluster behaves as you would expect (looks just like a single couch box, indeed we have tested against a single couchdb instance, we get no duplicate updates, but some missing on high thread counts/concurrency levels and we can live with that. Performance is down due to the node synchronisation, but its good enough)
BUT: What we see is that when we run the tread counts up above 4, and certainly reproducible every time with a thread count of 10, we are getting duplicate updates. That is two threads updating document version 100 to 101 (and the corresponding document integer value), both believing they made the update because BigCouch has not reported a version conflict on the update. ie. Both client processes thinking they did the update from 100 to 101, so both have the SAME 'unique' value.
We also see missing numbers in the sequence. That is; no thread was able to successfully update version 102 to 103, all threads received version conflict responses, yet looking a the documents version history, it's there in the version chain, that version does exist. So our incrementing integer test skips a value.
Whilst we can live with skipping a few identifiers, we cant manage when we dont have unique identifiers.
I really do apologise if the issue is not an issue at all, and its simply my misunderstanding of the dynamo quorum mechanism.
Many thanks in advance
Hi, let's treat the two cases separately, as they have rather different root causes and one of them is what I would consider to be a bug in BigCouch.
In the first case you have multiple clients reporting a successful update from 100 to 101. This is due to the fact that BigCouch occasionally responds with a 201 Created when at least 1 but fewer than W shard copies applied the requested update successfully. We are planning to change this behavior in the upcoming release of BigCouch so that the client receives a 202 Accepted when the update was written at least once but the quorum was not met.
In the second case you have all clients reporting a failure to meet the write quorum and yet the update appears in the version tree. This is by design: BigCouch does not implement the commit protocols to allow for a rollback of an update, so if an update is written to at least one shard copy it will eventually be applied to all shard copies. If your clients were applying unique updates you'd see branches (conflicts) in the edit tree representing the attempts by the different clients. As it stands all your clients are attempting the same update so no conflicts should be generated.
You were right to assume that a majority quorum would be sufficient to obtain the desired behavior of never having multiple clients receive a 201 Created response for the same update, and as of BigCouch 0.4 that should always be the case.
On the other hand, it's not correct to say that a unanimous quorum system will behave exactly like a single server CouchDB. Consider the case of two clients racing to apply different updates to the same document. In a single CouchDB one of them will win and one will receive a 409 Conflict. The losing update will not be present anywhere in the system. Now consider a three node BigCouch cluster. Let's say the update from client A was applied on two servers and the update from client B was applied on the third server. Client A receives a 201 Created and Client B receives a 202 Accepted (in BigCouch 0.4). Both updates are now stored in the system, and once the internal replication manager synchronizes the shards both updates will be stored on all three nodes. In fact, BigCouch may even determine that the update from Client B is the "winning" revision that will be shown to clients by default since, like CouchDB, it deterministically chooses a winning revision based on the number of edits in the history of each revision as well as their checksums. As a result it's important that your application knows how to find and resolve conflicting document versions if you think you might ever concurrently apply different updates to the same document.
Thanks for this clear and detailed bug report.
Thanks so much for that response.
In summary then.
2.In the situation described in your response; if one client receives a '201 Created' on an update and the second racing client (in 1. above) receives a '202 Accepted' it is possible that the second (202) clients update will, after a conflict resolution process, become the 'winning' version. So that even though the first client thinks it has done a successful update, and the second client did not get the '201 Created' (it got a '202 Accepted') response, the second clients data will be represented as the winning version after conflict resolution.
The process here is that after node synchronisation, we end up with the 2 document versions (the 201 status version and the 202 status version) on all the nodes in the cluster, AND all nodes use the same deterministic algorithm to determine the winner BUT the deterministic algorithm DOES not take into account fact the one document received a 201 and one received a 202. So that the external state of the system as derived from examining the 201/202/409 HTTP status codes will be inconsistent with the internal state of the system which is resolved independently (but consistently) from those status codes. This assumes of course (as I would have) that from an external point of view a 201 response is different to a 202 and would take precedence, when in fact a 201 AND a 202 response are treated as synonymous when resolving conflicts. I'm not sure how I would deal with that, but I dont think I need to at this point.
I think that all this means that we can safely move to production pilot now, where we will have low levels of contention, and provided we upgrade to BigCouch 0.4 before we scale up to higher levels of (production like) contention all will be well.
Thats a really good outcome.
We're really grateful that you (and anyone you bounced the response off before publishing) took the time to explain the issues!
Since the other documentation is not terribly detailed, I've been using this as my primary reference to get my head around Cloudant's Dynamo implementation. It seems the 202 Accepted fix as described above has been deployed to their multitenant clusters for a while now.
One point this discussion missed: According to their tech support, Cloudant/BigCouch does not enforce read quorums! That is, you can request r=2/r=3 but if less shards are actually available you may still simply get a normal 200 based on non-majority/non-unanimous single shard's state alone. So, /caveat relaxor/ — while you can now tell when a write failed to meet quorum by a 202 response, there's no way to tell if a read met quorum.