New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency problem even with majority and unanimous quorums #55

Closed
ElMicko opened this Issue Aug 9, 2011 · 7 comments

Comments

Projects
None yet
3 participants
@ElMicko

ElMicko commented Aug 9, 2011

We are running BigCouch in a 5 server pre-production cluster behind haproxy.
We're have an issue (perhaps with our understanding) with consistency.

We are deliberately running a case that makes it as hard as possible for BigCouch to maintain version consistency across the cluster, we have a requirement for sequential, integer identifiers, so that we can integrate with specific legacy systems. (We're trying real hard to keep a SQL solution off the table)

The test configuration has a database with n=5, q=128

We are testing consistency by having a single document, containing an integer value.
We run multiple test threads that read the value, increment it, and write it back, and repeat.
We are expecting contention, so where we get a version conflict after a write attempt, we simply read the new version and try again.

When we successfully make an update we log the the thread that succeeded and the incremented value.
After the test period, typically 3 minutes, we analyse the successful writes, and look for duplicates writes and for missing numbers in the sequence.

We've run the test rig with r=3/w=3 and r=5/w=5 in an attempt to force the cluster to be sure about document versions before accepting the write. (To act as a monlithic couchdb instance)

If we run the test with 1, 2 or 3 threads the cluster behaves as you would expect (looks just like a single couch box, indeed we have tested against a single couchdb instance, we get no duplicate updates, but some missing on high thread counts/concurrency levels and we can live with that. Performance is down due to the node synchronisation, but its good enough)

BUT: What we see is that when we run the tread counts up above 4, and certainly reproducible every time with a thread count of 10, we are getting duplicate updates. That is two threads updating document version 100 to 101 (and the corresponding document integer value), both believing they made the update because BigCouch has not reported a version conflict on the update. ie. Both client processes thinking they did the update from 100 to 101, so both have the SAME 'unique' value.

We also see missing numbers in the sequence. That is; no thread was able to successfully update version 102 to 103, all threads received version conflict responses, yet looking a the documents version history, it's there in the version chain, that version does exist. So our incrementing integer test skips a value.

Whilst we can live with skipping a few identifiers, we cant manage when we dont have unique identifiers.
We were under the impression that unanimous quorum versions would ensure that the behaviour of the cluster was analogous to a single couchDB server/database. We also assumed that a majority quorum would produce the desired behaviour.

I really do apologise if the issue is not an issue at all, and its simply my misunderstanding of the dynamo quorum mechanism.
Is this expected behaviour?

Many thanks in advance

@ElMicko

This comment has been minimized.

Show comment
Hide comment
@ElMicko

ElMicko Aug 9, 2011

I should have added.
We are running
{"couchdb":"Welcome","version":"1.0.2","bigcouch":"0.3.1"}

ElMicko commented Aug 9, 2011

I should have added.
We are running
{"couchdb":"Welcome","version":"1.0.2","bigcouch":"0.3.1"}

@kocolosk

This comment has been minimized.

Show comment
Hide comment
@kocolosk

kocolosk Aug 10, 2011

Member

Hi, this is a subtle issue that deserves a thorough response. I'm working on one now but I want to run it by a few people before posting.

Member

kocolosk commented Aug 10, 2011

Hi, this is a subtle issue that deserves a thorough response. I'm working on one now but I want to run it by a few people before posting.

@ElMicko

This comment has been minimized.

Show comment
Hide comment
@ElMicko

ElMicko Aug 10, 2011

Thanks very much.
I'd greatly appreciate any assistance.

ElMicko commented Aug 10, 2011

Thanks very much.
I'd greatly appreciate any assistance.

@kocolosk

This comment has been minimized.

Show comment
Hide comment
@kocolosk

kocolosk Aug 10, 2011

Member

Hi, let's treat the two cases separately, as they have rather different root causes and one of them is what I would consider to be a bug in BigCouch.

In the first case you have multiple clients reporting a successful update from 100 to 101. This is due to the fact that BigCouch occasionally responds with a 201 Created when at least 1 but fewer than W shard copies applied the requested update successfully. We are planning to change this behavior in the upcoming release of BigCouch so that the client receives a 202 Accepted when the update was written at least once but the quorum was not met.

In the second case you have all clients reporting a failure to meet the write quorum and yet the update appears in the version tree. This is by design: BigCouch does not implement the commit protocols to allow for a rollback of an update, so if an update is written to at least one shard copy it will eventually be applied to all shard copies. If your clients were applying unique updates you'd see branches (conflicts) in the edit tree representing the attempts by the different clients. As it stands all your clients are attempting the same update so no conflicts should be generated.

You were right to assume that a majority quorum would be sufficient to obtain the desired behavior of never having multiple clients receive a 201 Created response for the same update, and as of BigCouch 0.4 that should always be the case.

On the other hand, it's not correct to say that a unanimous quorum system will behave exactly like a single server CouchDB. Consider the case of two clients racing to apply different updates to the same document. In a single CouchDB one of them will win and one will receive a 409 Conflict. The losing update will not be present anywhere in the system. Now consider a three node BigCouch cluster. Let's say the update from client A was applied on two servers and the update from client B was applied on the third server. Client A receives a 201 Created and Client B receives a 202 Accepted (in BigCouch 0.4). Both updates are now stored in the system, and once the internal replication manager synchronizes the shards both updates will be stored on all three nodes. In fact, BigCouch may even determine that the update from Client B is the "winning" revision that will be shown to clients by default since, like CouchDB, it deterministically chooses a winning revision based on the number of edits in the history of each revision as well as their checksums. As a result it's important that your application knows how to find and resolve conflicting document versions if you think you might ever concurrently apply different updates to the same document.

Thanks for this clear and detailed bug report.

Member

kocolosk commented Aug 10, 2011

Hi, let's treat the two cases separately, as they have rather different root causes and one of them is what I would consider to be a bug in BigCouch.

In the first case you have multiple clients reporting a successful update from 100 to 101. This is due to the fact that BigCouch occasionally responds with a 201 Created when at least 1 but fewer than W shard copies applied the requested update successfully. We are planning to change this behavior in the upcoming release of BigCouch so that the client receives a 202 Accepted when the update was written at least once but the quorum was not met.

In the second case you have all clients reporting a failure to meet the write quorum and yet the update appears in the version tree. This is by design: BigCouch does not implement the commit protocols to allow for a rollback of an update, so if an update is written to at least one shard copy it will eventually be applied to all shard copies. If your clients were applying unique updates you'd see branches (conflicts) in the edit tree representing the attempts by the different clients. As it stands all your clients are attempting the same update so no conflicts should be generated.

You were right to assume that a majority quorum would be sufficient to obtain the desired behavior of never having multiple clients receive a 201 Created response for the same update, and as of BigCouch 0.4 that should always be the case.

On the other hand, it's not correct to say that a unanimous quorum system will behave exactly like a single server CouchDB. Consider the case of two clients racing to apply different updates to the same document. In a single CouchDB one of them will win and one will receive a 409 Conflict. The losing update will not be present anywhere in the system. Now consider a three node BigCouch cluster. Let's say the update from client A was applied on two servers and the update from client B was applied on the third server. Client A receives a 201 Created and Client B receives a 202 Accepted (in BigCouch 0.4). Both updates are now stored in the system, and once the internal replication manager synchronizes the shards both updates will be stored on all three nodes. In fact, BigCouch may even determine that the update from Client B is the "winning" revision that will be shown to clients by default since, like CouchDB, it deterministically chooses a winning revision based on the number of edits in the history of each revision as well as their checksums. As a result it's important that your application knows how to find and resolve conflicting document versions if you think you might ever concurrently apply different updates to the same document.

Thanks for this clear and detailed bug report.

@ElMicko

This comment has been minimized.

Show comment
Hide comment
@ElMicko

ElMicko Aug 10, 2011

Thanks so much for that response.

In summary then.

  1. As of BigCouch 0.4 multiple clients will never receive a '201 Created' response when racing to update the same document/version when using a majority quorum. One and one client only will receive that looked for '201 Created ' response. This solves our duplicate identifier generation issue. A guaranteed unique 201 response gives us a clear winner to our race, and a clearly unique ID.

2.In the situation described in your response; if one client receives a '201 Created' on an update and the second racing client (in 1. above) receives a '202 Accepted' it is possible that the second (202) clients update will, after a conflict resolution process, become the 'winning' version. So that even though the first client thinks it has done a successful update, and the second client did not get the '201 Created' (it got a '202 Accepted') response, the second clients data will be represented as the winning version after conflict resolution.

The process here is that after node synchronisation, we end up with the 2 document versions (the 201 status version and the 202 status version) on all the nodes in the cluster, AND all nodes use the same deterministic algorithm to determine the winner BUT the deterministic algorithm DOES not take into account fact the one document received a 201 and one received a 202. So that the external state of the system as derived from examining the 201/202/409 HTTP status codes will be inconsistent with the internal state of the system which is resolved independently (but consistently) from those status codes. This assumes of course (as I would have) that from an external point of view a 201 response is different to a 202 and would take precedence, when in fact a 201 AND a 202 response are treated as synonymous when resolving conflicts. I'm not sure how I would deal with that, but I dont think I need to at this point.

I think that all this means that we can safely move to production pilot now, where we will have low levels of contention, and provided we upgrade to BigCouch 0.4 before we scale up to higher levels of (production like) contention all will be well.

Thats a really good outcome.

We're really grateful that you (and anyone you bounced the response off before publishing) took the time to explain the issues!

Thank you.

Michael

ElMicko commented Aug 10, 2011

Thanks so much for that response.

In summary then.

  1. As of BigCouch 0.4 multiple clients will never receive a '201 Created' response when racing to update the same document/version when using a majority quorum. One and one client only will receive that looked for '201 Created ' response. This solves our duplicate identifier generation issue. A guaranteed unique 201 response gives us a clear winner to our race, and a clearly unique ID.

2.In the situation described in your response; if one client receives a '201 Created' on an update and the second racing client (in 1. above) receives a '202 Accepted' it is possible that the second (202) clients update will, after a conflict resolution process, become the 'winning' version. So that even though the first client thinks it has done a successful update, and the second client did not get the '201 Created' (it got a '202 Accepted') response, the second clients data will be represented as the winning version after conflict resolution.

The process here is that after node synchronisation, we end up with the 2 document versions (the 201 status version and the 202 status version) on all the nodes in the cluster, AND all nodes use the same deterministic algorithm to determine the winner BUT the deterministic algorithm DOES not take into account fact the one document received a 201 and one received a 202. So that the external state of the system as derived from examining the 201/202/409 HTTP status codes will be inconsistent with the internal state of the system which is resolved independently (but consistently) from those status codes. This assumes of course (as I would have) that from an external point of view a 201 response is different to a 202 and would take precedence, when in fact a 201 AND a 202 response are treated as synonymous when resolving conflicts. I'm not sure how I would deal with that, but I dont think I need to at this point.

I think that all this means that we can safely move to production pilot now, where we will have low levels of contention, and provided we upgrade to BigCouch 0.4 before we scale up to higher levels of (production like) contention all will be well.

Thats a really good outcome.

We're really grateful that you (and anyone you bounced the response off before publishing) took the time to explain the issues!

Thank you.

Michael

@kocolosk

This comment has been minimized.

Show comment
Hide comment
@kocolosk

kocolosk Aug 11, 2011

Member

Hi Michael, that summary is correct. Good luck with the pilot, and let us know if you bump into any other issues!

Member

kocolosk commented Aug 11, 2011

Hi Michael, that summary is correct. Good luck with the pilot, and let us know if you bump into any other issues!

@kocolosk kocolosk closed this Aug 11, 2011

@natevw

This comment has been minimized.

Show comment
Hide comment
@natevw

natevw Dec 9, 2013

Since the other documentation is not terribly detailed, I've been using this as my primary reference to get my head around Cloudant's Dynamo implementation. It seems the 202 Accepted fix as described above has been deployed to their multitenant clusters for a while now.

One point this discussion missed: According to their tech support, Cloudant/BigCouch does not enforce read quorums! That is, you can request r=2/r=3 but if less shards are actually available you may still simply get a normal 200 based on non-majority/non-unanimous single shard's state alone. So, /caveat relaxor/ — while you can now tell when a write failed to meet quorum by a 202 response, there's no way to tell if a read met quorum.

natevw commented Dec 9, 2013

Since the other documentation is not terribly detailed, I've been using this as my primary reference to get my head around Cloudant's Dynamo implementation. It seems the 202 Accepted fix as described above has been deployed to their multitenant clusters for a while now.

One point this discussion missed: According to their tech support, Cloudant/BigCouch does not enforce read quorums! That is, you can request r=2/r=3 but if less shards are actually available you may still simply get a normal 200 based on non-majority/non-unanimous single shard's state alone. So, /caveat relaxor/ — while you can now tell when a write failed to meet quorum by a 202 response, there's no way to tell if a read met quorum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment