-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent reads are not consistent #741
Comments
@aphyr Thanks for the testing. I think Consistency != linearization. I totally understand the problem you are saying and am aware of how to fix this. But it is just a tradeoff we made long time ago. Actually not only the leader changes can cause the problem, two concurrency write/read operations from different clients can also lead to stale read in etcd. How to fix? How to handle this in etcd: Again: |
@aphyr For the leader shifting issue, please tune your In your case, you need to loose the leader election timeout to second level. |
I think you can actually claim linearizability so long as all operations are
In this history, Client 2 sees [A B], but Client 3 sees [B A]. Sequential consistency is violated. You can recover sequential consistency by forcing all consistent reads to include an index, and only satisfying a read when the node sees a committed index greater than or equal to the requested index, but this does have the drawback that clients are required to implement a local index FSM in order to provide a sequentially consistent API. You'll have to communicate that constraint clearly to client authors and other etcd API users. There's a larger question of whether users really consider sequential consistency to be "consistent". Users may not understand that a "consistent read" may actually provide stale data. If you choose this model, I suggest that in the documentation, everywhere the term "consistent read" is used, you explain exactly what behaviors are allowable. I'll bet you dollars to donuts that you've got a significant (if not majority) fraction of the user base assuming that consistent reads are linearizable. I did, and I think about these things carefully, haha. :) For example, http://the.randomengineer.com/2014/01/27/using-etcd-as-a-clusterized-immediately-consistent-key-value-storage/ claims etcd is "immediately consistent", and could be forgiven for thinking so, as etcd's read consistency documentation says:
... which strongly implies that
This may actually be OK; as long as the operations are concurrent, returning either the old or new value satisfies linearizability. :)
This is true, but if clients are supposed to reorder operations in etcd index order, you've got a.) some serious documentation work to do, and b.) a lot of misbehaving clients. ;-)
This is how Consul got their system to pass Jepsen too, but adjusting the timeouts won't make etcd correct--it just makes the problem harder to demonstrate. Timeouts can help optimize performance, but without a realtime environment, they cannot ensure correctness. Anyway, none of this is to say that etcd is wrong or bad; there are, as you've observed, good reasons to offer "mostly consistent" reads. You just gotta document those decisions. One easy way to do that would be to change the parameter for this read behavior to |
Sequential consistence means replica execute the commands in the same order. We do not execute read, so there is no guarantee. In etcd, we return the result of a
I think this violates linearizability. If you have a clear use case for us to execute read in the state machine, we definitely would like to add it. :)
I am not saying I want to cheat. This explains:
|
@aphyr We would like to change the doc as you suggest. It makes things clearer. I do not want to mislead people too. |
Aha! I think we may be operating under different concepts of sequential consistency. May I suggest this INRIA paper on causal, sequential, and linearizable consistency? In particular,
... which is violated by the histories we've been talking about.
Ah! I understand now. Yes, in production I'd definitely adjust the timeouts, though that comes with consequences too: higher windows of unavailability for consistent operations during cluster transition. If etcd were to offer consistent reads, you could actually get away with having lower timeouts without compromising safety. Might be a worthwhile tradeoff for users. |
I dunno if you have to execute the read in the state machine or not (this is getting into some raft internals territory I'm unfamiliar with), but I think the obvious case for providing linearizability for |
@aphyr Thanks for reporting all these. Seems like we are on the same page now. So:
Jespen is awesome! Thanks for all your effort. |
Sounds like a plan! And thanks again for all your hard work on etcd; it's been a real pleasure working with the system and community alike. :) |
To this naive reader, this issue appears to be a correctness bug, and thus a clear showstopper for any use of etcd. Why has a correctness issue been left to languish since April? |
@glycerine Can you elaborate? |
Again, I'm naively summarizing the above, which seems to provide the detail: a read may actually provide stale data. This is never acceptable. |
I didn't mean to close this. |
To clarify, I think there are good reasons for a read to provide stale data sometimes; you just shouldn't call it "consistent" or say it returns the most recent value. I think Consul chose a good balance of behaviors here: http://www.consul.io/docs/internals/consensus.html |
I can deliver incorrect and inconsistent results extremely quickly, and without running etcd or any complicated consensus system. If I'm bothering with a distributed consensus system in the first place, it is because I want consistent results. Always. It should be impossible to do anything other than choice [2] below, quoting from the consul docs.
Rhetorically: Why would I want to run a system that lets me down at the exact point at which I need its guarantees the most? "We are correct, except during a partial system failure" is no correctness at all. |
@glycerine Thank you for your feedback and do know that the CoreOS team is taking this matter very seriously. But with all things there is a process to solving this problem "right now", which is through updating our docs and engaging in conversation with the community. The next step for the community, which includes CoreOS, is to understand the tradeoffs that we've made in the current design and work together to reexamine those decisions and strongly consider changing them for future releases of etcd. There is a lot of work going on in collaboration with the community to improve the quality and stability of etcd. We are currently in the late stages of research and design which is being heavily influence by feedback like yours and other etcd users. At the end of the day we all want etcd to be solid and live up to the promises me make. Thanks for your patience as we continue to work hard to make that happen. |
For me who as few reads + write I would definitely want to configure this trade off in order to have the reads go through the state machine in order to be fully consistent. As raftd is not active and etcd is considered to be the closest to production use-cases I really would like to have this improved. We are having a "showstopper" for us in order to be able to deploy our coming software release since we're assuming it wasn't returning stale reads (silly us, yes). Anyway, +1 in fixing it and making reads consistent. Let the user decide in a global etcd configuration. |
Am I to understand that there is no workaround here? I figured someone would chime in with "oh, just change this default settings to XYZ, and then you're reads will be strongly consistent, even through they will be slower." Is the state of the system really that etcd lacks a strongly consistent mode at all? Nothing equivalent to mode [2] of Consul (above)? |
@glycerine I have provided a hint for a work around in current As @kelseyhightower mentioned, we treat our system seriously and we are not ready for a decision right now. |
Hi Xiang, In reality, presumption falls the other way. I always expect consistency. What possible use case is there for an inconsistent consensus protocol? If I want inconsistent results, I can just skip running a consensus system. Much, much easier. Same result when it matters most. Could you be more specific about your hint? I don't see any way forward in your above comments, for those of us who need to be able to read the last write reliably. Thanks, |
I have to agree with @glycerine. I don't have any use for a consensus system which can return stale reads. Correctness is much more important than performance in my use cases (distributed consensus, locking, transaction monitoring, etc.) PLEASE implement linearizable reads. Thank you for your great work on etcd and CoreOS. They are great products which we hope to use in production soon. |
Also in agreement with @glycerine. One of the main draws of etcd for me was its presumed use as a consistent consensus system. Otherwise I could use any RDBMS...but I don't want that. I'm happy to have lookups take additional time...I'm not expecting etcd to be blazing fast. What I am expecting is for it to be correct. |
|
Etcd aims to provide linearizable registers supporting reads, writes, and CAS operations. In a linearizable system, all operations must appear to take place at a specific, atomic point in time between the invocation and completion of that operation. This is not the case: reads in etcd can return stale versions of nodes.
In short, it's possible to write "foo", write "bar", then read "foo".
Is this a theoretical or real problem?
Real. I can reproduce it reliably on a five-node cluster with only a small number of clients performing five concurrent ops per second. Since leadership transitions occur rapidly in etcd, even a transient hiccup in network or host latency can cause this issue. I have a failing case in Jepsen which triggers this behavior in roughly 20 seconds; see the client, test, and a sample run with analysis.
The window of inconsistency is relatively short--sub-second--but appears reliably during transition.
Why does this happen?
For performance reasons, when a read request arrives on a node which believes itself to be the leader, etcd reads the current state from the Raft FSM and returns it to the client. This is fast and simple, but not quite correct.
Raft allows multiple nodes to believe themselves the leader concurrently. If a client makes a request to a leader which has recently been isolated, that leader may believe itself to be the leader even though a new leader has been elected and is accepting new writes. Clients writing to one side of etcd will find that their writes are not reflected on the old leader, until the old leader steps down.
How do we fix this?
Consistent reads must wait for at least one network round-trip; the leader needs acknowledgement from a quorum of followers that it is still the leader in order to service a read. One way to do this is to make a entry for the read operation to the Raft log, and when it has been committed, send the read back to the client.
A more subtle way to fix this problem is to (invisibly) piggyback the read on the state of transactions the leader is already executing. In particular, the leader can track an outgoing unrelated log entry, and if it successfully commits, use that information to assert that it is still the leader. I believe the linearizability window for this strategy allows reads to proceed concurrently with writes; the leader looks at its local FSM, chooses the value X it thinks is valid, waits for a commit to propagate, then returns X to the client.
It might also be possible to simply wait for a heartbeat from all follower nodes, confirming that they still believe the leader is valid. I'm not quite sure about the safety of this one.
What are the costs?
At least one round-trip latency for each read. This makes reads only a little less expensive than writes, in latency terms, but since they don't need to touch the log directly, doesn't balloon the size of the log. All required state can be kept in a single leader's memory.
Since the HTTP API terms these reads "consistent", I strongly advise making them consistent. It may be worth having three consistency tiers:
any
, which allows arbitrary nodes to serve arbitrarily stale requests,leader
, which allows a short window of inconsistency but requires no network round trip. This is the currentconsistent
mode, and remains useful for systems doing CAS loops where the reduced inconsistency window dramatically reduces contention, but does not impact correctness, andconsistent
, which performs a consistent (i.e. linearizable) read.Incidentally, I believe Consul is subject to a similar problem, but haven't verified experimentally yet. I hope you find this helpful! :)
The text was updated successfully, but these errors were encountered: