Consistent reads are not consistent #741

aphyr · 2014-04-18T08:25:56Z

Etcd aims to provide linearizable registers supporting reads, writes, and CAS operations. In a linearizable system, all operations must appear to take place at a specific, atomic point in time between the invocation and completion of that operation. This is not the case: reads in etcd can return stale versions of nodes.

In short, it's possible to write "foo", write "bar", then read "foo".

Is this a theoretical or real problem?

Real. I can reproduce it reliably on a five-node cluster with only a small number of clients performing five concurrent ops per second. Since leadership transitions occur rapidly in etcd, even a transient hiccup in network or host latency can cause this issue. I have a failing case in Jepsen which triggers this behavior in roughly 20 seconds; see the client, test, and a sample run with analysis.

The window of inconsistency is relatively short--sub-second--but appears reliably during transition.

Why does this happen?

For performance reasons, when a read request arrives on a node which believes itself to be the leader, etcd reads the current state from the Raft FSM and returns it to the client. This is fast and simple, but not quite correct.

Raft allows multiple nodes to believe themselves the leader concurrently. If a client makes a request to a leader which has recently been isolated, that leader may believe itself to be the leader even though a new leader has been elected and is accepting new writes. Clients writing to one side of etcd will find that their writes are not reflected on the old leader, until the old leader steps down.

How do we fix this?

Consistent reads must wait for at least one network round-trip; the leader needs acknowledgement from a quorum of followers that it is still the leader in order to service a read. One way to do this is to make a entry for the read operation to the Raft log, and when it has been committed, send the read back to the client.

A more subtle way to fix this problem is to (invisibly) piggyback the read on the state of transactions the leader is already executing. In particular, the leader can track an outgoing unrelated log entry, and if it successfully commits, use that information to assert that it is still the leader. I believe the linearizability window for this strategy allows reads to proceed concurrently with writes; the leader looks at its local FSM, chooses the value X it thinks is valid, waits for a commit to propagate, then returns X to the client.

It might also be possible to simply wait for a heartbeat from all follower nodes, confirming that they still believe the leader is valid. I'm not quite sure about the safety of this one.

What are the costs?

At least one round-trip latency for each read. This makes reads only a little less expensive than writes, in latency terms, but since they don't need to touch the log directly, doesn't balloon the size of the log. All required state can be kept in a single leader's memory.

Since the HTTP API terms these reads "consistent", I strongly advise making them consistent. It may be worth having three consistency tiers:

any, which allows arbitrary nodes to serve arbitrarily stale requests,
leader, which allows a short window of inconsistency but requires no network round trip. This is the current consistent mode, and remains useful for systems doing CAS loops where the reduced inconsistency window dramatically reduces contention, but does not impact correctness, and
consistent, which performs a consistent (i.e. linearizable) read.

Incidentally, I believe Consul is subject to a similar problem, but haven't verified experimentally yet. I hope you find this helpful! :)

The text was updated successfully, but these errors were encountered:

xiang90 · 2014-04-18T11:54:27Z

@aphyr Thanks for the testing.
I have replied for this once at https://gist.github.com/xiangli-cmu/d6c59c42d72d3582db9d. I was thinking you were talking about the linearization of CAS operation on twitter, which made a a little confused. Anyway, I realized what you were talking about.

I think Consistency != linearization. etcd provides sequential consistency for all the write and read operations(in etcd we do not have write operation: we only have write and read/ read) that go thorough the state machine. We never aim at strict consistency (one-copy semantics). We do a read redirection to leader just to ensure the client that send out the previous write operation can see the result by a subsequent read operation.

I totally understand the problem you are saying and am aware of how to fix this. But it is just a tradeoff we made long time ago. Actually not only the leader changes can cause the problem, two concurrency write/read operations from different clients can also lead to stale read in etcd.

How to fix?
We can just simply let read also go through raft without persistent it to disk. We do not even need to record that in memory anyway. If we want we can make it happen very soon.

How to handle this in etcd:
In etcd, we attach an index to each response. It is the logical clock in etcd. We ensure linearization in logical clock perspective. You can see write k->foo index = 1, read k->bar index = 0, read k->foo index=1 in the real time perspective. But when you reorder them with the etcd time(index) and analyze them, they should be linearizable.

Again:
etcd tries to ensure sequential consistency, which means each replica have the same command execution ordering.
With consistent read, we ensure the strict consistency for each separated client in most cases (leader dies, new leader might has the command but not commit yet.).

xiang90 · 2014-04-18T12:03:08Z

@aphyr For the leader shifting issue, please tune your etcd cluster according to:
https://github.com/coreos/etcd/blob/master/Documentation/tuning.md.

In your case, you need to loose the leader election timeout to second level.

aphyr · 2014-04-18T18:08:27Z

etcd tries to ensure sequential consistency, which means each replica have the same command execution ordering.

etcd provides sequential consistency for all the write and read operations(in etcd we do not have write operation: we only have write and read/ read)

I think you can actually claim linearizability so long as all operations are write-and-read, but if you allow a process to perform a nominally consistent read, etcd is neither linearizable nor sequentially consistent. Consider this sequence of events:

Client 1 writes A
A leader transition occurs; both leaders have state A
Client 1 writes B
New leader has B, old leader has A; both consider themselves valid targets for reads
Client 2 reads A from the old leader
Client 2 reads B from the new leader
Client 3 reads B from the new leader
Client 3 reads A from the old leader

In this history, Client 2 sees [A B], but Client 3 sees [B A]. Sequential consistency is violated.

You can recover sequential consistency by forcing all consistent reads to include an index, and only satisfying a read when the node sees a committed index greater than or equal to the requested index, but this does have the drawback that clients are required to implement a local index FSM in order to provide a sequentially consistent API. You'll have to communicate that constraint clearly to client authors and other etcd API users.

There's a larger question of whether users really consider sequential consistency to be "consistent". Users may not understand that a "consistent read" may actually provide stale data. If you choose this model, I suggest that in the documentation, everywhere the term "consistent read" is used, you explain exactly what behaviors are allowable. I'll bet you dollars to donuts that you've got a significant (if not majority) fraction of the user base assuming that consistent reads are linearizable. I did, and I think about these things carefully, haha. :)

For example, http://the.randomengineer.com/2014/01/27/using-etcd-as-a-clusterized-immediately-consistent-key-value-storage/ claims etcd is "immediately consistent", and could be forgiven for thinking so, as etcd's read consistency documentation says:

If your application wants or needs the most up-to-date version of a key then it should ensure it reads from the current leader. By using the consistent=true flag in your GET requests, etcd will make sure you are talking to the current master.

... which strongly implies that consistent=true ensures one will read the most up-to-date value.

two concurrency write/read operations from different clients can also lead to stale read in etcd.

This may actually be OK; as long as the operations are concurrent, returning either the old or new value satisfies linearizability. :)

But when you reorder them with the etcd time(index) and analyze them, they should be linearizable.

This is true, but if clients are supposed to reorder operations in etcd index order, you've got a.) some serious documentation work to do, and b.) a lot of misbehaving clients. ;-)

@aphyr For the leader shifting issue, please tune your etcd cluster according to:
https://github.com/coreos/etcd/blob/master/Documentation/tuning.md.
In your case, you need to loose the leader election timeout to second level.

This is how Consul got their system to pass Jepsen too, but adjusting the timeouts won't make etcd correct--it just makes the problem harder to demonstrate. Timeouts can help optimize performance, but without a realtime environment, they cannot ensure correctness.

Anyway, none of this is to say that etcd is wrong or bad; there are, as you've observed, good reasons to offer "mostly consistent" reads. You just gotta document those decisions. One easy way to do that would be to change the parameter for this read behavior to mostly-consistent.

xiang90 · 2014-04-18T18:15:38Z

I think you can actually claim linearizability so long as all operations are write-and-read, but if you allow a process to perform a nominally consistent read, etcd is neither linearizable nor sequentially consistent. Consider this sequence of events

Sequential consistence means replica execute the commands in the same order. We do not execute read, so there is no guarantee.

In etcd, we return the result of a write. So actually it is a write and read. That is why we can safely return an index.

This may actually be OK; as long as the operations are concurrent, returning either the old or new value satisfies linearizability. :)

I think this violates linearizability.

If you have a clear use case for us to execute read in the state machine, we definitely would like to add it. :)

This is how Consul got their system to pass Jepsen too, but adjusting the timeouts won't make etcd correct--it just makes the problem harder to demonstrate. Timeouts can help optimize performance, but without a realtime environment, they cannot ensure correctness.

I am not saying I want to cheat. This explains:

Since leadership transitions occur rapidly in etcd, even a transient hiccup in network or host latency can cause this issue

xiang90 · 2014-04-18T18:16:46Z

@aphyr We would like to change the doc as you suggest. It makes things clearer. I do not want to mislead people too.

aphyr · 2014-04-18T18:32:23Z

Sequential consistence means replica execute the commands in the same order. We do not execute read, so there is no guarantee.

Aha! I think we may be operating under different concepts of sequential consistency. May I suggest this INRIA paper on causal, sequential, and linearizable consistency? In particular,

In all three cases a read operation returns the last value assigned to the variable... With sequential consistency, the processes [have] to agree on a total order of their read/write operations...

... which is violated by the histories we've been talking about.

I am not saying I want to cheat. This explains:

Since leadership transitions occur rapidly in etcd, even a transient hiccup in network or host
latency can cause this issue

Ah! I understand now. Yes, in production I'd definitely adjust the timeouts, though that comes with consequences too: higher windows of unavailability for consistent operations during cluster transition. If etcd were to offer consistent reads, you could actually get away with having lower timeouts without compromising safety. Might be a worthwhile tradeoff for users.

aphyr · 2014-04-18T18:34:29Z

If you have a clear use case for us to execute read in the state machine, we definitely would like to add it. :)

I dunno if you have to execute the read in the state machine or not (this is getting into some raft internals territory I'm unfamiliar with), but I think the obvious case for providing linearizability for consistent=true reads is "doing what the docs say etcd does", haha. Linearizability is definitely a useful property, and one that I think users are eager to have! :)

xiang90 · 2014-04-18T18:35:38Z

@aphyr Thanks for reporting all these. Seems like we are on the same page now.

So:

We will change our docs to make things more clearly.
We will discuss about letting a special read go through raft.
We will get back to you soon about our decision.

Jespen is awesome! Thanks for all your effort.

aphyr · 2014-04-18T18:38:32Z

Sounds like a plan! And thanks again for all your hard work on etcd; it's been a real pleasure working with the system and community alike. :)

glycerine · 2014-06-15T06:04:42Z

To this naive reader, this issue appears to be a correctness bug, and thus a clear showstopper for any use of etcd. Why has a correctness issue been left to languish since April?

bmizerany · 2014-06-16T00:52:25Z

@glycerine Can you elaborate?

glycerine · 2014-06-16T05:52:07Z

Again, I'm naively summarizing the above, which seems to provide the detail: a read may actually provide stale data. This is never acceptable.

bmizerany · 2014-06-16T05:56:31Z

I didn't mean to close this.

aphyr · 2014-06-16T06:25:23Z

To clarify, I think there are good reasons for a read to provide stale data sometimes; you just shouldn't call it "consistent" or say it returns the most recent value. I think Consul chose a good balance of behaviors here: http://www.consul.io/docs/internals/consensus.html

glycerine · 2014-06-16T08:06:36Z

I can deliver incorrect and inconsistent results extremely quickly, and without running etcd or any complicated consensus system.

If I'm bothering with a distributed consensus system in the first place, it is because I want consistent results. Always. It should be impossible to do anything other than choice [2] below, quoting from the consul docs.

The three read modes are:

[1]default - Raft makes use of leader leasing, providing a time window in which the leader assumes its role is stable. However, if a leader is partitioned from the remaining peers, a new leader may be elected while the old leader is holding the lease. This means there are 2 leader nodes. There is no risk of a split-brain since the old leader will be unable to commit new logs. However, if the old leader services any reads the values are potentially stale. The default consistency mode relies only on leader leasing, exposing clients to potentially stale values. We make this trade off because reads are fast, usually strongly consistent, and only stale in a hard to trigger situation. The time window of stale reads is also bounded, since the leader will step down due to the partition.

[2]consistent - This mode is strongly consistent without caveats. It requires that a leader verify with a quorum of peers that it is still leader. This introduces an additional round-trip to all server nodes. The trade off is always consistent reads, but increased latency due to an extra round trip.

[3]stale - This mode allows any server to service the read, regardless of if it is the leader. This means reads can be arbitrarily stale, but are generally within 50 milliseconds of the leader. The trade off is very fast and scalable reads but values will be stale. This mode allows reads without a leader, meaning a cluster that is unavailable will still be able to respond.

Rhetorically: Why would I want to run a system that lets me down at the exact point at which I need its guarantees the most? "We are correct, except during a partial system failure" is no correctness at all.

kelseyhightower · 2014-06-16T16:36:04Z

@glycerine Thank you for your feedback and do know that the CoreOS team is taking this matter very seriously. But with all things there is a process to solving this problem "right now", which is through updating our docs and engaging in conversation with the community.

The next step for the community, which includes CoreOS, is to understand the tradeoffs that we've made in the current design and work together to reexamine those decisions and strongly consider changing them for future releases of etcd.

There is a lot of work going on in collaboration with the community to improve the quality and stability of etcd. We are currently in the late stages of research and design which is being heavily influence by feedback like yours and other etcd users.

At the end of the day we all want etcd to be solid and live up to the promises me make. Thanks for your patience as we continue to work hard to make that happen.

ptomasroos · 2014-06-17T11:12:45Z

For me who as few reads + write I would definitely want to configure this trade off in order to have the reads go through the state machine in order to be fully consistent. As raftd is not active and etcd is considered to be the closest to production use-cases I really would like to have this improved.

We are having a "showstopper" for us in order to be able to deploy our coming software release since we're assuming it wasn't returning stale reads (silly us, yes).

Anyway, +1 in fixing it and making reads consistent. Let the user decide in a global etcd configuration.

glycerine · 2014-06-18T01:29:48Z

Am I to understand that there is no workaround here? I figured someone would chime in with "oh, just change this default settings to XYZ, and then you're reads will be strongly consistent, even through they will be slower."

Is the state of the system really that etcd lacks a strongly consistent mode at all? Nothing equivalent to mode [2] of Consul (above)?

xiang90 · 2014-06-18T01:47:12Z

@glycerine I have provided a hint for a work around in current etcd in my first reply to @aphyr. But I do not encourage people to do so. What is your use case for a consistent read?

As @kelseyhightower mentioned, we treat our system seriously and we are not ready for a decision right now.

glycerine · 2014-06-18T02:17:25Z

Hi Xiang,

In reality, presumption falls the other way. I always expect consistency. What possible use case is there for an inconsistent consensus protocol?

If I want inconsistent results, I can just skip running a consensus system. Much, much easier. Same result when it matters most.

Could you be more specific about your hint? I don't see any way forward in your above comments, for those of us who need to be able to read the last write reliably.

Thanks,
Jason

chadharrington · 2014-06-20T04:56:23Z

I have to agree with @glycerine. I don't have any use for a consensus system which can return stale reads. Correctness is much more important than performance in my use cases (distributed consensus, locking, transaction monitoring, etc.) PLEASE implement linearizable reads.

Thank you for your great work on etcd and CoreOS. They are great products which we hope to use in production soon.

jefferai · 2014-06-20T17:25:25Z

Also in agreement with @glycerine. One of the main draws of etcd for me was its presumed use as a consistent consensus system. Otherwise I could use any RDBMS...but I don't want that. I'm happy to have lookups take additional time...I'm not expecting etcd to be blazing fast. What I am expecting is for it to be correct.

philips · 2014-07-09T01:32:22Z

We have merged quorum=true via feat(get): get from quorum #866 that will put GET's through raft
I attempted to docs on consistent=true in this PR Documentation: add note about consistent=true #883

bmizerany closed this as completed Jun 16, 2014

bmizerany reopened this Jun 16, 2014

philips closed this as completed Jul 9, 2014

This was referenced May 15, 2015

Do we need consistent reads of etcd? kubernetes/kubernetes#8295

Closed

Does quorum=true eliminate the problems in #741? #2837

Closed

spikecurtis mentioned this issue Nov 2, 2015

IPAM raising KeyError projectcalico/libcalico#47

Closed

chenkaitopic mentioned this issue Feb 15, 2016

Investigate Get Quorum=true option topicai/etcc#3

Open

michaelbironneau mentioned this issue Mar 31, 2016

Add check for Consistent read michaelbironneau/go-agree#2

Closed

jupierce mentioned this issue Jul 5, 2016

Ensure client can delete a buildconfig with no associated builds, eve… openshift/origin#9657

Merged

yasker mentioned this issue Apr 15, 2017

random test failure: unable to attach volume: invalid empty address of replica rancher/longhorn-manager-old#22

Closed

rystsov mentioned this issue Jan 19, 2018

Potential stale reads yugabyte/yugabyte-db#16

Closed

yasker mentioned this issue Sep 6, 2018

Test: failure due to sometime cannot get the updated result immediately longhorn/longhorn#133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent reads are not consistent #741

Consistent reads are not consistent #741

aphyr commented Apr 18, 2014

xiang90 commented Apr 18, 2014

xiang90 commented Apr 18, 2014

aphyr commented Apr 18, 2014

xiang90 commented Apr 18, 2014

xiang90 commented Apr 18, 2014

aphyr commented Apr 18, 2014

aphyr commented Apr 18, 2014

xiang90 commented Apr 18, 2014

aphyr commented Apr 18, 2014

glycerine commented Jun 15, 2014

bmizerany commented Jun 16, 2014

glycerine commented Jun 16, 2014

bmizerany commented Jun 16, 2014

aphyr commented Jun 16, 2014

glycerine commented Jun 16, 2014

kelseyhightower commented Jun 16, 2014

ptomasroos commented Jun 17, 2014

glycerine commented Jun 18, 2014

xiang90 commented Jun 18, 2014

glycerine commented Jun 18, 2014

chadharrington commented Jun 20, 2014

jefferai commented Jun 20, 2014

philips commented Jul 9, 2014

Consistent reads are not consistent #741

Consistent reads are not consistent #741

Comments

aphyr commented Apr 18, 2014

Is this a theoretical or real problem?

Why does this happen?

How do we fix this?

What are the costs?

xiang90 commented Apr 18, 2014

xiang90 commented Apr 18, 2014

aphyr commented Apr 18, 2014

xiang90 commented Apr 18, 2014

xiang90 commented Apr 18, 2014

aphyr commented Apr 18, 2014

aphyr commented Apr 18, 2014

xiang90 commented Apr 18, 2014

aphyr commented Apr 18, 2014

glycerine commented Jun 15, 2014

bmizerany commented Jun 16, 2014

glycerine commented Jun 16, 2014

bmizerany commented Jun 16, 2014

aphyr commented Jun 16, 2014

glycerine commented Jun 16, 2014

kelseyhightower commented Jun 16, 2014

ptomasroos commented Jun 17, 2014

glycerine commented Jun 18, 2014

xiang90 commented Jun 18, 2014

glycerine commented Jun 18, 2014

chadharrington commented Jun 20, 2014

jefferai commented Jun 20, 2014

philips commented Jul 9, 2014