Skip to content

Consistent reads are not consistent #741

@aphyr

Description

@aphyr

Etcd aims to provide linearizable registers supporting reads, writes, and CAS operations. In a linearizable system, all operations must appear to take place at a specific, atomic point in time between the invocation and completion of that operation. This is not the case: reads in etcd can return stale versions of nodes.

In short, it's possible to write "foo", write "bar", then read "foo".

Is this a theoretical or real problem?

Real. I can reproduce it reliably on a five-node cluster with only a small number of clients performing five concurrent ops per second. Since leadership transitions occur rapidly in etcd, even a transient hiccup in network or host latency can cause this issue. I have a failing case in Jepsen which triggers this behavior in roughly 20 seconds; see the client, test, and a sample run with analysis.

The window of inconsistency is relatively short--sub-second--but appears reliably during transition.

Why does this happen?

For performance reasons, when a read request arrives on a node which believes itself to be the leader, etcd reads the current state from the Raft FSM and returns it to the client. This is fast and simple, but not quite correct.

Raft allows multiple nodes to believe themselves the leader concurrently. If a client makes a request to a leader which has recently been isolated, that leader may believe itself to be the leader even though a new leader has been elected and is accepting new writes. Clients writing to one side of etcd will find that their writes are not reflected on the old leader, until the old leader steps down.

How do we fix this?

Consistent reads must wait for at least one network round-trip; the leader needs acknowledgement from a quorum of followers that it is still the leader in order to service a read. One way to do this is to make a entry for the read operation to the Raft log, and when it has been committed, send the read back to the client.

A more subtle way to fix this problem is to (invisibly) piggyback the read on the state of transactions the leader is already executing. In particular, the leader can track an outgoing unrelated log entry, and if it successfully commits, use that information to assert that it is still the leader. I believe the linearizability window for this strategy allows reads to proceed concurrently with writes; the leader looks at its local FSM, chooses the value X it thinks is valid, waits for a commit to propagate, then returns X to the client.

It might also be possible to simply wait for a heartbeat from all follower nodes, confirming that they still believe the leader is valid. I'm not quite sure about the safety of this one.

What are the costs?

At least one round-trip latency for each read. This makes reads only a little less expensive than writes, in latency terms, but since they don't need to touch the log directly, doesn't balloon the size of the log. All required state can be kept in a single leader's memory.

Since the HTTP API terms these reads "consistent", I strongly advise making them consistent. It may be worth having three consistency tiers:

  • any, which allows arbitrary nodes to serve arbitrarily stale requests,
  • leader, which allows a short window of inconsistency but requires no network round trip. This is the current consistent mode, and remains useful for systems doing CAS loops where the reduced inconsistency window dramatically reduces contention, but does not impact correctness, and
  • consistent, which performs a consistent (i.e. linearizable) read.

Incidentally, I believe Consul is subject to a similar problem, but haven't verified experimentally yet. I hope you find this helpful! :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions