[WIP] rfc: Consistent Read Replicas #39758

nvanbenschoten · 2019-08-20T03:03:49Z

!!!!!!
Likely rejected in favor of #52745.
!!!!!!

Consistent Read Replicas provide a mechanism through which follower replicas in a Range can be used to serve reads for non-stale read-only and read-write transactions.

The ability to serve reads from follower replicas is beneficial both because it can reduce wide-area network jumps in geo-distributed deployments and because it can serve as a form of load-balancing for concentrated read traffic. It may also provide an avenue to reduce tail latencies in read-heavy workloads, although such a benefit is not a focus of this RFC.

The purpose of this RFC is to introduce an approach to consistent read replicas that I think could be implemented in CockroachDB in the medium-term future. It takes inspiration from Andy Kimball and the AcidLib project. I'm hoping for this to spur a discussion about potential solutions to the problem and also generally about our ambitions in this area going forward.

The most salient design decision from this proposal is that it remains completely separate from the Raft consensus protocol. It does not interact with Raft, instead operating in the lease index domain. This is similar in spirit to our approach to implementing closed timestamps.

The RFC includes three alternative approaches that each address part of these issues, but come with their own challenges and costs. Unlike with most RFCs, I think it's very possible that we end up preferring one of these alternatives over the main proposal in the RFC. Because of this, the RFC is fairly unpolished and avoids going into too much detail about implementation specifics of any single approach.

cockroach-teamcity · 2019-08-20T03:03:56Z

This change is

nvanbenschoten · 2019-08-20T03:04:57Z

cc. @andy-kimball @spencerkimball @tbg @bdarnell @ajwerner

ajwerner

Nice! I'm pumped to see this out for comment so quickly.

Offering a mechanism to hoist latency costs onto writes to relieve the cost of reads seems like a valuable tradeoff we currently can't offer. Furthermore, despite its implementation complexity this proposal seems to have a minimal number of touch points with other subsystems.

Can we come up with some concrete use cases and user stories to inform the trade-offs presented by these approaches? Consider an application with a global user base, would you consider running in a fully read-replicated configuration for user profile data just to avoid the need to geo-partition? User profiles are read much much more often than they are written for most applications.

Do we expect this to become the default way to use a cockroach cluster in cases where write latency isn't particularly important? What global user stories do we have that we know we aren't telling very well?

Reviewable status: complete! 0 of 0 LGTMs obtained

docs/RFCS/20190819_consistent_read_replicas.md, line 132 at r1 (raw file):

A better solution to this problem would be adding a consistent read replica for
the table in each geo-graphical locality. This would reduce operational burden,
write amplification, and availability.

missing an increase on availability.

docs/RFCS/20190819_consistent_read_replicas.md, line 145 at r1 (raw file):

  "partial" Timestamp Cache structures
- Timestamp Cache information is collected from each read replica during each write
- Latches are maintained across the leaseholder and all read replicas during each

This latching is my biggest concern with this RFC. For write-almost-never ("reference") tables and indexes this RFC will work great. That said, for reference tables mixed timestamp statements might be a much simpler acceptable (though admittedly harder to reason about and use) approach for the majority of cases.

For write-rarely indexes where workloads perform scans these latches may become a deal breaker. Any latch you run into is going to have to endure on average at least one WAN round trip for replication in addition to the half RTT to communicate the latch response. If a workload has a high likelihood of encountering a latch then this proposal seems to become almost as bad as just going to the leaseholder.

Just brainstorming here: I wonder if there's a way to take a cue from some of the more optimistic ideas below and combine them with this RFC. What if we provide a mechanism to push read only transactions backwards in time so that they occur before any currently held latches. This will allow reads to be served locally without blocking on latches. Furthermore it seems like one could implement a read-refresh mechanism that ensure that a read is not invalidated moving backwards in time in cases where there are covering read replicas for all the data needed by a transaction in the local region. Such a mechanism would allow a transaction to observe a consistent snapshot though without the inconsistency of the optimistic follower reads approach below. While this approach can work to provide serializable isolation it opens the door to stale reads. Perhaps that too can be fixed with a causality token.

Admittedly if a transaction went on to do a write it would almost certainly fail to refresh its reads. I guess what I'm getting at is that these read replicas could provide very near real time stale reads without any reliance on the closed timestamp subsystem.

tbg

Thanks for the write-up!

I echo @ajwerner's comment that we should understand better how each of these solutions would fit into our standard deployment patterns. Are we trying to solve a problem specific to reference tables? Are we willing to accept that the mechanism will only be enabled manually (via configuration) on tables on which it's expected to be an overall net win? Does this need to be self-driving somehow? And how important is the load balancing aspect against the latency improvements?

My two top contenders here are (writer-enforced) consistent read replicas and consistent optimistic follower reads.

Reader-enforced read replicas make reads more latency bound rather than less, and returning potentially inconsistent data to the SQL layer seems like it would add a prohibitive amount of complexity and bugs there (can SQL folks confirm?).

Consistent read replicas are nicely "local" (mod the epoch, but that's an old constraint); they don't need to involve the "global" construct of closed timestamp in which the weakest link determines how far behind it is (this can be mitigated somewhat, but still). There's still lots of complexity in the implementation, though it's sort of a plus in my book that it touches areas that need some love anyway (lease transfers etc). The big question is whether it satisfactorily solves a large enough problem on top of reference tables (which we can already mitigate with the per-region indexes we have today).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

docs/RFCS/20190819_consistent_read_replicas.md, line 34 at r1 (raw file):

issues, but come with their own challenges and costs. Unlike with most RFCs, I
think it's very possible that we end up preferring one of these alternatives
over the main proposal in the RFC._

stray _

tbg · 2019-08-20T09:17:42Z

In particular, could we consider getting rid of closed timestamps? A follower read could instead be satisfied by a consistent read replica. That replica is guaranteed to know about all relevant inflight proposals, so the closed timestamp shouldn't be needed there any more and could simply be replaced by a continuously rising low water mark on the leaseholder. For CDC there should be a similar story.
Of course read replicas are far from free, they're pretty terrible for write throughput, so it will probably be hard to make a convincing argument here, unless we're willing to remove the concept of follower reads altogether, which I assume we're not ready to do. But the thought experiment in itself is interesting.

nvanbenschoten

Can we come up with some concrete use cases and user stories to inform the trade-offs presented by these approaches? Consider an application with a global user base, would you consider running in a fully read-replicated configuration for user profile data just to avoid the need to geo-partition? User profiles are read much much more often than they are written for most applications.

I added this as an example.

Do we expect this to become the default way to use a cockroach cluster in cases where write latency isn't particularly important?

I'm not sure. I would expect this to be used heavily in situations where read latency is considered much more important than write latency, but I don't know if I'd go as far as saying it should be the default.

I echo @ajwerner's comment that we should understand better how each of these solutions would fit into our standard deployment patterns. Are we trying to solve a problem specific to reference tables? Are we willing to accept that the mechanism will only be enabled manually (via configuration) on tables on which it's expected to be an overall net win? Does this need to be self-driving somehow? And how important is the load balancing aspect against the latency improvements?

Agreed. We'll want to answer all of these questions. Interestingly, some of the alternatives provided here would require manual intervention while others would be accessible without any configuration. A split I see arising between the two groups of solutions is whether tables can be clearly partitioned ahead of time into write optimized vs. read optimized. Alternatives that require manual configuration can reach the ends of this spectrum while some of the optimistic solutions that avoid configuration improve latency across the board but fail to optimize for these extreme cases. I know @andy-kimball has strong opinions on how developers should think about their schema and the trade-offs they would want to make with it, so I'd like to hear his thoughts on all of this.

My two top contenders here are (writer-enforced) consistent read replicas and consistent optimistic follower reads.

Mine too. I actually just updated the consistent optimistic follower reads proposal to avoid any concerns of stale reads. We can do this through the observation that waiting for a follower's closed timestamp to exceed a transaction's timestamp can often take half the time as actually performing a leaseholder round trip, and it should only need to be paid once per txn, and only if the txn ever actually needs to read remote data.

The updated proposal includes a latency model that places upper bounds on transaction latencies for transactions that touch data locally owned by either a leaseholder or follower in their region. The TL;DR is that I think we could get down to something around the range of 1.5x to 3.5x WAN RTT for general read/write transactions the replicate across regions and touch data without a local leaseholder.

See Consistent Optimistic Follower Reads v2.

In particular, could we consider getting rid of closed timestamps? A follower read could instead be satisfied by a consistent read replica.

I think this would take us further away from making good use of follower replicas, rather than closer to it.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

docs/RFCS/20190819_consistent_read_replicas.md, line 34 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

stray _

I was putting these three paragraphs in italics, but removed.

docs/RFCS/20190819_consistent_read_replicas.md, line 132 at r1 (raw file):

Previously, ajwerner wrote…

missing an increase on availability.

Done.

docs/RFCS/20190819_consistent_read_replicas.md, line 145 at r1 (raw file):

Previously, ajwerner wrote…

This latching is my biggest concern with this RFC. For write-almost-never ("reference") tables and indexes this RFC will work great. That said, for reference tables mixed timestamp statements might be a much simpler acceptable (though admittedly harder to reason about and use) approach for the majority of cases.

For write-rarely indexes where workloads perform scans these latches may become a deal breaker. Any latch you run into is going to have to endure on average at least one WAN round trip for replication in addition to the half RTT to communicate the latch response. If a workload has a high likelihood of encountering a latch then this proposal seems to become almost as bad as just going to the leaseholder.

Just brainstorming here: I wonder if there's a way to take a cue from some of the more optimistic ideas below and combine them with this RFC. What if we provide a mechanism to push read only transactions backwards in time so that they occur before any currently held latches. This will allow reads to be served locally without blocking on latches. Furthermore it seems like one could implement a read-refresh mechanism that ensure that a read is not invalidated moving backwards in time in cases where there are covering read replicas for all the data needed by a transaction in the local region. Such a mechanism would allow a transaction to observe a consistent snapshot though without the inconsistency of the optimistic follower reads approach below. While this approach can work to provide serializable isolation it opens the door to stale reads. Perhaps that too can be fixed with a causality token.

Admittedly if a transaction went on to do a write it would almost certainly fail to refresh its reads. I guess what I'm getting at is that these read replicas could provide very near real time stale reads without any reliance on the closed timestamp subsystem.

The idea of moving a transaction backward in time scares me.

If a read-only transaction doesn't want to interfere with foreground writes is ok with staleness then it should use follower reads, which I hope we can reduce the staleness of significantly. If we can reduce this staleness sufficiently then I don't think this the complexity of this adaptive approach gives us all that much. With the 46 second staleness now though, I see why something like this would be an interesting alternative.

If the read-only transaction doesn't want to interfere with foreground writes but isn't ok with staleness then I guess it could run at a high priority.

bdarnell

This is similar to quorum leases: https://www.cs.cmu.edu/~dga/papers/leases-socc2014.pdf

returning potentially inconsistent data to the SQL layer seems like it would add a prohibitive amount of complexity and bugs there (can SQL folks confirm?).

I'm not sure how much of a problem this would be in the SQL layer, but exposing inconsistently-read data to the application seems like a non-starter. (Example: a bank transaction does a balance check, sees that there's not enough in the account and decides to roll back. Since it rolls back instead of committing, it doesn't see the error returned from the commit-time refresh. This could be solved if we change the retry loop protocol so that errors in rollback could cause the transaction to retry (and could still succeed), but that seems like a very heavy lift)

Do we expect this to become the default way to use a cockroach cluster in cases where write latency isn't particularly important?

I think it probably would - this makes geo-distributed clusters much more useful without the substantial development investment required to use partitioning. Right now you have to think a lot about partitioning to even get started with geo-distribution; this gives you a more reasonable baseline with good read performance and then partitioning can be an "advanced" option for apps where write performance is critical. I can't think of a case where I'd choose our current default for a geo-distributed cluster instead of making all replicas consistent read replicas.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)

docs/RFCS/20190819_consistent_read_replicas.md, line 145 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The idea of moving a transaction backward in time scares me.

If a read-only transaction doesn't want to interfere with foreground writes is ok with staleness then it should use follower reads, which I hope we can reduce the staleness of significantly. If we can reduce this staleness sufficiently then I don't think this the complexity of this adaptive approach gives us all that much. With the 46 second staleness now though, I see why something like this would be an interesting alternative.

If the read-only transaction doesn't want to interfere with foreground writes but isn't ok with staleness then I guess it could run at a high priority.

Yeah, moving a transaction backward in time is scary. What if we go the other way and move the writing transaction forward in time?

The idea is that when the RemoteLatch request comes in, we add an offset to the max timestamp we get from our timestamp cache (maybe a multiple of the RTT). Then before it commits, the transaction would have its timestamp pushed to this value and refresh all its spans.

This way, the read replicas would not find their reads blocked as long as the write transaction completes promptly (but if the writer takes long enough, reads would be blocked to ensure it doesn't starve).

docs/RFCS/20190819_consistent_read_replicas.md, line 90 at r2 (raw file):

## Consistent Read Replicas

The RFC introduces the term "consistent read replica", which is a non-leaseholder

Users who are accustomed to asynchronous replication often ask for "read-only replicas", by which they mean basically the opposite of this: replicas that can serve stale reads without slowing down writes. We should think about how to name these things so that we can distinguish between the two types of read replicas, and so that users can understand that one speeds up writes and the other slows them down (I'm not sure that "consistent read replicas" and "async/inconsistent read-only replicas" conveys that effectively).

docs/RFCS/20190819_consistent_read_replicas.md, line 330 at r2 (raw file):

1. read request enters follower replica
2. follower replica sends `RemoteRead` request to leaseholder replica
3. `RemoteRead` request acquires latches, grabs current lease applied index, and

FWIW, etcd/raft has a ReadIndex feature that does essentially this (with batching). I'm not sure whether we'd be able to use it or if we'd need something that understands our tscache and lease indexes.

bdarnell

I can't think of a case where I'd choose our current default for a geo-distributed cluster instead of making all replicas consistent read replicas.

I'm backing off this a bit after talking to Nathan offline. The performance impact on writes is worse than I initially thought since it partially negates write pipelining (my head was still in the pre-pipelining world so I was thinking it was just a small constant multiplier). So apps that have transactions with a lot of back-and-forth might still prefer not to use consistent read replicas. (But I'd still use them most of the time - it's usually possible to minimize the amount of back-and-forth in your transactions)

And to be clear, I prefer the primary proposal here (writer-enforced). I don't think the reader-enforced option is very interesting (the important part is getting to local reads). The "consistent optimistic" options might be interesting (if we can satisfy the prerequisite of low closed timestamp intervals), but I'm wary of moving back towards optimism after seeing how resistant people are to retry loops.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)

docs/RFCS/20190819_consistent_read_replicas.md, line 447 at r2 (raw file):

Both this proposal and the previous prompt us to reconsider the idea of batching
all reads until commit-time instead of pipelining them as we go. Doing so would

If we batch the read verification to commit time, aren't we back to the "inconsistent optimistic" option?

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @nvanbenschoten, and @tbg)

docs/RFCS/20190819_consistent_read_replicas.md, line 75 at r2 (raw file):

imposes a large cost on the use of follower reads. Furthermore, because a
statement is the smallest granularity that a user can buy-in to follower reads
at, there is a strong desire to support [mixed timestamp statements](https://github.com/cockroachdb/cockroach/issues/39275), which

There's a second issue number we're referrring to from inside CockroachDB: #35712

docs/RFCS/20190819_consistent_read_replicas.md, line 90 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Users who are accustomed to asynchronous replication often ask for "read-only replicas", by which they mean basically the opposite of this: replicas that can serve stale reads without slowing down writes. We should think about how to name these things so that we can distinguish between the two types of read replicas, and so that users can understand that one speeds up writes and the other slows them down (I'm not sure that "consistent read replicas" and "async/inconsistent read-only replicas" conveys that effectively).

I think we'd cover some good ground with the distinction between "async. read replicas" and "sync. read replicas"

docs/RFCS/20190819_consistent_read_replicas.md, line 98 at r2 (raw file):

reads and writes. A Range with one or more read replicas will be required to
perform an extra round of communication between the leaseholder and each of the
Range's read replicas before evaluating a write. In exchange for this, each of

the description makes it sounds more like you're decoupling the notion of "write leader" (raft) from "read leader" (ts cache and whatnot), and then introducing "secondary read leaders" (the new thing).

That may be easier to explain/document than a distinction between flavors of reads - I think folk generally understand that all "leaders" must participate in deciding new states (writes).

docs/RFCS/20190819_consistent_read_replicas.md, line 106 at r2 (raw file):

Consistent read replicas are configured using zone configs. It's likely that their
configuration will mirror that of [lease preferences](https://www.cockroachlabs.com/docs/v19.1/configure-replication-zones.html#constrain-leaseholders-to-specific-datacenters).
Zones will default to zero consistent read replicas per Range, but this can

the problem with the term "consistent read replica" is that the main replica that's leaseholder is also able to serve consistent reads. As-is you've chosen a term which intuitively applies to the leaseholder, whereas you'd like it not to.

if you were to call them "secondary read leaders" then you'd have one "primary read leader", zero or more "secondary read leaders" and as a group "all read leaders can serve reads" which is nice and symmetrical.

Then you'd say "the default zone config defaults to having just one leader" which is very intuitive.

docs/RFCS/20190819_consistent_read_replicas.md, line 355 at r2 (raw file):

goes against the general intuition that reads are commonly orders of magnitude
more frequent than writes. Because of this, it's probably not the right
general-purpose solution.

I think that it's a bit unfair to make a judgment call here of the form "we think some postulate is true, therefore the approach above is better" (postulate = reads more frequent)

I think the more neutral form would be to open the RFC at the top with a paragraph that says (paraphrasing)

"""
if we want to serve reads from other replicas than the leaseholder, we really have a choice of four algorithms, along two axes:

early validation (as the operations happen) vs late validation (at commit time). Early validation is better when we expect high contention, late validation is better when we expect low contention.
readers ask the write leader for permission to progress, or writers ask the read leaders for permission to progress. The former is better if we expect more writes than reads, and the latter is better if we expect more reads than writes.

"""

Then something like this:
"This RFC focuses on the case of potentially highly contended, more frequent reads and leaves the other 3 optimization points out of scope for now. For reference, two of these other points are outlined in the 'alternative' section below, with a nod to Spencer for outlining one of them earlier."

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @nvanbenschoten, and @tbg)

docs/RFCS/20190819_consistent_read_replicas.md, line 167 at r2 (raw file):

same path that it normally does. It would acquire latches from its spanlatch
manager, it would evaluate against the local state of the storage engine, and the
it would bump its timestamp cache.

and if it needs to wait the wait queue is still in the leaseholder?

docs/RFCS/20190819_consistent_read_replicas.md, line 285 at r2 (raw file):

expired leases after some period of blocking on latches. Without this, a read
could enter a read replica, observe a valid lease, and then get stuck
indefinitely waiting for a write latch to be invalidated.

I did not understand this second mechanism (because I don't know enough details of the system) -- isn't it the collective responsibility of all the replicas to maintain liveness of the property that there is a leaseholder. And if that is being done the lease sequence number will increment and the first mechanism will handle the invalidation?

docs/RFCS/20190819_consistent_read_replicas.md, line 478 at r2 (raw file):

  that a causality token would be useful

### Consistent Optimistic Follower Reads v2

I think one could construct a hybrid of the writer enforced consistent read replicas and this scheme (I may be overlooking something, and probably misusing the terminology).

There would be 2 closed timestamp horizons T_c1 and T_c2, where T_c1 would be less aggressive (like now, so one would not need to push timestamps more often) and would guarantee no writes can happen below it (same as now). T_c2 would be more aggressive (say ~1s lag) and would indicate that transactions trying to write to [T_c1, T_c2] need to use writer enforced consistent reads. Transactions > T_c2 do not need that "coordination". So short lived write transactions can commit without coordination with read replicas -- I am assuming we care most about latency for short-lived transactions that do writes as opposed to long-lived write transactions. Advancing T_c2 by delta means that the leaseholder is promising to coordinate writes up to T_c2 + delta, so it would include an upper bound on the lease index for uncoordinated writes it has already issued in [T_c2, T_c2 + delta]. Long lived read transactions in the location of a read replica will soon fall behind the T_c2 horizon and be able to read locally. Short-lived read transactions that are reading > T_c2 can do one of (a) read at the leaseholder, (b) use the v2 scheme described here (i.e., wait for T_c2 to go into the past).

One could adaptively set how much T_c2 lags the current time, for the current workload on a range, by keeping track of the distribution of lag (from the current time) of writes and reads, and computing a T_c2 lag that minimizes some objective function on the latency of the workload.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @nvanbenschoten, and @tbg)

docs/RFCS/20190819_consistent_read_replicas.md, line 478 at r2 (raw file):

Previously, sumeerbhola wrote…

I think one could construct a hybrid of the writer enforced consistent read replicas and this scheme (I may be overlooking something, and probably misusing the terminology).

There would be 2 closed timestamp horizons T_c1 and T_c2, where T_c1 would be less aggressive (like now, so one would not need to push timestamps more often) and would guarantee no writes can happen below it (same as now). T_c2 would be more aggressive (say ~1s lag) and would indicate that transactions trying to write to [T_c1, T_c2] need to use writer enforced consistent reads. Transactions > T_c2 do not need that "coordination". So short lived write transactions can commit without coordination with read replicas -- I am assuming we care most about latency for short-lived transactions that do writes as opposed to long-lived write transactions. Advancing T_c2 by delta means that the leaseholder is promising to coordinate writes up to T_c2 + delta, so it would include an upper bound on the lease index for uncoordinated writes it has already issued in [T_c2, T_c2 + delta]. Long lived read transactions in the location of a read replica will soon fall behind the T_c2 horizon and be able to read locally. Short-lived read transactions that are reading > T_c2 can do one of (a) read at the leaseholder, (b) use the v2 scheme described here (i.e., wait for T_c2 to go into the past).

One could adaptively set how much T_c2 lags the current time, for the current workload on a range, by keeping track of the distribution of lag (from the current time) of writes and reads, and computing a T_c2 lag that minimizes some objective function on the latency of the workload.

IIUC, only "writer enforced consistent read replicas" can easily prevent stale reads. With the closed timestamp schemes, the closed timestamp has to be more than maximum clock offset ahead of the read timestamp in order to discover that one is not doing a stale read. I guess one could discover the potential stale read later in the transaction lifetime (may require talking to the leaseholder for the replica, unless the closed timestamp has advanced sufficiently into the future), but then there is the likelihood of more false positives wrt stale reads since one can't know whether a later write that is within the clock offset was already known to the leaseholder when this transaction performed the read at the follower. Am I missing something?

But writer enforced consistent read replicas don't have that problem because the leaseholder coordinates with the read replicas before writing the intent. The issue there seems to be that any read replica failure causes a stall until the read replica's node liveness epoch is invalidated and it is removed from the lease, so it seems one can't be very aggressive wrt removal (say remove if the node seems slow but may still be up). One could have a lighter weight and short read-lease that each read replica needs to maintain in the leaseholder (in the leaseholder's memory) which the leader could use to kick out the read replica and readmit it without messing with the node liveness epoch -- and perhaps one could optimize these to be per node instead of per range.

nvanbenschoten · 2019-11-15T20:06:25Z

Something that came up this week about consistent read replicas that I figured should be recorded here is that the diameter of the consistent read replica "quorum" doesn't need to be the same as the diameter of the replication write quorum. For instance, it would be a valid configuration to have a write quorum that was contained to a region (e.g. survive AZ failure but not region failure) but have a consistent read replica "quorum" that expanded out to learner replicas across the globe.

In the "hubs and spokes" topology that we are discussing the doc about optimistic global transactions, this kind of flexibility would be very flexible.

tbg · 2019-11-18T13:22:20Z

For a concrete example, we could have all three regular voters in us-central1 but we could have learners in us-west1 and us-east1. A write would then have to

go to leaseholder in us-central1
(in parallel) acquire remote latches in us-{east,west}1
replicate (locally within us-central1), assumed to take negligible amounts of time
(in parallel) release latches in us-{east,west}1

The latches on us-{east,west}1 would be held for one RTTs (so "current" reads would be blocked for as long), and if the replication latency becomes non-negligible, it gets added on top as well.

Having worked out (with Nathan) pretty fully what optimistic global txns would look like, the writer-enforced proposal looks like the right choice for now. In the locally-replicated case the mechanism is "latency optimal". I also like that it's not a terribly invasive change compared to others that we discussed, and additionally as was pointed out that it doesn't tightly couple us into Raft more.

On top of that I have a hard time figuring out where we would use optimistic global txns over a hypothetical "optimal" writer-enforced scheme (i.e. one that doesn't eat an extra RTT over the replication, but one in which we fold the timestamp cache check into the replication itself). Optimistic txns only pay the round-trip to the leaseholder once, so you could imagine a deployment which has a few very distant regions but which runs many chatty read-mostly transactions. In such a case, we could want "local reads" without paying for the replication latencies into the remote regions. However, this is more on the obscure side. In all other regular deployments, it'll be a major bummer to need to contact a remote leaseholder even if all the txn did was read locally from a learner, for example to access a reference table in an otherwise well-partitioned workload.

Also, it doesn't seem like writer-enforcement can be built "easily" outside of KV. If we tried to match writer-enforced read replicas without changing replication, we'd "denormalize" these indexes into each region, but use 1x replication (today we use 3x), plus partition where possible. Superficially, this then looks like writer-enforced consistent read replicas, except the failure mode is worse. So there's real value in baking this into KV (are there any weird scenarios in which what SQL could do is more powerful?)

nvanbenschoten · 2019-11-20T22:37:49Z

One thing that's been concerning me recently about this proposal is what happens during writes to consistent read replica-backed reference tables, and how those writes interact with reads. My understanding is that customers use these reference tables when they want very low-latency reads globally on a small dataset. Furthermore, my understanding is that customers expect these reads to be reliably low-latency. In fact, I expect that they often build their application with this assumption backing a lot of their design decisions. For instance, they create FKs pointing at these tables and expect that to always be cheap. I'm skeptical that they'd be ok with their 1ms reads jumping up to 240ms whenever there is contention on the table.

But to ensure consistency, these reads can't not wait on writes to the table if there is contention. My question is whether customers are ok with this and require consistency on these tables or whether they would actually prefer some staleness on reads to the tables in exchange for truly reliable low-latency reads. This was actually a trick we used to rely on heavily - use AS OF SYSTEM TIME to avoid conflicts between reads and writes. In today's world, this is even more of an interesting alternative, because doing so allows for follower reads! So before continuing on with the idea that consistent read replicas are the right solution for reference tables, I'd like to take an idea from 2018 @andreimatei and propose that we consider the alternative of a reference table being a table that's fully replicated and where all reads are sufficiently stale such that they always use follower reads. This would be trading consistency on these tables for guaranteed low latency, which might actually be the preferred trade-off. This draws parallels to our implementation of SQL sequences, which also break consistency to avoid blocking.

Optionally, updates to these tables could simply wait out this staleness before returning (think spanner-style commit wait) to ensure single-key linearizability (recapturing consistency... I think?). That would help in the case where a user inserts into a reference table, accepting that it will be slow, and immediately wants to insert a row with an FK pointing at the new reference table row.

That all said, my concern might be alleviated by digging more into contention on these reference tables. Even if there are concurrent reads and writes on a table, that doesn't mean that the reads and writes necessarily conflict. For instance, if all reads are part of foreign key checks then its likely that the concurrent reads and writes are always to disjoint sets of keys. I think we should explore this a little bit further.

cc. @andy-kimball

danhhz · 2019-11-20T22:40:44Z

Optionally, updates to these tables could simply wait out this staleness before returning (think spanner-style commit wait) to ensure single-key linearizability (recapturing consistency... I think?). That would help in the case where a user inserts into a reference table, accepting that it will be slow, and immediately wants to insert a row with an FK pointing at the new reference table row.

This is really interesting. How invasive would this be?

ajwerner · 2019-11-20T22:48:03Z

propose that we consider the alternative of a reference table being a table that's fully replicated and where all reads are sufficiently stale such that they always use follower reads.

What do you have in mind for the UX here?

ajwerner · 2019-11-20T22:58:50Z

What do you have in mind for the UX here?

Discussed offline. We'd just make this the behavior for all reads on the table and then disallow mixing writes and reads or force the reads to refresh up to "now" or something like that.

nvanbenschoten · 2019-11-20T23:01:07Z

We'd just make this the behavior for all reads on the table and then disallow mixing writes and reads or force the reads to refresh up to "now" or something like that.

Yeah, I think this would be a schema-level property.

How invasive would this be?

I think it would be as straightforward as running all reads on the table in a separate transaction that's sufficiently stale. Certainly less invasive than consistent read replicas in their entirety.

nvanbenschoten · 2019-11-20T23:05:30Z

@ajwerner points out that this solves the reference table use case but not the load-balancing consistent reads use case. That's not a deal-breaker, but it's something to keep in mind.

andy-kimball · 2019-11-20T23:13:22Z

In the past, I've thought about committing reference table writes "in the future" so that reads are very unlikely to block (say 500ms in the future). I like the idea to then wait out that interval so that we don't give up consistency.

nvanbenschoten · 2019-11-20T23:20:23Z

I've thought about committing reference table writes "in the future" so that reads are very unlikely to block (say 500ms in the future). I like the idea to then wait out that interval so that we don't give up consistency.

Andrei and I had the exact same thought yesterday. Committing them in the present and making all reads stale seems to have the same effect without being as invasive. The waiting period would be required either way to avoid losing consistency.

andreimatei · 2019-12-05T21:16:47Z

Optionally, updates to these tables could simply wait out this staleness before returning (think spanner-style commit wait) to ensure single-key linearizability (recapturing consistency... I think?).

This "recapturing of consistency" I don't think quite happens without any waiting on the reading side in the face of clock skew, unfortunately. When two replicas have clocks out of sync, the read timestamps they use will cause a particular value to be returned by a fast-clock replica before a slow-clock replica. Unless, as Nathan points out, we use an uncertainty window for the reads and have replicas wait out that uncertainty on the read path when they see an uncertain value. But, of course, that defeats the whole point of not blocking reads.
It seems to me that we can't have it all - clock skew, consistent reads, no synchronization between replicas on the read path.

That all said, my concern might be alleviated by digging more into contention on these reference tables. Even if there are concurrent reads and writes on a table, that doesn't mean that the reads and writes necessarily conflict. For instance, if all reads are part of foreign key checks then its likely that the concurrent reads and writes are always to disjoint sets of keys. I think we should explore this a little bit further.

This is where I think my money is. As Ben pointed out in the meeting the other day, append-only tables probably don't lead to any contention

tbg · 2019-12-06T13:48:49Z

@andreimatei could you explain this a bit more? If I'm inserting into a reference table and delay returning to the client by MaxOffset+MaxAllowedReferenceRead, the next transaction I start - anywhere on the cluster - is guaranteed to have a timestamp that will observe my previous insert into the reference table even when shifted backwards by up to MaxAllowedReferenceRead, and I shouldn't need to ever restart on that table without uncertainty as well (as long as I know that the inserter did the waiting). Am I missing something?

andreimatei · 2019-12-06T15:24:13Z

If I'm inserting into a reference table and delay returning to the client by MaxOffset+MaxAllowedReferenceRead, the next transaction I start - anywhere on the cluster - is guaranteed to have a timestamp that will observe my previous insert into the reference table

The next transaction you start is fine. Transactions that start after the client gets the response indeed will see the row unequivocally. But what about transactions that don't wait for that delayed response?
So, the scenario is:

txn W inserts a row in a ref table with timestamp 100. Readers read 20 in the past for this table. The write is not acknowledged to the respective client for the remaining of this scenario (and unfortunately that doesn't help).
reader r1 comes at timestamp 120, deducts 20 and performs a read @100, seeing the new row.
reader r2 comes along at timestmap 119 (later in real time, but his timestamp was assigned by a slow clock). r2 does not see the row. This is inconsistent.
Right?

A way to eliminate the inconsistency is to have r1 not return the row until its certain that everybody's clock advanced past 120 (thus, wait out the MaxClockOffset on the read). But that defeats the purpose of the scheme.

tbg · 2019-12-09T04:51:25Z

Good point, that is unfortunately a problem, at least as long as we provide causal consistency without passing the explicit token around.

This commit pulls the replica leaseholder check on the read and write evaluation paths above concurrency manager sequencing. This change makes semantic sense because the contents of the concurrency manager and its lockTable are a reflection of the state of the range, so they should only be exposed to requests if the request is evaluating on a leaseholder replica or has otherwise. Up to this point, this hasn't been a real issue because the lockTable was purely a best-effort representation of the state of a range. There are a few strong reasons to make this change: - it makes it easier to fix cockroachdb#46148, a current release blocker - it allows the MaxTimestamp of transactions to be bound before sequencing in the lockTable. This has the effect of ensuring that all transaction pushes are to a timestamp less than or equal to the current node's HLC clock. - it seems like the order we would want things in if/when we begin to address consistent read replicas (cockroachdb#39758), where the concurrency manager sequencing step becomes a distributed operation that depends on the state of the current lease. One concern here is that the concurrency manager sequencing can take an arbitrarily long amount of time because it blocks on conflicting locks. It would be unfortunate if this lead to an increased number of rejected proposals due to lease sequence mismatches. To address this, the commit makes two small changes the interact with each other: 1. it performs a best-effort check before evaluating and proposing a write that the lease it will be proposing under is still active. This check is similar to the best-effort context cancellation check we perform before proposing. 2. it adds a retry loop for NotLeaseHolderErrors hit by requests into executeBatchWithConcurrencyRetries. We've had a retry loop like this a few times in the past (most recently removed in cockroachdb#35261). Now that we're proactively checking the lease before proposing, it seems like a good idea to retry on the replica itself when possible. I could be convinced out of one or both of these changes.

andreimatei

Well, if you want to kill my project, you gotta say more than this :)
Like, acknowledge the schema change issues, the fault tolerance subtleties and the storage cost (ideally on the "duplicate index" pattern section). Also, load-balancing reads within a single is an important use-case too, and that one doesn't seem to be served by duplicate indexes at all.
Or would you rather we talk directly?

and people are doing so successfully

I might call citation needed on this one. We know that many people want to read from multiple nodes (as an anecdote, even last week a customer was asking me about it); out of these "many people", how many are using duplicate indexes successfully? It's partly an honest question cause I don't know the answer myself, but I just don't believe there's any significant numbers here because there's just too many caveats to the implementation. I think it duplicate indexes makes for a cool demo, but not much more. Even when someone is using duplicate indexes, we also need to ask the question about how many regions they're running in - because I claim the two solutions scale differently.
To be clear, I don't think looking at duplicate indexes as they are today is a fair comparison - we need to imagine duplicate indexes + new work. You say that the problems "can generally be addressed in non-intrusive ways". I see that very differently, so I'm curious to hear what you'd do for them and perhaps some hints about how you'd do it.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, @nvanbenschoten, and @tbg)

andy-kimball · 2020-08-04T22:52:49Z

We could bring in Product on this, but my impression is that customers see Duplicate Indexes as a hack, because it puts so much burden on them. If they add or remove a column, they have to update all the indexes. If they add or remove a region, they have to add or remove an index. The query plans that get generated are rather confusing. They'd get even more confusing if we're implicitly creating these duplicate indexes as part of magic multi-region syntax.

That leads me to the next point, which is that trying to automate all of these complex schema changes is going to be a huge complication in the SQL layer that will be as complex or more so than the complication of doing this in the replication layer. So I don't subscribe to the argument that we shouldn't do this at the replication layer b/c it adds complexity without enough benefit. If we want to make multi-region easy for customers, we've got to accept significant new complexity in our code somewhere. So then it's just a question of where to do so. IMO, the customer experience will be much better, and the architecture cleaner, if we add this complexity to replication, so that's where I think it should go.

andy-kimball · 2020-08-04T23:04:28Z

One other related point: CRDB's key value prop is the ability to scale out geographically. I think we need to be committed to providing the very best, most advanced algorithms to make this work as well as possible from an end-user point of view. Our secondary concern is about the attendant complexity this brings to our codebase. Best-in-class multi-region support is going to pay our bills for the next decade or more; I don't want to compromise and accept something second-best.

bdarnell

I might call citation needed on this one

Fair enough. My impression is that it's a useful tool in the SEs' toolbox, but maybe it's more demo-able than production-ready. I'm not sure how much real usage it's getting.

Anyway, it's true that duplicate indexes is a hack and needs a lot of work to make it more usable (if we're not going to make it obsolete). My point is that complexity at the transaction and replication layers is a very precious thing - there's a big tangle of things that all have to work together, and the size and complexity of this tangle limits our ability to expand the KV team. We have to be judicious about the things we take on at this layer. Chaining schema changes together is also a significant amount of work, but it's implementation can be more self-contained and doesn't intrude as much into the subtle areas of replication.

This is definitely an improvement on duplicate indexes. But I don't think it's the last word in providing fast local reads. There's still a need for something that more-or-less guarantees a local read (possibly stale) regardless of contention. If consistent read replicas aren't the ideal solution, are they worth the complexity?

One thing to consider is that there are two distinct use cases for local reads: read-only transactions that can accept stale data, and read-write transactions that must operate at the current time (especially for things like FK and uniqueness checks). Maybe this could be the pinnacle of evolution for the consistent local reads case while the missing piece is for stale local reads.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, @nvanbenschoten, and @tbg)

nvanbenschoten

I sympathize with all of the shortcomings of the duplicate index pattern. I'd very much like us to avoid investing further in that direction.

That said, after refreshing myself on all of this, I'm starting to get concerned again about the interaction between contending reads and writes and our ability to guarantee low latency reads on these tables (essentially #39758 (comment)). I think @andreimatei is too, which is why he mentioned "talk about FK checks not needing to conflict with latches". Our offline discussions about this in the past have gone in two directions:

support more expressive access patterns in the KV API to create opportunities to avoid false contention. This discussion has primarily revolved around sql/kv: support FOR NO KEY UPDATE and FOR KEY SHARE #52420.
reduce the consistency of read-only operations on these tables by exploiting staleness in one way or another. This has more tradeoffs than option 1, but it is also applicable to more situations.

I'd like to see us discuss this further. EDIT: I see Ben just posted and it sounds like he agrees.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20190819_consistent_read_replicas.md, line 263 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This feels like a step in the wrong direction: we'd like to start sending snapshots from the nearest replica (regardless of where the leader/lease is) to make more efficient use of network bandwidth. #42491

I agree with this.

docs/RFCS/20190819_consistent_read_replicas.md, line 288 at r3 (raw file):

Don't we already have sufficient mechanisms to ensure that dropped proposals get retried until they succeed or there's a change of lease?

We should, yes.

It strikes me that this proposal starts to look a little bit like the closed timestamp mechanism if I squint hard enough. Instead of maintaining a [minLatchSeq, curLatchSeq) window, closed timestamps maintain a [closedTS, curTS) window. Follower reads work because a follower knows that any in-flight write below the closedTS will necessarily be rejected by the LAI check below Raft. And we already have a transport mechanism to broadcast closed timestamp information to replicas in a Range.

So I wonder if we could use the existing closed timestamp mechanism for this. Here's a strawman proposal:

after acquiring latches but before issuing a RemoteLatchRequest, begin tracking the write with the leaseholder's ClosedTimestamp tracker.
grab a clock reading off the leaseholder's HLC clock. Call this the maxProposalTS. This will be larger than the tracker's provided minTS.
broadcast the RemoteLatchRequests, including this new maxProposalTS
on each consistent read replica, associate latches with their maxProposalTS
evaluate and propose the request
prevent reproposals if doing so would violate this maxProposalTS. Similar to this check.
on each consistent read replica, when the closed timestamp is advanced, drop all latches whose maxProposalTS is below the current closed timestamp.

docs/RFCS/20190819_consistent_read_replicas.md, line 122 at r4 (raw file):

have to remember to delete or add 20 indexes. Tall order. And it gets better:
say you want to add a column to one of the tables. If, naively, you just do your
`ALTER TABLE` and add that column to the primary key, then, if you had any

This is a compelling example.

docs/RFCS/20190819_consistent_read_replicas.md, line 281 at r4 (raw file):

same path that it normally does: it would acquire latches from its spanlatch
manager, it would evaluate against the local state of the storage engine, and
then it would bump the local timestamp cache. If a replicated lock is

This sentence cuts off.

docs/RFCS/20190819_consistent_read_replicas.md, line 339 at r4 (raw file):

leaseas

docs/RFCS/20190819_consistent_read_replicas.md, line 451 at r4 (raw file):

replicas which continue serving while the transfer is ongoing. It also gives us
a natural mechanism for informing kvclients of the read replicas by using the
existing mechanism for keeping the range caches up to date.

+1, those are all very strong reasons to add them to the lease.

nvanbenschoten · 2020-08-05T20:37:50Z

I'm not convinced that #39758 (comment) is a blocker for #39758 (comment). To steal some of the terminology from https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels, I don't think it's possible to provide strong consistency without coordination between reads and writes, which necessarily implies reads blocking on writes in some cases. So if we want to guarantee that reads never block on writes, we need to lower the consistency level.

The proposal in #39758 (comment) seems to provide a mix between bounded staleness and session consistency. Reads on the table are stale by up to a bounded duration but still provide session guarantees within sessions that performed writes to the table. That's still a relatively strong consistency level as it retains ordering between writes (monotonic writes), places an upper bound on staleness, and guarantees read-your-writes.

There are clear downsides to this. For one, we're giving up single-key linearizability, as you identified (so the recapturing consistency... I think? was definitely wrong). But that doesn't necessarily mean we shouldn't consider that alternative. I think it's actually close to what you would want for one of these reference tables in a system that offers tunable consistency.

nvanbenschoten · 2020-08-05T22:06:04Z

we use an uncertainty window for the reads and have replicas wait out that uncertainty on the read path when they see an uncertain value. But, of course, that defeats the whole point of not blocking reads.

@andy-kimball and I talked about this a little more offline. The one point to add here is that if we got our uncertainty window down far enough (uncertainty < WAN RTT) then this might actually be a real solution that doesn't force us to compromise on strong consistency. So it's another case, like we saw in the optimistic global transaction proposal, where we can trade coordination for waiting out clock synchronization to maintain linearizability.

andreimatei

This is definitely an improvement on duplicate indexes. But I don't think it's the last word in providing fast local reads. There's still a need for something that more-or-less guarantees a local read (possibly stale) regardless of contention. If consistent read replicas aren't the ideal solution, are they worth the complexity?
One thing to consider is that there are two distinct use cases for local reads: read-only transactions that can accept stale data, and read-write transactions that must operate at the current time (especially for things like FK and uniqueness checks). Maybe this could be the pinnacle of evolution for the consistent local reads case while the missing piece is for stale local reads.

I wanna add a further nuance to the discussion: there's a difference between a read that's "non-local" because it needs to perform a push (a successful push) and a read that takes an arbitrary amount of time because it needs to wait for another transaction to succeed. In other words, I think that high-priority follower reads are a useful tool, and one we should advertise. In the one conversation I've had with a customer being concerned about impacts on writes on follower reads, it seemed to me that the fear was that reads can just starve when there's a long-running ETL job running, and not so much that every now and then a read will be "non-local".
That's why I personally am not so impressed by the need to "guarantee" the latency for reads. I think it's good enough if we guarantee they don't block arbitrarily long. I could be wrong.

And btw, we're working on something for stale reads - the long-lived learner replicas which should let you conveniently serve stale reads from all localities, and also scale up these reads.
If you buy that people want high-priority follower reads, what would further help there is improvements to writing transactions that would help them commit even when they're constantly pushed. There's various ideas in the area.

There are clear downsides to this. For one, we're giving up single-key linearizability, as you identified (so the recapturing consistency... I think? was definitely wrong). But that doesn't necessarily mean we shouldn't consider that alternative. I think it's actually close to what you would want for one of these reference tables in a system that offers tunable consistency.

One thing that I think we need to pay attention to is not losing referential integrity. I think this is why the line of thinking rebuffed back in the day by @andy-kimball who's raised concerns about optimizer transformations that become illegal when referential integrity is broken. I think the idea of writers waiting out the "staleness period" is part of the answer - it reduces the cases where a reader can observe a child but not observe the parent (if the parent is in one of these reference tables). But I think there's a question remaining about how to handle parents and children written in the same transaction. Do we disallow them? If we don't (i.e. if we allow a child and a parent to be written at the same timestamp), then we have the problem - the child is immediately visible but the parent isn't.
If we'd want disallow such transactions, then what exactly would we disallow? Would we disallow reading your own write on a reference table within a transaction? That seems nasty. Would we disallow transactions that write to reference tables and transactions that write to non-reference tables? Maybe that's palatable...

If we figure out the referential integrity part, then the next problem is, as you say, losing single-key linearizability. In other words, while the writer is waiting out the staleness, there will come a moment when a gateway returns the new data and another one doesn't. I think this can be avoided if we introduce uncertainty into the reads on the reference table. If the uncertainty interval were really small (much smaller than the staleness period), then we could use it to avoid the inconsistency I thnk.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, @nvanbenschoten, and @tbg)

awoods187

I think that this is a cool project with many interesting ideas but I'm not convinced the ROI is here from a customer viewpoint, and, in particular, I think the opportunity cost is too high. Customers are widely using the duplicate index pattern and in many cases prefer it to geopartitioing and other available patterns. In our user research in January the majority of users were happy with the performance and outcome of the duplicate index pattern, if not the ergonomics. For example, CK wrote, when asked about their preference between latency and cost, "Latency is a blocker for them while the cost is less important. Further, they’ve found the cost to be more based on CPU than storage, so they do not mind more replication if it doesn’t similarly increase CPU usage." They don't mind the duplicative storage cost of the duplicate index pattern. I think if we take steps to make this pattern more declarative and take care of the complexity of adding and removing regions automatically, we would avoid the real burden in this pattern. In some future release, we could consider returning to this idea but I'm not convinced that the value to customers is sufficient when compared to the significant complexity outlined in this proposal. I'm more keen to improve the ergonomics in the current pattern, and to implement non-voting replicas and/or reduced follower read timestamps. I think nothing stops us from telling a table to be "read latency focused" and have that today use duplicate indexes and in the future use an enhancement like this.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20190819_consistent_read_replicas.md, line 90 at r2 (raw file):

Previously, knz (kena) wrote…

I think we'd cover some good ground with the distinction between "async. read replicas" and "sync. read replicas"

I'm a +1 on ben's comment. We will need to spend some time thinking up a good name.

docs/RFCS/20190819_consistent_read_replicas.md, line 106 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I see what you're saying about the leaseholder also serving consistent reads, but I don't think the term "read leaders" helps.

authorized reader replica? read leaseholder? I kind of like secondary read leaders

docs/RFCS/20190819_consistent_read_replicas.md, line 118 at r4 (raw file):

### Problem 1: Ergonomics

So you’ve got 15 regions and 20 of these global tables. You’ve created 15*20=300

Nit the first sentence is a bit too conversational

docs/RFCS/20190819_consistent_read_replicas.md, line 122 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is a compelling example.

It is compelling but its worth noting that right now we really have something like 5 regions as the max customers use. Its much more likely we would have many more than 20 global tables as some customers have 1000s of tables

docs/RFCS/20190819_consistent_read_replicas.md, line 141 at r4 (raw file):

changes. That alone seems like major unwanted pollution. And then there's the
issue of wanting stronger consistency between the indexes, not just eventual
consistency; I'm not sure how achievable that is.

I think this problem could be solved in part by making this pattern more declarative. If the user declared regions (and/or we automatically detected them from CC), and the user further declared a table as duplicate indexes, we could be smart enough to know that when adding or removing a region, we need to trigger a schema change. Of course we need to be able to handle failures and rollbacks and other components, but I think we are liable to work this direction in the upcoming release.

docs/RFCS/20190819_consistent_read_replicas.md, line 151 at r4 (raw file):

is that when one region goes down (you’ve got 15 of them, they can’t all be
good) you lose write-availability for the table (because any writes needs to
update all the indexes). So that's no good.

Region failure is incredibly uncommon. Its worth solving for but this will by no means be a common occurrence.

docs/RFCS/20190819_consistent_read_replicas.md, line 164 at r4 (raw file):

only set up one replica per region).  

So, all replicas in one region is no good, a single replica in one region is not

Why can't you pin the leaseholder replica to that region? We expect pinning to be a hard constraint that isnt ever failed.

docs/RFCS/20190819_consistent_read_replicas.md, line 179 at r4 (raw file):

pay a lot to duplicate storage and replication work just to maintain the
independent availability of each index.

I think these concerns are true of the current implementation but could be addressed in the declarative syntax changes referenced above.

docs/RFCS/20190819_consistent_read_replicas.md, line 592 at r4 (raw file):

The major drawbacks of this approach are that it is somewhat complex, it
introduces a latency penalty on writes to Ranges with one or more read replicas,
and it requires an explicit configuration decision.

I think there is an opportunity cost argument here as well. This seems like an incredibly complicated investment that offers only somewhat marginal improvements for the end user. I'm making a not now rather than a never argument.

docs/RFCS/20190819_consistent_read_replicas.md, line 692 at r4 (raw file):

  transaction instead of once per statement
- Optimistic, may cause transaction restarts and severely changes our approach
  to concurrency control

I think any solution that introduces more retries is a non-starter. I'm particularly concerned with the silent magic that this option takes--it violates a number of expectations i'd have about CRDB.

docs/RFCS/20190819_consistent_read_replicas.md, line 854 at r4 (raw file):

*  all assuming no restarts due to contention
** see NOTE above

I like this option

nvanbenschoten · 2020-08-06T21:32:15Z

That's why I personally am not so impressed by the need to "guarantee" the latency for reads. I think it's good enough if we guarantee they don't block arbitrarily long. I could be wrong.

I disagree with this. There's a very real case for not being able to tolerate any WAN latencies on reads, not even the latency required to perform a high-priority push. So I don't see this nuance as particularly meaningful. However...

After discussions with @andreimatei and @andy-kimball and a re-read of the Spanner paper to make sure we weren't missing anything about what you can get out of 0 uncertainty, I'm less convinced that we should push this in the direction of eliminating blocking in the presence of contention. While there's a very real need for such an offering, it creates a lot of confusion to try to hide it behind a proposal like #39758 (comment). Instead, we should double-down on encouraging users to use the exact staleness (and bounded staleness, once we implement that) consistency level provided by AOST txns. That's essentially the solution we ended up deriving in #39758 (comment) anyway and it causes all kinds of complexity to try to hide the staleness in the schema. It also already enables follower reads out of the box, and we've made big improvements in this area recently.

So I think I've come to terms with the fact that consistent read replicas will still block under read-write contention, and that that's just the price you pay for strong consistency. Most of the use cases we've talked about shouldn't actually see much contention on these tables, and we can try to fight the cases that will through more general improvements like #52420 where this contention is artificial.

Consistent Read Replicas provide a mechanism through which follower replicas in a Range can be used to serve reads for **non-stale** read-only and read-write transactions. The ability to serve reads from follower replicas is beneficial both because it can reduce wide-area network jumps in geo-distributed deployments and because it can serve as a form of load-balancing for concentrated read traffic. It may also provide an avenue to reduce tail latencies in read-heavy workloads, although such a benefit is not a focus of this RFC. _The purpose of this RFC is to introduce an approach to consistent read replicas that I think could be implemented in CockroachDB in the medium-term future. It takes inspiration from Andy Kimball and the AcidLib project. I'm hoping for this to spur a discussion about potential solutions to the problem and also generally about our ambitions in this area going forward. The most salient design decision from this proposal is that it remains completely separate from the Raft consensus protocol. It does not interact with Raft, instead operating in the lease index domain. This is similar in spirit to our approach to implementing closed timestamps. The RFC includes three alternative approaches that each address part of these issues, but come with their own challenges and costs. Unlike with most RFCs, I think it's very possible that we end up preferring one of these alternatives over the main proposal in the RFC._ Release note: None

andreimatei

@awoods187 thanks for the comments; I'll think them over and respond.

@bdarnell I've got a new scheme strawman that I want to run past you. It does away with remote latches, which were a major point of complexity, and replaces them somewhat with regular replication. It requires some more complexity in the client, but that seems a lot more manageable than the complexity in the leaseholder. PTAL at the Application-time latching section.
I've also added a small section comparing to the quorum lease paper since you've brought it up again, as a misc.

In the meantime, Nathan is chasing the non-blocking writes thread some more; there's some new thinking there. As he well put it, we're kinda converging on two options, one using regular communication (this RFC), the other one using tightly-synchronized clocks for some implicit communication.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @knz, @nvanbenschoten, and @tbg)

Non-blocking transactions are a variant of CockroachDB's standard transaction protocol that are optimized for writing to read-mostly or read-only (excluding maintenance events) data. The transaction protocol and the replication schema that it is paired with differ from standard read-write transactions in two important ways: - non-blocking transactions support a replication scheme over the Ranges that they operate on which allow all followers to perform **non-stale** follower reads. - non-blocking transactions are **minimally disruptive** to reads over the data that they modify, even in the presence of read/write contention. The ability to serve reads from follower replicas is beneficial both because it can reduce read latency in geo-distributed deployments and because it can serve as a form of load-balancing for concentrated read traffic in order to reduce tail latencies. The ability to serve **non-stale** follower reads makes the functionality applicable to a far larger class of read-only and read-write transactions. The ability to perform writes on read-heavy data without causing conflicting reads to block for long periods of time is beneficial for providing predictable read latency. When customers ask for the `READ COMMITTED` isolation level, this is what they are actually asking for. This proposal serves as an alternative to the [Consistent Read Replicas proposal](cockroachdb#39758). Whereas the Consistent Read Replicas proposal enforces consistency through communication, this proposal enforces consistency through semi-synchronized clocks with bounded uncertainty.

nvanbenschoten · 2020-08-13T00:22:16Z

In the meantime, Nathan is chasing the non-blocking writes thread some more; there's some new thinking there. As he well put it, we're kinda converging on two options, one using regular communication (this RFC), the other one using tightly-synchronized clocks for some implicit communication.

I synthesized this alternate proposal in #52745, which draws from conversations I've had over the past few weeks with @andy-kimball, @andreimatei, and @tbg. The end result is both simpler than where this proposal ended up and more powerful, though it would be a longer-term endeavor because it requires us to invest in clock synchronization. It also turned out to parallel the optimistic global transaction proposal much more closely than I was expecting while avoiding the worst part of that proposal. I'm curious to get people's thoughts.

Also, I'm realizing that we are a week away from the 1 year anniversary of this RFC. Looking forward to v3 in 2021!

Non-blocking transactions are a variant of CockroachDB's standard transaction protocol that are optimized for writing to read-mostly or read-only (excluding maintenance events) data. The transaction protocol and the replication schema that it is paired with differ from standard read-write transactions in two important ways: - non-blocking transactions support a replication scheme over the Ranges that they operate on which allows all followers to perform **non-stale** follower reads. - non-blocking transactions are **minimally disruptive** to reads over the data that they modify, even in the presence of read/write contention. The ability to serve reads from follower replicas is beneficial both because it can reduce read latency in geo-distributed deployments and because it can serve as a form of load-balancing for concentrated read traffic in order to reduce tail latencies. The ability to serve **non-stale** follower reads makes the functionality applicable to a far larger class of read-only and read-write transactions. The ability to perform writes on read-heavy data without causing conflicting reads to block for long periods of time is beneficial for providing predictable read latency. When customers ask for the `READ COMMITTED` isolation level, this is what they are actually asking for. This proposal serves as an alternative to the [Consistent Read Replicas proposal](cockroachdb#39758). Whereas the Consistent Read Replicas proposal enforces consistency through communication, this proposal enforces consistency through semi-synchronized clocks with bounded uncertainty.

Non-blocking transactions are a variant of CockroachDB's standard read-write transaction protocol that permit low-latency, global reads of read-mostly and read-only (excluding maintenance events) data. The transaction protocol and the replication schema that it is paired with differ from standard read-write transactions in two important ways: - non-blocking transactions support a replication scheme over Ranges that they operate on which allows all followers in these Ranges to server **consistent** (non-stale) follower reads. - non-blocking transactions are **minimally disruptive** to reads over the data that they modify, even in the presence of read/write contention. The ability to serve reads from follower and/or learner replicas is beneficial both because it can reduce read latency in geo-distributed deployments and because it can serve as a form of load-balancing for concentrated read traffic in order to reduce tail latencies. The ability to serve **consistent** (non-stale) reads from any replica in a Range makes the functionality accessible to a larger class of read-only transactions and accessible for the first time to read-write transactions. The ability to perform writes on read-heavy data without causing conflicting reads to block is beneficial for providing predictable read latency. Such predictability is doubly important in global deployments, where the cost of read/write contention can delay reads for 100's of ms as they are forced to navigate wide-area network latencies in order to resolve conflicts. These properties combine to prioritize read latency over write latency for some configurable subset of data, recognizing that there exists a sizable class of data which is heavily skewed towards read traffic. Non-blocking transactions are provided through extensions to existing concepts in the CockroachDB architecture (i.e. uncertainty intervals, read refreshes, closed timestamps, learner replicas) and compose with CockroachDB's standard transaction protocol intuitively and effectively. This proposal serves as an alternative to the [Consistent Read Replicas proposal](cockroachdb#39758). Whereas the Consistent Read Replicas proposal enforces consistency through communication, this proposal enforces consistency through semi-synchronized clocks with bounded uncertainty.

Non-blocking transactions are a variant of CockroachDB's standard read-write transaction protocol that permit low-latency, global reads of read-mostly and read-only (excluding maintenance events) data. The transaction protocol and the replication schema that it is paired with differ from standard read-write transactions in two important ways: - non-blocking transactions support a replication scheme over Ranges that they operate on which allows all followers in these Ranges to serve **consistent** (non-stale) follower reads. - non-blocking transactions are **minimally disruptive** to reads over the data that they modify, even in the presence of read/write contention. The ability to serve reads from follower and/or learner replicas is beneficial both because it can reduce read latency in geo-distributed deployments and because it can serve as a form of load-balancing for concentrated read traffic in order to reduce tail latencies. The ability to serve **consistent** (non-stale) reads from any replica in a Range makes the functionality accessible to a larger class of read-only transactions and accessible for the first time to read-write transactions. The ability to perform writes on read-heavy data without causing conflicting reads to block is beneficial for providing predictable read latency. Such predictability is doubly important in global deployments, where the cost of read/write contention can delay reads for 100's of ms as they are forced to navigate wide-area network latencies in order to resolve conflicts. These properties combine to prioritize read latency over write latency for some configurable subset of data, recognizing that there exists a sizable class of data which is heavily skewed towards read traffic. Non-blocking transactions are provided through extensions to existing concepts in the CockroachDB architecture (i.e. uncertainty intervals, read refreshes, closed timestamps, learner replicas) and compose with CockroachDB's standard transaction protocol intuitively and effectively. This proposal serves as an alternative to the [Consistent Read Replicas proposal](cockroachdb#39758). Whereas the Consistent Read Replicas proposal enforces consistency through communication, this proposal enforces consistency through semi-synchronized clocks with bounded uncertainty.

52745: rfc: Non-Blocking Transactions r=nvanbenschoten a=nvanbenschoten [Link to text of RFC](https://github.com/cockroachdb/cockroach/pull/52745/files?short_path=ea92ee8#diff-ea92ee85a58b7b133ce5a1a8d6843113) ---- Non-blocking transactions are a variant of CockroachDB's standard read-write transaction protocol that permit low-latency, global reads of read-mostly and read-only (excluding maintenance events) data. The transaction protocol and the replication schema that it is paired with differ from standard read-write transactions in two important ways: - non-blocking transactions support a replication scheme over Ranges that they operate on which allows all followers in these Ranges to serve **consistent** (non-stale) follower reads. - non-blocking transactions are **minimally disruptive** to reads over the data that they modify, even in the presence of read/write contention. The ability to serve reads from follower and/or learner replicas is beneficial both because it can reduce read latency in geo-distributed deployments and because it can serve as a form of load-balancing for concentrated read traffic in order to reduce tail latencies. The ability to serve **consistent** (non-stale) reads from any replica in a Range makes the functionality accessible to a larger class of read-only transactions and accessible for the first time to read-write transactions. The ability to perform writes on read-heavy data without causing conflicting reads to block is beneficial for providing predictable read latency. Such predictability is doubly important in global deployments, where the cost of read/write contention can delay reads for 100's of ms as they are forced to navigate wide-area network latencies in order to resolve conflicts. These properties combine to prioritize read latency over write latency for some configurable subset of data, recognizing that there exists a sizable class of data which is heavily skewed towards read traffic. Non-blocking transactions are provided through extensions to existing concepts in the CockroachDB architecture (i.e. uncertainty intervals, read refreshes, closed timestamps, learner replicas) and compose with CockroachDB's standard transaction protocol intuitively and effectively. This proposal serves as an alternative to the [Consistent Read Replicas proposal](#39758). Whereas the Consistent Read Replicas proposal enforces consistency through communication, this proposal enforces consistency through semi-synchronized clocks with bounded uncertainty. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

nvanbenschoten requested a review from a team as a code owner August 20, 2019 03:03

ajwerner reviewed Aug 20, 2019

View reviewed changes

tbg reviewed Aug 20, 2019

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/readReplRFC branch from 3c186a6 to 48f13ff Compare August 22, 2019 20:51

nvanbenschoten commented Aug 22, 2019

View reviewed changes

bdarnell reviewed Aug 26, 2019

View reviewed changes

knz reviewed Aug 27, 2019

View reviewed changes

sumeerbhola reviewed Sep 22, 2019

View reviewed changes

sumeerbhola reviewed Oct 22, 2019

View reviewed changes

mwang1026 added this to Triage in [DEPRECATED] CDC Feb 14, 2020

mwang1026 removed this from Triage in [DEPRECATED] CDC Feb 14, 2020

andreimatei reviewed Aug 4, 2020

View reviewed changes

bdarnell reviewed Aug 5, 2020

View reviewed changes

nvanbenschoten commented Aug 5, 2020

View reviewed changes

andreimatei reviewed Aug 5, 2020

View reviewed changes

awoods187 requested changes Aug 6, 2020

View reviewed changes

andreimatei force-pushed the nvanbenschoten/readReplRFC branch from ac4d3fd to c9bdee3 Compare August 11, 2020 23:46

andreimatei reviewed Aug 11, 2020

View reviewed changes

nvanbenschoten mentioned this pull request Aug 13, 2020

rfc: Non-Blocking Transactions #52745

Merged

bdarnell mentioned this pull request Sep 14, 2020

kvserver: allow non-voting replicas to be created via ChangeReplicas #53715

Merged

nvanbenschoten mentioned this pull request Jan 25, 2021

config/sql: add global_reads field to ZoneConfig, use for GLOBAL tables #59304

Merged

tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021

nvanbenschoten closed this May 9, 2022

nvanbenschoten deleted the nvanbenschoten/readReplRFC branch June 6, 2022 16:31

[WIP] rfc: Consistent Read Replicas #39758

[WIP] rfc: Consistent Read Replicas #39758

Conversation

nvanbenschoten commented Aug 20, 2019 • edited Loading

cockroach-teamcity commented Aug 20, 2019

nvanbenschoten commented Aug 20, 2019

ajwerner left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg commented Aug 20, 2019

nvanbenschoten left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Nov 15, 2019

tbg commented Nov 18, 2019

nvanbenschoten commented Nov 20, 2019

danhhz commented Nov 20, 2019

ajwerner commented Nov 20, 2019

ajwerner commented Nov 20, 2019

nvanbenschoten commented Nov 20, 2019

nvanbenschoten commented Nov 20, 2019

andy-kimball commented Nov 20, 2019

nvanbenschoten commented Nov 20, 2019

andreimatei commented Dec 5, 2019

tbg commented Dec 6, 2019

andreimatei commented Dec 6, 2019 • edited Loading

tbg commented Dec 9, 2019

andreimatei left a comment

Choose a reason for hiding this comment

andy-kimball commented Aug 4, 2020

andy-kimball commented Aug 4, 2020

bdarnell left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Aug 5, 2020

nvanbenschoten commented Aug 5, 2020

andreimatei left a comment

Choose a reason for hiding this comment

awoods187 left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Aug 6, 2020

andreimatei left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Aug 13, 2020

nvanbenschoten commented Aug 20, 2019 •

edited

Loading

andreimatei commented Dec 6, 2019 •

edited

Loading