New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachpb: spell out how observed timestamps pertain to follower reads #51555
base: master
Are you sure you want to change the base?
roachpb: spell out how observed timestamps pertain to follower reads #51555
Conversation
baf846d
to
29af488
Compare
It took me a while to understand the observed timestamp property and come to the observation in the patch. Please let me know if it's incorrect. However, given that this stuff is so tricky, I felt it was worthwhile including some talk around follower reads & observed timestamps here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this stuff is confusing. I'm very glad to see that you're thinking about it.
What you have here seems correct to me, but I'm hoping we can improve it a bit more. You say that "the reduced uncertainty window only pertains to requests served by the leaseholder and not ones served by followers" which seems to be both accurate and also not saying all that much because of how ambiguous it is. What do you mean by "only pertains"? We currently call limitTxnMaxTimestamp
during follower reads and use observed timestamps off of the follower's clock. I think that's just straight-up wrong. It sounds like you've come to the same conclusion. But then the question becomes – can we use any observed timestamp when performing a follower read? Does an observed timestamp off the current leaseholder replica tell us anything? I think so, as long as we still forward the observed timestamp by the start of the current lease, but we should think through that more.
I'd also generalize the comment a little. The max offset vs. closed timestamp duration part is interesting, but it seems too specific. That doesn't really change anything about the fact that an observed timestamp from a follower replica tells us nothing about causality between a read and values in the range that the read might observe, right?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)
29af488
to
7fa5923
Compare
I improved the language around this, should be clearer now.
Yup, but see the section on follower reads at the end. Currently, with the target duration being greater than max offset, at the time these follower read requests are served (which would be more than
Makes sense, but I see that you have a TODO saying exactly this: cockroach/pkg/kv/kvserver/replica_send.go Line 671 in d86e265
Do you remember if there was a reason you didn't make that change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad we're improving this comment.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)
pkg/roachpb/data.proto, line 396 at r1 (raw file):
// There are two invariants necessary for this property to hold: // // 1. a leaseholder's clock must always be equal to or greater than the
I find this paragraph pretty confusing; let's take the opportunity to refine it.
To understand it you have to connect a bunch of dots. It's talking about the hazard where intents have been written at timestamp 10, txn commits at ts 20, and someone observes ts 15 before the intents are finally written at ts 20. The observer of 15 might ignore the values @ 20. We protect against this hazard by ensuring that nobody can observe 15 if the writing txn committed.
It'd be good for the paragraph to spell this out, and also to have some words about specifically how we bump all the clocks before committing. Does the txnCommitter
do that? The EndTxn
batch can have the CanForwardReadTimestamp
bit set, in which case the txn's timestamp can be decided when evaluating this batch, so presumably the bumping of timestamp needs to happen on the return path, before telling the client that the writer succeeded. But I don't see it... How does this work @nvanbenschoten ?
pkg/roachpb/data.proto, line 402 at r1 (raw file):
// committing, even after writing intents on remote Ranges. To accommodate // this situation, transactions ensure that at the time of their commit, any // leaseholder for a Range that contains one of its intent has an HLC clock
s/its intent/its intents
since you're taking the blame anyway
But are you sure you want to take the blame on this paragraph? I don't think you're changed anything.
pkg/roachpb/data.proto, line 406 at r1 (raw file):
// TODO(nvanbenschoten): This is violated by txn refreshes. See #36431. // // 2. a store's clock must always be equal to or greater than the timestamp of
I find it confusing to talk about stores here. Let's just talk about replicas (of which the leaseholder is one).
pkg/roachpb/data.proto, line 449 at r1 (raw file):
// on a copy of the slice. // // ### Follower Reads:
I'd note here somewhere that commonly follower reads have no uncertainty - when coming from the as of system time
syntax. Which means that they can observe the "causal inversion" phenomenon (seeing the reply post but not the original post in a forum thread example).
pkg/roachpb/data.proto, line 451 at r1 (raw file):
// ### Follower Reads: // // If the store serving a request weren't a leaseholder for the range, we
I think there's something funny with the first statement. s/weren't/isn't ?
pkg/roachpb/data.proto, line 453 at r1 (raw file):
// If the store serving a request weren't a leaseholder for the range, we // can't always guarantee that invariant 2 stated above in our formalization // is upheld. If the closed timestamp duration is greater than the max offset,
I don't think I really follow what we're talking about here. "invariant 2" is about leases and leaseholder. It always holds. You want to talk about the followers clocks and what properties they have, right?. It seems to me that this has nothing to do with the invariants above, or the property that they give us (which is also about the leaseholder). I'd like to re-read this paragraph when it's put in that light -> how the followers' clocks differ from the leaseholders'.
No?
pkg/roachpb/data.proto, line 453 at r1 (raw file):
My intention was to make invariant 2 a little more general than this, which is why I changed
Does that sound better? The way I see it, invariant 1 only concerns leaseholders whereas invariant 2 should be more general. |
Release note: None
7fa5923
to
d66424e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you remember if there was a reason you didn't make that change?
No, I don't think there was a good reason for not addressing it.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)
pkg/roachpb/data.proto, line 396 at r1 (raw file):
It'd be good for the paragraph to spell this out, and also to have some words about specifically how we bump all the clocks before committing. Does the txnCommitter do that?
No, no one does that. That's why #36431 is broken. This is referenced in the TODO below.
pkg/roachpb/data.proto, line 402 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
s/its intent/its intents
since you're taking the blame anyway
But are you sure you want to take the blame on this paragraph? I don't think you're changed anything.
Agreed, probably best to fix the typo without moving the rest of the paragraph in order to not mess up the blame. It's useful to be able to quickly get back to the commit/PR that made a change, especially when the change is discussing something like this and may provide valuable context.
Also, I think this was formatted such that it read well in godoc. Does the outdent help that?
pkg/roachpb/data.proto, line 406 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I find it confusing to talk about stores here. Let's just talk about replicas (of which the leaseholder is one).
+1
pkg/roachpb/data.proto, line 453 at r1 (raw file):
I don't get why we're making it more general and then saying that it doesn't hold on anything other than the leaseholder. That's not really making it more general, is it?
For reads to be served by a follower of a range, we need to ensure that its clock is above both the current & previous leaseholders of the range.
But this isn't true, is it? How could we enforce this?
pkg/roachpb/data.proto, line 449 at r2 (raw file):
// on a copy of the slice. // // ### Follower Reads:
nit: move this section above Usage.
pkg/roachpb/data.proto, line 451 at r2 (raw file):
// ### Follower Reads: // // If the store serving a request isn't a leaseholder for the range, we
I'd still avoid talking about "stores". What you mean here is "replica".
pkg/roachpb/data.proto, line 457 at r2 (raw file):
as the observed timestamp of such a store ... will be above the max offset
What does this mean? It's comparing a timestamp against a duration.
pkg/roachpb/data.proto, line 463 at r2 (raw file):
// that we can only limit the upper bound of our uncertainty interval for read // requests served by a leaseholder and *not* for the ones served by // followers.
I'm still a little confused why we're comparing max offset vs. follower read duration. Neither situation seems to tell us anything all that interesting. Maybe the insight just hasn't clicked for me yet.
Release note: None