New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: augment command queue with timestamps for non-interfering reads #14342
Conversation
The read can't affect the later write, but the write can affect the read's outcome because of uncertainty errors. With the change (didn't look at the code, just the PR message) it seems that we could fail to let uncertain reads fail, which in turn would lead to anomalies. |
@tschottdorf I don't think this is an anomaly. There is no way for the concurrent write at the higher timestamp to return and send a back channel message, back in time, to the reader such that the reader experiences a "does not read what was written previously in absolute time". A less tongue in cheek way to say this is that the uncertainty interval errors protect us from failing to read something that provably happened before the read started. Here's another way to look at it: imagine that the write happens at t=10, and the max offset = 0.25s. The case you bring up is that the read and write are concurrent, with read at t=9.90. You're worried that if the read is allowed to proceed with this change, the behavior will change from getting an uncertainty interval error to not getting one. But the same thing is already true if the read arrives just slightly before the write, finishes without any uncertainty error for t=9.90, and then the write arrives and puts a value at t=10. The reason we return the uncertainty errors is to avoid the failed external guarantee of not reading what some observer can prove already was written. That's not the case with concurrent reads and writes. |
I tested this locally and can confirm it seems to fix #14298. I'll also push it to lapis to make extra sure |
Works on lapis, too 👌 |
Reviewed 2 of 4 files at r1. pkg/storage/command_queue.go, line 309 at r1 (raw file):
command IDs? pkg/storage/command_queue.go, line 337 at r1 (raw file):
nit: move mustWait before the pkg/storage/command_queue.go, line 369 at r1 (raw file):
ditto pkg/storage/command_queue.go, line 425 at r1 (raw file):
This condition isn't necessary to check, is it? ditto below. pkg/storage/replica.go, line 1687 at r1 (raw file):
to make the comment above clearer, can you name this Comments from Reviewable |
Reviewed 4 of 4 files at r1. pkg/storage/command_queue.go, line 207 at r1 (raw file):
Add a comment about the timestamp argument, and the special meaning of a zero timestamp. Comments from Reviewable |
b7e956b
to
9f2fe51
Compare
Review status: all files reviewed at latest revision, 6 unresolved discussions. pkg/storage/command_queue.go, line 207 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/command_queue.go, line 309 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
These are allocated by the command queue. pkg/storage/command_queue.go, line 337 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/command_queue.go, line 369 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/command_queue.go, line 425 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. Could you check my logic? These timestamp pkg/storage/replica.go, line 1687 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. Comments from Reviewable |
Reviewed 2 of 4 files at r2. pkg/storage/command_queue.go, line 425 at r1 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
Yeah, I think this is correct; if Comments from Reviewable |
9f2fe51
to
2e3cdd9
Compare
Review status: 2 of 4 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful. pkg/storage/command_queue.go, line 327 at r3 (raw file):
I'd pull pkg/storage/command_queue_test.go, line 423 at r3 (raw file):
This is a pretty nice way to test this. My only concern is that it's dependent on the order that pkg/storage/replica.go, line 1690 at r3 (raw file):
Could you expand on why "Reads and writes to local keys always interfere"? Comments from Reviewable |
Previously, the command queue considered any two commands to overlap if the key ranges overlapped, regardless of whether the timestamps on the two commands guaranteed they were non-overlapping. Consider the case of a read over key ranges "a" - "z" at timestamp t=1s. If a write arrives for key "c" at timestamp > t=1s, then it should be free to proceed regardless of the disposition of the read because the read cannot affect the write's timestamp. Similarly, if a write is active for key "c" at timestamp t=1s, a read that arrives for key range "a" - "z" should be free to proceed as long as its timestamp is < t=1s. Note that local keys do not allow non-interference between earlier reads and overlapping writes. Fixes #14298
Review status: 2 of 4 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful. pkg/storage/command_queue.go, line 327 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/storage/command_queue_test.go, line 423 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yes this will break if that changes, but should be easy enough to fix the test. pkg/storage/replica.go, line 1690 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. Comments from Reviewable |
2e3cdd9
to
81ce3af
Compare
Change #14342 had a bug in that it did not remove the optimization to stop checking dependencies in the event that the range group being built up completely enclosed the candidate span. Bailing out early is no longer allowed because we must additionally include any read-only spans which might not be transitively depended on by the enclosing read-write span(s). This bug applies to both prop-eval-KV as well as non-prop-eval-KV, but manifested much more readily with prop-eval-KV due to the massively enclosing span which is added to provide linearizability across all ops hitting the range's command queue. Fixes #14434 Fixes #8057
Change #14342 had a bug in that it did not remove the optimization to stop checking dependencies in the event that the range group being built up completely enclosed the candidate span. Bailing out early is no longer allowed because we must additionally include any read-only spans which might not be transitively depended on by the enclosing read-write span(s). This bug applies to both prop-eval-KV as well as non-prop-eval-KV, but manifested much more readily with prop-eval-KV due to the massively enclosing span which is added to provide linearizability across all ops hitting the range's command queue. Fixes #14434 Fixes #8057
The test added here verifies that if a dependency relation between commands inserted into the `CommandQueue` should exist, it is always transitively maintained even if other commands are inserted between them. This is important because the transitive dependency relation is required for the correctness of certain optimizations performed by the `CommandQueue`, as well as by our approach to command cancellation and prerequisite migration. The test made me feel more confident that even with the timestamp handling added in cockroachdb#14342, our optimizations still work correctly.
Previously, the command queue considered any two commands to overlap if the
key ranges overlapped, regardless of whether the timestamps on the two
commands guaranteed they were non-overlapping.
Consider the case of a read over key ranges "a" - "z" at timestamp t=1s.
If a write arrives for key "c" at timestamp > t=1s, then it should be
free to proceed regardless of the disposition of the read because the
read cannot affect the write's timestamp.
Similarly, if a write is active for key "c" at timestamp t=1s, a read
that arrives for key range "a" - "z" should be free to proceed as long
as its timestamp is < t=1s.
Note that local keys do not allow non-interference between earlier reads
and overlapping writes.
Fixes #14298
This change is