-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvfollowerreadsccl: implement txn retries for nearest_only=True #68969
kvfollowerreadsccl: implement txn retries for nearest_only=True #68969
Conversation
i'm a little iffy at the schema changes, so would appreciate a look from @cockroachdb/sql-schema on that. |
0f33fbb
to
0306380
Compare
worked around this |
0306380
to
e990692
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a design for how this is supposed to work regarding schema descriptor timestamps and bounded staleness time ranges?
Historical descriptors have a stricter safety policy than current descriptors for which you actively hold a lease. To use historical descriptors, you must use them during their durable validity interval (between their ModificationTime and the successor descriptor's Modification time, exclusive). This is because we don't track the moment when we've stopped assuming that the predecessor version might be active.
Can you help me understand some timestamps in play here? My assumption is that the ReadTimestamp
on the Txn
is set to the minimum timestamp that the query is allowed to see (i.e. the value computed from the AOST clause). Let's think about this in terms of a timeline. Imagine user specifies that their query should use a timestamp no older than a
and now is g
.
Now we need to plan the query with some descriptor versions. Imagine our query uses two descriptors and those descriptors have gone through some schema changes during the time interval. Namely d1
saw a change at c
and at e
and d2
saw changes at b
, d
, and f
.
--time-->
abc d e f g // timestamp labels
[...............] // interval for bounded staleness query, g == now
..|......|....... // version validity intervals for two versions of descriptor d1
.|...|.......|... // version validity intervals for two versions of descriptor d2
So, there should exist the following distinct plans as far as the optimizer is concerned:
[a, b)
, [b, c)
, [c, d)
, [d, e)
, [e, f)
, and [f, g)
.
When I look back at the RFC, there's some language about using follower reads and a similar lookup protocol to find descriptors and then to try to make that cohere here. I don't think I ever internalized this part of the proposal at all and assumed it would just use leases and other cached historical descriptors given we rely on system tables remaining available.
So, in my idealized world, in the face of a schema change, we'd
- Plan at the present time
- Look at the descriptors used to plan and bound the minimum timestamp to the earliest modification time of any descriptors used.
In this case it would mean that we'd end up changing the minimum timestamp from a
to f
for the purpose of execution and we'd use g
for planning.
I may be misunderstanding things and am more than happy to get on calls.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @rytaft)
Ah, yeah... that's because this used to that we would give up on looking at the past due to the "schema unavailability" problem, but we changed it to what you saw when we thought we could change this: #67837 I probably should have circulated the RFC change when we did this.
I think the pseudocode made sense for conceptualising what was happening, but in actuality (aka when I made this work), we do something like this:
Assume (a) a bounded staleness bound, (b) is a schema modification, (c) is "current time".
This can continue if there are more schema changes, in which case we always do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I feel a lot better now.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @otan, and @rytaft)
pkg/kv/txn.go, line 1108 at r1 (raw file):
err := errors.AssertionFailedf("%s: ba=%s, txn=%s", s, ba.String(), txn.String()) err = errors.WithContextTags(err, ctx) panic(err)
are we happy with this new panic behavior? Is this a lot better than turning it into a roachpb error? It should retain it's assertion failure nature as it goes through that round of boxing and unboxing.
pkg/sql/conn_executor.go, line 2191 at r7 (raw file):
var minTSErr *roachpb.MinTimestampBoundUnsatisfiableError if errors.As(err, &minTSErr) {
total nit: you can scope this variable to ths conditional with:
if minTSErr := (*roachpb.MinTimestampBoundUnsatisfiableError)(nil); errors.As(err, &minTSErr) {
// ...
pkg/sql/conn_executor.go, line 2198 at r7 (raw file):
// a minimum timestamp during a bounded staleness read to be greater // than the maximum staleness bound we put up. err = errors.Wrapf(
assertion failure?
pkg/sql/catalog/descs/collection.go, line 109 at r7 (raw file):
// nearest_only=True, in which we want a schema read that should be // no older than maxTimestampBound. maxTimestampBound *hlc.Timestamp
Assuming we need all of this, thoughts on just sticking a maxTimestampBoundDeadlineHolder
here and then detecting whether that thing has a zero timestamp as opposed to allocating a new one on each use?
pkg/sql/catalog/descs/collection.go, line 122 at r7 (raw file):
// SetMaxTimestampBound sets the maximum timestamp to read schemas at. func (tc *Collection) SetMaxTimestampBound(maxTimestampBound *hlc.Timestamp) {
I'm still unclear on why we need this. Reasons seem to be:
- we want to avoid actually using the transaction to go read from the store for the descriptor in the case we don't have a lease
- we want to sidestep the UpdateDeadline call
- we want to use this timestamp as opposed to the transaction's read timestamp
Taking them backwards, what is the ReadTimestamp
on one of these transactions? What does happen on UpdateDeadline
on one of these transactions? Can you tell that you have one of these transactions just from the *kv.Txn
object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @rytaft)
pkg/kv/txn.go, line 1108 at r1 (raw file):
Previously, ajwerner wrote…
are we happy with this new panic behavior? Is this a lot better than turning it into a roachpb error? It should retain it's assertion failure nature as it goes through that round of boxing and unboxing.
pkg/sql/conn_executor.go, line 2191 at r7 (raw file):
Previously, ajwerner wrote…
var minTSErr *roachpb.MinTimestampBoundUnsatisfiableError if errors.As(err, &minTSErr) {
total nit: you can scope this variable to ths conditional with:
if minTSErr := (*roachpb.MinTimestampBoundUnsatisfiableError)(nil); errors.As(err, &minTSErr) { // ...
Done.
pkg/sql/catalog/descs/collection.go, line 122 at r7 (raw file):
Previously, ajwerner wrote…
I'm still unclear on why we need this. Reasons seem to be:
- we want to avoid actually using the transaction to go read from the store for the descriptor in the case we don't have a lease
- we want to sidestep the UpdateDeadline call
- we want to use this timestamp as opposed to the transaction's read timestamp
Taking them backwards, what is the
ReadTimestamp
on one of these transactions? What does happen onUpdateDeadline
on one of these transactions? Can you tell that you have one of these transactions just from the*kv.Txn
object?
I was told by @nvanbenschoten to not put things on kv.Txn
, which is how it ended up here where it is.
ReadTimestamp
doesn't change at the moment in a staleness read - should it? I tried that and a bunch of other things complained based on assertions made that it should not be set during a max staleness read. I've been assuming (based on the APIs given) that it's not important.
Your reasons are correct.
e990692
to
8c9e19d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 18 files at r1, 3 of 3 files at r4, 1 of 2 files at r5, 1 of 19 files at r8, 5 of 10 files at r9, 10 of 13 files at r10, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @otan)
pkg/sql/conn_executor.go, line 2409 at r10 (raw file):
// as there was a schema update in between. When retrying, we want to keep the // same minimum timestamp for the AOST read, but set the maximum timestamp // to the point just before our failed read to ensure we don't try read data
nit: try read -> try to read
pkg/sql/exec_util.go, line 1170 at r10 (raw file):
DistSQLReceiverPushCallbackFactory func(query string) func(rowenc.EncDatumRow, *execinfrapb.ProducerMetadata) // OnTxnRetry, if sets, will be called if there is a transaction retry.
nit: if sets -> if set
pkg/sql/sem/tree/as_of.go, line 93 at r10 (raw file):
NearestOnly bool // If this is a bounded staleness read with nearest_only=True, this is set // when we failed to satisfy a bounded staleness with a nearby replica as
nit: satisfy a bounded staleness read
ba20676
to
e90277b
Compare
Instead of panicking. Discussed in cockroachdb#68969.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @nvanbenschoten, and @rytaft)
pkg/sql/catalog/descs/collection.go, line 122 at r7 (raw file):
Previously, otan (Oliver Tan) wrote…
I was told by @nvanbenschoten to not put things on
kv.Txn
, which is how it ended up here where it is.
ReadTimestamp
doesn't change at the moment in a staleness read - should it? I tried that and a bunch of other things complained based on assertions made that it should not be set during a max staleness read. I've been assuming (based on the APIs given) that it's not important.Your reasons are correct.
Sounds good. I'm cool with this. My sincere and heartfelt hope is to soon rework the dependency logic in this here Collection
object to be such that it is constructed (or reset, I suppose) with a binding to the (abstracted) transaction and we remove txn
parameter from all of these methods. Seems unlikely to make this release so let's march forward with this and then, when that refactor arrives, we can just set this up at construction time in a more direct manner than MaybeUpdateDeadline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 13 files at r10, 1 of 3 files at r11, 33 of 33 files at r12, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @otan)
pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 145 at r12 (raw file):
# successfully. Note the schema changes several times, hence the multiple events # output. Note on the first read, event 7 is not a follower read as the query # requires reading a descriptor which does not utilize a follower read.
Can we ignore the descriptor lookup? Or somehow ensure that it's cached at this point? This seems like it will be a source of flakiness. I also don't understand why the descriptor lookup results in a local read, but not a follower read. Have we ensured that the descriptors leaseholder is on n2?
pkg/sql/conn_executor.go, line 2208 at r12 (raw file):
) } if aost.Timestamp.Less(minTSErr.MinTimestampBound.Prev()) {
Could you add a comment explaining the .Prev()
here? It must have something to do with an inclusive/exclusive translation somewhere else, right?
pkg/sql/conn_executor.go, line 2412 at r12 (raw file):
// same minimum timestamp for the AOST read, but set the maximum timestamp // to the point just before our failed read to ensure we don't try to read //data which may be after the schema change when we retry.
nit: missing space at the beginning of this line
pkg/sql/conn_executor.go, line 2415 at r12 (raw file):
var minTSErr *roachpb.MinTimestampBoundUnsatisfiableError if err := ex.extraTxnState.autoRetryReason; err != nil && errors.As(err, &minTSErr) { bound := minTSErr.MinTimestampBound.Prev()
Would things be easier to reason about it max_timestamp_bound was exclusive instead of inclusive? We can make that change in the KV API if necessary.
pkg/sql/catalog/descs/descriptor.go, line 212 at r12 (raw file):
ctx, tc.deadlineHolder(txn), parentID, parentSchemaID, name) if err != nil {
nit: stray new line
pkg/sql/catalog/descs/leased_descriptors.go, line 59 at r12 (raw file):
ctx context.Context, deadline hlc.Timestamp, ) error { return nil
Do we need to use the deadline even for bounded staleness queries? Right now we aren't passing a MaxTimestampBound
for bounded staleness queries that have not yet retried, but I think this may be faulty. If the bounded staleness read is operating on the current schema, gets delayed, and then chooses a dynamic timestamp beyond the end of the validity period for this schema, won't bad things happen?
cc. @ajwerner
pkg/sql/catalog/descs/leased_descriptors.go, line 141 at r12 (raw file):
// limitations of AcquireByName for details. // Note we never should read from store during a bounded staleness read, // as it is safe to return the schema as non-existent.
What is the effect of this? Does this mean that if the schema exists but is not cached, we won't observe it and will incorrectly claim that it does not exist?
pkg/sql/sem/tree/as_of.go, line 96 at r12 (raw file):
// have no followers with an up-to-date schema. // In this case, we want a read that is >= Timestamp and <= MaxTimestampBound. MaxTimestampBound *hlc.Timestamp
nit: we generally use the zero-value ot hlc.Timestamp to represent an optional timestamp. There's an IsEmpty
method you can use to check whether the timestamp is present or not. No reason to incur runtime cost for this.
pkg/kv/txn.go, line 1108 at r1 (raw file):
Previously, otan (Oliver Tan) wrote…
We could turn this into an error. Done in #69046.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @nvanbenschoten)
pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 145 at r12 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Can we ignore the descriptor lookup? Or somehow ensure that it's cached at this point? This seems like it will be a source of flakiness. I also don't understand why the descriptor lookup results in a local read, but not a follower read. Have we ensured that the descriptors leaseholder is on n2?
it's because kv.OnlyFollowerReads
ensures everything is a follower read, including the descriptor lookup.
i've changed this to wait for an instance of follower read before continuing so we have the cache entry.
pkg/sql/conn_executor.go, line 2208 at r12 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you add a comment explaining the
.Prev()
here? It must have something to do with an inclusive/exclusive translation somewhere else, right?
ah, it's not needed here.
pkg/sql/conn_executor.go, line 2415 at r12 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Would things be easier to reason about it max_timestamp_bound was exclusive instead of inclusive? We can make that change in the KV API if necessary.
still needs the Prev()
for the schema descriptor lookup, so i prefer it this way tbh.
pkg/sql/catalog/descs/leased_descriptors.go, line 59 at r12 (raw file):
If the bounded staleness read is operating on the current schema, gets delayed, and then chooses a dynamic timestamp beyond the end of the validity period for this schema, won't bad things happen?
isn't that ok? we would truncate the min timestamp for that read anyway right, and retry would take care of the rest? unless i'm misunderstanding.
pkg/sql/catalog/descs/leased_descriptors.go, line 141 at r12 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
What is the effect of this? Does this mean that if the schema exists but is not cached, we won't observe it and will incorrectly claim that it does not exist?
The schema doesn't exist; we're falling back to the KV read in case it does. But for bounded staleness, we should avoid that because that only observes the "latest" read.
e90277b
to
0cd5f95
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 9 files at r13, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @otan)
pkg/sql/catalog/descs/leased_descriptors.go, line 59 at r12 (raw file):
Previously, otan (Oliver Tan) wrote…
If the bounded staleness read is operating on the current schema, gets delayed, and then chooses a dynamic timestamp beyond the end of the validity period for this schema, won't bad things happen?
isn't that ok? we would truncate the min timestamp for that read anyway right, and retry would take care of the rest? unless i'm misunderstanding.
We would truncate the min timestamp, but what about the max timestamp? I'm concerned that we have a lease on a descriptor from time 20 to time 30 and we have a min_timestamp_bound of 10. We bump the min_timestamp_bound to 20, but what if we end up evaluating the bounded staleness read at time 35? What if a schema change or two landed between 30 and 35? Could that lead to incoherence between the table data and the schema we use to decode that data?
The TL;DR is that I think we may need to always set a max timestamp bound.
pkg/sql/catalog/descs/leased_descriptors.go, line 141 at r12 (raw file):
Previously, otan (Oliver Tan) wrote…
The schema doesn't exist; we're falling back to the KV read in case it does. But for bounded staleness, we should avoid that because that only observes the "latest" read.
I see, this is about dropped/truncated tables, not newly created ones. This makes sense then. Returning that the schema is non-existent is the "least stale" result possible. Makes sense.
pkg/sql/sem/tree/as_of.go, line 96 at r12 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: we generally use the zero-value ot hlc.Timestamp to represent an optional timestamp. There's an
IsEmpty
method you can use to check whether the timestamp is present or not. No reason to incur runtime cost for this.
follow-up nit: mention that the value can be set to the zero value when a query has no maximum timestamp bound.
pkg/sql/catalog/descs/leased_descriptors.go, line 59 at r12 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Not sure I understand, this controls what time the schema should be read. For the first read, if we have a min_timestamp_bound of 10 in your example, max is not set, so we try read from latest (and this code is not used). If we have something between 30-35, then the min_timestamp_bound will be reset to something in 30-35 (say 30). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @otan)
pkg/sql/catalog/descs/leased_descriptors.go, line 59 at r12 (raw file):
Previously, otan (Oliver Tan) wrote…
Not sure I understand, this controls what time the schema should be read.
For the first read, if we have a min_timestamp_bound of 10 in your example, max is not set, so we try read from latest (and this code is not used). If we have something between 30-35, then the min_timestamp_bound will be reset to something in 30-35 (say 30).
If that fails, we have set a min_timestamp_bound of say 30 in the read because of the bump, then max timestamp bound is set to 29.
The next time, we always read the schema from time 29. We then use this deadline holder. Does the deadline matter in this case when we know there is a schema that is later?
Isn't UpdateDeadline
used to ensure a transaction does not commit beyond the validity of the schema descriptors that it is using? If I'm not wildly off on that, then I'm thinking that we need the same form of protection for bounded staleness reads, even on their first (the "latest") attempt. If the first attempt is performed without any max_timestamp_bound at all, then how do we ensure that it doesn't dynamically choose a timestamp well in the future and beyond its descriptor's expiration?
pkg/sql/catalog/descs/leased_descriptors.go, line 59 at r12 (raw file):
I'm not as learned in the schema code, but we don't use deadline holder in this case
|
pkg/sql/catalog/descs/leased_descriptors.go, line 59 at r12 (raw file): Previously, otan (Oliver Tan) wrote…
sorry, we don't use this deadline holder in that case (we fall back to the txn one), which updates the deadline. |
0cd5f95
to
7a8e6bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @nvanbenschoten, and @otan)
pkg/sql/catalog/descs/leased_descriptors.go, line 59 at r12 (raw file):
Previously, otan (Oliver Tan) wrote…
sorry, we don't use this deadline holder in that case (we fall back to the txn one), which updates the deadline.
Got it, I think that's ok, but then we need to do some extra work in kv.Txn
to respect the deadline when NegotiateAndSend
is used. That should be fine and I should have realized that's how this was going to work earlier. We'll just use the same max_timestamp_bound
mechanism with some code in NegotiateAndSend
like:
if txn.mu.deadline != nil {
ba.BoundedStaleness.MaxTimestampBound.Backward(*txn.mu.deadline)
}
I'll open an issue for this, though I won't get to it until next week.
pkg/sql/catalog/descs/leased_descriptors.go, line 59 at r12 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
sounds good, yeah it smelt as thought it would have affected the nearest_only=False case as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r14, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @otan)
This implements the schema unavailability workaround for bounded staleness reads, where if we had to bump the AOST min timestamp bound as we fetched a schema descriptor which was more recent than the AOST min timestamp bound. In this case, we look to retry, using an older schema descriptor each time we go back. We utilize the existing machinery around transaction retries to work. Between retries, we keep track of a `MaxStalenessBound`, so that in between retries we always read a lower bound. There are a number of changes around retries needed to make this work: * We preserve AOST between transaction retries. This is needed to prevent a case where the timestamp bound keeps progressing. * with_min_timestamp / with_max_staleness now always return the context AS OF SYSTEM TIME. This is required as checks elsewhere require this to be equal to the evaluated time, but on txn retries. * The schema descriptor code has to know the `MaxStalenessBound`. As such, we store this in the `Collection` object which is persisted inside `extraTxnState` and is preserved between retries. Release note (sql change): Bounded staleness reads now retry transactions when nearest_only=True and a schema change is detected which may prevent a follower read from being served.
7a8e6bc
to
f5bdee0
Compare
cheers! bors r=nvanbenschoten,ajwerner,rytaft |
Build succeeded: |
Instead of panicking. Discussed in cockroachdb#68969. Release justification: change to new functionality
69046: kv: return error from checkNegotiateAndSendPreconditions r=nvanbenschoten a=nvanbenschoten Instead of panicking. Discussed in #68969. Release justification: change to new functionality 69273: sql: crdb_internal.reset_sql_stats() now resets persisted SQL Stats r=maryliag a=Azhng Depends on #68401 and #68434 Previously, crdb_internal.reset_sql_stats() builtin only resets cluster-wide in-memory sql stats. This patch updated the builtin to be able to reset persisted sql stats as well. Release justification: category 4 Release note (sql change): crdb_internal.reset_sql_stats() now resets persisted SQL Stats. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Azhng <archer.xn@gmail.com>
See commit for details.