-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
colexec: remove internal cancellation behavior from unordered synchronizer #52463
Conversation
One concern I had was what if one of the inputs is blocked in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but is the race in the test concerning?
Reviewed 3 of 3 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained
I think we should get this in sooner rather than later. |
You're right, thanks for the ping. The race was concerning but not something that could've happened before. Since we weren't waiting for inputs to exit in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @asubiotto)
pkg/sql/colexec/parallel_unordered_synchronizer.go, line 156 at r2 (raw file):
// batchCh is a buffered channel in order to offer non-blocking writes to // input goroutines. During normal operation, this channel will have at most // inputs messages. However, during DrainMeta, inputs might need to push
nit: s/inputs messages/len(inputs) messages/g
.
pkg/sql/colexec/parallel_unordered_synchronizer.go, line 358 at r2 (raw file):
} // Unblock any goroutines currently waiting to be told to read a next batch,
nit: s/a next/the next/g
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the CI failures. One is an edge case described in DrainMeta
and the other is related to the recent change to send metadata messages with errors on all streams. TestVectorizedFlowShutdown
failed because it got multiple metadata messages for the same id. I'm not sure why this wasn't failing before, but I think it's probably got to do with the unordered synchronizer properly forwarding the rest of the metadata now in case of error, so I change the test expectations.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto)
pkg/sql/colexec/parallel_unordered_synchronizer.go, line 363 at r3 (raw file):
// retrieve the next batch but encounters an error. There are now n+1 messages // in batchCh. Notifying all these inputs to read the next batch would result // in 2n+1 messages on batchCh, which would cause a deadlock since this
Given 2n+1 here should the creation of batchCh
be updated?
pkg/sql/colexec/parallel_unordered_synchronizer.go, line 371 at r3 (raw file):
select { case msg := <-s.batchCh: if msg == nil {
Do we ever send nil
message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe the latest race failure was caused by this PR and submitted #52890 to track. Rerunning the build.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)
pkg/sql/colexec/parallel_unordered_synchronizer.go, line 363 at r3 (raw file):
Previously, yuzefovich wrote…
Given 2n+1 here should the creation of
batchCh
be updated?
I would say no, because I think it's easier to reason about 2n
during normal operations and simply handle the complexity of an extra message here in DrainMeta
.
pkg/sql/colexec/parallel_unordered_synchronizer.go, line 371 at r3 (raw file):
Previously, yuzefovich wrote…
Do we ever send
nil
message?
We don't explicitly send nil
, it's just the signal that the channel API gives us for when a channel is close
d.
0dc130e
to
2a14c44
Compare
…nizer Previously, the unordered synchronizer would cancel all inputs if one of the inputs encountered an error. This would result in possible context cancellation errors racing with the original error and would sometimes cause the original error to be overwritten (according to priority) in the distsql receiver (the root of a query). This behavior was incorrect, because what should happen is that the original error should be propagated followed by a call to DrainMeta by the caller to drain and close the remaining inputs. This commit removes internal context cancellation in favor of this behavior. Release note (bug fix): unexpected context cancellation errors could sometimes be returned in the vectorized execution engine. This is now fixed.
bors r=yuzefovich |
Build succeeded: |
Previously, the unordered synchronizer would cancel all inputs if one of the
inputs encountered an error. This would result in possible context cancellation
errors racing with the original error and would sometimes cause the original
error to be overwritten (according to priority) in the distsql receiver (the
root of a query).
This behavior was incorrect, because what should happen is that the original
error should be propagated followed by a call to DrainMeta by the caller to
drain and close the remaining inputs. This commit removes internal context
cancellation in favor of this behavior.
Release note (bug fix): unexpected context cancellation errors could sometimes
be returned in the vectorized execution engine. This is now fixed.
Fixes #51647
Fixes #52057