-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rowflow,colexec: make routers propagate errors to all non-closed outputs #51518
Conversation
06e201c
to
5c692cf
Compare
I confirmed that this PR fixes https://github.com/cockroachlabs/support/issues/513, however, when forcing the vectorized execution (with I think we should still go ahead with this PR and backport the row-execution fix, and we can leave the vectorized context cancellation issue till later. |
30113ae
to
004e832
Compare
When this PR contains the commits from #51772, then I no longer see any issues on the customer's reproduction when vectorized engine is used. RFAL (only the last commit is in this PR). |
I'm sorry but I won't have time to look at this for a while. I'm happy to see it though! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks correct, but I have one request for clarity.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @jordanlewis, and @yuzefovich)
pkg/sql/rowflow/routers.go, line 419 at r1 (raw file):
forwarded, holdingSema := false, false for i := range rb.outputs {
I find this confusing. Can you have two cases at the top level, one for if there is an error (forward to all non closed streams) and one if there is not (forward to first non closed stream)?
It's hard to understand what's going on with the semaphore the way things are here, because of the fact that this loop is doing 2 different modes at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @jordanlewis)
pkg/sql/rowflow/routers.go, line 419 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
I find this confusing. Can you have two cases at the top level, one for if there is an error (forward to all non closed streams) and one if there is not (forward to first non closed stream)?
It's hard to understand what's going on with the semaphore the way things are here, because of the fact that this loop is doing 2 different modes at once.
Done.
I'm a little unclear on the usage of rb.semaphore
- in the error case, initially I thought it would be safer to acquire and release it for each non-closed stream, but now I think it should be ok to acquire and release it only once, outside of the loop. Thoughts?
This commit changes the way we propagate the errors in the hash router so that the error metadata is sent on all non-closed streams. Previously, we would be sending it over only the first non-closed stream which could result in the processors on the same stage as that single stream end to treat the absence of rows and errors as the input being exhausted successfully, which is wrong because the input did encounter an error. The same thing has been happening in the vectorized flow, but in that case the problem is less severe - the issue will present itself only when we have wrapped processors (because the materializers will prevent the propagation throughout the whole flow as described below): In the vectorized engine we use panic-catch mechanism of error propagation, and we end up with the following sequence of events: 1. an operator encounters an error on any node (e.g. `colBatchScan` encounters RWUI error on a remote node). It is not an internal vectorized error, so the operator will panic with `colexecerror.ExpectedError`. 2. the panic is caught by one of the catchers (it can be a parallel unordered synchronizer goroutine, an outbox goroutine, a materializer, a hash router) 3. that component will then decide how to propagate the error further: 3.1 if it is a parallel unordered synchronizer, then it will cancel all of its inputs and will repanic 3.2 if it is an outbox, the error is sent as metadata which will be received by an inbox which will panic with it 3.3. if it is a materializer, then it might swallow the error (this is the reason we need for the vectorized hash router to send the error to all of its inputs). The swallowing is acceptable if it is the root materializer though. 3.4 if it is a hash router, it'll cancel all of its outputs and will forward the error on each of the outputs. Release note (bug fix): Previously, CockroachDB could return incorrect results on query that encountered ReadWithinUncertaintyInterval error, and this has been fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 5 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
pkg/sql/rowflow/routers.go, line 419 at r1 (raw file):
Previously, yuzefovich wrote…
Done.
I'm a little unclear on the usage of
rb.semaphore
- in the error case, initially I thought it would be safer to acquire and release it for each non-closed stream, but now I think it should be ok to acquire and release it only once, outside of the loop. Thoughts?
The semaphore is designed to make it so that if all of the outputs are blocked, we have to stop pushing to any of the outputs. I think your proposal makes sense, because it won't start its push loop until at least one is available, and then once it starts the push loop, if no other outputs are available, nobody else can enter the semaphore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I didn't mean to reject this.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r+
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @jordanlewis)
pkg/sql/rowflow/routers.go, line 419 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
The semaphore is designed to make it so that if all of the outputs are blocked, we have to stop pushing to any of the outputs. I think your proposal makes sense, because it won't start its push loop until at least one is available, and then once it starts the push loop, if no other outputs are available, nobody else can enter the semaphore.
Cool, thanks for checking.
Build succeeded: |
This commit changes the way we propagate the errors in the hash router
so that the error metadata is sent on all non-closed streams.
Previously, we would be sending it over only the first non-closed stream
which could result in the processors on the same stage as that single
stream end to treat the absence of rows and errors as the input being
exhausted successfully, which is wrong because the input did encounter
an error.
The same thing has been happening in the vectorized flow, but in that
case the problem is less severe - the issue will present itself only
when we have wrapped processors (because the materializers will prevent
the propagation throughout the whole flow as described below):
In the vectorized engine we use panic-catch mechanism of error
propagation, and we end up with the following sequence of events:
colBatchScan
encounters RWUI error on a remote node). It is not an internal vectorized
error, so the operator will panic with
colexecerror.ExpectedError
.unordered synchronizer goroutine, an outbox goroutine, a materializer,
a hash router)
3.1 if it is a parallel unordered synchronizer, then it will cancel all
of its inputs and will repanic
3.2 if it is an outbox, the error is sent as metadata which will be
received by an inbox which will panic with it
3.3. if it is a materializer, then it might swallow the error (this is
the reason we need for the vectorized hash router to send the error to
all of its inputs). The swallowing is acceptable if it is the root
materializer though.
3.4 if it is a hash router, it'll cancel all of its outputs and will
forward the error on each of the outputs.
Fixes: #51458.
Release note (bug fix): Previously, CockroachDB could return incorrect
results on query that encountered ReadWithinUncertaintyInterval error,
and this has been fixed.