rowflow,colexec: make routers propagate errors to all non-closed outputs #51518

yuzefovich · 2020-07-16T18:31:22Z

This commit changes the way we propagate the errors in the hash router
so that the error metadata is sent on all non-closed streams.
Previously, we would be sending it over only the first non-closed stream
which could result in the processors on the same stage as that single
stream end to treat the absence of rows and errors as the input being
exhausted successfully, which is wrong because the input did encounter
an error.

The same thing has been happening in the vectorized flow, but in that
case the problem is less severe - the issue will present itself only
when we have wrapped processors (because the materializers will prevent
the propagation throughout the whole flow as described below):
In the vectorized engine we use panic-catch mechanism of error
propagation, and we end up with the following sequence of events:

an operator encounters an error on any node (e.g. colBatchScan
encounters RWUI error on a remote node). It is not an internal vectorized
error, so the operator will panic with colexecerror.ExpectedError.
the panic is caught by one of the catchers (it can be a parallel
unordered synchronizer goroutine, an outbox goroutine, a materializer,
a hash router)
that component will then decide how to propagate the error further:
3.1 if it is a parallel unordered synchronizer, then it will cancel all
of its inputs and will repanic
3.2 if it is an outbox, the error is sent as metadata which will be
received by an inbox which will panic with it
3.3. if it is a materializer, then it might swallow the error (this is
the reason we need for the vectorized hash router to send the error to
all of its inputs). The swallowing is acceptable if it is the root
materializer though.
3.4 if it is a hash router, it'll cancel all of its outputs and will
forward the error on each of the outputs.

Fixes: #51458.

Release note (bug fix): Previously, CockroachDB could return incorrect
results on query that encountered ReadWithinUncertaintyInterval error,
and this has been fixed.

cockroach-teamcity · 2020-07-16T18:31:33Z

This change is

yuzefovich · 2020-07-17T02:52:16Z

I confirmed that this PR fixes https://github.com/cockroachlabs/support/issues/513, however, when forcing the vectorized execution (with vectorize_row_count_threshold=0) we occasionally get ERROR: rpc error: code = Canceled desc = context canceled instead of RWUI error. I think it probably has the same cause as the context cancellation issue that @asubiotto mentioned in #51375, and on a quick glance I didn't find the root cause of it.

I think we should still go ahead with this PR and backport the row-execution fix, and we can leave the vectorized context cancellation issue till later.

yuzefovich · 2020-07-22T20:22:44Z

When this PR contains the commits from #51772, then I no longer see any issues on the customer's reproduction when vectorized engine is used. RFAL (only the last commit is in this PR).

andreimatei · 2020-07-23T16:35:03Z

I'm sorry but I won't have time to look at this for a while. I'm happy to see it though!

jordanlewis

Looks correct, but I have one request for clarity.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @jordanlewis, and @yuzefovich)

pkg/sql/rowflow/routers.go, line 419 at r1 (raw file):

	forwarded, holdingSema := false, false

	for i := range rb.outputs {

I find this confusing. Can you have two cases at the top level, one for if there is an error (forward to all non closed streams) and one if there is not (forward to first non closed stream)?

It's hard to understand what's going on with the semaphore the way things are here, because of the fact that this loop is doing 2 different modes at once.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @jordanlewis)

pkg/sql/rowflow/routers.go, line 419 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I find this confusing. Can you have two cases at the top level, one for if there is an error (forward to all non closed streams) and one if there is not (forward to first non closed stream)?

It's hard to understand what's going on with the semaphore the way things are here, because of the fact that this loop is doing 2 different modes at once.

Done.

I'm a little unclear on the usage of rb.semaphore - in the error case, initially I thought it would be safer to acquire and release it for each non-closed stream, but now I think it should be ok to acquire and release it only once, outside of the loop. Thoughts?

This commit changes the way we propagate the errors in the hash router so that the error metadata is sent on all non-closed streams. Previously, we would be sending it over only the first non-closed stream which could result in the processors on the same stage as that single stream end to treat the absence of rows and errors as the input being exhausted successfully, which is wrong because the input did encounter an error. The same thing has been happening in the vectorized flow, but in that case the problem is less severe - the issue will present itself only when we have wrapped processors (because the materializers will prevent the propagation throughout the whole flow as described below): In the vectorized engine we use panic-catch mechanism of error propagation, and we end up with the following sequence of events: 1. an operator encounters an error on any node (e.g. `colBatchScan` encounters RWUI error on a remote node). It is not an internal vectorized error, so the operator will panic with `colexecerror.ExpectedError`. 2. the panic is caught by one of the catchers (it can be a parallel unordered synchronizer goroutine, an outbox goroutine, a materializer, a hash router) 3. that component will then decide how to propagate the error further: 3.1 if it is a parallel unordered synchronizer, then it will cancel all of its inputs and will repanic 3.2 if it is an outbox, the error is sent as metadata which will be received by an inbox which will panic with it 3.3. if it is a materializer, then it might swallow the error (this is the reason we need for the vectorized hash router to send the error to all of its inputs). The swallowing is acceptable if it is the root materializer though. 3.4 if it is a hash router, it'll cancel all of its outputs and will forward the error on each of the outputs. Release note (bug fix): Previously, CockroachDB could return incorrect results on query that encountered ReadWithinUncertaintyInterval error, and this has been fixed.

jordanlewis

Reviewed 4 of 5 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)

pkg/sql/rowflow/routers.go, line 419 at r1 (raw file):

Previously, yuzefovich wrote…

Done.

I'm a little unclear on the usage of rb.semaphore - in the error case, initially I thought it would be safer to acquire and release it for each non-closed stream, but now I think it should be ok to acquire and release it only once, outside of the loop. Thoughts?

The semaphore is designed to make it so that if all of the outputs are blocked, we have to stop pushing to any of the outputs. I think your proposal makes sense, because it won't start its push loop until at least one is available, and then once it starts the push loop, if no other outputs are available, nobody else can enter the semaphore.

jordanlewis

Oops, I didn't mean to reject this.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)

yuzefovich

TFTR!

bors r+

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @jordanlewis)

pkg/sql/rowflow/routers.go, line 419 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

The semaphore is designed to make it so that if all of the outputs are blocked, we have to stop pushing to any of the outputs. I think your proposal makes sense, because it won't start its push loop until at least one is available, and then once it starts the push loop, if no other outputs are available, nobody else can enter the semaphore.

Cool, thanks for checking.

craig · 2020-07-28T21:22:44Z

Build succeeded:

GitHub CI (Cockroach)

yuzefovich force-pushed the hash-router branch 2 times, most recently from 06e201c to 5c692cf Compare July 17, 2020 00:28

yuzefovich requested review from andreimatei and jordanlewis July 17, 2020 00:28

yuzefovich force-pushed the hash-router branch from 5c692cf to b16ff3b Compare July 17, 2020 01:03

yuzefovich changed the title ~~rowflow: make routers propagate errors to all non-closed streams~~ rowflow,colexec: make routers propagate errors to all non-closed outputs Jul 17, 2020

irfansharif mentioned this pull request Jul 21, 2020

roachtest: scaledata/filesystem_simulator/nodes=6 failed #51215

Closed

yuzefovich mentioned this pull request Jul 22, 2020

colexec: inconsistent handling of context cancellation #51647

Closed

yuzefovich force-pushed the hash-router branch 3 times, most recently from 30113ae to 004e832 Compare July 22, 2020 19:41

yuzefovich force-pushed the hash-router branch from 004e832 to 8a7ccf8 Compare July 22, 2020 20:35

yuzefovich force-pushed the hash-router branch from 8a7ccf8 to 7eafafe Compare July 27, 2020 20:48

jordanlewis reviewed Jul 27, 2020

View reviewed changes

yuzefovich force-pushed the hash-router branch from 7eafafe to 9e654d4 Compare July 28, 2020 00:13

yuzefovich commented Jul 28, 2020

View reviewed changes

yuzefovich force-pushed the hash-router branch from 9e654d4 to 6beedd1 Compare July 28, 2020 00:17

yuzefovich force-pushed the hash-router branch from 6beedd1 to 88252df Compare July 28, 2020 17:40

jordanlewis requested changes Jul 28, 2020

View reviewed changes

jordanlewis approved these changes Jul 28, 2020

View reviewed changes

yuzefovich commented Jul 28, 2020

View reviewed changes

craig bot merged commit 4de665c into cockroachdb:master Jul 28, 2020

yuzefovich mentioned this pull request Jul 29, 2020

release-20.1: rowflow: make routers propagate errors to all non-closed outputs #52045

Merged

yuzefovich deleted the hash-router branch July 29, 2020 00:34

This was referenced Aug 3, 2020

sql/rowexec: TestUncertaintyErrorIsReturned failed #52057

Closed

release-19.2: rowflow: make routers propagate errors to all non-closed outputs #52252

Merged

sql: TestQueryProgress failed #52456

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rowflow,colexec: make routers propagate errors to all non-closed outputs #51518

rowflow,colexec: make routers propagate errors to all non-closed outputs #51518

yuzefovich commented Jul 16, 2020 •

edited

Loading

cockroach-teamcity commented Jul 16, 2020

yuzefovich commented Jul 17, 2020

yuzefovich commented Jul 22, 2020

andreimatei commented Jul 23, 2020

jordanlewis left a comment

yuzefovich left a comment

jordanlewis left a comment

jordanlewis left a comment

yuzefovich left a comment

craig bot commented Jul 28, 2020

rowflow,colexec: make routers propagate errors to all non-closed outputs #51518

rowflow,colexec: make routers propagate errors to all non-closed outputs #51518

Conversation

yuzefovich commented Jul 16, 2020 • edited Loading

cockroach-teamcity commented Jul 16, 2020

yuzefovich commented Jul 17, 2020

yuzefovich commented Jul 22, 2020

andreimatei commented Jul 23, 2020

jordanlewis left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

jordanlewis left a comment

Choose a reason for hiding this comment

jordanlewis left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

craig bot commented Jul 28, 2020

yuzefovich commented Jul 16, 2020 •

edited

Loading