Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rowflow,colexec: make routers propagate errors to all non-closed outputs #51518

Merged
merged 1 commit into from
Jul 28, 2020

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Jul 16, 2020

This commit changes the way we propagate the errors in the hash router
so that the error metadata is sent on all non-closed streams.
Previously, we would be sending it over only the first non-closed stream
which could result in the processors on the same stage as that single
stream end to treat the absence of rows and errors as the input being
exhausted successfully, which is wrong because the input did encounter
an error.

The same thing has been happening in the vectorized flow, but in that
case the problem is less severe - the issue will present itself only
when we have wrapped processors (because the materializers will prevent
the propagation throughout the whole flow as described below):
In the vectorized engine we use panic-catch mechanism of error
propagation, and we end up with the following sequence of events:

  1. an operator encounters an error on any node (e.g. colBatchScan
    encounters RWUI error on a remote node). It is not an internal vectorized
    error, so the operator will panic with colexecerror.ExpectedError.
  2. the panic is caught by one of the catchers (it can be a parallel
    unordered synchronizer goroutine, an outbox goroutine, a materializer,
    a hash router)
  3. that component will then decide how to propagate the error further:
    3.1 if it is a parallel unordered synchronizer, then it will cancel all
    of its inputs and will repanic
    3.2 if it is an outbox, the error is sent as metadata which will be
    received by an inbox which will panic with it
    3.3. if it is a materializer, then it might swallow the error (this is
    the reason we need for the vectorized hash router to send the error to
    all of its inputs). The swallowing is acceptable if it is the root
    materializer though.
    3.4 if it is a hash router, it'll cancel all of its outputs and will
    forward the error on each of the outputs.

Fixes: #51458.

Release note (bug fix): Previously, CockroachDB could return incorrect
results on query that encountered ReadWithinUncertaintyInterval error,
and this has been fixed.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich yuzefovich force-pushed the hash-router branch 2 times, most recently from 06e201c to 5c692cf Compare July 17, 2020 00:28
@yuzefovich yuzefovich changed the title rowflow: make routers propagate errors to all non-closed streams rowflow,colexec: make routers propagate errors to all non-closed outputs Jul 17, 2020
@yuzefovich
Copy link
Member Author

I confirmed that this PR fixes https://github.com/cockroachlabs/support/issues/513, however, when forcing the vectorized execution (with vectorize_row_count_threshold=0) we occasionally get ERROR: rpc error: code = Canceled desc = context canceled instead of RWUI error. I think it probably has the same cause as the context cancellation issue that @asubiotto mentioned in #51375, and on a quick glance I didn't find the root cause of it.

I think we should still go ahead with this PR and backport the row-execution fix, and we can leave the vectorized context cancellation issue till later.

@yuzefovich
Copy link
Member Author

When this PR contains the commits from #51772, then I no longer see any issues on the customer's reproduction when vectorized engine is used. RFAL (only the last commit is in this PR).

@andreimatei
Copy link
Contributor

I'm sorry but I won't have time to look at this for a while. I'm happy to see it though!

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct, but I have one request for clarity.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @jordanlewis, and @yuzefovich)


pkg/sql/rowflow/routers.go, line 419 at r1 (raw file):

	forwarded, holdingSema := false, false

	for i := range rb.outputs {

I find this confusing. Can you have two cases at the top level, one for if there is an error (forward to all non closed streams) and one if there is not (forward to first non closed stream)?

It's hard to understand what's going on with the semaphore the way things are here, because of the fact that this loop is doing 2 different modes at once.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @jordanlewis)


pkg/sql/rowflow/routers.go, line 419 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I find this confusing. Can you have two cases at the top level, one for if there is an error (forward to all non closed streams) and one if there is not (forward to first non closed stream)?

It's hard to understand what's going on with the semaphore the way things are here, because of the fact that this loop is doing 2 different modes at once.

Done.

I'm a little unclear on the usage of rb.semaphore - in the error case, initially I thought it would be safer to acquire and release it for each non-closed stream, but now I think it should be ok to acquire and release it only once, outside of the loop. Thoughts?

This commit changes the way we propagate the errors in the hash router
so that the error metadata is sent on all non-closed streams.
Previously, we would be sending it over only the first non-closed stream
which could result in the processors on the same stage as that single
stream end to treat the absence of rows and errors as the input being
exhausted successfully, which is wrong because the input did encounter
an error.

The same thing has been happening in the vectorized flow, but in that
case the problem is less severe - the issue will present itself only
when we have wrapped processors (because the materializers will prevent
the propagation throughout the whole flow as described below):
In the vectorized engine we use panic-catch mechanism of error
propagation, and we end up with the following sequence of events:
1. an operator encounters an error on any node (e.g. `colBatchScan`
encounters RWUI error on a remote node). It is not an internal vectorized
error, so the operator will panic with `colexecerror.ExpectedError`.
2. the panic is caught by one of the catchers (it can be a parallel
unordered synchronizer goroutine, an outbox goroutine, a materializer,
a hash router)
3. that component will then decide how to propagate the error further:
3.1 if it is a parallel unordered synchronizer, then it will cancel all
of its inputs and will repanic
3.2 if it is an outbox, the error is sent as metadata which will be
received by an inbox which will panic with it
3.3. if it is a materializer, then it might swallow the error (this is
the reason we need for the vectorized hash router to send the error to
all of its inputs). The swallowing is acceptable if it is the root
materializer though.
3.4 if it is a hash router, it'll cancel all of its outputs and will
forward the error on each of the outputs.

Release note (bug fix): Previously, CockroachDB could return incorrect
results on query that encountered ReadWithinUncertaintyInterval error,
and this has been fixed.
Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 5 files at r1, 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)


pkg/sql/rowflow/routers.go, line 419 at r1 (raw file):

Previously, yuzefovich wrote…

Done.

I'm a little unclear on the usage of rb.semaphore - in the error case, initially I thought it would be safer to acquire and release it for each non-closed stream, but now I think it should be ok to acquire and release it only once, outside of the loop. Thoughts?

The semaphore is designed to make it so that if all of the outputs are blocked, we have to stop pushing to any of the outputs. I think your proposal makes sense, because it won't start its push loop until at least one is available, and then once it starts the push loop, if no other outputs are available, nobody else can enter the semaphore.

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I didn't mean to reject this. :lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @jordanlewis)


pkg/sql/rowflow/routers.go, line 419 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

The semaphore is designed to make it so that if all of the outputs are blocked, we have to stop pushing to any of the outputs. I think your proposal makes sense, because it won't start its push loop until at least one is available, and then once it starts the push loop, if no other outputs are available, nobody else can enter the semaphore.

Cool, thanks for checking.

@craig
Copy link
Contributor

craig bot commented Jul 28, 2020

Build succeeded:

@craig craig bot merged commit 4de665c into cockroachdb:master Jul 28, 2020
@yuzefovich yuzefovich deleted the hash-router branch July 29, 2020 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

execinfra: uncertainty error can be incorrectly swallowed during a distributed join stage
4 participants