colexec: remove internal cancellation behavior from unordered synchronizer #52463

asubiotto · 2020-08-06T09:14:50Z

Previously, the unordered synchronizer would cancel all inputs if one of the
inputs encountered an error. This would result in possible context cancellation
errors racing with the original error and would sometimes cause the original
error to be overwritten (according to priority) in the distsql receiver (the
root of a query).

This behavior was incorrect, because what should happen is that the original
error should be propagated followed by a call to DrainMeta by the caller to
drain and close the remaining inputs. This commit removes internal context
cancellation in favor of this behavior.

Release note (bug fix): unexpected context cancellation errors could sometimes
be returned in the vectorized execution engine. This is now fixed.

Fixes #51647
Fixes #52057

cockroach-teamcity · 2020-08-06T09:14:56Z

This change is

asubiotto · 2020-08-06T09:19:44Z

One concern I had was what if one of the inputs is blocked in Next, say, reading from a GRPC stream? However, I can't come up with a scenario in which this would be the case but it might be good to think more about this.

yuzefovich

LGTM, but is the race in the test concerning?

Reviewed 3 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained

yuzefovich · 2020-08-13T17:50:30Z

I think we should get this in sooner rather than later.

asubiotto · 2020-08-14T13:13:40Z

You're right, thanks for the ping. The race was concerning but not something that could've happened before. Since we weren't waiting for inputs to exit in Next, DrainMeta had to deal with live inputs which resulted in this race. The new version properly synchronizes accesses using the existing channels and waitgroup.

yuzefovich

Reviewed 2 of 2 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto)

pkg/sql/colexec/parallel_unordered_synchronizer.go, line 156 at r2 (raw file):

		// batchCh is a buffered channel in order to offer non-blocking writes to
		// input goroutines. During normal operation, this channel will have at most
		// inputs messages. However, during DrainMeta, inputs might need to push

nit: s/inputs messages/len(inputs) messages/g.

pkg/sql/colexec/parallel_unordered_synchronizer.go, line 358 at r2 (raw file):

	}

	// Unblock any goroutines currently waiting to be told to read a next batch,

nit: s/a next/the next/g.

asubiotto

Fixed the CI failures. One is an edge case described in DrainMeta and the other is related to the recent change to send metadata messages with errors on all streams. TestVectorizedFlowShutdown failed because it got multiple metadata messages for the same id. I'm not sure why this wasn't failing before, but I think it's probably got to do with the unordered synchronizer properly forwarding the rest of the metadata now in case of error, so I change the test expectations.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)

yuzefovich

Reviewed 2 of 2 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto)

pkg/sql/colexec/parallel_unordered_synchronizer.go, line 363 at r3 (raw file):

	// retrieve the next batch but encounters an error. There are now n+1 messages
	// in batchCh. Notifying all these inputs to read the next batch would result
	// in 2n+1 messages on batchCh, which would cause a deadlock since this

Given 2n+1 here should the creation of batchCh be updated?

pkg/sql/colexec/parallel_unordered_synchronizer.go, line 371 at r3 (raw file):

		select {
		case msg := <-s.batchCh:
			if msg == nil {

Do we ever send nil message?

asubiotto

I don't believe the latest race failure was caused by this PR and submitted #52890 to track. Rerunning the build.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)

pkg/sql/colexec/parallel_unordered_synchronizer.go, line 363 at r3 (raw file):

Previously, yuzefovich wrote…

Given 2n+1 here should the creation of batchCh be updated?

I would say no, because I think it's easier to reason about 2n during normal operations and simply handle the complexity of an extra message here in DrainMeta.

pkg/sql/colexec/parallel_unordered_synchronizer.go, line 371 at r3 (raw file):

Previously, yuzefovich wrote…

Do we ever send nil message?

We don't explicitly send nil, it's just the signal that the channel API gives us for when a channel is closed.

…nizer Previously, the unordered synchronizer would cancel all inputs if one of the inputs encountered an error. This would result in possible context cancellation errors racing with the original error and would sometimes cause the original error to be overwritten (according to priority) in the distsql receiver (the root of a query). This behavior was incorrect, because what should happen is that the original error should be propagated followed by a call to DrainMeta by the caller to drain and close the remaining inputs. This commit removes internal context cancellation in favor of this behavior. Release note (bug fix): unexpected context cancellation errors could sometimes be returned in the vectorized execution engine. This is now fixed.

asubiotto · 2020-08-17T14:22:25Z

bors r=yuzefovich

craig · 2020-08-17T14:47:31Z

Build succeeded:

Compile Build (Cockroach)

asubiotto requested review from yuzefovich and a team August 6, 2020 09:14

yuzefovich reviewed Aug 6, 2020

View reviewed changes

asubiotto force-pushed the pusc branch from 6b8b351 to 67c74ff Compare August 14, 2020 13:12

yuzefovich approved these changes Aug 14, 2020

View reviewed changes

asubiotto force-pushed the pusc branch from 67c74ff to d92ef26 Compare August 14, 2020 16:23

asubiotto commented Aug 14, 2020

View reviewed changes

yuzefovich reviewed Aug 14, 2020

View reviewed changes

asubiotto mentioned this pull request Aug 17, 2020

execinfra: RemoteProducerMetadata_Metrics race #52890

Closed

asubiotto force-pushed the pusc branch from d92ef26 to 4035245 Compare August 17, 2020 08:34

asubiotto commented Aug 17, 2020

View reviewed changes

asubiotto force-pushed the pusc branch 2 times, most recently from 0dc130e to 2a14c44 Compare August 17, 2020 12:12

asubiotto force-pushed the pusc branch from 2a14c44 to 3888352 Compare August 17, 2020 12:50

craig bot merged commit edb904b into cockroachdb:master Aug 17, 2020

asubiotto deleted the pusc branch August 17, 2020 16:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colexec: remove internal cancellation behavior from unordered synchronizer #52463

colexec: remove internal cancellation behavior from unordered synchronizer #52463

asubiotto commented Aug 6, 2020

cockroach-teamcity commented Aug 6, 2020

asubiotto commented Aug 6, 2020

yuzefovich left a comment

yuzefovich commented Aug 13, 2020

asubiotto commented Aug 14, 2020

yuzefovich left a comment

asubiotto left a comment

yuzefovich left a comment

asubiotto left a comment

asubiotto commented Aug 17, 2020

craig bot commented Aug 17, 2020

colexec: remove internal cancellation behavior from unordered synchronizer #52463

colexec: remove internal cancellation behavior from unordered synchronizer #52463

Conversation

asubiotto commented Aug 6, 2020

cockroach-teamcity commented Aug 6, 2020

asubiotto commented Aug 6, 2020

yuzefovich left a comment

Choose a reason for hiding this comment

yuzefovich commented Aug 13, 2020

asubiotto commented Aug 14, 2020

yuzefovich left a comment

Choose a reason for hiding this comment

asubiotto left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

asubiotto left a comment

Choose a reason for hiding this comment

asubiotto commented Aug 17, 2020

craig bot commented Aug 17, 2020