Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flowinfra: fix shutting down vectorized flows with wrapped processors #43579

Merged
merged 1 commit into from Jan 13, 2020

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Dec 27, 2019

flowinfra: fix shutting down vectorized flows with wrapped processors

Our flow shutdown mechanism relies on all components to listen to the
same context which is canceled by the "root" component (either
materializer or outbox) when the query needs to exit. Columnarizers
currently capture the context that is passed into their constructor
and will pass that context further into their downstream operators.
However, I believe it is possible that the cancellation of the "flow"
context will not trigger the cancellation of the columnarizer's context,
so some components are left hanging.

The problem was that we derived a new context in startInternal call,
but it was not returned outside of FlowBase.Setup method, so
ServerImpl.setupFlow used to return an old context. Now this is fixed by
doing the context derivation as the first step of Setup and returning
it.

Fixes: #43269.

Release note (bug fix): Previously, a query shutdown mechanism could fail
to fully cleanup the infrastructure when the query was executed via the
vectorized engine and the query plan contained wrapped row-by-row
processors (in 19.2 release those are Lookup join and Index join).

@yuzefovich yuzefovich requested review from asubiotto and a team December 27, 2019 20:01
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich yuzefovich added the do-not-merge bors won't merge a PR with this label. label Dec 27, 2019
@yuzefovich yuzefovich force-pushed the vec-shutdown branch 2 times, most recently from 8842c24 to ce6a6f7 Compare December 28, 2019 02:08
@yuzefovich yuzefovich changed the title colexec: fix shutting down remote flows with wrapped processors colexec: fix shutting down flows with wrapped processors Dec 28, 2019
@yuzefovich yuzefovich removed the do-not-merge bors won't merge a PR with this label. label Dec 30, 2019
@yuzefovich yuzefovich changed the title colexec: fix shutting down flows with wrapped processors flowinfra: fix shutting down vectorized flows with wrapped processors Dec 30, 2019
@yuzefovich yuzefovich added the backport-19.2.x Flags PRs that need to be backported to 19.2. label Dec 30, 2019
Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you figure out how the context passed into the constructor differed from the expected context? Would be nice to figure that out before getting this in just to understand this problem more. Also, could you add a wrapped processor to the VectorizedFlowShutdownTest to test this scenario? I guess it's specific to how a flow is set up and the test does that itself, but might be good to have that condition anyway.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto)

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I spent some time trying to understand where the contexts deviate to no success. I came to this solution from "philosophical" grounds. I feel like there might be a place in the code where we don't use the "descendant" of the cancellation context, but I couldn't find it. One place where this could happen maybe is in the flow scheduler, I'm not very familiar with that code though.

I also spent some time thinking about how to create a unit test for this and also didn't succeed. The complication here is that if we explicitly use different contexts for wrapped processors, then we're not doing what the "production" code now will do; but if we use the same contexts, then we're not testing much. It seems like too many things to mock out. I added the logic test from the issue, and it seems to me to be the best option.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/flowinfra/flow.go, line 339 at r1 (raw file):

// Start is part of the Flow interface.
func (f *FlowBase) Start(ctx context.Context, doneFn func()) error {
	if _, err := f.startInternal(ctx, f.processors, doneFn); err != nil {

I thought that maybe this spot was the culprit, so I tried returning new context from Start method - unfortunately, that didn't work.

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the problem the fact that we only derive a cancellable context in startInternal but are very likely to use a parent context in Setup since none is returned from Setup? How about rather than using SetupCtxCancel, we derive a cancellable context in Setup, store that in the flow, and then do the same in startInternal, then set ctxCancel to cancel both?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're suggesting having two ctxCancels in FlowBase? That would also mean two ctxDones.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I see the problem.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@asubiotto
Copy link
Contributor

I'm suggesting having a setupCtxCancel and setupCtxDone that is set in Setup and then in startInternal:

ctx, cancel = WithCancel(startInternalCtx)
flow.ctxCancel = func () {
    flow.setupCtxCancel()
    cancel()
    <-flow.setupCtxDone
}
flow.ctxDone = ctx.Done()

Or something like that

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that was it. PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/distsql/server.go, line 352 at r3 (raw file):

	}
	var err error
	if ctx, err = f.Setup(ctx, &req.Flow, opt); err != nil {

This was the actual problem - we were not returning the correct context from ServerImpl.setupFlow.

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r2, 4 of 5 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

Our flow shutdown mechanism relies on all components to listen to the
same context which is canceled by the "root" component (either
materializer or outbox) when the query needs to exit. Columnarizers
currently capture the context that is passed into their constructor
and will pass that context further into their downstream operators.
However, I believe it is possible that the cancellation of the "flow"
context will not trigger the cancellation of the columnarizer's context,
so some components are left hanging.

The problem was that we derived a new context in `startInternal` call,
but it was not returned outside of `FlowBase.Setup` method, so
ServerImpl.setupFlow used to return an old context. Now this is fixed by
doing the context derivation as the first step of `Setup` and returning
it.

Release note (bug fix): Previously, a query shutdown mechanism could fail
to fully cleanup the infrastructure when the query was executed via the
vectorized engine and the query plan contained wrapped row-by-row
processors (in 19.2 release those are Lookup join and Index join).
@yuzefovich
Copy link
Member Author

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 13, 2020

Build failed

@yuzefovich
Copy link
Member Author

Flake.

bors r+

craig bot pushed a commit that referenced this pull request Jan 13, 2020
43579: flowinfra: fix shutting down vectorized flows with wrapped processors r=yuzefovich a=yuzefovich

flowinfra: fix shutting down vectorized flows with wrapped processors

Our flow shutdown mechanism relies on all components to listen to the
same context which is canceled by the "root" component (either
materializer or outbox) when the query needs to exit. Columnarizers
currently capture the context that is passed into their constructor
and will pass that context further into their downstream operators.
However, I believe it is possible that the cancellation of the "flow"
context will not trigger the cancellation of the columnarizer's context,
so some components are left hanging.

The problem was that we derived a new context in `startInternal` call,
but it was not returned outside of `FlowBase.Setup` method, so
ServerImpl.setupFlow used to return an old context. Now this is fixed by
doing the context derivation as the first step of `Setup` and returning
it.

Fixes: #43269.

Release note (bug fix): Previously, a query shutdown mechanism could fail
to fully cleanup the infrastructure when the query was executed via the
vectorized engine and the query plan contained wrapped row-by-row
processors (in 19.2 release those are Lookup join and Index join).

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Jan 13, 2020

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-19.2.x Flags PRs that need to be backported to 19.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

colexec: incomplete shutdown of the flows with wrapped processors
3 participants