Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colexec: make unordered distinct streaming-like #57579

Merged
merged 1 commit into from Dec 7, 2020

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Dec 4, 2020

Previously, when executing an unordered distinct, we would build the
whole hash table and consume the input source entirely before emitting
any output. This is a suboptimal behavior when the query has a limit -
we're likely to reach the limit long time before consuming the whole
input source.

This commit makes the unordered distinct more streaming-like - it builds
the hash table one batch at a time, and whenever some distinct tuples
are appended to the hash table, all of them are emitted in the output.

Fixes: #57566.

Release note (performance improvement): Previously, CockroachDB when
performing an unordered DISTINCT operation via the vectorized execution
engine would buffer up all tuples from the input which is a suboptimal
behavior when the query has a LIMIT clause, and this has now been fixed.
This behavior was introduced in 20.1. Note that the old row-by-row
engine doesn't have this issue.

@yuzefovich yuzefovich requested review from asubiotto and a team December 4, 2020 18:07
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich
Copy link
Member Author

yuzefovich commented Dec 4, 2020

Benchmarks are here. I think the increase in allocations in some cases is due to the fact that we now can no longer allocate the output batch of large size once and, instead, we have dynamic batch size behavior for the output.

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/sql/colexec/hashtable.go, line 359 at r1 (raw file):

	if ht.buildMode != hashTableDistinctBuildMode {
		colexecerror.InternalError(errors.AssertionFailedf(
			"hashTable.fullBuild is called in unexpected build mode %d", ht.buildMode,

nit: s/fullBuild/distinctBuild

Previously, when executing an unordered distinct, we would build the
whole hash table and consume the input source entirely before emitting
any output. This is a suboptimal behavior when the query has a limit -
we're likely to reach the limit long time before consuming the whole
input source.

This commit makes the unordered distinct more streaming-like - it builds
the hash table one batch at a time, and whenever some distinct tuples
are appended to the hash table, all of them are emitted in the output.

Release note (performance improvement): Previously, CockroachDB when
performing an unordered DISTINCT operation via the vectorized execution
engine would buffer up all tuples from the input which is a suboptimal
behavior when the query has a LIMIT clause, and this has now been fixed.
This behavior was introduced in 20.1. Note that the old row-by-row
engine doesn't have this issue.
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to add a release note for this, and I think it'll be worth it to backport it to 20.2 since it is a regression between using the vec and the row engines.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto)

@craig
Copy link
Contributor

craig bot commented Dec 7, 2020

Build succeeded:

@craig craig bot merged commit 278214f into cockroachdb:master Dec 7, 2020
@yuzefovich yuzefovich deleted the distinct-limit branch December 7, 2020 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

colexec: suboptimal behavior of unordered distinct with limit
3 participants