Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colexec: optimize the external sort for top K case #66303

Merged
merged 2 commits into from
Jun 16, 2021

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Jun 10, 2021

colexec: extend external sort benchmark for top K case

Release note: None

colexec: optimize the external sort for top K case

Previously, if the top K sort spilled to disk, we used the general
external sort. However, we could easily optimize that case with the
knowledge that only K tuples are needed by the output.

Namely, we can use the in-memory top K sort (in order to create each new
partition of the desired size) and also limit the size of each merged
partition by K tuples. This commit adds these optimizations.

Addresses: #45192.

Release note: None

@yuzefovich yuzefovich requested review from michae2 and a team June 10, 2021 02:25
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding tests for this new code is important.

Also, any benchmark changes of note?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)


pkg/sql/colexec/external_sort.go, line 421 at r1 (raw file):

			// the new partition).
			s.outputUnlimitedAllocator.ReleaseMemory(s.outputUnlimitedAllocator.Used())
			// Make sure to close out all partitions we have just read from.

Was this being missing before a bug?


pkg/sql/colexec/external_sort.go, line 468 at r1 (raw file):

// If the current partition reaches the desired topK number of tuples, a zero
// batch is enqueued and true is returned indicating that the current partition
// is done.

I don't understand how this works - why is it okay to stop creating a partition once we reach k values inside the partition? Unlike the top k sort, which uses the heap on every input, we don't do that hear, right? Or do we? Also, when does top k sort spill in the first place?

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have unit tests for the external sort which runs "top K" test cases too, so we do get the correctness test coverage for free (in fact, the tests caught a bug in my WIP).

What missing is the test that verifies that each partition is at most K tuples (which is actually not true for merging phase - there we're counting with the "batch granularity"), but testing that is annoying and is not worth it IMO. Let me know if you think otherwise.

I've just added another commit to extend the benchmark and will share the results shortly.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @michae2)


pkg/sql/colexec/external_sort.go, line 421 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Was this being missing before a bug?

No, this is only needed if we do not exhaust partitions we're reading from, but previously we would always fully exhaust each old partition. When zero-batch is dequeued from the partition, the partition is closed for reading automatically. So this change has an effect only in the case of top K sort because we might stop reading from the partitions once we have at least K tuples.

Extended the comment.


pkg/sql/colexec/external_sort.go, line 468 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I don't understand how this works - why is it okay to stop creating a partition once we reach k values inside the partition? Unlike the top k sort, which uses the heap on every input, we don't do that hear, right? Or do we? Also, when does top k sort spill in the first place?

The crucial observation here is that b is not coming from the input to the sort operation as a whole - b is coming to us either from the in-memory top K sorter (which has already performed the sort over a subset of tuples, with input partitioner defining the boundaries of that subset) or from the merger (which performs the merge of N already sorted partitions while preserving the order of tuples). So we are implicitly using the heap of the in-memory top K sorter here. Expanded the comment.

Top K sort will spill to disk whenever current "top K" rows (those stored in the heap and the AppendOnlyBufferedBatch) exceed the workmem limit. That limit can be exceeded when either K is very large or rows are very large in size. In the former case the current PR doesn't improve things (and probably makes things slightly worse than before), but in the latter case the current PR should be an improvement.

However, I think the former case should be addressed in a different manner - we should decide during planning that K is too large and use the general sort, we actually have a TODO for this:

// TODO(radu): we should not choose this processor when K is very large
// - it is slower unless we get significantly more rows than the limit.

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code :lgtm: (assuming no benchmark surprises), thanks for the improved comments and explanation

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2)

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks look good:

name                                                   old time/op    new time/op     delta
ExternalSort/rows=2048/cols=1/topK/spilled=false-24      67.2µs ± 5%     66.9µs ± 2%      ~     (p=0.921 n=10+9)
ExternalSort/rows=2048/cols=1/topK/spilled=true-24       3.75ms ± 5%     3.47ms ±37%      ~     (p=0.220 n=6+10)
ExternalSort/rows=2048/cols=2/topK/spilled=false-24       266µs ± 5%      263µs ± 3%      ~     (p=0.353 n=10+10)
ExternalSort/rows=2048/cols=2/topK/spilled=true-24       4.07ms ± 2%     3.95ms ± 7%    -2.76%  (p=0.015 n=8+8)
ExternalSort/rows=2048/cols=4/topK/spilled=false-24       291µs ± 8%      283µs ± 3%    -2.98%  (p=0.019 n=10+10)
ExternalSort/rows=2048/cols=4/topK/spilled=true-24       4.40ms ± 4%     4.07ms ± 4%    -7.42%  (p=0.000 n=8+8)
ExternalSort/rows=16384/cols=1/topK/spilled=false-24      316µs ± 1%      309µs ± 2%    -2.17%  (p=0.001 n=9+10)
ExternalSort/rows=16384/cols=1/topK/spilled=true-24      4.09ms ± 4%     4.00ms ± 4%    -2.24%  (p=0.050 n=8+8)
ExternalSort/rows=16384/cols=2/topK/spilled=false-24      737µs ± 2%      751µs ± 3%    +1.82%  (p=0.023 n=10+10)
ExternalSort/rows=16384/cols=2/topK/spilled=true-24      5.78ms ±17%     4.28ms ±25%   -26.03%  (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=4/topK/spilled=false-24      796µs ± 3%      807µs ± 5%      ~     (p=0.393 n=10+10)
ExternalSort/rows=16384/cols=4/topK/spilled=true-24      6.95ms ±14%     4.43ms ±23%   -36.24%  (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=1/topK/spilled=false-24    4.52ms ± 3%     4.54ms ± 2%      ~     (p=0.905 n=10+9)
ExternalSort/rows=262144/cols=1/topK/spilled=true-24     10.5ms ± 8%      8.1ms ±12%   -22.91%  (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=2/topK/spilled=false-24    7.17ms ± 4%     7.35ms ± 2%    +2.49%  (p=0.004 n=10+10)
ExternalSort/rows=262144/cols=2/topK/spilled=true-24     42.6ms ± 5%     10.1ms ± 1%   -76.19%  (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=4/topK/spilled=false-24    7.40ms ± 2%     7.48ms ± 3%      ~     (p=0.190 n=10+10)
ExternalSort/rows=262144/cols=4/topK/spilled=true-24     70.6ms ± 3%     10.4ms ± 3%   -85.22%  (p=0.000 n=10+9)

name                                                   old speed      new speed       delta
ExternalSort/rows=2048/cols=1/topK/spilled=false-24     244MB/s ± 5%    245MB/s ± 2%      ~     (p=0.921 n=10+9)
ExternalSort/rows=2048/cols=1/topK/spilled=true-24     4.16MB/s ±16%   4.90MB/s ±36%      ~     (p=0.080 n=8+10)
ExternalSort/rows=2048/cols=2/topK/spilled=false-24     123MB/s ± 5%    124MB/s ± 3%      ~     (p=0.353 n=10+10)
ExternalSort/rows=2048/cols=2/topK/spilled=true-24     8.06MB/s ± 2%   8.29MB/s ± 6%    +2.89%  (p=0.014 n=8+8)
ExternalSort/rows=2048/cols=4/topK/spilled=false-24     225MB/s ± 7%    232MB/s ± 3%    +3.01%  (p=0.019 n=10+10)
ExternalSort/rows=2048/cols=4/topK/spilled=true-24     14.9MB/s ± 5%   16.1MB/s ± 4%    +8.02%  (p=0.000 n=8+8)
ExternalSort/rows=16384/cols=1/topK/spilled=false-24    415MB/s ± 1%    425MB/s ± 2%    +2.23%  (p=0.001 n=9+10)
ExternalSort/rows=16384/cols=1/topK/spilled=true-24    32.1MB/s ± 4%   32.8MB/s ± 3%    +2.30%  (p=0.050 n=8+8)
ExternalSort/rows=16384/cols=2/topK/spilled=false-24    356MB/s ± 2%    349MB/s ± 3%    -1.76%  (p=0.023 n=10+10)
ExternalSort/rows=16384/cols=2/topK/spilled=true-24    45.9MB/s ±20%   62.4MB/s ±31%   +36.00%  (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=4/topK/spilled=false-24    658MB/s ± 3%    650MB/s ± 4%      ~     (p=0.393 n=10+10)
ExternalSort/rows=16384/cols=4/topK/spilled=true-24    75.9MB/s ±16%  112.1MB/s ±11%   +47.72%  (p=0.000 n=10+8)
ExternalSort/rows=262144/cols=1/topK/spilled=false-24   464MB/s ± 3%    462MB/s ± 2%      ~     (p=0.905 n=10+9)
ExternalSort/rows=262144/cols=1/topK/spilled=true-24    200MB/s ± 8%    260MB/s ±13%   +30.20%  (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=2/topK/spilled=false-24   585MB/s ± 3%    571MB/s ± 2%    -2.45%  (p=0.004 n=10+10)
ExternalSort/rows=262144/cols=2/topK/spilled=true-24   98.5MB/s ± 4%  413.3MB/s ± 1%  +319.73%  (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=4/topK/spilled=false-24  1.13GB/s ± 2%   1.12GB/s ± 3%      ~     (p=0.190 n=10+10)
ExternalSort/rows=262144/cols=4/topK/spilled=true-24    119MB/s ± 3%    799MB/s ± 6%  +572.36%  (p=0.000 n=10+10)

name                                                   old alloc/op   new alloc/op    delta
ExternalSort/rows=2048/cols=1/topK/spilled=false-24      37.8kB ± 2%     37.8kB ± 2%      ~     (p=0.380 n=10+10)
ExternalSort/rows=2048/cols=1/topK/spilled=true-24        219kB ± 0%       99kB ± 0%   -54.92%  (p=0.000 n=9+10)
ExternalSort/rows=2048/cols=2/topK/spilled=false-24      48.5kB ± 0%     48.5kB ± 0%      ~     (all equal)
ExternalSort/rows=2048/cols=2/topK/spilled=true-24        390kB ± 0%      129kB ± 0%   -66.97%  (p=0.000 n=10+10)
ExternalSort/rows=2048/cols=4/topK/spilled=false-24      70.5kB ± 0%     70.5kB ± 0%    -0.09%  (p=0.000 n=9+9)
ExternalSort/rows=2048/cols=4/topK/spilled=true-24        524kB ± 0%      179kB ± 0%   -65.73%  (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=1/topK/spilled=false-24     37.6kB ± 0%     37.6kB ± 0%    +0.09%  (p=0.000 n=9+10)
ExternalSort/rows=16384/cols=1/topK/spilled=true-24      1.03MB ± 0%     0.10MB ± 1%   -90.42%  (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=2/topK/spilled=false-24     48.5kB ± 0%     48.5kB ± 0%      ~     (all equal)
ExternalSort/rows=16384/cols=2/topK/spilled=true-24      1.69MB ± 0%     0.13MB ± 0%   -92.40%  (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=4/topK/spilled=false-24     70.5kB ± 0%     70.5kB ± 0%    -0.09%  (p=0.000 n=9+9)
ExternalSort/rows=16384/cols=4/topK/spilled=true-24      3.00MB ± 0%     0.18MB ± 0%   -94.03%  (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=1/topK/spilled=false-24    37.6kB ± 0%     37.6kB ± 0%    +0.08%  (p=0.000 n=9+8)
ExternalSort/rows=262144/cols=1/topK/spilled=true-24     14.2MB ± 0%      0.1MB ± 1%   -99.31%  (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=2/topK/spilled=false-24    48.6kB ± 0%     48.6kB ± 0%    +0.01%  (p=0.000 n=8+10)
ExternalSort/rows=262144/cols=2/topK/spilled=true-24     26.3MB ± 0%      0.1MB ± 0%   -99.51%  (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=4/topK/spilled=false-24    70.6kB ± 0%     70.5kB ± 0%    -0.09%  (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=4/topK/spilled=true-24     50.1MB ± 0%      0.2MB ± 0%   -99.64%  (p=0.000 n=9+9)

name                                                   old allocs/op  new allocs/op   delta
ExternalSort/rows=2048/cols=1/topK/spilled=false-24         113 ± 0%        112 ± 0%    -0.88%  (p=0.000 n=10+10)
ExternalSort/rows=2048/cols=1/topK/spilled=true-24          327 ± 0%        317 ± 0%    -3.06%  (p=0.000 n=10+10)
ExternalSort/rows=2048/cols=2/topK/spilled=false-24         130 ± 0%        128 ± 0%    -1.54%  (p=0.000 n=10+10)
ExternalSort/rows=2048/cols=2/topK/spilled=true-24          392 ± 0%        364 ± 0%    -7.14%  (p=0.000 n=10+10)
ExternalSort/rows=2048/cols=4/topK/spilled=false-24         162 ± 0%        160 ± 0%    -1.23%  (p=0.000 n=10+10)
ExternalSort/rows=2048/cols=4/topK/spilled=true-24          509 ± 0%        453 ± 0%   -11.00%  (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=1/topK/spilled=false-24        113 ± 0%        112 ± 0%    -0.88%  (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=1/topK/spilled=true-24         418 ± 0%        317 ± 0%   -24.16%  (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=2/topK/spilled=false-24        130 ± 0%        128 ± 0%    -1.54%  (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=2/topK/spilled=true-24         570 ± 0%        364 ± 0%   -36.14%  (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=4/topK/spilled=false-24        162 ± 0%        160 ± 0%    -1.23%  (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=4/topK/spilled=true-24         866 ± 0%        453 ± 0%   -47.69%  (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=1/topK/spilled=false-24       113 ± 0%        112 ± 0%    -0.88%  (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=1/topK/spilled=true-24      1.70k ± 0%      0.32k ± 0%   -81.40%  (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=2/topK/spilled=false-24       130 ± 0%        128 ± 0%    -1.54%  (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=2/topK/spilled=true-24      3.26k ± 0%      0.36k ± 0%   -88.84%  (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=4/topK/spilled=false-24       162 ± 0%        160 ± 0%    -1.23%  (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=4/topK/spilled=true-24      6.24k ± 0%      0.45k ± 0%   -92.74%  (p=0.000 n=10+10)

(note that this for K == 128, so imagining that each row is like 1MiB in size, we would see roughly the same improvement)

TFTR!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2)

@craig
Copy link
Contributor

craig bot commented Jun 16, 2021

Build failed:

Previously, if the top K sort spilled to disk, we used the general
external sort. However, we could easily optimize that case with the
knowledge that only K tuples are needed by the output.

Namely, we can use the in-memory top K sort (in order to create each new
partition of the desired size) and also limit the size of each merged
partition by K tuples. This commit adds these optimizations.

Release note: None
@yuzefovich
Copy link
Member Author

Needed to fix a linter.

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 16, 2021

Build succeeded:

@craig craig bot merged commit 0ba1894 into cockroachdb:master Jun 16, 2021
@yuzefovich yuzefovich deleted the topk branch June 16, 2021 02:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants