Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce in-Segment Trim for GroupBy OrderBy Query #6991

Merged
merged 52 commits into from
Jun 10, 2021

Conversation

wuwenw
Copy link
Contributor

@wuwenw wuwenw commented May 27, 2021

Description

One of the major bottlenecks for the current GroupBy OrderBy query on high cardinality columns is the merge phase. Essentially every segment brings a large number of intermediate results to a global concurrent map for further aggregation and merge, which takes up a lot of space and is very time-consuming. This PR introduces an optimization option that each segment trims its intermediate results to a given size. The size is configurable by the user and is guaranteed to be max(limit N * 5, 5000). It won't affect accuracy much but reduces the running time for high cardinality dataset. ~5 times faster for String data with 10M cardinality. This option is turned off by default to ensure backward compatibility.

Upgrade Notes

Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

  • Yes (Please label as backward-incompat, and complete the section below on Release Notes)

Does this PR fix a zero-downtime upgrade introduced earlier?

  • Yes (Please label this as backward-incompat, and complete the section below on Release Notes)

Does this PR otherwise need attention when creating release notes? Things to consider:

  • New configuration options
  • Deprecation of configurations
  • Signature changes to public methods/interfaces
  • New plugins added or old plugins removed
  • Yes (Please label this PR as release-notes and complete the section on Release Notes)

Release Notes

Optimized GroupBy OrderBy queries by introducing an in-segment trim option that can significantly reduce the size of intermediate results and speed up the execution.

Documentation

GroupBy OrderBy In-Segment Trim
This option is used to apply an additional filter in the segments for the GroupBy OrderBy query. Based on a configurable trim_size k, each segment will sort and send back the top k results only. It will slightly decrease the accuracy but boost up the performance for high-cardinality data.

Related query options:
SEGMENT TRIM ON
SEGMENT TRIM SIZE

Related server config:
enable.segment.group.trim
size.segment.group.trim

Related query keywords:
LIMIT N

SEGMENT TRIM ON and enable.segment.group.trim are used to enable/disable this feature. By default, it is turned off and segments will send back all results.

If it is enabled, then actual trim_size is calculated based on the parameters (SEGMENT TRIM SIZE and size.segment.group.trim) and query keyword LIMIT N.

Trim_size = max(5*LIMIT, minTrimSize) where minTrimSize is set up by the user (Default value: 5000)

Copy link
Contributor

@siddharthteotia siddharthteotia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wuwenw thanks for the PR. Yet to fully go through it. Please give a day to go through it.

@codecov-commenter
Copy link

codecov-commenter commented May 27, 2021

Codecov Report

Merging #6991 (03beb3e) into master (0185482) will decrease coverage by 7.81%.
The diff coverage is 64.74%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #6991      +/-   ##
============================================
- Coverage     73.23%   65.42%   -7.82%     
  Complexity       12       12              
============================================
  Files          1439     1454      +15     
  Lines         71333    72091     +758     
  Branches      10334    10441     +107     
============================================
- Hits          52243    47162    -5081     
- Misses        15578    21517    +5939     
+ Partials       3512     3412     -100     
Flag Coverage Δ
integration ?
unittests 65.42% <64.74%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...roker/requesthandler/BaseBrokerRequestHandler.java 17.37% <0.00%> (-53.88%) ⬇️
...e/pinot/common/minion/MergeRollupTaskMetadata.java 0.00% <0.00%> (ø)
...e/pinot/common/minion/MinionTaskMetadataUtils.java 0.00% <0.00%> (-75.00%) ⬇️
...mmon/restlet/resources/SegmentServerDebugInfo.java 0.00% <0.00%> (ø)
...che/pinot/controller/api/debug/TableDebugInfo.java 0.00% <0.00%> (ø)
...ces/PinotSegmentUploadDownloadRestletResource.java 10.00% <0.00%> (-45.03%) ⬇️
...t/controller/api/resources/TableDebugResource.java 0.00% <0.00%> (ø)
...troller/helix/core/minion/ClusterInfoAccessor.java 32.14% <0.00%> (-47.86%) ⬇️
...x/core/minion/generator/TaskGeneratorRegistry.java 76.00% <ø> (-4.00%) ⬇️
.../org/apache/pinot/core/common/MinionConstants.java 0.00% <0.00%> (ø)
... and 435 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0185482...03beb3e. Read the comment docs.

@wuwenw wuwenw requested a review from Jackie-Jiang May 28, 2021 02:55
@wuwenw wuwenw changed the base branch from master to upgrade_jdk_11 June 2, 2021 17:50
@wuwenw wuwenw changed the base branch from upgrade_jdk_11 to master June 2, 2021 17:50
@wuwenw wuwenw requested a review from Jackie-Jiang June 3, 2021 00:49
@wuwenw wuwenw requested a review from Jackie-Jiang June 4, 2021 03:54
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please revise the tests as discussed offline

@wuwenw wuwenw requested a review from Jackie-Jiang June 9, 2021 02:24
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@@ -52,39 +53,53 @@
import org.slf4j.LoggerFactory;



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) remove

@@ -38,28 +44,34 @@
@SuppressWarnings("rawtypes")
public class AggregationGroupByOrderByOperator extends BaseOperator<IntermediateResultsBlock> {
private static final String OPERATOR_NAME = "AggregationGroupByOrderByOperator";
private static final int TRIM_OFF = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems not used?

* Currently tests 'max' functions, and can be easily extended to
* test other conditions such as GroupBy without OrderBy
*/
public class GroupByInSegmentTrimTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test could be flaky, fix as discussed offline

@Jackie-Jiang Jackie-Jiang merged commit 2a5d5b7 into apache:master Jun 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants