Skip to content

Potential Optmization for GroupBy Sorting Phase#6875

Open
wuwenw wants to merge 8 commits intoapache:masterfrom
wuwenw:topk
Open

Potential Optmization for GroupBy Sorting Phase#6875
wuwenw wants to merge 8 commits intoapache:masterfrom
wuwenw:topk

Conversation

@wuwenw
Copy link
Copy Markdown
Contributor

@wuwenw wuwenw commented May 4, 2021

Description

Currently, we're using heap sort at the end of groupBy, whose big O time complexity is n+nlogk. Since it is only necessary to keep the number of records up to TRIM_SIZE (normally 5000), we can use the pivot selection algorithm to select topk elements. When the number of records is relatively low (i.e. smaller than 150k), pivot selection algorithm can boost the performance by around 30-40%, at the expanse of extra memory usage. However, current benchmark results show that this algorithm becomes super inefficient if the memory usage exceeds some limits, mainly because of GC overhead. The detailed results and discussion can be found in this link.

Note that this PR does not change the original API and method, but just brings in a second option for the groupBy sorting phase.

Upgrade Notes

Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

  • Yes (Please label as backward-incompat, and complete the section below on Release Notes)

Does this PR fix a zero-downtime upgrade introduced earlier?

  • Yes (Please label this as backward-incompat, and complete the section below on Release Notes)

Does this PR otherwise need attention when creating release notes? Things to consider:

  • New configuration options
  • Deprecation of configurations
  • Signature changes to public methods/interfaces
  • New plugins added or old plugins removed
  • Yes (Please label this PR as release-notes and complete the section on Release Notes)

Release Notes

Documentation

@siddharthteotia
Copy link
Copy Markdown
Contributor

Is this same as standard order statistics algorithm which gives worst case linear time (compared to heap sort approach) for selecting k largest or smallest elements?

@wuwenw
Copy link
Copy Markdown
Contributor Author

wuwenw commented May 5, 2021

Is this same as standard order statistics algorithm which gives worst case linear time (compared to heap sort approach) for selecting k largest or smallest elements?

The selection algorithm used here is quickselct, a partial quicksort algorithm. The best case and average performance is O(n) but the worst case is O(n^2).

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 5, 2021

Codecov Report

Merging #6875 (148673a) into master (3e20979) will decrease coverage by 0.45%.
The diff coverage is 34.66%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #6875      +/-   ##
============================================
- Coverage     73.99%   73.53%   -0.46%     
  Complexity       12       12              
============================================
  Files          1421     1421              
  Lines         69141    70055     +914     
  Branches       9986    10130     +144     
============================================
+ Hits          51159    51518     +359     
- Misses        14633    15129     +496     
- Partials       3349     3408      +59     
Flag Coverage Δ Complexity Δ
integration 43.28% <0.00%> (-0.29%) 7.00 <0.00> (ø)
unittests 65.45% <34.66%> (-0.52%) 12.00 <0.00> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...org/apache/pinot/core/data/table/TableResizer.java 71.07% <34.66%> (-21.17%) 0.00 <0.00> (ø)
...he/pinot/common/utils/request/FilterQueryTree.java 65.21% <0.00%> (-34.79%) 0.00% <0.00%> (ø%)
...a/manager/realtime/RealtimeSegmentDataManager.java 28.57% <0.00%> (-28.58%) 0.00% <0.00%> (ø%)
...mon/metadata/segment/OfflineSegmentZKMetadata.java 74.57% <0.00%> (-25.43%) 0.00% <0.00%> (ø%)
.../starter/helix/HelixInstanceDataManagerConfig.java 78.04% <0.00%> (-21.96%) 0.00% <0.00%> (ø%)
...che/pinot/core/common/datatable/BaseDataTable.java 81.06% <0.00%> (-18.18%) 0.00% <0.00%> (ø%)
...ator/transform/function/CastTransformFunction.java 62.50% <0.00%> (-16.67%) 0.00% <0.00%> (ø%)
...impl/dictionary/DoubleOnHeapMutableDictionary.java 54.21% <0.00%> (-12.05%) 0.00% <0.00%> (ø%)
.../segment/processing/collector/ConcatCollector.java 89.74% <0.00%> (-10.26%) 0.00% <0.00%> (ø%)
.../filter/predicate/InPredicateEvaluatorFactory.java 78.21% <0.00%> (-7.99%) 0.00% <0.00%> (ø%)
... and 184 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e20979...148673a. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants