Potential Optmization for GroupBy Sorting Phase#6875
Potential Optmization for GroupBy Sorting Phase#6875wuwenw wants to merge 8 commits intoapache:masterfrom
Conversation
|
Is this same as standard order statistics algorithm which gives worst case linear time (compared to heap sort approach) for selecting k largest or smallest elements? |
The selection algorithm used here is quickselct, a partial quicksort algorithm. The best case and average performance is O(n) but the worst case is O(n^2). |
Codecov Report
@@ Coverage Diff @@
## master #6875 +/- ##
============================================
- Coverage 73.99% 73.53% -0.46%
Complexity 12 12
============================================
Files 1421 1421
Lines 69141 70055 +914
Branches 9986 10130 +144
============================================
+ Hits 51159 51518 +359
- Misses 14633 15129 +496
- Partials 3349 3408 +59
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
Description
Currently, we're using heap sort at the end of groupBy, whose big O time complexity is n+nlogk. Since it is only necessary to keep the number of records up to TRIM_SIZE (normally 5000), we can use the pivot selection algorithm to select topk elements. When the number of records is relatively low (i.e. smaller than 150k), pivot selection algorithm can boost the performance by around 30-40%, at the expanse of extra memory usage. However, current benchmark results show that this algorithm becomes super inefficient if the memory usage exceeds some limits, mainly because of GC overhead. The detailed results and discussion can be found in this link.
Note that this PR does not change the original API and method, but just brings in a second option for the groupBy sorting phase.
Upgrade Notes
Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)
backward-incompat, and complete the section below on Release Notes)Does this PR fix a zero-downtime upgrade introduced earlier?
backward-incompat, and complete the section below on Release Notes)Does this PR otherwise need attention when creating release notes? Things to consider:
release-notesand complete the section on Release Notes)Release Notes
Documentation