New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10196: Improve IntroSorter with 3-ways partitioning. #404
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether this will show up as a difference in real-life benchmarks!
@@ -35,51 +38,102 @@ public final void sort(int from, int to) { | |||
quicksort(from, to, 2 * MathUtil.log(to - from, 2)); | |||
} | |||
|
|||
/** | |||
* Sorts between from (inclusive) and to (exclusive) with intro sort. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"with intro sort"? Is this accurate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think yes, because we still fallback to heap sort if the recursive stack goes too large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. The quicksort method name is sort of confusing, but I see it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rename the method simply "sort".
@@ -216,6 +219,25 @@ void binarySort(int from, int to, int i) { | |||
} | |||
} | |||
|
|||
/** | |||
* Sorts between from (inclusive) and to (exclusive) with insertion sort. Runs in {@code O(n^2)}. | |||
* It is typically used by more sophisticated implementations as a fall-back when the numbers of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the number (not numbers)
@@ -68,70 +69,77 @@ public void test(Entry[] arr) { | |||
enum Strategy { | |||
RANDOM { | |||
@Override | |||
public void set(Entry[] arr, int i) { | |||
arr[i] = new Entry(random().nextInt(), i); | |||
public void set(Entry[] arr, int i, Random random) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for using an explicit random here. random() in loops is costly.
} | ||
} | ||
|
||
public static void main(String[] args) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could convert it to a test and give the test an assumption on some property (or just an Ignore). Then you'd have a seed-reproducible-benchmark. :)
This stuff fits JMH nicely but I understand why you didn't want to roll out the big guns here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we disable the assertions when running a test?
(yes I hesitated to go with JMH but indeed I kept it simple, and based on the many runs I saw, the ratio between the sorters is reproducible)
sorter.sort(0, clone.length); | ||
} | ||
long timeMs = (System.nanoTime() - startTimeNs) / 1000000; | ||
System.out.print(" " + padLeft(timeMs, 4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String.printf("%5f\n", timeMs), String.printf("%-5f\n", timeMs)?
@jpountz maybe you could be interested in the review? |
Yes! Sorry I saw it and wanted to have a look and then got distracted. I haven't taken the time to take a look yet but I ran indexing with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty cool! I tried to replace some of the radix sorter we're using with this improved IntroSorter to see whether they still performed better. They did, but maybe we need to revisit when to fall back from radix sort to intro sort in the future.
// Select the pivot with a single median around the middle element. | ||
// Do not take the median between [from, mid, last] because it hurts performance | ||
// if the order is descending. | ||
pivot = median(mid - range, mid, mid + range); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a 32 elements array, we'd be looking at indices 12, 16 and 20, so all rather close to the mid element. Should range
be size/4
instead of size/8
in that case to be less subject to special distributions (ie. we'd be looking at indices 8, 16 and 24 in a 32-elements array)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it is more intuitive.
if (size <= SINGLE_MEDIAN_THRESHOLD) { | ||
// Select the pivot with a single median around the middle element. | ||
// Do not take the median between [from, mid, last] because it hurts performance | ||
// if the order is descending. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, why does it hurt performance? If the array is descending, then mid
would be the actual median, which should be the best-case scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is counter-intuitive indeed. It is due to the way the 3-way partitioning algorithm swaps elements. At the beginning I used a median between [from, mid, last] but on a descending ordering and after a couple recursive levels, systematically the median was 'last' and it resulted in a bad partition with an empty right set, triggering a worst case. By taking this new median, we avoid the worst case whatever the order is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining.
if (left < right) { | ||
swap(left, right); | ||
--right; | ||
// Recursion on the smallest partition. Replace the tail recursion by a loop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the second sentence be a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, maybe I'm not clear. I replaced the two recursive calls on both left and right parts, by a recursive call on the smallest part and a loop on the largest part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I had misunderstood what the comment meant indeed, thanks for explaining.
* | ||
* <p>Run the static {@link #main(String[])} method to start the benchmark. | ||
*/ | ||
public class SorterBenchmark { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if luceneutil would be a better home for this.
if (size <= SINGLE_MEDIAN_THRESHOLD) { | ||
// Select the pivot with a single median around the middle element. | ||
// Do not take the median between [from, mid, last] because it hurts performance | ||
// if the order is descending. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining.
if (left < right) { | ||
swap(left, right); | ||
--right; | ||
// Recursion on the smallest partition. Replace the tail recursion by a loop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I had misunderstood what the comment meant indeed, thanks for explaining.
No description provided.