Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10196: Improve IntroSorter with 3-ways partitioning. #404

Closed
wants to merge 5 commits into from

Conversation

bruno-roustant
Copy link
Contributor

No description provided.

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether this will show up as a difference in real-life benchmarks!

@@ -35,51 +38,102 @@ public final void sort(int from, int to) {
quicksort(from, to, 2 * MathUtil.log(to - from, 2));
}

/**
* Sorts between from (inclusive) and to (exclusive) with intro sort.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"with intro sort"? Is this accurate here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes, because we still fallback to heap sort if the recursive stack goes too large.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. The quicksort method name is sort of confusing, but I see it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename the method simply "sort".

@@ -216,6 +219,25 @@ void binarySort(int from, int to, int i) {
}
}

/**
* Sorts between from (inclusive) and to (exclusive) with insertion sort. Runs in {@code O(n^2)}.
* It is typically used by more sophisticated implementations as a fall-back when the numbers of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the number (not numbers)

@@ -68,70 +69,77 @@ public void test(Entry[] arr) {
enum Strategy {
RANDOM {
@Override
public void set(Entry[] arr, int i) {
arr[i] = new Entry(random().nextInt(), i);
public void set(Entry[] arr, int i, Random random) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for using an explicit random here. random() in loops is costly.

}
}

public static void main(String[] args) throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could convert it to a test and give the test an assumption on some property (or just an Ignore). Then you'd have a seed-reproducible-benchmark. :)

This stuff fits JMH nicely but I understand why you didn't want to roll out the big guns here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we disable the assertions when running a test?
(yes I hesitated to go with JMH but indeed I kept it simple, and based on the many runs I saw, the ratio between the sorters is reproducible)

sorter.sort(0, clone.length);
}
long timeMs = (System.nanoTime() - startTimeNs) / 1000000;
System.out.print(" " + padLeft(timeMs, 4));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.printf("%5f\n", timeMs), String.printf("%-5f\n", timeMs)?

@bruno-roustant
Copy link
Contributor Author

@jpountz maybe you could be interested in the review?

@jpountz
Copy link
Contributor

jpountz commented Oct 29, 2021

Yes! Sorry I saw it and wanted to have a look and then got distracted. I haven't taken the time to take a look yet but I ran indexing with IndexAndSearchOpenStreetMaps and saw a speedup with this change on big merges, so I'm already excited about it. :)

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty cool! I tried to replace some of the radix sorter we're using with this improved IntroSorter to see whether they still performed better. They did, but maybe we need to revisit when to fall back from radix sort to intro sort in the future.

// Select the pivot with a single median around the middle element.
// Do not take the median between [from, mid, last] because it hurts performance
// if the order is descending.
pivot = median(mid - range, mid, mid + range);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a 32 elements array, we'd be looking at indices 12, 16 and 20, so all rather close to the mid element. Should range be size/4 instead of size/8 in that case to be less subject to special distributions (ie. we'd be looking at indices 8, 16 and 24 in a 32-elements array)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it is more intuitive.

if (size <= SINGLE_MEDIAN_THRESHOLD) {
// Select the pivot with a single median around the middle element.
// Do not take the median between [from, mid, last] because it hurts performance
// if the order is descending.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, why does it hurt performance? If the array is descending, then mid would be the actual median, which should be the best-case scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is counter-intuitive indeed. It is due to the way the 3-way partitioning algorithm swaps elements. At the beginning I used a median between [from, mid, last] but on a descending ordering and after a couple recursive levels, systematically the median was 'last' and it resulted in a bad partition with an empty right set, triggering a worst case. By taking this new median, we avoid the worst case whatever the order is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining.

if (left < right) {
swap(left, right);
--right;
// Recursion on the smallest partition. Replace the tail recursion by a loop.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the second sentence be a TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, maybe I'm not clear. I replaced the two recursive calls on both left and right parts, by a recursive call on the smallest part and a loop on the largest part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I had misunderstood what the comment meant indeed, thanks for explaining.

*
* <p>Run the static {@link #main(String[])} method to start the benchmark.
*/
public class SorterBenchmark {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if luceneutil would be a better home for this.

if (size <= SINGLE_MEDIAN_THRESHOLD) {
// Select the pivot with a single median around the middle element.
// Do not take the median between [from, mid, last] because it hurts performance
// if the order is descending.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining.

if (left < right) {
swap(left, right);
--right;
// Recursion on the smallest partition. Replace the tail recursion by a loop.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I had misunderstood what the comment meant indeed, thanks for explaining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants