New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use radix sort to speed up the sorting of terms in TermInSetQuery #12587
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good idea. Thanks for pursuing!
new StringSorter(BytesRefComparator.NATURAL) { | ||
|
||
@Override | ||
protected void get(BytesRefBuilder builder, BytesRef result, int i) { | ||
BytesRef term = sortedTerms[i]; | ||
result.length = term.length; | ||
result.offset = term.offset; | ||
result.bytes = term.bytes; | ||
} | ||
|
||
@Override | ||
protected void swap(int i, int j) { | ||
BytesRef b = sortedTerms[i]; | ||
sortedTerms[i] = sortedTerms[j]; | ||
sortedTerms[j] = b; | ||
} | ||
}.sort(0, sortedTerms.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could (should?) use radix sort directly instead of going through StringSorter
? We don't need the fallback sort logic that StringSorter provides right? Then we could avoid the API visibility change. Maybe there's a good reason though to go through StringSorter I'm missing?
I think this would do the trick though right?
new MSBRadixSorter(Integer.MAX_VALUE) {
@Override
protected int byteAt(int i, int k) {
BytesRef b = sortedTerms[i];
if (b.length <= k) {
return -1;
}
return b.bytes[b.offset + k] & 0xFF;
}
@Override
protected void swap(int i, int j) {
BytesRef b = sortedTerms[i];
sortedTerms[i] = sortedTerms[j];
sortedTerms[j] = b;
}
}.sort(0, sortedTerms.length);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for feedback!
If using MSBRadixSorter
, we need to override MSBRadixSorter#getFallbackSorter
to make the fallback sorter benefited from the vectorized equals, instead of the default impl that comparing bytes one by one.
We can avoid this with StringSorter
as it is implemented :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, that makes sense to me. Thanks!
* A {@link BytesRef} sorter tries to use a efficient radix sorter if {@link StringSorter#cmp} is a | ||
* {@link BytesRefComparator}, otherwise fallback to {@link StringSorter#fallbackSorter} | ||
*/ | ||
public abstract class StringSorter extends Sorter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add @lucene.internal
now that we're making this public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gf2121 !
…ache.org * upstream/main: (239 commits) Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation (apache#12633) Fix index out of bounds when writing FST to different metaOut (apache#12697) (apache#12698) Avoid object construction when linear searching arcs (apache#12692) chore: update the Javadoc example in Analyzer (apache#12693) coorect position on entry in CHANGES.txt Refactor ByteBlockPool so it is just a "shift/mask big array" (apache#12625) Extract the hnsw graph merging from being part of the vector writer (apache#12657) Specialize `BlockImpactsDocsEnum#nextDoc()`. (apache#12670) Speed up TestIndexOrDocValuesQuery. (apache#12672) Remove over-counting of deleted terms (apache#12586) Use MergeSorter in StableStringSorter (apache#12652) Use radix sort to speed up the sorting of terms in TermInSetQuery (apache#12587) Add timeouts to github jobs. Estimates taken from empirical run times (actions history), with a generous buffer added. (apache#12687) Optimize OnHeapHnswGraph's data structure (apache#12651) Add createClassLoader to replicator permissions (block specific to jacoco). (apache#12684) Move changes entry before backporting CHANGES Move testing properties to provider class (no classloading deadlock possible) and fallback to default provider in non-test mode simple cleanups to vector code (apache#12680) Better detect vector module in non-default setups (e.g., custom module layers) (apache#12677) ...
Description
Sort terms in TermInSetQuery with radix sort. This helps TermInSetQueries with a number of terms.
Benchmark
I made a simple benchmark on sorting
BytesRef[]
with random bytes to verify the improvements.