Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use radix sort to speed up the sorting of terms in TermInSetQuery #12587

Merged
merged 8 commits into from Oct 17, 2023

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Sep 25, 2023

Description

Sort terms in TermInSetQuery with radix sort. This helps TermInSetQueries with a number of terms.

Benchmark

I made a simple benchmark on sorting BytesRef[] with random bytes to verify the improvements.

  timsort ( took nanos ) radixsort ( took nanos ) took diff
10 terms (16 bytes per term) 1292 1083 -16.18%
100 terms (16 bytes per term) 17959 11750 -34.57%
1000 terms (16 bytes per term) 387916 50375 -87.01%
10000 terms (16 bytes per term) 5407208 1062500 -80.35%
100000 terms (16 bytes per term) 65577084 5404958 -91.76%
10 terms (256 bytes per term) 3500 1750 -50.00%
100 terms (256 bytes per term) 18000 11708 -34.96%
1000 terms (256 bytes per term) 410959 52417 -87.25%
10000 terms (256 bytes per term) 5325666 1299125 -75.61%
100000 terms (256 bytes per term) 71316500 11346584 -84.09%

Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea. Thanks for pursuing!

Comment on lines +116 to +132
new StringSorter(BytesRefComparator.NATURAL) {

@Override
protected void get(BytesRefBuilder builder, BytesRef result, int i) {
BytesRef term = sortedTerms[i];
result.length = term.length;
result.offset = term.offset;
result.bytes = term.bytes;
}

@Override
protected void swap(int i, int j) {
BytesRef b = sortedTerms[i];
sortedTerms[i] = sortedTerms[j];
sortedTerms[j] = b;
}
}.sort(0, sortedTerms.length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could (should?) use radix sort directly instead of going through StringSorter? We don't need the fallback sort logic that StringSorter provides right? Then we could avoid the API visibility change. Maybe there's a good reason though to go through StringSorter I'm missing?

I think this would do the trick though right?

      new MSBRadixSorter(Integer.MAX_VALUE) {
        @Override
        protected int byteAt(int i, int k) {
          BytesRef b = sortedTerms[i];
          if (b.length <= k) {
            return -1;
          }
          return b.bytes[b.offset + k] & 0xFF;
        }

        @Override
        protected void swap(int i, int j) {
          BytesRef b = sortedTerms[i];
          sortedTerms[i] = sortedTerms[j];
          sortedTerms[j] = b;
        }
      }.sort(0, sortedTerms.length);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for feedback!

If using MSBRadixSorter, we need to override MSBRadixSorter#getFallbackSorter to make the fallback sorter benefited from the vectorized equals, instead of the default impl that comparing bytes one by one.

We can avoid this with StringSorter as it is implemented :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, that makes sense to me. Thanks!

* A {@link BytesRef} sorter tries to use a efficient radix sorter if {@link StringSorter#cmp} is a
* {@link BytesRefComparator}, otherwise fallback to {@link StringSorter#fallbackSorter}
*/
public abstract class StringSorter extends Sorter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add @lucene.internal now that we're making this public?

Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gf2121 !

@gf2121 gf2121 merged commit c6e76d3 into apache:main Oct 17, 2023
4 checks passed
@gf2121 gf2121 added this to the 9.9.0 milestone Oct 18, 2023
clayburn added a commit to runningcode/lucene that referenced this pull request Oct 20, 2023
…ache.org

* upstream/main: (239 commits)
  Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation (apache#12633)
  Fix index out of bounds when writing FST to different metaOut (apache#12697) (apache#12698)
  Avoid object construction when linear searching arcs (apache#12692)
  chore: update the Javadoc example in Analyzer (apache#12693)
  coorect position on entry in CHANGES.txt
  Refactor ByteBlockPool so it is just a "shift/mask big array" (apache#12625)
  Extract the hnsw graph merging from being part of the vector writer (apache#12657)
  Specialize `BlockImpactsDocsEnum#nextDoc()`. (apache#12670)
  Speed up TestIndexOrDocValuesQuery. (apache#12672)
  Remove over-counting of deleted terms (apache#12586)
  Use MergeSorter in StableStringSorter (apache#12652)
  Use radix sort to speed up the sorting of terms in TermInSetQuery (apache#12587)
  Add timeouts to github jobs. Estimates taken from empirical run times (actions history), with a generous buffer added. (apache#12687)
  Optimize OnHeapHnswGraph's data structure (apache#12651)
  Add createClassLoader to replicator permissions (block specific to jacoco). (apache#12684)
  Move changes entry before backporting
  CHANGES
  Move testing properties to provider class (no classloading deadlock possible) and fallback to default provider in non-test mode
  simple cleanups to vector code (apache#12680)
  Better detect vector module in non-default setups (e.g., custom module layers) (apache#12677)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants