Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use radix sort to sort postings when index sorting is enabled. #12114

Merged
merged 4 commits into from
Mar 15, 2023

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jan 27, 2023

This switches to LSBRadixSorter instead of TimSorter to sort postings whose index options are DOCS. On a synthetic benchmark this yielded barely any difference in the case when the index order is the same as the sort order, or reverse, but almost a 3x speedup for writing postings in the case when the index order is mostly random.

This switches to LSBRadixSorter instead of TimSorter to sort postings whose
index options are `DOCS`. On a synthetic benchmark this yielded barely any
difference in the case when the index order is the same as the sort order, or
reverse, but almost a 3x speedup for writing postings in the case when the
index order is mostly random.
@jpountz
Copy link
Contributor Author

jpountz commented Jan 27, 2023

Here is the synthetic benchmark that I used if someone is interested in reproducing:

  enum Order {
    RANDOM,
    ASC,
    DESC;
  }

  public static void main(String[] args) throws IOException {
    Order order = Order.RANDOM;
    Directory dir = FSDirectory.open(Paths.get("/tmp/a"));
    IndexWriterConfig cfg = new IndexWriterConfig(null);
    cfg.setInfoStream(new PrintStreamInfoStream(System.out));
    cfg.setMaxBufferedDocs(100_000);
    cfg.setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH);
    cfg.setIndexSort(new Sort(LongField.newSortField("sort_field", false, SortedNumericSelector.Type.MIN)));
    IndexWriter w = new IndexWriter(dir, cfg);
    Document doc = new Document();
    LongField sortField = new LongField("sort_field", 0);
    doc.add(sortField);
    StringField stringField1 = new StringField("string_field", "", Store.NO);
    doc.add(stringField1);
    StringField stringField2 = new StringField("string_field", "", Store.NO);
    doc.add(stringField2);
    StringField stringField3 = new StringField("string_field", "", Store.NO);
    doc.add(stringField3);
    for (int i = 0; i < 5_000_000; ++i) {
      long sortValue = switch (order) {
      case RANDOM -> i % 15;
      case ASC -> i;
      case DESC -> -i;
      };
      sortField.setLongValue(sortValue);
      stringField1.setStringValue(Integer.toBinaryString(i % 10));
      stringField2.setStringValue(Integer.toBinaryString(i % 100));
      stringField3.setStringValue(Integer.toBinaryString(i % 1000));
      w.addDocument(doc);
    }
  }

And flush times for postings:

Main Patch
Index sort matches indexing order 6 7
Index sort is reverse indexing order 7 8
Random sort 27 10

PostingsEnum getWrapped() {
return in;
@Override
public int freq() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're removing freq support because no one is really using it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, fields that have frequencies are now handled by SortingPostingsEnum while SortingDocsEnum focuses on fields that only index docs.

Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining! LGTM

@jpountz
Copy link
Contributor Author

jpountz commented Mar 15, 2023

I purposedly introduced a bug to see what would fail, and only high-level tests that check early query termination or dynamic pruning failed, so I introduced lower-level tests that make sure that postings get reordered correctly.

@jpountz jpountz added this to the 9.6.0 milestone Mar 15, 2023
@jpountz jpountz merged commit 805eb0b into apache:main Mar 15, 2023
@jpountz jpountz deleted the docs_only_radix_sort branch March 15, 2023 10:56
jpountz added a commit that referenced this pull request Mar 15, 2023
This switches to LSBRadixSorter instead of TimSorter to sort postings whose
index options are `DOCS`. On a synthetic benchmark this yielded barely any
difference in the case when the index order is the same as the sort order, or
reverse, but almost a 3x speedup for writing postings in the case when the
index order is mostly random.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants