-
Notifications
You must be signed in to change notification settings - Fork 987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use radix sort to sort postings when index sorting is enabled. #12114
Conversation
This switches to LSBRadixSorter instead of TimSorter to sort postings whose index options are `DOCS`. On a synthetic benchmark this yielded barely any difference in the case when the index order is the same as the sort order, or reverse, but almost a 3x speedup for writing postings in the case when the index order is mostly random.
Here is the synthetic benchmark that I used if someone is interested in reproducing: enum Order {
RANDOM,
ASC,
DESC;
}
public static void main(String[] args) throws IOException {
Order order = Order.RANDOM;
Directory dir = FSDirectory.open(Paths.get("/tmp/a"));
IndexWriterConfig cfg = new IndexWriterConfig(null);
cfg.setInfoStream(new PrintStreamInfoStream(System.out));
cfg.setMaxBufferedDocs(100_000);
cfg.setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH);
cfg.setIndexSort(new Sort(LongField.newSortField("sort_field", false, SortedNumericSelector.Type.MIN)));
IndexWriter w = new IndexWriter(dir, cfg);
Document doc = new Document();
LongField sortField = new LongField("sort_field", 0);
doc.add(sortField);
StringField stringField1 = new StringField("string_field", "", Store.NO);
doc.add(stringField1);
StringField stringField2 = new StringField("string_field", "", Store.NO);
doc.add(stringField2);
StringField stringField3 = new StringField("string_field", "", Store.NO);
doc.add(stringField3);
for (int i = 0; i < 5_000_000; ++i) {
long sortValue = switch (order) {
case RANDOM -> i % 15;
case ASC -> i;
case DESC -> -i;
};
sortField.setLongValue(sortValue);
stringField1.setStringValue(Integer.toBinaryString(i % 10));
stringField2.setStringValue(Integer.toBinaryString(i % 100));
stringField3.setStringValue(Integer.toBinaryString(i % 1000));
w.addDocument(doc);
}
} And flush times for postings:
|
PostingsEnum getWrapped() { | ||
return in; | ||
@Override | ||
public int freq() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're removing freq
support because no one is really using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, fields that have frequencies are now handled by SortingPostingsEnum
while SortingDocsEnum
focuses on fields that only index docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining! LGTM
I purposedly introduced a bug to see what would fail, and only high-level tests that check early query termination or dynamic pruning failed, so I introduced lower-level tests that make sure that postings get reordered correctly. |
This switches to LSBRadixSorter instead of TimSorter to sort postings whose index options are `DOCS`. On a synthetic benchmark this yielded barely any difference in the case when the index order is the same as the sort order, or reverse, but almost a 3x speedup for writing postings in the case when the index order is mostly random.
This switches to LSBRadixSorter instead of TimSorter to sort postings whose index options are
DOCS
. On a synthetic benchmark this yielded barely any difference in the case when the index order is the same as the sort order, or reverse, but almost a 3x speedup for writing postings in the case when the index order is mostly random.