-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up the sort when building forward index #12712
Conversation
To get an quick insight, i make a naive benchmark on the sorter, showing generally 5x faster than baseline.
|
I forked the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thanks for helping making recursive graph bisection faster! This is an interesting approach to running radix sort in an offline fashion. Initially I was a bit worried about random access, but it doesn't look too bad, especially as postings would generally fit in the page cache.
Maybe we can keep the "everything in memory" approach for later. It sounds like a good idea to me, it just feels like this 5x speedup would be enough for building the forward index to no longer be a bottleneck, and I worry that the in-memory optimization would decrease test coverage for the offline approach as most tests/benchmarks would work on small datasets that always go in memory.
final long[] buffer = new long[BUFFER_SIZE]; | ||
int finalBlockSize; | ||
|
||
void addEntry(long l, IndexOutput output) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It would read better to me if the IndexOutput
was an argument of reset
rather than of addEntry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it too!
} | ||
String sourceFileName = fileName; | ||
long indexFP = -1; | ||
for (int shift = 0; shift < bitsRequired; shift += 8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some comments that it's ok to only compare doc IDs since this is a LSD radix sort which is stable and term IDs are already sorted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
private static final int HISTOGRAM_SIZE = 256; | ||
private static final int BUFFER_SIZE = 8192; | ||
private static final int BUFFER_BYTES = BUFFER_SIZE * Long.BYTES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add some comments about the fact that we expect this sorter to need ~16MB of heap (I think?)
} | ||
} | ||
|
||
void flush(IndexOutput output, boolean isFinal) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the IndexOutput from flush as well?
Unrelated to this change: I sent you an email to your Apache address, can you check it out? (Sorry for the noise on this PR, I don't know how else to contact you). |
Sorry for missing the email. Thank you so much for reminding me here! |
Based on the idea mentioned here:
Some thoughts / todos:
ForkJoinPool
by now.