Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the sort when building forward index #12712

Merged
merged 24 commits into from
Oct 25, 2023

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Oct 23, 2023

Based on the idea mentioned here:

  1. If we use a stable sorter, we can only compare docIds because termIds are already in order.
  2. If we take the maxDoc into consideration, we can save 1 round of reorder when maxDoc < (1 << 24).
  3. We may even purely use an offline version of radix sorter to sort the whole file, since all we need is just 3 or 4 times reorder based on point 1 and 2.

Some thoughts / todos:

  • I have not finished the performance benchmark.
  • The radix sorter can not take advantage of ForkJoinPool by now.
  • Do we need to turn to a pure memory radix sorter if memory budget is enough?
  • This offline radix sorter is a bit different as i can not write to random address in a file. I have to play some trick here if I do not want to create 256 tmp files...

@gf2121 gf2121 marked this pull request as draft October 23, 2023 14:22
@gf2121
Copy link
Contributor Author

gf2121 commented Oct 23, 2023

To get an quick insight, i make a naive benchmark on the sorter, showing generally 5x faster than baseline.

  • JVM 8G (result in the ram budget of OfflineSorter = 800MB)
  • No ForkJoinPool, parallelism = 1.
max doc bits term count doc per term baseline candidate diff
24 1000 1000 647 137 -78.83%
24 1000 10000 6709 1058 -84.23%
24 10000 1000 5858 1460 -75.08%
24 10000 10000 90835 12002 -86.79%
32 1000 1000 649 180 -72.27%
32 1000 10000 7219 1801 -75.05%
32 10000 1000 7368 1774 -75.92%
32 10000 10000 89758 12981 -85.54%

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 24, 2023

I forked the LSBRadixSorter to sort longs and use it when ram budget is enough. Generally 5x faster than candidate, 25x faster than baseline.

max doc bits term count doc per term baseline candidate candidate memory
24 1000 1000 647 137 27
24 1000 10000 6709 1058 242
24 10000 1000 5858 1460 248
24 10000 10000 90835 12002 /
32 1000 1000 649 180 21
32 1000 10000 7219 1801 261
32 10000 1000 7368 1774 220
32 10000 10000 89758 12981 /

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 24, 2023

I indexed wikimidumall with:

  • BPIndexReorder monfig mentioned here.
  • BPMergePolicy on this commit (most recent commit passed CI).
  • without docvalues and facets
  • 8GB JVM heap.

The index time reduced around 7%:

  baselline candidate diff
index took 13852262 12785804 -7.70%

@gf2121 gf2121 marked this pull request as ready for review October 24, 2023 12:09
@gf2121 gf2121 requested a review from jpountz October 24, 2023 12:50
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks for helping making recursive graph bisection faster! This is an interesting approach to running radix sort in an offline fashion. Initially I was a bit worried about random access, but it doesn't look too bad, especially as postings would generally fit in the page cache.

Maybe we can keep the "everything in memory" approach for later. It sounds like a good idea to me, it just feels like this 5x speedup would be enough for building the forward index to no longer be a bottleneck, and I worry that the in-memory optimization would decrease test coverage for the offline approach as most tests/benchmarks would work on small datasets that always go in memory.

final long[] buffer = new long[BUFFER_SIZE];
int finalBlockSize;

void addEntry(long l, IndexOutput output) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It would read better to me if the IndexOutput was an argument of reset rather than of addEntry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it too!

}
String sourceFileName = fileName;
long indexFP = -1;
for (int shift = 0; shift < bitsRequired; shift += 8) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments that it's ok to only compare doc IDs since this is a LSD radix sort which is stable and term IDs are already sorted?

@gf2121 gf2121 added this to the 9.9.0 milestone Oct 24, 2023
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


private static final int HISTOGRAM_SIZE = 256;
private static final int BUFFER_SIZE = 8192;
private static final int BUFFER_BYTES = BUFFER_SIZE * Long.BYTES;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some comments about the fact that we expect this sorter to need ~16MB of heap (I think?)

}
}

void flush(IndexOutput output, boolean isFinal) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the IndexOutput from flush as well?

@gf2121 gf2121 modified the milestone: 9.9.0 Oct 24, 2023
@jpountz
Copy link
Contributor

jpountz commented Oct 24, 2023

Unrelated to this change: I sent you an email to your Apache address, can you check it out? (Sorry for the noise on this PR, I don't know how else to contact you).

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 24, 2023

I sent you an email to your Apache address, can you check it out?

Sorry for missing the email. Thank you so much for reminding me here!

@gf2121 gf2121 merged commit 7795927 into apache:main Oct 25, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants