Speed up the sort when building forward index #12712

gf2121 · 2023-10-23T14:22:29Z

Based on the idea mentioned here:

If we use a stable sorter, we can only compare docIds because termIds are already in order.

If we take the maxDoc into consideration, we can save 1 round of reorder when maxDoc < (1 << 24).

We may even purely use an offline version of radix sorter to sort the whole file, since all we need is just 3 or 4 times reorder based on point 1 and 2.

Some thoughts / todos:

I have not finished the performance benchmark.
The radix sorter can not take advantage of ForkJoinPool by now.
Do we need to turn to a pure memory radix sorter if memory budget is enough?
This offline radix sorter is a bit different as i can not write to random address in a file. I have to play some trick here if I do not want to create 256 tmp files...

gf2121 · 2023-10-23T16:55:26Z

To get an quick insight, i make a naive benchmark on the sorter, showing generally 5x faster than baseline.

JVM 8G (result in the ram budget of OfflineSorter = 800MB)
No ForkJoinPool, parallelism = 1.

max doc bits	term count	doc per term	baseline	candidate	diff
24	1000	1000	647	137	-78.83%
24	1000	10000	6709	1058	-84.23%
24	10000	1000	5858	1460	-75.08%
24	10000	10000	90835	12002	-86.79%
32	1000	1000	649	180	-72.27%
32	1000	10000	7219	1801	-75.05%
32	10000	1000	7368	1774	-75.92%
32	10000	10000	89758	12981	-85.54%

gf2121 · 2023-10-24T01:59:46Z

I forked the LSBRadixSorter to sort longs and use it when ram budget is enough. Generally 5x faster than candidate, 25x faster than baseline.

max doc bits	term count	doc per term	baseline	candidate	candidate memory
24	1000	1000	647	137	27
24	1000	10000	6709	1058	242
24	10000	1000	5858	1460	248
24	10000	10000	90835	12002	/
32	1000	1000	649	180	21
32	1000	10000	7219	1801	261
32	10000	1000	7368	1774	220
32	10000	10000	89758	12981	/

gf2121 · 2023-10-24T11:50:27Z

I indexed wikimidumall with:

BPIndexReorder monfig mentioned here.
BPMergePolicy on this commit (most recent commit passed CI).
without docvalues and facets
8GB JVM heap.

The index time reduced around 7%:

	baselline	candidate	diff
index took	13852262	12785804	-7.70%

jpountz

Very nice, thanks for helping making recursive graph bisection faster! This is an interesting approach to running radix sort in an offline fashion. Initially I was a bit worried about random access, but it doesn't look too bad, especially as postings would generally fit in the page cache.

Maybe we can keep the "everything in memory" approach for later. It sounds like a good idea to me, it just feels like this 5x speedup would be enough for building the forward index to no longer be a bottleneck, and I worry that the in-memory optimization would decrease test coverage for the offline approach as most tests/benchmarks would work on small datasets that always go in memory.

jpountz · 2023-10-24T13:42:17Z

lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java

+      final long[] buffer = new long[BUFFER_SIZE];
+      int finalBlockSize;
+
+      void addEntry(long l, IndexOutput output) throws IOException {


Nit: It would read better to me if the IndexOutput was an argument of reset rather than of addEntry

I like it too!

jpountz · 2023-10-24T13:46:17Z

lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java

+      }
+      String sourceFileName = fileName;
+      long indexFP = -1;
+      for (int shift = 0; shift < bitsRequired; shift += 8) {


Add some comments that it's ok to only compare doc IDs since this is a LSD radix sort which is stable and term IDs are already sorted?

jpountz

LGTM

jpountz · 2023-10-24T15:21:53Z

lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java

+
+    private static final int HISTOGRAM_SIZE = 256;
+    private static final int BUFFER_SIZE = 8192;
+    private static final int BUFFER_BYTES = BUFFER_SIZE * Long.BYTES;


Maybe add some comments about the fact that we expect this sorter to need ~16MB of heap (I think?)

jpountz · 2023-10-24T15:22:17Z

lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java

+        }
+      }
+
+      void flush(IndexOutput output, boolean isFinal) throws IOException {


Remove the IndexOutput from flush as well?

jpountz · 2023-10-24T16:31:57Z

Unrelated to this change: I sent you an email to your Apache address, can you check it out? (Sorry for the noise on this PR, I don't know how else to contact you).

gf2121 · 2023-10-24T16:50:03Z

I sent you an email to your Apache address, can you check it out?

Sorry for missing the email. Thank you so much for reminding me here!

gf2121 added 4 commits October 23, 2023 21:54

forward_index_sorter

9b9172d

test pass

d5d3d2f

encode

6a4fe77

tidy

cbe08ed

gf2121 marked this pull request as draft October 23, 2023 14:22

gf2121 mentioned this pull request Oct 23, 2023

Enable recursive graph bisection out of the box? #12665

Open

5 tasks

stash

762e645

gf2121 added 6 commits October 24, 2023 10:02

memory sorter

dcab98e

footer

3e26f99

tidy

91079be

bug fix

3357e09

iter

a7d1670

iter

6f2f57e

gf2121 marked this pull request as ready for review October 24, 2023 12:09

gf2121 requested a review from jpountz October 24, 2023 12:50

jpountz reviewed Oct 24, 2023

View reviewed changes

gf2121 added 3 commits October 24, 2023 22:38

remove memory sorter

d16405f

add comments

31078d0

move output to reset

f639995

gf2121 added this to the 9.9.0 milestone Oct 24, 2023

jpountz approved these changes Oct 24, 2023

View reviewed changes

gf2121 added 3 commits October 24, 2023 23:46

add comments

f019a7a

iter

9d1669a

iter

b1db7dd

gf2121 modified the milestone: 9.9.0 Oct 24, 2023

add CHANGES

ea2a3ef

gf2121 and others added 3 commits October 25, 2023 00:02

add CHANGES

00fa8c1

iter

4353c26

Merge branch 'main' into forward_index_sorter

a51fe68

gf2121 and others added 3 commits October 25, 2023 11:14

Merge branch 'main' into forward_index_sorter

f0c402d

more private

c527096

remove unnecessay diff

c07fe02

gf2121 merged commit 7795927 into apache:main Oct 25, 2023
4 checks passed

asfgit pushed a commit that referenced this pull request Oct 25, 2023

Speed up the sort when building forward index (#12712)

1cb1a14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up the sort when building forward index #12712

Speed up the sort when building forward index #12712

gf2121 commented Oct 23, 2023 •

edited

Loading

gf2121 commented Oct 23, 2023 •

edited

Loading

gf2121 commented Oct 24, 2023 •

edited

Loading

gf2121 commented Oct 24, 2023

jpountz left a comment

jpountz Oct 24, 2023

gf2121 Oct 24, 2023

jpountz Oct 24, 2023

jpountz left a comment

jpountz Oct 24, 2023

jpountz Oct 24, 2023

jpountz commented Oct 24, 2023

gf2121 commented Oct 24, 2023

Speed up the sort when building forward index #12712

Speed up the sort when building forward index #12712

Conversation

gf2121 commented Oct 23, 2023 • edited Loading

gf2121 commented Oct 23, 2023 • edited Loading

gf2121 commented Oct 24, 2023 • edited Loading

gf2121 commented Oct 24, 2023

jpountz left a comment

Choose a reason for hiding this comment

jpountz Oct 24, 2023

Choose a reason for hiding this comment

gf2121 Oct 24, 2023

Choose a reason for hiding this comment

jpountz Oct 24, 2023

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

jpountz Oct 24, 2023

Choose a reason for hiding this comment

jpountz Oct 24, 2023

Choose a reason for hiding this comment

jpountz commented Oct 24, 2023

gf2121 commented Oct 24, 2023

gf2121 commented Oct 23, 2023 •

edited

Loading

gf2121 commented Oct 23, 2023 •

edited

Loading

gf2121 commented Oct 24, 2023 •

edited

Loading