Add a merge policy wrapper that performs recursive graph bisection on merge. #12622

jpountz · 2023-10-04T13:11:54Z

This adds BPReorderingMergePolicy, a merge policy wrapper that reorders doc IDs on merge using a BPIndexReorderer.

Reordering always run on forced merges.
A minNaturalMergeNumDocs parameter helps only enable reordering on the larger merged segments. This way, small merges retain all merging optimizations like bulk copying of stored fields, and only the larger segments - which are the most important for search performance - get reordered.
If not enough RAM is available to perform reordering, reordering is skipped.

To make this work, I had to add the ability for any merge to reorder doc IDs of the merged segment via OneMerge#reorder. MockRandomMergePolicy from the test framework randomly reverts the order of documents in a merged segment to make sure this logic is properly exercised.

… merge. This adds `BPReorderingMergePolicy`, a merge policy wrapper that reorders doc IDs on merge using a `BPIndexReorderer`. - Reordering always run on forced merges. - A `minNaturalMergeNumDocs` parameter helps only enable reordering on the larger merged segments. This way, small merges retain all merging optimizations like bulk copying of stored fields, and only the larger segments - which are the most important for search performance - get reordered. - If not enough RAM is available to perform reordering, reordering is skipped. To make this work, I had to add the ability for any merge to reorder doc IDs of the merged segment via `OneMerge#reorder`. `MockRandomMergePolicy` from the test framework randomly reverts the order of documents in a merged segment to make sure this logic is properly exercised.

…Values.

jpountz · 2023-10-04T14:46:46Z

The diff is large because I had to introduce a new SlowCompositeCodecReaderWrapper, which effectively does the merge (lazily) and can be fed to the reordering logic prior to actually running the merge. The rest of the change is reasonable in size (at least for now).

s1monw

I did a first pass on that and left some comments and questions, nothing major. I left some nits on the way but have questions to be clarified before I can continue. thanks man for adding this change.

s1monw · 2023-10-12T16:44:21Z

lucene/core/src/java/org/apache/lucene/index/IndexWriter.java

+          int docBase = 0;
+          int i = 0;
+          for (CodecReader reader : mergeReaders) {
+            final int finalDocBase = docBase;


nit: it confused me when I read the code. I read "final" as the eventual DocBase not docBase var being final. maybe call it currentDocBase or so?

s1monw · 2023-10-12T16:49:35Z

lucene/core/src/java/org/apache/lucene/index/IndexWriter.java

+              new MergeState.DocMap() {
+                @Override
+                public int get(int docID) {
+                  int reorderedDocId = reorderDocMap.get(docID);


maybe I am missing something but what do we need the reorderDocMap for? Can't we do the calculation we do in the loop above in-line with this?

it basically does this:

reorderedDocId = docMap.oldToNew(finalDocBase + docID);

and I wonder if we really need the second loop? or is this because we reassigning the mergeReaders after we record that?

We could try to combine the two doc maps.

I did it this way to try to keep the reordering logic and the merging logic as independent as possible. On the one hand, we have the reordering logic that works on a merged view of the input segments and doesn't care about deletes. On the other hand, we have the merge logic that computes the mapping between doc IDs in input segments and doc IDs in the merged segment (often just compacting deletes, ie. if index sorting is not enabled).

If we wanted to better combine these two things, we'd need to either ignore the MergeState's doc maps or somehow make MergeState aware of the reordering. Today MergeState is completely unaware of the reordering, from its perspective it just needs to run a singleton merge (single input codec reader) and its only job is to reclaim deletes.

s1monw · 2023-10-12T17:05:21Z

lucene/core/src/java/org/apache/lucene/index/SlowCompositeCodecReaderWrapper.java

+
+    @Override
+    public Fields get(int doc) throws IOException {
+      int readerId = Arrays.binarySearch(docStarts, doc);


maybe have a method that shares the code when then readerID is less then 0

s1monw · 2023-10-12T17:11:56Z

lucene/core/src/java/org/apache/lucene/index/SortingCodecReader.java

@@ -468,7 +468,11 @@ public void checkIntegrity() throws IOException {

      @Override
      public PointValues getValues(String field) throws IOException {
-        return new SortingPointValues(delegate.getValues(field), docMap);
+        PointValues values = delegate.getValues(field);


why has this changed now?

Thanks for catching, it should not change as file formats should never be called on fields whose info say the field is not indexed. I fixed the slow composite reader wrapper instead.

s1monw · 2023-10-12T17:14:45Z

lucene/core/src/java/org/apache/lucene/index/SlowCompositeCodecReaderWrapper.java

+  int numDocs = -1;
+
+  @Override
+  public synchronized int numDocs() {


is this costly? I wonder why we lazy init that numDocs here? we already iterate over the codecreaders in the ctor

This tries to cover for leaves that have a lazy numDocs impl. I added a comment.

s1monw · 2023-10-12T17:17:17Z

lucene/core/src/java/org/apache/lucene/index/SlowCompositeCodecReaderWrapper.java

+
+    @Override
+    public void checkIntegrity() throws IOException {
+      for (KnnVectorsReader reader : readers) {


this and other places look like they could just be iterated with stream api and filter non null

Isn't the regular for loop equally simple? I'm fine either way.

s1monw · 2023-10-12T17:27:54Z

lucene/core/src/java/org/apache/lucene/index/MergePolicy.java

+    /**
+     * Wrap a reader prior to merging in order to add/remove fields or documents.
+     *
+     * <p><b>NOTE:</b> It is illegal to reorder doc IDs here, use {@link


maybe a long shot but can we assert that somehow if we do that anywhere with an asserting reader in tests?

I don't think we can check if a reader was reordered easily.

s1monw · 2023-10-12T17:36:25Z

lucene/core/src/java/org/apache/lucene/index/IndexWriter.java

-        mergeLiveDocs == null || mergeLiveDocs == prevHardLiveDocs
-            ? docId -> currentHardLiveDocs.get(docId) == false
-            : docId -> mergeLiveDocs.get(docId) && currentHardLiveDocs.get(docId) == false;
+        docId -> segDocMap.get(docId) != -1 && currentHardLiveDocs.get(docId) == false;


that is in-fact a nice simplification here.

s1monw · 2023-10-12T18:03:08Z

lucene/core/src/java/org/apache/lucene/index/IndexWriter.java

+        docMaps = mergeState.docMaps;
+      } else {
+        assert mergeState.docMaps.length == 1;
+        MergeState.DocMap compactionDocMap = mergeState.docMaps[0];


this really confuses me. Why do we only look at the first element of the docMaps array and never advance to the next segment?

is this because of the fact that we are wrapping the source readers into a single deadslow reader? If so I think we should document that and assert that it has only one element. It took me quite some time to figure that out.

The above line already has an assertion, I'll add a comment too.

s1monw · 2023-10-12T18:11:26Z

just for the record. I think we should record if a segment was written and it contains blocks and make it viral. that might be a good idea anyway.

lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoin.java

This reverts commit 8f8be9f.

This reverts commit 8e4a26f.

jpountz · 2023-11-21T09:51:13Z

@s1monw Could you take another look?

s1monw

Left some minor suggstions LGTM otherwise

s1monw · 2023-11-21T10:01:00Z

lucene/core/src/java/org/apache/lucene/index/IndexWriter.java

@@ -3475,6 +3475,8 @@ public void addIndexesReaderMerge(MergePolicy.OneMerge merge) throws IOException
      merge.getMergeInfo().info.setUseCompoundFile(true);
    }

+    merge.setMergeInfo(merge.info);


did we have a test that realized that merge.info is missing?

Yes, some of the new tests failed because the new merge policy records that a segment has been reordered in the segment's diagnostics. Let me add dedicated tests for setMergeInfo.

s1monw · 2023-11-21T10:01:14Z

lucene/core/src/java/org/apache/lucene/index/MergePolicy.java

-        total += info.info.maxDoc();
-      }
-      return total;
+      return totalMaxDoc;


s1monw · 2023-11-21T10:02:16Z

lucene/misc/src/java/org/apache/lucene/misc/index/BPReorderingMergePolicy.java

      newSpec.add(
          new OneMerge(oneMerge) {
+
+            private boolean reordered = false;


maybe make that a setOnce?

s1monw · 2023-11-21T10:05:54Z

lucene/core/src/java/org/apache/lucene/index/IndexWriter.java

+          for (CodecReader reader : mergeReaders) {
+            final int currentDocBase = docBase;
+            reorderDocMaps[i] =
+                new MergeState.DocMap() {


maybe we can make DocMap an interface and then just use a lambda here and in other places. would look much nicer. I can also do that in a different PR after this is in.

I'll leave it to you for a follow-up. :)

s1monw · 2023-11-21T10:06:28Z

lucene/core/src/java/org/apache/lucene/index/IndexWriter.java

+        // Since the reader was reordered, we passed a merged view to MergeState and from its
+        // perspective there is a single input segment to the merge and the
+        // SlowCompositeCodecReaderWrapper is effectively doing the merge.
+        assert mergeState.docMaps.length == 1;


can you put the length in the message. It will be helpful if we run in to an error here

s1monw · 2023-11-21T10:09:40Z

lucene/misc/src/java/org/apache/lucene/misc/index/BPReorderingMergePolicy.java

+
+            @Override
+            public Sorter.DocMap reorder(CodecReader reader, Directory dir) throws IOException {
+              if (reader.numDocs() < minNumDocs) {


I guess it's style, but can we maybe only have a single return statement here?

s1monw · 2023-11-21T10:10:13Z

lucene/misc/src/java/org/apache/lucene/misc/index/BPReorderingMergePolicy.java

+            @Override
+            public void setMergeInfo(SegmentCommitInfo info) {
+              info.info.addDiagnostics(
+                  Collections.singletonMap("bp.reordered", Boolean.toString(reordered)));


use the constant here REORDERED?

jpountz · 2023-11-22T16:46:30Z

@s1monw I pushed a commit that should address your feedback

s1monw

LGTM

… merge. (#12622) This adds `BPReorderingMergePolicy`, a merge policy wrapper that reorders doc IDs on merge using a `BPIndexReorderer`. - Reordering always run on forced merges. - A `minNaturalMergeNumDocs` parameter helps only enable reordering on the larger merged segments. This way, small merges retain all merging optimizations like bulk copying of stored fields, and only the larger segments - which are the most important for search performance - get reordered. - If not enough RAM is available to perform reordering, reordering is skipped. To make this work, I had to add the ability for any merge to reorder doc IDs of the merged segment via `OneMerge#reorder`. `MockRandomMergePolicy` from the test framework randomly reverts the order of documents in a merged segment to make sure this logic is properly exercised.

jpountz added 3 commits October 4, 2023 15:10

tidy

e2c218a

Update block-join tests to disable reordering. Simplify with MultiDoc…

e5d1462

…Values.

Fix more block-join tests

e3d3711

s1monw self-requested a review October 10, 2023 08:53

s1monw requested changes Oct 12, 2023

View reviewed changes

jpountz mentioned this pull request Oct 12, 2023

Enable recursive graph bisection out of the box? #12665

Open

5 tasks

Review feedback.

dfdb407

jpountz requested a review from s1monw October 17, 2023 09:30

gf2121 reviewed Oct 18, 2023

View reviewed changes

lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoin.java Outdated Show resolved Hide resolved

jpountz added 14 commits October 18, 2023 11:14

Fix double set of the merge policy.

a86a629

Fix NPE.

2e1dbae

Merge branch 'main' into bp_merge_policy

39d644b

test

8e4a26f

max_doc_freq

8f8be9f

Revert "max_doc_freq"

9590e75

This reverts commit 8f8be9f.

Revert "test"

d9d55aa

This reverts commit 8e4a26f.

Make it possible to make the merging overhead cheaper.

fa7dee6

iter

b2cf7c1

Add tests for addIndexes.

4f75db2

Record reordering in diagnostics.

0008842

Merge branch 'main' into bp_merge_policy

da76632

Record blocks on SlowCompositeCodecReaderWrapper.

91d169e

Merge branch 'main' into bp_merge_policy

2fbf828

s1monw requested changes Nov 21, 2023

View reviewed changes

jpountz added 2 commits November 22, 2023 16:23

Merge branch 'main' into bp_merge_policy

2cea617

Review feedback.

b73988a

fix test failure

276c76a

s1monw approved these changes Nov 23, 2023

View reviewed changes

jpountz merged commit f7cab16 into apache:main Nov 23, 2023
4 checks passed

jpountz deleted the bp_merge_policy branch November 23, 2023 12:25

mkhludnev mentioned this pull request Feb 19, 2024

Log MockRandomMergePolicy reverse in verbose. #13117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a merge policy wrapper that performs recursive graph bisection on merge. #12622

Add a merge policy wrapper that performs recursive graph bisection on merge. #12622

jpountz commented Oct 4, 2023

jpountz commented Oct 4, 2023 •

edited

s1monw left a comment

s1monw Oct 12, 2023

s1monw Oct 12, 2023

s1monw Oct 12, 2023

jpountz Oct 17, 2023

s1monw Oct 12, 2023

s1monw Oct 12, 2023

jpountz Oct 17, 2023

s1monw Oct 12, 2023

jpountz Oct 17, 2023

s1monw Oct 12, 2023

jpountz Oct 17, 2023

s1monw Oct 12, 2023

jpountz Oct 17, 2023 •

edited

s1monw Oct 12, 2023

s1monw Oct 12, 2023

s1monw Oct 12, 2023

jpountz Oct 17, 2023

s1monw commented Oct 12, 2023

jpountz commented Nov 21, 2023

s1monw left a comment

s1monw Nov 21, 2023

jpountz Nov 22, 2023

s1monw Nov 21, 2023

s1monw Nov 21, 2023

s1monw Nov 21, 2023

jpountz Nov 22, 2023

s1monw Nov 21, 2023

s1monw Nov 21, 2023

s1monw Nov 21, 2023

jpountz commented Nov 22, 2023

s1monw left a comment

Add a merge policy wrapper that performs recursive graph bisection on merge. #12622

Add a merge policy wrapper that performs recursive graph bisection on merge. #12622

Conversation

jpountz commented Oct 4, 2023

jpountz commented Oct 4, 2023 • edited

s1monw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz Oct 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw commented Oct 12, 2023

jpountz commented Nov 21, 2023

s1monw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Nov 22, 2023

s1monw left a comment

Choose a reason for hiding this comment

jpountz commented Oct 4, 2023 •

edited

jpountz Oct 17, 2023 •

edited