Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a merge policy wrapper that performs recursive graph bisection on merge. #12622

Merged
merged 22 commits into from
Nov 23, 2023

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Oct 4, 2023

This adds BPReorderingMergePolicy, a merge policy wrapper that reorders doc IDs on merge using a BPIndexReorderer.

  • Reordering always run on forced merges.
  • A minNaturalMergeNumDocs parameter helps only enable reordering on the larger merged segments. This way, small merges retain all merging optimizations like bulk copying of stored fields, and only the larger segments - which are the most important for search performance - get reordered.
  • If not enough RAM is available to perform reordering, reordering is skipped.

To make this work, I had to add the ability for any merge to reorder doc IDs of the merged segment via OneMerge#reorder. MockRandomMergePolicy from the test framework randomly reverts the order of documents in a merged segment to make sure this logic is properly exercised.

… merge.

This adds `BPReorderingMergePolicy`, a merge policy wrapper that reorders doc
IDs on merge using a `BPIndexReorderer`.
 - Reordering always run on forced merges.
 - A `minNaturalMergeNumDocs` parameter helps only enable reordering on the
   larger merged segments. This way, small merges retain all merging
   optimizations like bulk copying of stored fields, and only the larger
   segments - which are the most important for search performance - get
   reordered.
 - If not enough RAM is available to perform reordering, reordering is skipped.

To make this work, I had to add the ability for any merge to reorder doc IDs of
the merged segment via `OneMerge#reorder`. `MockRandomMergePolicy` from the
test framework randomly reverts the order of documents in a merged segment to
make sure this logic is properly exercised.
@jpountz
Copy link
Contributor Author

jpountz commented Oct 4, 2023

The diff is large because I had to introduce a new SlowCompositeCodecReaderWrapper, which effectively does the merge (lazily) and can be fed to the reordering logic prior to actually running the merge. The rest of the change is reasonable in size (at least for now).

@s1monw s1monw self-requested a review October 10, 2023 08:53
Copy link
Member

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass on that and left some comments and questions, nothing major. I left some nits on the way but have questions to be clarified before I can continue. thanks man for adding this change.

int docBase = 0;
int i = 0;
for (CodecReader reader : mergeReaders) {
final int finalDocBase = docBase;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it confused me when I read the code. I read "final" as the eventual DocBase not docBase var being final. maybe call it currentDocBase or so?

new MergeState.DocMap() {
@Override
public int get(int docID) {
int reorderedDocId = reorderDocMap.get(docID);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I am missing something but what do we need the reorderDocMap for? Can't we do the calculation we do in the loop above in-line with this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it basically does this:

reorderedDocId =  docMap.oldToNew(finalDocBase + docID);

and I wonder if we really need the second loop? or is this because we reassigning the mergeReaders after we record that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could try to combine the two doc maps.

I did it this way to try to keep the reordering logic and the merging logic as independent as possible. On the one hand, we have the reordering logic that works on a merged view of the input segments and doesn't care about deletes. On the other hand, we have the merge logic that computes the mapping between doc IDs in input segments and doc IDs in the merged segment (often just compacting deletes, ie. if index sorting is not enabled).

If we wanted to better combine these two things, we'd need to either ignore the MergeState's doc maps or somehow make MergeState aware of the reordering. Today MergeState is completely unaware of the reordering, from its perspective it just needs to run a singleton merge (single input codec reader) and its only job is to reclaim deletes.


@Override
public Fields get(int doc) throws IOException {
int readerId = Arrays.binarySearch(docStarts, doc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe have a method that shares the code when then readerID is less then 0

@@ -468,7 +468,11 @@ public void checkIntegrity() throws IOException {

@Override
public PointValues getValues(String field) throws IOException {
return new SortingPointValues(delegate.getValues(field), docMap);
PointValues values = delegate.getValues(field);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why has this changed now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching, it should not change as file formats should never be called on fields whose info say the field is not indexed. I fixed the slow composite reader wrapper instead.

int numDocs = -1;

@Override
public synchronized int numDocs() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this costly? I wonder why we lazy init that numDocs here? we already iterate over the codecreaders in the ctor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tries to cover for leaves that have a lazy numDocs impl. I added a comment.


@Override
public void checkIntegrity() throws IOException {
for (KnnVectorsReader reader : readers) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and other places look like they could just be iterated with stream api and filter non null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the regular for loop equally simple? I'm fine either way.

/**
* Wrap a reader prior to merging in order to add/remove fields or documents.
*
* <p><b>NOTE:</b> It is illegal to reorder doc IDs here, use {@link
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a long shot but can we assert that somehow if we do that anywhere with an asserting reader in tests?

Copy link
Contributor Author

@jpountz jpountz Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can check if a reader was reordered easily.

mergeLiveDocs == null || mergeLiveDocs == prevHardLiveDocs
? docId -> currentHardLiveDocs.get(docId) == false
: docId -> mergeLiveDocs.get(docId) && currentHardLiveDocs.get(docId) == false;
docId -> segDocMap.get(docId) != -1 && currentHardLiveDocs.get(docId) == false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is in-fact a nice simplification here.

docMaps = mergeState.docMaps;
} else {
assert mergeState.docMaps.length == 1;
MergeState.DocMap compactionDocMap = mergeState.docMaps[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this really confuses me. Why do we only look at the first element of the docMaps array and never advance to the next segment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this because of the fact that we are wrapping the source readers into a single deadslow reader? If so I think we should document that and assert that it has only one element. It took me quite some time to figure that out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above line already has an assertion, I'll add a comment too.

@s1monw
Copy link
Member

s1monw commented Oct 12, 2023

just for the record. I think we should record if a segment was written and it contains blocks and make it viral. that might be a good idea anyway.

@jpountz jpountz requested a review from s1monw October 17, 2023 09:30
@jpountz
Copy link
Contributor Author

jpountz commented Nov 21, 2023

@s1monw Could you take another look?

Copy link
Member

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor suggstions LGTM otherwise

@@ -3475,6 +3475,8 @@ public void addIndexesReaderMerge(MergePolicy.OneMerge merge) throws IOException
merge.getMergeInfo().info.setUseCompoundFile(true);
}

merge.setMergeInfo(merge.info);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we have a test that realized that merge.info is missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, some of the new tests failed because the new merge policy records that a segment has been reordered in the segment's diagnostics. Let me add dedicated tests for setMergeInfo.

total += info.info.maxDoc();
}
return total;
return totalMaxDoc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

newSpec.add(
new OneMerge(oneMerge) {

private boolean reordered = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make that a setOnce?

for (CodecReader reader : mergeReaders) {
final int currentDocBase = docBase;
reorderDocMaps[i] =
new MergeState.DocMap() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can make DocMap an interface and then just use a lambda here and in other places. would look much nicer. I can also do that in a different PR after this is in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it to you for a follow-up. :)

// Since the reader was reordered, we passed a merged view to MergeState and from its
// perspective there is a single input segment to the merge and the
// SlowCompositeCodecReaderWrapper is effectively doing the merge.
assert mergeState.docMaps.length == 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put the length in the message. It will be helpful if we run in to an error here


@Override
public Sorter.DocMap reorder(CodecReader reader, Directory dir) throws IOException {
if (reader.numDocs() < minNumDocs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's style, but can we maybe only have a single return statement here?

@Override
public void setMergeInfo(SegmentCommitInfo info) {
info.info.addDiagnostics(
Collections.singletonMap("bp.reordered", Boolean.toString(reordered)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the constant here REORDERED?

@jpountz
Copy link
Contributor Author

jpountz commented Nov 22, 2023

@s1monw I pushed a commit that should address your feedback

Copy link
Member

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jpountz jpountz merged commit f7cab16 into apache:main Nov 23, 2023
4 checks passed
@jpountz jpountz deleted the bp_merge_policy branch November 23, 2023 12:25
jpountz added a commit that referenced this pull request Nov 23, 2023
… merge. (#12622)

This adds `BPReorderingMergePolicy`, a merge policy wrapper that reorders doc
IDs on merge using a `BPIndexReorderer`.
 - Reordering always run on forced merges.
 - A `minNaturalMergeNumDocs` parameter helps only enable reordering on the
   larger merged segments. This way, small merges retain all merging
   optimizations like bulk copying of stored fields, and only the larger
   segments - which are the most important for search performance - get
   reordered.
 - If not enough RAM is available to perform reordering, reordering is skipped.

To make this work, I had to add the ability for any merge to reorder doc IDs of
the merged segment via `OneMerge#reorder`. `MockRandomMergePolicy` from the
test framework randomly reverts the order of documents in a merged segment to
make sure this logic is properly exercised.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants