-
Notifications
You must be signed in to change notification settings - Fork 962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a merge policy wrapper that performs recursive graph bisection on merge. #12622
Conversation
… merge. This adds `BPReorderingMergePolicy`, a merge policy wrapper that reorders doc IDs on merge using a `BPIndexReorderer`. - Reordering always run on forced merges. - A `minNaturalMergeNumDocs` parameter helps only enable reordering on the larger merged segments. This way, small merges retain all merging optimizations like bulk copying of stored fields, and only the larger segments - which are the most important for search performance - get reordered. - If not enough RAM is available to perform reordering, reordering is skipped. To make this work, I had to add the ability for any merge to reorder doc IDs of the merged segment via `OneMerge#reorder`. `MockRandomMergePolicy` from the test framework randomly reverts the order of documents in a merged segment to make sure this logic is properly exercised.
The diff is large because I had to introduce a new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first pass on that and left some comments and questions, nothing major. I left some nits on the way but have questions to be clarified before I can continue. thanks man for adding this change.
int docBase = 0; | ||
int i = 0; | ||
for (CodecReader reader : mergeReaders) { | ||
final int finalDocBase = docBase; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it confused me when I read the code. I read "final" as the eventual DocBase not docBase var being final. maybe call it currentDocBase or so?
new MergeState.DocMap() { | ||
@Override | ||
public int get(int docID) { | ||
int reorderedDocId = reorderDocMap.get(docID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I am missing something but what do we need the reorderDocMap for? Can't we do the calculation we do in the loop above in-line with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it basically does this:
reorderedDocId = docMap.oldToNew(finalDocBase + docID);
and I wonder if we really need the second loop? or is this because we reassigning the mergeReaders
after we record that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could try to combine the two doc maps.
I did it this way to try to keep the reordering logic and the merging logic as independent as possible. On the one hand, we have the reordering logic that works on a merged view of the input segments and doesn't care about deletes. On the other hand, we have the merge logic that computes the mapping between doc IDs in input segments and doc IDs in the merged segment (often just compacting deletes, ie. if index sorting is not enabled).
If we wanted to better combine these two things, we'd need to either ignore the MergeState
's doc maps or somehow make MergeState
aware of the reordering. Today MergeState
is completely unaware of the reordering, from its perspective it just needs to run a singleton merge (single input codec reader) and its only job is to reclaim deletes.
|
||
@Override | ||
public Fields get(int doc) throws IOException { | ||
int readerId = Arrays.binarySearch(docStarts, doc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe have a method that shares the code when then readerID is less then 0
@@ -468,7 +468,11 @@ public void checkIntegrity() throws IOException { | |||
|
|||
@Override | |||
public PointValues getValues(String field) throws IOException { | |||
return new SortingPointValues(delegate.getValues(field), docMap); | |||
PointValues values = delegate.getValues(field); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why has this changed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching, it should not change as file formats should never be called on fields whose info say the field is not indexed. I fixed the slow composite reader wrapper instead.
int numDocs = -1; | ||
|
||
@Override | ||
public synchronized int numDocs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this costly? I wonder why we lazy init that numDocs here? we already iterate over the codecreaders in the ctor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tries to cover for leaves that have a lazy numDocs
impl. I added a comment.
|
||
@Override | ||
public void checkIntegrity() throws IOException { | ||
for (KnnVectorsReader reader : readers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this and other places look like they could just be iterated with stream api and filter non null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the regular for loop equally simple? I'm fine either way.
/** | ||
* Wrap a reader prior to merging in order to add/remove fields or documents. | ||
* | ||
* <p><b>NOTE:</b> It is illegal to reorder doc IDs here, use {@link |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a long shot but can we assert that somehow if we do that anywhere with an asserting reader in tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can check if a reader was reordered easily.
mergeLiveDocs == null || mergeLiveDocs == prevHardLiveDocs | ||
? docId -> currentHardLiveDocs.get(docId) == false | ||
: docId -> mergeLiveDocs.get(docId) && currentHardLiveDocs.get(docId) == false; | ||
docId -> segDocMap.get(docId) != -1 && currentHardLiveDocs.get(docId) == false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is in-fact a nice simplification here.
docMaps = mergeState.docMaps; | ||
} else { | ||
assert mergeState.docMaps.length == 1; | ||
MergeState.DocMap compactionDocMap = mergeState.docMaps[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this really confuses me. Why do we only look at the first element of the docMaps array and never advance to the next segment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this because of the fact that we are wrapping the source readers into a single deadslow reader? If so I think we should document that and assert that it has only one element. It took me quite some time to figure that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above line already has an assertion, I'll add a comment too.
just for the record. I think we should record if a segment was written and it contains blocks and make it viral. that might be a good idea anyway. |
lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoin.java
Outdated
Show resolved
Hide resolved
@s1monw Could you take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor suggstions LGTM otherwise
@@ -3475,6 +3475,8 @@ public void addIndexesReaderMerge(MergePolicy.OneMerge merge) throws IOException | |||
merge.getMergeInfo().info.setUseCompoundFile(true); | |||
} | |||
|
|||
merge.setMergeInfo(merge.info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we have a test that realized that merge.info is missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, some of the new tests failed because the new merge policy records that a segment has been reordered in the segment's diagnostics. Let me add dedicated tests for setMergeInfo.
total += info.info.maxDoc(); | ||
} | ||
return total; | ||
return totalMaxDoc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<3
newSpec.add( | ||
new OneMerge(oneMerge) { | ||
|
||
private boolean reordered = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make that a setOnce?
for (CodecReader reader : mergeReaders) { | ||
final int currentDocBase = docBase; | ||
reorderDocMaps[i] = | ||
new MergeState.DocMap() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can make DocMap an interface and then just use a lambda here and in other places. would look much nicer. I can also do that in a different PR after this is in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave it to you for a follow-up. :)
// Since the reader was reordered, we passed a merged view to MergeState and from its | ||
// perspective there is a single input segment to the merge and the | ||
// SlowCompositeCodecReaderWrapper is effectively doing the merge. | ||
assert mergeState.docMaps.length == 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put the length in the message. It will be helpful if we run in to an error here
|
||
@Override | ||
public Sorter.DocMap reorder(CodecReader reader, Directory dir) throws IOException { | ||
if (reader.numDocs() < minNumDocs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's style, but can we maybe only have a single return statement here?
@Override | ||
public void setMergeInfo(SegmentCommitInfo info) { | ||
info.info.addDiagnostics( | ||
Collections.singletonMap("bp.reordered", Boolean.toString(reordered))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the constant here REORDERED
?
@s1monw I pushed a commit that should address your feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… merge. (#12622) This adds `BPReorderingMergePolicy`, a merge policy wrapper that reorders doc IDs on merge using a `BPIndexReorderer`. - Reordering always run on forced merges. - A `minNaturalMergeNumDocs` parameter helps only enable reordering on the larger merged segments. This way, small merges retain all merging optimizations like bulk copying of stored fields, and only the larger segments - which are the most important for search performance - get reordered. - If not enough RAM is available to perform reordering, reordering is skipped. To make this work, I had to add the ability for any merge to reorder doc IDs of the merged segment via `OneMerge#reorder`. `MockRandomMergePolicy` from the test framework randomly reverts the order of documents in a merged segment to make sure this logic is properly exercised.
This adds
BPReorderingMergePolicy
, a merge policy wrapper that reorders doc IDs on merge using aBPIndexReorderer
.minNaturalMergeNumDocs
parameter helps only enable reordering on the larger merged segments. This way, small merges retain all merging optimizations like bulk copying of stored fields, and only the larger segments - which are the most important for search performance - get reordered.To make this work, I had to add the ability for any merge to reorder doc IDs of the merged segment via
OneMerge#reorder
.MockRandomMergePolicy
from the test framework randomly reverts the order of documents in a merged segment to make sure this logic is properly exercised.