-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract the hnsw graph merging from being part of the vector writer #12657
Extract the hnsw graph merging from being part of the vector writer #12657
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this refactor is great, will take another closer look later
lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java
Outdated
Show resolved
Hide resolved
@zhaih I updated the API a bit. This is more like I was thinking. Having a builder that accepts readers, doc maps, etc. And then can build with the final merge state. |
lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java
Outdated
Show resolved
Hide resolved
* | ||
* @lucene.internal | ||
*/ | ||
public final class InitializedHnswGraphBuilder extends HnswGraphBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether it'll be better to allow HnswGraphBuilder
to accept a OnHeapHnswGraph
as a constructor parameter and then when we init from graph we will use this InitializedHnswGraphBuilder
to build a graph and then pass the built graph and a node filter (to avoid reindex the same nodes) to the normal HnswGraphBuilder
?
Then for example if we want to have a multi-thread ConcurrentHnswGraphBuilder
we can still use this InitializedHnswGraphBuilder
to build the init graph and pass it to the ConcurentHnswGraphBuilder
?
I mentioned this because in my draft concurrent HNSW merge PR #12660 I do need to pass the HNSW graph to a builder per thread, altho it can be done in various ways. But I still feel using parent/child class to separate this can make things a little bit hard later? Like, if I want a concurrent builder will the concurrent builder extend from this class? If so we need to be quite careful not to inherit a wrong behavior from the original HnswGraphBuilder
and things can become quite complex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhaih I think we can do that. What I also want to do is remove all this Map<Integer, Integer> oldToNewOrdinalMap
. The caller should handle all that, not the constructor, especially since there is probably a way faster and simpler way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't follow this closely, but do we really have a map like that? If so, let's at least switch to IntIntHashMap to avoid boxing (if we're unable to remove it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msokolov I think we can just have int[]
, where int[oldOrd]=newOrd
since old vector ordinals are continuous.
Just being paranoid, I tested and verified that recall is absolutely unchanged between these changes. baseline:
candidate:
The performance isn't reliable, this was running on my laptop while it was doing lots of other work. |
for (int offset = 0; offset < size; offset += random.nextInt(3) + 1) { | ||
for (int offset = 0; offset < size; offset++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was to simulate sparse vectors. But the sparse vector iterator never returns null
for the vectors. Instead it just skips to the next non-null vector from what I can tell.
// Since there are no deleted documents in our chosen segment, we can assume that the ordinals | ||
// are unchanged | ||
// meaning we only need to know what ordinal offset to select and apply it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be wrong when index sort is configured? E.g. if we have 2 segments each have 2 docs
seg0
doc0: rank=0
doc1: rank=2
seg1
doc0: rank=1
doc1: rank=3
So after merge those doc will be in an interleaved order but not simply old order + base?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct! I will revert back to the integer map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhaih I switched to an int[]
where the indices are the old ordinals (since these are contiguous values as we don't allow deleted docs...for now).
The values in the indices are the newOrdinal
calculated in a similar loop as before with the Map<Integer, Integer>
* This selects the biggest Hnsw graph from the provided merge state and initializes a new | ||
* HnswGraphBuilder with that graph as a starting point. | ||
*/ | ||
public class IncrementalHnswGraphMerger { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put it in util/hnsw
package instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
This would make sense for a ConcurrentMerger & "BetterFutureHnswGraphMerger" or whatever it will be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only some minor nits
lucene/core/src/java/org/apache/lucene/util/hnsw/IncrementalHnswGraphMerger.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/IncrementalHnswGraphMerger.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/IncrementalHnswGraphMerger.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/IncrementalHnswGraphMerger.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/InitializedHnswGraphBuilder.java
Outdated
Show resolved
Hide resolved
…12657) While working on the quantization codec & thinking about how merging will evolve, it became clearer that having merging attached directly to the vector writer is weird. I extracted it out to its own class and removed the "initializedNodes" logic from the base class builder. Also, there was on other refactoring around grabbing sorted nodes from the neighbor iterator, I just moved that static method so its not attached to the writer (as all bwc writers need it and all future HNSW writers will as well).
…ache.org * upstream/main: (239 commits) Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation (apache#12633) Fix index out of bounds when writing FST to different metaOut (apache#12697) (apache#12698) Avoid object construction when linear searching arcs (apache#12692) chore: update the Javadoc example in Analyzer (apache#12693) coorect position on entry in CHANGES.txt Refactor ByteBlockPool so it is just a "shift/mask big array" (apache#12625) Extract the hnsw graph merging from being part of the vector writer (apache#12657) Specialize `BlockImpactsDocsEnum#nextDoc()`. (apache#12670) Speed up TestIndexOrDocValuesQuery. (apache#12672) Remove over-counting of deleted terms (apache#12586) Use MergeSorter in StableStringSorter (apache#12652) Use radix sort to speed up the sorting of terms in TermInSetQuery (apache#12587) Add timeouts to github jobs. Estimates taken from empirical run times (actions history), with a generous buffer added. (apache#12687) Optimize OnHeapHnswGraph's data structure (apache#12651) Add createClassLoader to replicator permissions (block specific to jacoco). (apache#12684) Move changes entry before backporting CHANGES Move testing properties to provider class (no classloading deadlock possible) and fallback to default provider in non-test mode simple cleanups to vector code (apache#12680) Better detect vector module in non-default setups (e.g., custom module layers) (apache#12677) ...
While working on the quantization codec & thinking about how merging will evolve, it became clearer that having merging attached directly to the vector writer is weird.
I extracted it out to its own class and removed the "initializedNodes" logic from the base class builder.
Also, there was on other refactoring around grabbing sorted nodes from the neighbor iterator, I just moved that static method so its not attached to the writer (as all bwc writers need it and all future HNSW writers will as well).