Extract the hnsw graph merging from being part of the vector writer #12657

benwtrent · 2023-10-11T19:21:25Z

While working on the quantization codec & thinking about how merging will evolve, it became clearer that having merging attached directly to the vector writer is weird.

I extracted it out to its own class and removed the "initializedNodes" logic from the base class builder.

Also, there was on other refactoring around grabbing sorted nodes from the neighbor iterator, I just moved that static method so its not attached to the writer (as all bwc writers need it and all future HNSW writers will as well).

zhaih

Overall I think this refactor is great, will take another closer look later

lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java

…sw-merging

benwtrent · 2023-10-12T17:48:05Z

@zhaih I updated the API a bit. This is more like I was thinking. Having a builder that accepts readers, doc maps, etc. And then can build with the final merge state.

lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java

zhaih · 2023-10-13T05:40:13Z

lucene/core/src/java/org/apache/lucene/util/hnsw/InitializedHnswGraphBuilder.java

+ *
+ * @lucene.internal
+ */
+public final class InitializedHnswGraphBuilder extends HnswGraphBuilder {


I wonder whether it'll be better to allow HnswGraphBuilder to accept a OnHeapHnswGraph as a constructor parameter and then when we init from graph we will use this InitializedHnswGraphBuilder to build a graph and then pass the built graph and a node filter (to avoid reindex the same nodes) to the normal HnswGraphBuilder?

Then for example if we want to have a multi-thread ConcurrentHnswGraphBuilder we can still use this InitializedHnswGraphBuilder to build the init graph and pass it to the ConcurentHnswGraphBuilder?

I mentioned this because in my draft concurrent HNSW merge PR #12660 I do need to pass the HNSW graph to a builder per thread, altho it can be done in various ways. But I still feel using parent/child class to separate this can make things a little bit hard later? Like, if I want a concurrent builder will the concurrent builder extend from this class? If so we need to be quite careful not to inherit a wrong behavior from the original HnswGraphBuilder and things can become quite complex?

@zhaih I think we can do that. What I also want to do is remove all this Map<Integer, Integer> oldToNewOrdinalMap. The caller should handle all that, not the constructor, especially since there is probably a way faster and simpler way to do it.

I didn't follow this closely, but do we really have a map like that? If so, let's at least switch to IntIntHashMap to avoid boxing (if we're unable to remove it)

@msokolov I think we can just have int[], where int[oldOrd]=newOrd since old vector ordinals are continuous.

…sw-merging

benwtrent · 2023-10-13T19:59:00Z

Just being paranoid, I tested and verified that recall is absolutely unchanged between these changes.

baseline:

0.500	 0.10	100000	10	4	50	20	11496	1.00	post-filter
0.533	 0.10	100000	10	4	100	20	18504	1.00	post-filter
0.844	 0.21	100000	10	16	50	20	22211	1.00	post-filter
0.875	 0.24	100000	10	16	100	20	44031	1.00	post-filter

candidate:

0.500	 0.10	100000	10	4	50	20	11778	1.00	post-filter
0.533	 0.10	100000	10	4	100	20	18439	1.00	post-filter
0.844	 0.20	100000	10	16	50	20	24012	1.00	post-filter
0.875	 0.25	100000	10	16	100	20	46249	1.00	post-filter

The performance isn't reliable, this was running on my laptop while it was doing lots of other work.

benwtrent · 2023-10-13T20:23:01Z

lucene/core/src/test/org/apache/lucene/util/hnsw/HnswGraphTestCase.java

-    for (int offset = 0; offset < size; offset += random.nextInt(3) + 1) {
+    for (int offset = 0; offset < size; offset++) {


I think this was to simulate sparse vectors. But the sparse vector iterator never returns null for the vectors. Instead it just skips to the next non-null vector from what I can tell.

zhaih · 2023-10-14T18:53:14Z

lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java

+    // Since there are no deleted documents in our chosen segment, we can assume that the ordinals
+    // are unchanged
+    // meaning we only need to know what ordinal offset to select and apply it


This can be wrong when index sort is configured? E.g. if we have 2 segments each have 2 docs

seg0 doc0: rank=0 doc1: rank=2 seg1 doc0: rank=1 doc1: rank=3

So after merge those doc will be in an interleaved order but not simply old order + base?

You are correct! I will revert back to the integer map.

@zhaih I switched to an int[] where the indices are the old ordinals (since these are contiguous values as we don't allow deleted docs...for now).

The values in the indices are the newOrdinal calculated in a similar loop as before with the Map<Integer, Integer>

zhaih · 2023-10-14T19:20:08Z

lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java

+ * This selects the biggest Hnsw graph from the provided merge state and initializes a new
+ * HnswGraphBuilder with that graph as a starting point.
+ */
+public class IncrementalHnswGraphMerger {


Should we put it in util/hnsw package instead?

++

This would make sense for a ConcurrentMerger & "BetterFutureHnswGraphMerger" or whatever it will be.

…sw-merging

zhaih

LGTM, only some minor nits

lucene/core/src/java/org/apache/lucene/util/hnsw/IncrementalHnswGraphMerger.java

lucene/core/src/java/org/apache/lucene/util/hnsw/InitializedHnswGraphBuilder.java

…sw-merging

…12657) While working on the quantization codec & thinking about how merging will evolve, it became clearer that having merging attached directly to the vector writer is weird. I extracted it out to its own class and removed the "initializedNodes" logic from the base class builder. Also, there was on other refactoring around grabbing sorted nodes from the neighbor iterator, I just moved that static method so its not attached to the writer (as all bwc writers need it and all future HNSW writers will as well).

…ache.org * upstream/main: (239 commits) Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation (apache#12633) Fix index out of bounds when writing FST to different metaOut (apache#12697) (apache#12698) Avoid object construction when linear searching arcs (apache#12692) chore: update the Javadoc example in Analyzer (apache#12693) coorect position on entry in CHANGES.txt Refactor ByteBlockPool so it is just a "shift/mask big array" (apache#12625) Extract the hnsw graph merging from being part of the vector writer (apache#12657) Specialize `BlockImpactsDocsEnum#nextDoc()`. (apache#12670) Speed up TestIndexOrDocValuesQuery. (apache#12672) Remove over-counting of deleted terms (apache#12586) Use MergeSorter in StableStringSorter (apache#12652) Use radix sort to speed up the sorting of terms in TermInSetQuery (apache#12587) Add timeouts to github jobs. Estimates taken from empirical run times (actions history), with a generous buffer added. (apache#12687) Optimize OnHeapHnswGraph's data structure (apache#12651) Add createClassLoader to replicator permissions (block specific to jacoco). (apache#12684) Move changes entry before backporting CHANGES Move testing properties to provider class (no classloading deadlock possible) and fallback to default provider in non-test mode simple cleanups to vector code (apache#12680) Better detect vector module in non-default setups (e.g., custom module layers) (apache#12677) ...

benwtrent added 2 commits October 11, 2023 15:16

Extract the hnsw graph merging from being part of the vector writer

cf0efc3

make initialized class final

55db668

benwtrent requested a review from zhaih October 11, 2023 19:21

zhaih reviewed Oct 11, 2023

View reviewed changes

benwtrent added 2 commits October 12, 2023 13:45

Better refactor of the merging class

f84b52e

Merge remote-tracking branch 'upstream/main' into refactor/extract-hn…

e9f602f

…sw-merging

benwtrent commented Oct 12, 2023

View reviewed changes

lucene/core/src/java/org/apache/lucene/codecs/lucene95/IncrementalHnswGraphMerger.java Outdated Show resolved Hide resolved

zhaih reviewed Oct 13, 2023

View reviewed changes

benwtrent added 2 commits October 13, 2023 14:57

further refactor

4d4292f

Merge remote-tracking branch 'upstream/main' into refactor/extract-hn…

0235eba

…sw-merging

formatting

7b5ed90

benwtrent commented Oct 13, 2023

View reviewed changes

zhaih reviewed Oct 14, 2023

View reviewed changes

benwtrent added 3 commits October 16, 2023 14:19

fixing ordinal corrections

fe93679

moving class to util/hnsw

1f6f056

Merge remote-tracking branch 'upstream/main' into refactor/extract-hn…

1cce31d

…sw-merging

benwtrent requested a review from zhaih October 16, 2023 18:27

zhaih approved these changes Oct 16, 2023

View reviewed changes

benwtrent added 4 commits October 17, 2023 10:01

Merge remote-tracking branch 'upstream/main' into refactor/extract-hn…

6685f92

…sw-merging

fix after merge and address pr comments

4aa56a2

fixing up injection

cd0a3ec

adding CHANGES

1141bf7

benwtrent merged commit ea272d0 into apache:main Oct 17, 2023
4 checks passed

benwtrent deleted the refactor/extract-hnsw-merging branch October 17, 2023 17:45

msokolov mentioned this pull request Dec 14, 2023

Test failure in TestHnswFloatVectorGraph #12945

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract the hnsw graph merging from being part of the vector writer #12657

Extract the hnsw graph merging from being part of the vector writer #12657

benwtrent commented Oct 11, 2023

zhaih left a comment

benwtrent commented Oct 12, 2023

zhaih Oct 13, 2023

benwtrent Oct 13, 2023

msokolov Oct 15, 2023

benwtrent Oct 16, 2023

benwtrent commented Oct 13, 2023

benwtrent Oct 13, 2023

zhaih Oct 14, 2023

benwtrent Oct 15, 2023

benwtrent Oct 16, 2023

zhaih Oct 14, 2023

benwtrent Oct 16, 2023

zhaih left a comment

		for (int offset = 0; offset < size; offset += random.nextInt(3) + 1) {
		for (int offset = 0; offset < size; offset++) {

Extract the hnsw graph merging from being part of the vector writer #12657

Extract the hnsw graph merging from being part of the vector writer #12657

Conversation

benwtrent commented Oct 11, 2023

zhaih left a comment

Choose a reason for hiding this comment

benwtrent commented Oct 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented Oct 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaih left a comment

Choose a reason for hiding this comment