Reuse HNSW graph for intialization during merge #12050

jmazanec15 · 2022-12-30T22:08:09Z

Description

Related to #11354 (performance metrics can be found here). I also started a draft PR in #11719, but decided to refactor into a new PR.

This PR adds the functionality to initialize a merged segment's HNSW graph from the largest HNSW graph from the segments being merged. The graph selected must not contain any dead documents. If no suitable intiailizer graph is found, it will fall back to creating the graph from scratch.

To support this functionality, a couple of changes to current graph construction process needed to be made. OnHeapHnswGraph had to support out of order insertion. This is because the mapped ordinals of the nodes in the graph used for initialization are not necessarily the first X ordinals in the new graph.

I also removed the implicit addition of the first node into the graph. Implicitly adding the first node created a lot of complexity for initialization. In #11719, I got it to work without changing this but thought it was cleaner to switch to require the first node to be added explicitly.

In addition to this, graphs produced by merging two segments are no longer necessarily going to be equivalent to indexing one segment directly. This is caused by both differences in assigned random values as well as insertion order dictating which neighbors are selected for which nodes.

zhaih

Thank you for the work! In general it looks good to me (haven't checked the tests yet). Just want to discuss about a few places where it might worth optimizing.

zhaih · 2023-01-03T22:20:30Z

lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java

+    }
+
+    throw new IllegalArgumentException(
+        "Invalid KnnVectorsReader. Must be of type PerFieldKnnVectorsFormat.FieldsReader or Lucene94HnswVectorsReader");


Maybe say:
"Invalid KnnVectorsReader type for field: " + fieldName + ". Must be Lucene95HnswVectorsReader or newer"?

Makes sense. Will update.

lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java

zhaih · 2023-01-03T22:33:46Z

lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java

+
+    Map<Integer, Integer> oldToNewOrdinalMap = new HashMap<>();
+    int newOrd = 0;
+    int maxNewDocID = Collections.max(newIdToOldOrdinal.keySet());


It might be a bit faster to calculate this max in the previous loop?

Good idea, I will update this.

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java

lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java

msokolov · 2023-01-11T16:36:02Z

To support this functionality, a couple of changes to current graph construction process needed to be made. OnHeapHnswGraph had to support out of order insertion. This is because the mapped ordinals of the nodes in the graph used for initialization are not necessarily the first X ordinals in the new graph.

I'm having trouble wrapping my head around this. When we start merging some field, each segment seg has a graph with ordinals in [0,seg.size]. Why can't we preserve the ordinals from the largest segment, and then let the others fall where they may?

jmazanec15 · 2023-01-11T17:36:38Z

@msokolov The main reason I did not do this was to avoid having to modify the ordering of the vectors from the MergedVectorValues. I believe that the ordinals in the graph map to the positioning in the the vector values, so they need to be synchronized.

zhaih · 2023-01-13T19:12:14Z

+1, That sounds good!

…

On Fri, Jan 13, 2023, 11:10 John Mazanec ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java <#12050 (comment)>: > @@ -94,36 +93,83 @@ public int size() { } /** - * Add node on the given level + * Add node on the given level. Nodes can be inserted out of order, but it requires that the nodes Oh I see what you mean. Yes, that makes sense. I think for level 0, we will still want to use a List because all nodes will eventually be present in this level. However, for levels > 0, we can use a TreeMap and then add an iterator over the keys of that map. — Reply to this email directly, view it on GitHub <#12050 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEFSB7CTOHHMPL4YNBM6ZLTWSGSC7ANCNFSM6AAAAAATNF2BCI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jmazanec15 · 2023-01-19T21:40:38Z

Per this discussion, I refactored OnHeapHnswGraph to use a TreeMap to represent the graph structure for levels greater than 0. I ran performance tests with the same setup as #11354 (comment), and the results did not show a significant difference in indexing time between my previous implementation, the implementation using the map, and the current implementation with no merge optimization. Additionally, the results did not show a difference in merge time between by previous implementation and the implementation using the map.

Here are the results:

Segment Size 10K

Exper.	Total indexing time (s)	Total time to merge numeric vectors (ms)	Recall
Control-1	189s	697280	0.979
Control-2	190s	722042	0.979
Control-3	191s	713402	0.979
Test-array 1	190s	683966	0.98
Test-array 2	187s	683584	0.98
Test-array 3	190s	702458	0.98
Test-map 1	189s	723582	0.98
Test-map 2	187s	658196	0.98
Test-map 3	190s	667777	0.98

Segment Size 100K

Exper.	Total indexing time (s)	Total time to merge numeric vectors (ms)	Recall
Control-1	366s	675361	0.981
Control-2	370s	695974	0.981
Control-3	367s	684418	0.981
Test-array 1	368s	651814	0.981
Test-array 2	368s	654862	0.981
Test-array 3	368s	656062	0.981
Test-map 1	364s	637257	0.981
Test-map 2	370s	628755	0.981
Test-map 3	366s	647569	0.981

Segment Size 500K

Exper.	Total indexing time (s)	Total time to merge numeric vectors (ms)	Recall
Control-1	633s	655538	0.98
Control-2	631s	664622	0.98
Control-3	627s	635919	0.98
Test-array 1	639s	376139	0.98
Test-array 2	636s	378071	0.98
Test-array 3	638s	352633	0.98
Test-map 1	645s	373572	0.98
Test-map 2	635s	374309	0.98
Test-map 3	633s	381212	0.98

Given that the results do not show a significant difference, I switched to use the treemap to avoid multiple large array copies.

benwtrent

I was able to replicate the results and the decrease in merge time is really nice once data size becomes less trivial.

I know there have been many recent changes in the vectors interface, so to prevent this from rotting on the vine, I can commit in and handle the merge conflicts, if @jmazanec15 doesn't mind a co-author. But if you have already started that merge, then no worries :)

benwtrent · 2023-01-05T20:31:29Z

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java

+  public void initializeFromGraph(
+      HnswGraph initializerGraph, Map<Integer, Integer> oldToNewOrdinalMap) throws IOException {
+    assert hnsw.size() == 0;


Could you make this a new static method that also constructs the graph builder?

Makes sense. Will update.

jmazanec15 · 2023-01-30T20:25:43Z

@benwtrent thanks! I do not mind a coauthor. Was working on the rebase and just finished it.

zhaih

Sorry for the delay, I have a few small comment but overall LGTM, thank you!

lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java

lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java

zhaih

Ah since Lucene95 has just been released, I think we should move this to Lucene 96?

lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java

benwtrent · 2023-01-31T13:35:23Z

Ah since Lucene95 has just been released, I think we should move this to Lucene 96?

@zhaih

Do you mean create a new Codec version? From what I can tell, nothing in the underlying storage format has changed and the only reason Lucene95HnswVectorsReader is cast is for Lucene95HnswVectorsReader#getGraph, which already existed.

Could you clarify your concern?

zhaih · 2023-01-31T17:51:05Z

Do you mean create a new Codec version? From what I can tell, nothing in the underlying storage format has changed and the only reason Lucene95HnswVectorsReader is cast is for Lucene95HnswVectorsReader#getGraph, which already existed.

@benwtrent You're right, I had an impression of this work was based on the newly created codec but yeah we don't need a new codec for it. Sorry for the confusion.

benwtrent · 2023-01-31T18:19:54Z

lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java

@@ -56,6 +56,8 @@ long apply(long v) {
  // Whether the search stopped early because it reached the visited nodes limit
  private boolean incomplete;

+  public static final NeighborQueue EMPTY_MAX_HEAP_NEIGHBOR_QUEUE = new NeighborQueue(1, true);


It is nice to have a static thing like this. But, EMPTY_MAX_HEAP_NEIGHBOR_QUEUE#add(int float) is possible. This seems dangerous to me as somebody might accidentally call search and then add values to this static object.

If we are going to have a static object like this, it would be good if it was EmptyNeighborQueue that disallows add or any mutable action.

You are right, I did not think about this. Given how much mutable state there is, I am wondering if it might just be better to get rid of this. What do you think?

@jmazanec15 simply removing it and going back to the way it was (since all the following loops would be empty) should be OK imo. Either way I am good.

benwtrent

My last comment is a minor thing.

Pinging @msokolov to see if he has any more concerns.

The performance improvements here are nice :). Thanks for your persistence on this @jmazanec15!!

Removes logic to add 0 vector implicitly. This is in preparation for adding nodes from other graphs to initialize a new graph. Having the implicit addition of node 0 complicates this logic. Signed-off-by: John Mazanec <jmazane@amazon.com>

Enables nodes to be added into OnHeapHnswGraph in out of order fashion. To do so, additional operations have to be taken to resort the nodesByLevel array. Optimizations have been made to avoid sorting whenever possible. Signed-off-by: John Mazanec <jmazane@amazon.com>

Adds method to initialize an HNSWGraphBuilder from another HNSWGraph. Initialization can only happen when the builder's graph is empty. Signed-off-by: John Mazanec <jmazane@amazon.com>

Uses HNSWGraphBuilder initialization from graph functionality in Lucene95HnswVectorsWriter. Selects the largest graph to initialize the new graph produced by the HNSWGraphBuilder for merge. Signed-off-by: John Mazanec <jmazane@amazon.com>

Signed-off-by: John Mazanec <jmazane@amazon.com>

Refactors OnHeapHnswGraph to use TreeMap to represent graph structure of levels greater than 0. Refactors NodesIterator to support set representation of nodes. Signed-off-by: John Mazanec <jmazane@amazon.com>

Refeactors initialization from graph to be accessible via a create static method in HnswGraphBuilder. Signed-off-by: John Mazanec <jmazane@amazon.com>

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Remove implicit addition of vector 0 Removes logic to add 0 vector implicitly. This is in preparation for adding nodes from other graphs to initialize a new graph. Having the implicit addition of node 0 complicates this logic. Signed-off-by: John Mazanec <jmazane@amazon.com> * Enable out of order insertion of nodes in hnsw Enables nodes to be added into OnHeapHnswGraph in out of order fashion. To do so, additional operations have to be taken to resort the nodesByLevel array. Optimizations have been made to avoid sorting whenever possible. Signed-off-by: John Mazanec <jmazane@amazon.com> * Add ability to initialize from graph Adds method to initialize an HNSWGraphBuilder from another HNSWGraph. Initialization can only happen when the builder's graph is empty. Signed-off-by: John Mazanec <jmazane@amazon.com> * Utilize merge with graph init in HNSWWriter Uses HNSWGraphBuilder initialization from graph functionality in Lucene95HnswVectorsWriter. Selects the largest graph to initialize the new graph produced by the HNSWGraphBuilder for merge. Signed-off-by: John Mazanec <jmazane@amazon.com> * Minor modifications to Lucene95HnswVectorsWriter Signed-off-by: John Mazanec <jmazane@amazon.com> * Use TreeMap for graph structure for levels > 0 Refactors OnHeapHnswGraph to use TreeMap to represent graph structure of levels greater than 0. Refactors NodesIterator to support set representation of nodes. Signed-off-by: John Mazanec <jmazane@amazon.com> * Refactor initializer to be in static create method Refeactors initialization from graph to be accessible via a create static method in HnswGraphBuilder. Signed-off-by: John Mazanec <jmazane@amazon.com> * Address review comments Signed-off-by: John Mazanec <jmazane@amazon.com> * Add change log entry Signed-off-by: John Mazanec <jmazane@amazon.com> * Remove empty iterator for neighborqueue Signed-off-by: John Mazanec <jmazane@amazon.com> --------- Signed-off-by: John Mazanec <jmazane@amazon.com>

benwtrent · 2023-02-07T20:13:03Z

@jmazanec15 merged and I backported to branch_9x (some minor changes for java version stuff around switch statements).

Good stuff!

jmazanec15 · 2023-02-07T22:09:22Z

Thanks @benwtrent!

jpountz · 2023-02-10T13:25:49Z

Nightlies have failed for the last couple days, complaining that KNN searches now return different hits. Is it expected that given the exact same indexing conditions (flushing on doc count and serial merge scheduler), KNN searches may return different hits for the same query with this change?

Here's the error I'm seeing in the log for reference (can be retrieved via curl -r -10000 http://people.apache.org/~mikemccand/lucenebench/nightly.log):

RuntimeError: search result differences: ["query=KnnFloatVectorQuery:vector[0.024077624,...][100] filter=None sort=None groupField=None hitCount=100: hit 6 has wrong field/score value ([19995955], '0.82841617') vs ([19404640], '0.8304943')", "query=KnnFloatVectorQuery:vector[0.028473025,...][100] filter=None sort=None groupField=None hitCount=100: hit 1 has wrong field/score value ([2139705], '0.9640273') vs ([20795785], '0.9655802')", "query=KnnFloatVectorQuery:vector[0.02227773,...][100] filter=None sort=None groupField=None hitCount=100: hit 19 has wrong field/score value ([20249582], '0.9433427') vs ([8538823], '0.94324553')", "query=KnnFloatVectorQuery:vector[-0.047548626,...][100] filter=None sort=None groupField=None hitCount=100: hit 0 has wrong field/score value ([24831434], '0.84341675') vs ([20712471], '0.8335463')", "query=KnnFloatVectorQuery:vector[0.02625591,...][100] filter=None sort=None groupField=None hitCount=100: hit 6 has wrong field/score value ([25459412], '0.8309758') vs ([15548210], '0.8312737')"]

jpountz · 2023-02-10T13:28:05Z

I think that the answer to my question is "yes" given this paragraph in the issue description: "In addition to this, graphs produced by merging two segments are no longer necessarily going to be equivalent to indexing one segment directly. This is caused by both differences in assigned random values as well as insertion order dictating which neighbors are selected for which nodes."

@mikemccand Could you kick off a re-gold of nightly benchmarks?

jmazanec15 · 2023-02-13T17:52:03Z

@jpountz yes that's correct. The random number assignment is no longer going to be the same when merging multiple graphs together, because the segment whose graph is being used to initialize won't take any random numbers. Additionally, depending on the ordinals the vectors map to in the initializer graph, the neighbor assignment may be different.

benwtrent · 2023-02-13T21:30:56Z

@jmazanec15 did his due diligence, just being paranoid :).

I have confirmed that for the following ann-benchmarks datasets the recall before and after this change are 1-1: mnist-784-euclidean, sift-128-euclidean, glove-100-angular. However, all these datasets are pretty small, and may not kick off many segment merges, etc.

So I tested with deep-image-96-angular and it took some time.

But here are the results:

parameters	test recall	control recall
{'M': 48, 'efConstruction': 100} fanout=100	0.995	0.994
{'M': 16, 'efConstruction': 100} fanout=100	0.986	0.986
{'M': 16, 'efConstruction': 100} fanout=50	0.969	0.969
{'M': 16, 'efConstruction': 100} fanout=500	0.998	0.998
{'M': 48, 'efConstruction': 100} fanout=500	0.999	0.999
{'M': 16, 'efConstruction': 100} fanout=10	0.892	0.892
{'M': 48, 'efConstruction': 100} fanout=50	0.986	0.986
{'M': 48, 'efConstruction': 100} fanout=10	0.941	0.940

So, there are no significant changes in recall. So, I think this change is good and we should update the test.

@jpountz

jmazanec15 mentioned this pull request Dec 30, 2022

Reuse HNSW graphs when merging segments? [LUCENE-10318] #11354

Closed

jmazanec15 force-pushed the hnsw-merge-from-graph branch from c887ab8 to 3d27f6b Compare December 30, 2022 22:52

msokolov linked an issue Jan 3, 2023 that may be closed by this pull request

Reuse HNSW graphs when merging segments? [LUCENE-10318] #11354

Closed

zhaih reviewed Jan 3, 2023

View reviewed changes

jmazanec15 requested a review from zhaih January 12, 2023 21:08

benwtrent reviewed Jan 30, 2023

View reviewed changes

jmazanec15 force-pushed the hnsw-merge-from-graph branch 2 times, most recently from b166c5b to e6b8a07 Compare January 30, 2023 20:06

jmazanec15 requested review from benwtrent and zhaih and removed request for zhaih and benwtrent January 30, 2023 22:17

zhaih approved these changes Jan 30, 2023

View reviewed changes

zhaih requested changes Jan 31, 2023

View reviewed changes

lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java Outdated Show resolved Hide resolved

jmazanec15 force-pushed the hnsw-merge-from-graph branch from 79bd47c to 56f114f Compare January 31, 2023 17:38

zhaih approved these changes Jan 31, 2023

View reviewed changes

benwtrent reviewed Jan 31, 2023

View reviewed changes

benwtrent approved these changes Jan 31, 2023

View reviewed changes

jmazanec15 added 5 commits February 1, 2023 14:54

Remove implicit addition of vector 0

0bcce49

Removes logic to add 0 vector implicitly. This is in preparation for adding nodes from other graphs to initialize a new graph. Having the implicit addition of node 0 complicates this logic. Signed-off-by: John Mazanec <jmazane@amazon.com>

Add ability to initialize from graph

dc17ffa

Adds method to initialize an HNSWGraphBuilder from another HNSWGraph. Initialization can only happen when the builder's graph is empty. Signed-off-by: John Mazanec <jmazane@amazon.com>

Utilize merge with graph init in HNSWWriter

ffd47fe

Uses HNSWGraphBuilder initialization from graph functionality in Lucene95HnswVectorsWriter. Selects the largest graph to initialize the new graph produced by the HNSWGraphBuilder for merge. Signed-off-by: John Mazanec <jmazane@amazon.com>

Minor modifications to Lucene95HnswVectorsWriter

23cdd7f

Signed-off-by: John Mazanec <jmazane@amazon.com>

jmazanec15 added 5 commits February 1, 2023 14:54

Use TreeMap for graph structure for levels > 0

897df5a

Refactors OnHeapHnswGraph to use TreeMap to represent graph structure of levels greater than 0. Refactors NodesIterator to support set representation of nodes. Signed-off-by: John Mazanec <jmazane@amazon.com>

Refactor initializer to be in static create method

73c3f99

Refeactors initialization from graph to be accessible via a create static method in HnswGraphBuilder. Signed-off-by: John Mazanec <jmazane@amazon.com>

Address review comments

f3539d0

Signed-off-by: John Mazanec <jmazane@amazon.com>

Add change log entry

c684c7a

Signed-off-by: John Mazanec <jmazane@amazon.com>

Remove empty iterator for neighborqueue

f81e085

Signed-off-by: John Mazanec <jmazane@amazon.com>

jmazanec15 force-pushed the hnsw-merge-from-graph branch from d40f225 to f81e085 Compare February 1, 2023 22:55

benwtrent merged commit 776149f into apache:main Feb 7, 2023

jmazanec15 mentioned this pull request Mar 1, 2023

Support Lucene HNSW merge optimization opensearch-project/k-NN#785

Closed

zhaih mentioned this pull request Apr 19, 2023

Lazily compute similarity score when reuse the old HNSW graph #12236

Closed

mbrette mentioned this pull request Aug 17, 2023

Make HNSW merges faster #12440

Open

jmazanec15 mentioned this pull request Aug 31, 2023

[FEATURE] Optimize native merge for native indices opensearch-project/k-NN#1086

Open

zhaih mentioned this pull request Sep 1, 2023

Init HNSW merge with graph containing deleted documents #12533

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse HNSW graph for intialization during merge #12050

Reuse HNSW graph for intialization during merge #12050

jmazanec15 commented Dec 30, 2022

zhaih left a comment

zhaih Jan 3, 2023

jmazanec15 Jan 4, 2023

zhaih Jan 3, 2023

jmazanec15 Jan 4, 2023

msokolov commented Jan 11, 2023

jmazanec15 commented Jan 11, 2023

zhaih commented Jan 13, 2023 via email

jmazanec15 commented Jan 19, 2023

benwtrent left a comment

benwtrent Jan 5, 2023

jmazanec15 Jan 30, 2023

jmazanec15 commented Jan 30, 2023

zhaih left a comment

zhaih left a comment

benwtrent commented Jan 31, 2023

zhaih commented Jan 31, 2023

benwtrent Jan 31, 2023

jmazanec15 Jan 31, 2023

benwtrent Jan 31, 2023

benwtrent left a comment

benwtrent commented Feb 7, 2023

jmazanec15 commented Feb 7, 2023

jpountz commented Feb 10, 2023

jpountz commented Feb 10, 2023

jmazanec15 commented Feb 13, 2023

benwtrent commented Feb 13, 2023

Reuse HNSW graph for intialization during merge #12050

Reuse HNSW graph for intialization during merge #12050

Conversation

jmazanec15 commented Dec 30, 2022

Description

zhaih left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msokolov commented Jan 11, 2023

jmazanec15 commented Jan 11, 2023

zhaih commented Jan 13, 2023 via email

jmazanec15 commented Jan 19, 2023

Segment Size 10K

Segment Size 100K

Segment Size 500K

benwtrent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmazanec15 commented Jan 30, 2023

zhaih left a comment

Choose a reason for hiding this comment

zhaih left a comment

Choose a reason for hiding this comment

benwtrent commented Jan 31, 2023

zhaih commented Jan 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent left a comment

Choose a reason for hiding this comment

benwtrent commented Feb 7, 2023

jmazanec15 commented Feb 7, 2023

jpountz commented Feb 10, 2023

jpountz commented Feb 10, 2023

jmazanec15 commented Feb 13, 2023

benwtrent commented Feb 13, 2023