-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse HNSW graph for intialization during merge #12050
Conversation
c887ab8
to
3d27f6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the work! In general it looks good to me (haven't checked the tests yet). Just want to discuss about a few places where it might worth optimizing.
} | ||
|
||
throw new IllegalArgumentException( | ||
"Invalid KnnVectorsReader. Must be of type PerFieldKnnVectorsFormat.FieldsReader or Lucene94HnswVectorsReader"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say:
"Invalid KnnVectorsReader type for field: " + fieldName + ". Must be Lucene95HnswVectorsReader or newer"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Will update.
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
Show resolved
Hide resolved
|
||
Map<Integer, Integer> oldToNewOrdinalMap = new HashMap<>(); | ||
int newOrd = 0; | ||
int maxNewDocID = Collections.max(newIdToOldOrdinal.keySet()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a bit faster to calculate this max in the previous loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I will update this.
I'm having trouble wrapping my head around this. When we start merging some field, each segment seg has a graph with ordinals in [0,seg.size]. Why can't we preserve the ordinals from the largest segment, and then let the others fall where they may? |
@msokolov The main reason I did not do this was to avoid having to modify the ordering of the vectors from the MergedVectorValues. I believe that the ordinals in the graph map to the positioning in the the vector values, so they need to be synchronized. |
+1, That sounds good!
…On Fri, Jan 13, 2023, 11:10 John Mazanec ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java
<#12050 (comment)>:
> @@ -94,36 +93,83 @@ public int size() {
}
/**
- * Add node on the given level
+ * Add node on the given level. Nodes can be inserted out of order, but it requires that the nodes
Oh I see what you mean. Yes, that makes sense.
I think for level 0, we will still want to use a List because all nodes
will eventually be present in this level. However, for levels > 0, we can
use a TreeMap and then add an iterator over the keys of that map.
—
Reply to this email directly, view it on GitHub
<#12050 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEFSB7CTOHHMPL4YNBM6ZLTWSGSC7ANCNFSM6AAAAAATNF2BCI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Per this discussion, I refactored OnHeapHnswGraph to use a TreeMap to represent the graph structure for levels greater than 0. I ran performance tests with the same setup as #11354 (comment), and the results did not show a significant difference in indexing time between my previous implementation, the implementation using the map, and the current implementation with no merge optimization. Additionally, the results did not show a difference in merge time between by previous implementation and the implementation using the map. Here are the results: Segment Size 10K
Segment Size 100K
Segment Size 500K
Given that the results do not show a significant difference, I switched to use the treemap to avoid multiple large array copies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to replicate the results and the decrease in merge time is really nice once data size becomes less trivial.
I know there have been many recent changes in the vectors interface, so to prevent this from rotting on the vine, I can commit in and handle the merge conflicts, if @jmazanec15 doesn't mind a co-author. But if you have already started that merge, then no worries :)
public void initializeFromGraph( | ||
HnswGraph initializerGraph, Map<Integer, Integer> oldToNewOrdinalMap) throws IOException { | ||
assert hnsw.size() == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this a new static method that also constructs the graph builder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Will update.
b166c5b
to
e6b8a07
Compare
@benwtrent thanks! I do not mind a coauthor. Was working on the rebase and just finished it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I have a few small comment but overall LGTM, thank you!
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah since Lucene95 has just been released, I think we should move this to Lucene 96?
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
Outdated
Show resolved
Hide resolved
Do you mean create a new Codec version? From what I can tell, nothing in the underlying storage format has changed and the only reason Could you clarify your concern? |
79bd47c
to
56f114f
Compare
@benwtrent You're right, I had an impression of this work was based on the newly created codec but yeah we don't need a new codec for it. Sorry for the confusion. |
@@ -56,6 +56,8 @@ long apply(long v) { | |||
// Whether the search stopped early because it reached the visited nodes limit | |||
private boolean incomplete; | |||
|
|||
public static final NeighborQueue EMPTY_MAX_HEAP_NEIGHBOR_QUEUE = new NeighborQueue(1, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is nice to have a static thing like this. But, EMPTY_MAX_HEAP_NEIGHBOR_QUEUE#add(int float)
is possible. This seems dangerous to me as somebody might accidentally call search
and then add values to this static object.
If we are going to have a static object like this, it would be good if it was EmptyNeighborQueue
that disallows add
or any mutable action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I did not think about this. Given how much mutable state there is, I am wondering if it might just be better to get rid of this. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmazanec15 simply removing it and going back to the way it was (since all the following loops would be empty) should be OK imo. Either way I am good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My last comment is a minor thing.
Pinging @msokolov to see if he has any more concerns.
The performance improvements here are nice :). Thanks for your persistence on this @jmazanec15!!
Removes logic to add 0 vector implicitly. This is in preparation for adding nodes from other graphs to initialize a new graph. Having the implicit addition of node 0 complicates this logic. Signed-off-by: John Mazanec <jmazane@amazon.com>
Enables nodes to be added into OnHeapHnswGraph in out of order fashion. To do so, additional operations have to be taken to resort the nodesByLevel array. Optimizations have been made to avoid sorting whenever possible. Signed-off-by: John Mazanec <jmazane@amazon.com>
Adds method to initialize an HNSWGraphBuilder from another HNSWGraph. Initialization can only happen when the builder's graph is empty. Signed-off-by: John Mazanec <jmazane@amazon.com>
Uses HNSWGraphBuilder initialization from graph functionality in Lucene95HnswVectorsWriter. Selects the largest graph to initialize the new graph produced by the HNSWGraphBuilder for merge. Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Refactors OnHeapHnswGraph to use TreeMap to represent graph structure of levels greater than 0. Refactors NodesIterator to support set representation of nodes. Signed-off-by: John Mazanec <jmazane@amazon.com>
Refeactors initialization from graph to be accessible via a create static method in HnswGraphBuilder. Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
d40f225
to
f81e085
Compare
* Remove implicit addition of vector 0 Removes logic to add 0 vector implicitly. This is in preparation for adding nodes from other graphs to initialize a new graph. Having the implicit addition of node 0 complicates this logic. Signed-off-by: John Mazanec <jmazane@amazon.com> * Enable out of order insertion of nodes in hnsw Enables nodes to be added into OnHeapHnswGraph in out of order fashion. To do so, additional operations have to be taken to resort the nodesByLevel array. Optimizations have been made to avoid sorting whenever possible. Signed-off-by: John Mazanec <jmazane@amazon.com> * Add ability to initialize from graph Adds method to initialize an HNSWGraphBuilder from another HNSWGraph. Initialization can only happen when the builder's graph is empty. Signed-off-by: John Mazanec <jmazane@amazon.com> * Utilize merge with graph init in HNSWWriter Uses HNSWGraphBuilder initialization from graph functionality in Lucene95HnswVectorsWriter. Selects the largest graph to initialize the new graph produced by the HNSWGraphBuilder for merge. Signed-off-by: John Mazanec <jmazane@amazon.com> * Minor modifications to Lucene95HnswVectorsWriter Signed-off-by: John Mazanec <jmazane@amazon.com> * Use TreeMap for graph structure for levels > 0 Refactors OnHeapHnswGraph to use TreeMap to represent graph structure of levels greater than 0. Refactors NodesIterator to support set representation of nodes. Signed-off-by: John Mazanec <jmazane@amazon.com> * Refactor initializer to be in static create method Refeactors initialization from graph to be accessible via a create static method in HnswGraphBuilder. Signed-off-by: John Mazanec <jmazane@amazon.com> * Address review comments Signed-off-by: John Mazanec <jmazane@amazon.com> * Add change log entry Signed-off-by: John Mazanec <jmazane@amazon.com> * Remove empty iterator for neighborqueue Signed-off-by: John Mazanec <jmazane@amazon.com> --------- Signed-off-by: John Mazanec <jmazane@amazon.com>
@jmazanec15 merged and I backported to branch_9x (some minor changes for java version stuff around switch statements). Good stuff! |
Thanks @benwtrent! |
Nightlies have failed for the last couple days, complaining that KNN searches now return different hits. Is it expected that given the exact same indexing conditions (flushing on doc count and serial merge scheduler), KNN searches may return different hits for the same query with this change? Here's the error I'm seeing in the log for reference (can be retrieved via
|
I think that the answer to my question is "yes" given this paragraph in the issue description: "In addition to this, graphs produced by merging two segments are no longer necessarily going to be equivalent to indexing one segment directly. This is caused by both differences in assigned random values as well as insertion order dictating which neighbors are selected for which nodes." @mikemccand Could you kick off a re-gold of nightly benchmarks? |
@jpountz yes that's correct. The random number assignment is no longer going to be the same when merging multiple graphs together, because the segment whose graph is being used to initialize won't take any random numbers. Additionally, depending on the ordinals the vectors map to in the initializer graph, the neighbor assignment may be different. |
@jmazanec15 did his due diligence, just being paranoid :). I have confirmed that for the following ann-benchmarks datasets the recall before and after this change are 1-1: So I tested with But here are the results:
So, there are no significant changes in recall. So, I think this change is good and we should update the test. |
Description
Related to #11354 (performance metrics can be found here). I also started a draft PR in #11719, but decided to refactor into a new PR.
This PR adds the functionality to initialize a merged segment's HNSW graph from the largest HNSW graph from the segments being merged. The graph selected must not contain any dead documents. If no suitable intiailizer graph is found, it will fall back to creating the graph from scratch.
To support this functionality, a couple of changes to current graph construction process needed to be made. OnHeapHnswGraph had to support out of order insertion. This is because the mapped ordinals of the nodes in the graph used for initialization are not necessarily the first X ordinals in the new graph.
I also removed the implicit addition of the first node into the graph. Implicitly adding the first node created a lot of complexity for initialization. In #11719, I got it to work without changing this but thought it was cleaner to switch to require the first node to be added explicitly.
In addition to this, graphs produced by merging two segments are no longer necessarily going to be equivalent to indexing one segment directly. This is caused by both differences in assigned random values as well as insertion order dictating which neighbors are selected for which nodes.