Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse HNSW graph for intialization during merge #12050

Merged
merged 10 commits into from
Feb 7, 2023

Conversation

jmazanec15
Copy link
Contributor

Description

Related to #11354 (performance metrics can be found here). I also started a draft PR in #11719, but decided to refactor into a new PR.

This PR adds the functionality to initialize a merged segment's HNSW graph from the largest HNSW graph from the segments being merged. The graph selected must not contain any dead documents. If no suitable intiailizer graph is found, it will fall back to creating the graph from scratch.

To support this functionality, a couple of changes to current graph construction process needed to be made. OnHeapHnswGraph had to support out of order insertion. This is because the mapped ordinals of the nodes in the graph used for initialization are not necessarily the first X ordinals in the new graph.

I also removed the implicit addition of the first node into the graph. Implicitly adding the first node created a lot of complexity for initialization. In #11719, I got it to work without changing this but thought it was cleaner to switch to require the first node to be added explicitly.

In addition to this, graphs produced by merging two segments are no longer necessarily going to be equivalent to indexing one segment directly. This is caused by both differences in assigned random values as well as insertion order dictating which neighbors are selected for which nodes.

Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work! In general it looks good to me (haven't checked the tests yet). Just want to discuss about a few places where it might worth optimizing.

}

throw new IllegalArgumentException(
"Invalid KnnVectorsReader. Must be of type PerFieldKnnVectorsFormat.FieldsReader or Lucene94HnswVectorsReader");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say:
"Invalid KnnVectorsReader type for field: " + fieldName + ". Must be Lucene95HnswVectorsReader or newer"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Will update.


Map<Integer, Integer> oldToNewOrdinalMap = new HashMap<>();
int newOrd = 0;
int maxNewDocID = Collections.max(newIdToOldOrdinal.keySet());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a bit faster to calculate this max in the previous loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I will update this.

@msokolov
Copy link
Contributor

To support this functionality, a couple of changes to current graph construction process needed to be made. OnHeapHnswGraph had to support out of order insertion. This is because the mapped ordinals of the nodes in the graph used for initialization are not necessarily the first X ordinals in the new graph.

I'm having trouble wrapping my head around this. When we start merging some field, each segment seg has a graph with ordinals in [0,seg.size]. Why can't we preserve the ordinals from the largest segment, and then let the others fall where they may?

@jmazanec15
Copy link
Contributor Author

@msokolov The main reason I did not do this was to avoid having to modify the ordering of the vectors from the MergedVectorValues. I believe that the ordinals in the graph map to the positioning in the the vector values, so they need to be synchronized.

@jmazanec15 jmazanec15 requested a review from zhaih January 12, 2023 21:08
@zhaih
Copy link
Contributor

zhaih commented Jan 13, 2023 via email

@jmazanec15
Copy link
Contributor Author

Per this discussion, I refactored OnHeapHnswGraph to use a TreeMap to represent the graph structure for levels greater than 0. I ran performance tests with the same setup as #11354 (comment), and the results did not show a significant difference in indexing time between my previous implementation, the implementation using the map, and the current implementation with no merge optimization. Additionally, the results did not show a difference in merge time between by previous implementation and the implementation using the map.

Here are the results:

Segment Size 10K

Exper. Total indexing time (s) Total time to merge numeric vectors (ms) Recall
Control-1 189s 697280 0.979
Control-2 190s 722042 0.979
Control-3 191s 713402 0.979
Test-array 1 190s 683966 0.98
Test-array 2 187s 683584 0.98
Test-array 3 190s 702458 0.98
Test-map 1 189s 723582 0.98
Test-map 2 187s 658196 0.98
Test-map 3 190s 667777 0.98

Segment Size 100K

Exper. Total indexing time (s) Total time to merge numeric vectors (ms) Recall
Control-1 366s 675361 0.981
Control-2 370s 695974 0.981
Control-3 367s 684418 0.981
Test-array 1 368s 651814 0.981
Test-array 2 368s 654862 0.981
Test-array 3 368s 656062 0.981
Test-map 1 364s 637257 0.981
Test-map 2 370s 628755 0.981
Test-map 3 366s 647569 0.981

Segment Size 500K

Exper. Total indexing time (s) Total time to merge numeric vectors (ms) Recall
Control-1 633s 655538 0.98
Control-2 631s 664622 0.98
Control-3 627s 635919 0.98
Test-array 1 639s 376139 0.98
Test-array 2 636s 378071 0.98
Test-array 3 638s 352633 0.98
Test-map 1 645s 373572 0.98
Test-map 2 635s 374309 0.98
Test-map 3 633s 381212 0.98

Given that the results do not show a significant difference, I switched to use the treemap to avoid multiple large array copies.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to replicate the results and the decrease in merge time is really nice once data size becomes less trivial.

I know there have been many recent changes in the vectors interface, so to prevent this from rotting on the vine, I can commit in and handle the merge conflicts, if @jmazanec15 doesn't mind a co-author. But if you have already started that merge, then no worries :)

Comment on lines 161 to 178
public void initializeFromGraph(
HnswGraph initializerGraph, Map<Integer, Integer> oldToNewOrdinalMap) throws IOException {
assert hnsw.size() == 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this a new static method that also constructs the graph builder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Will update.

@jmazanec15 jmazanec15 force-pushed the hnsw-merge-from-graph branch 2 times, most recently from b166c5b to e6b8a07 Compare January 30, 2023 20:06
@jmazanec15
Copy link
Contributor Author

@benwtrent thanks! I do not mind a coauthor. Was working on the rebase and just finished it.

@jmazanec15 jmazanec15 requested review from benwtrent and zhaih and removed request for zhaih and benwtrent January 30, 2023 22:17
Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I have a few small comment but overall LGTM, thank you!

Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah since Lucene95 has just been released, I think we should move this to Lucene 96?

@benwtrent
Copy link
Member

Ah since Lucene95 has just been released, I think we should move this to Lucene 96?

@zhaih

Do you mean create a new Codec version? From what I can tell, nothing in the underlying storage format has changed and the only reason Lucene95HnswVectorsReader is cast is for Lucene95HnswVectorsReader#getGraph, which already existed.

Could you clarify your concern?

@zhaih
Copy link
Contributor

zhaih commented Jan 31, 2023

Do you mean create a new Codec version? From what I can tell, nothing in the underlying storage format has changed and the only reason Lucene95HnswVectorsReader is cast is for Lucene95HnswVectorsReader#getGraph, which already existed.

@benwtrent You're right, I had an impression of this work was based on the newly created codec but yeah we don't need a new codec for it. Sorry for the confusion.

@@ -56,6 +56,8 @@ long apply(long v) {
// Whether the search stopped early because it reached the visited nodes limit
private boolean incomplete;

public static final NeighborQueue EMPTY_MAX_HEAP_NEIGHBOR_QUEUE = new NeighborQueue(1, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is nice to have a static thing like this. But, EMPTY_MAX_HEAP_NEIGHBOR_QUEUE#add(int float) is possible. This seems dangerous to me as somebody might accidentally call search and then add values to this static object.

If we are going to have a static object like this, it would be good if it was EmptyNeighborQueue that disallows add or any mutable action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I did not think about this. Given how much mutable state there is, I am wondering if it might just be better to get rid of this. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmazanec15 simply removing it and going back to the way it was (since all the following loops would be empty) should be OK imo. Either way I am good.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My last comment is a minor thing.

Pinging @msokolov to see if he has any more concerns.

The performance improvements here are nice :). Thanks for your persistence on this @jmazanec15!!

Removes logic to add 0 vector implicitly. This is in preparation for
adding nodes from other graphs to initialize a new graph. Having the
implicit addition of node 0 complicates this logic.

Signed-off-by: John Mazanec <jmazane@amazon.com>
Enables nodes to be added into OnHeapHnswGraph in out of order fashion.
To do so, additional operations have to be taken to resort the
nodesByLevel array. Optimizations have been made to avoid sorting
whenever possible.

Signed-off-by: John Mazanec <jmazane@amazon.com>
Adds method to initialize an HNSWGraphBuilder from another HNSWGraph.
Initialization can only happen when the builder's graph is empty.

Signed-off-by: John Mazanec <jmazane@amazon.com>
Uses HNSWGraphBuilder initialization from graph functionality in
Lucene95HnswVectorsWriter. Selects the largest graph to initialize the
new graph produced by the HNSWGraphBuilder for merge.

Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Refactors OnHeapHnswGraph to use TreeMap to represent graph structure of
levels greater than 0. Refactors NodesIterator to support set
representation of nodes.

Signed-off-by: John Mazanec <jmazane@amazon.com>
Refeactors initialization from graph to be accessible via a create
static method in HnswGraphBuilder.

Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
@benwtrent benwtrent merged commit 776149f into apache:main Feb 7, 2023
benwtrent pushed a commit that referenced this pull request Feb 7, 2023
* Remove implicit addition of vector 0

Removes logic to add 0 vector implicitly. This is in preparation for
adding nodes from other graphs to initialize a new graph. Having the
implicit addition of node 0 complicates this logic.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Enable out of order insertion of nodes in hnsw

Enables nodes to be added into OnHeapHnswGraph in out of order fashion.
To do so, additional operations have to be taken to resort the
nodesByLevel array. Optimizations have been made to avoid sorting
whenever possible.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Add ability to initialize from graph

Adds method to initialize an HNSWGraphBuilder from another HNSWGraph.
Initialization can only happen when the builder's graph is empty.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Utilize merge with graph init in HNSWWriter

Uses HNSWGraphBuilder initialization from graph functionality in
Lucene95HnswVectorsWriter. Selects the largest graph to initialize the
new graph produced by the HNSWGraphBuilder for merge.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Minor modifications to Lucene95HnswVectorsWriter

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Use TreeMap for graph structure for levels > 0

Refactors OnHeapHnswGraph to use TreeMap to represent graph structure of
levels greater than 0. Refactors NodesIterator to support set
representation of nodes.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Refactor initializer to be in static create method

Refeactors initialization from graph to be accessible via a create
static method in HnswGraphBuilder.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Address review comments

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Add change log entry

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Remove empty iterator for neighborqueue

Signed-off-by: John Mazanec <jmazane@amazon.com>

---------

Signed-off-by: John Mazanec <jmazane@amazon.com>
@benwtrent
Copy link
Member

@jmazanec15 merged and I backported to branch_9x (some minor changes for java version stuff around switch statements).

Good stuff!

@jmazanec15
Copy link
Contributor Author

Thanks @benwtrent!

@jpountz
Copy link
Contributor

jpountz commented Feb 10, 2023

Nightlies have failed for the last couple days, complaining that KNN searches now return different hits. Is it expected that given the exact same indexing conditions (flushing on doc count and serial merge scheduler), KNN searches may return different hits for the same query with this change?

Here's the error I'm seeing in the log for reference (can be retrieved via curl -r -10000 http://people.apache.org/~mikemccand/lucenebench/nightly.log):

RuntimeError: search result differences: ["query=KnnFloatVectorQuery:vector[0.024077624,...][100] filter=None sort=None groupField=None hitCount=100: hit 6 has wrong field/score value ([19995955], '0.82841617') vs ([19404640], '0.8304943')", "query=KnnFloatVectorQuery:vector[0.028473025,...][100] filter=None sort=None groupField=None hitCount=100: hit 1 has wrong field/score value ([2139705], '0.9640273') vs ([20795785], '0.9655802')", "query=KnnFloatVectorQuery:vector[0.02227773,...][100] filter=None sort=None groupField=None hitCount=100: hit 19 has wrong field/score value ([20249582], '0.9433427') vs ([8538823], '0.94324553')", "query=KnnFloatVectorQuery:vector[-0.047548626,...][100] filter=None sort=None groupField=None hitCount=100: hit 0 has wrong field/score value ([24831434], '0.84341675') vs ([20712471], '0.8335463')", "query=KnnFloatVectorQuery:vector[0.02625591,...][100] filter=None sort=None groupField=None hitCount=100: hit 6 has wrong field/score value ([25459412], '0.8309758') vs ([15548210], '0.8312737')"]

@jpountz
Copy link
Contributor

jpountz commented Feb 10, 2023

I think that the answer to my question is "yes" given this paragraph in the issue description: "In addition to this, graphs produced by merging two segments are no longer necessarily going to be equivalent to indexing one segment directly. This is caused by both differences in assigned random values as well as insertion order dictating which neighbors are selected for which nodes."

@mikemccand Could you kick off a re-gold of nightly benchmarks?

@jmazanec15
Copy link
Contributor Author

@jpountz yes that's correct. The random number assignment is no longer going to be the same when merging multiple graphs together, because the segment whose graph is being used to initialize won't take any random numbers. Additionally, depending on the ordinals the vectors map to in the initializer graph, the neighbor assignment may be different.

@benwtrent
Copy link
Member

@jmazanec15 did his due diligence, just being paranoid :).

I have confirmed that for the following ann-benchmarks datasets the recall before and after this change are 1-1: mnist-784-euclidean, sift-128-euclidean, glove-100-angular. However, all these datasets are pretty small, and may not kick off many segment merges, etc.

So I tested with deep-image-96-angular and it took some time.

But here are the results:

parameters test recall control recall
{'M': 48, 'efConstruction': 100} fanout=100 0.995 0.994
{'M': 16, 'efConstruction': 100} fanout=100 0.986 0.986
{'M': 16, 'efConstruction': 100} fanout=50 0.969 0.969
{'M': 16, 'efConstruction': 100} fanout=500 0.998 0.998
{'M': 48, 'efConstruction': 100} fanout=500 0.999 0.999
{'M': 16, 'efConstruction': 100} fanout=10 0.892 0.892
{'M': 48, 'efConstruction': 100} fanout=50 0.986 0.986
{'M': 48, 'efConstruction': 100} fanout=10 0.941 0.940

So, there are no significant changes in recall. So, I think this change is good and we should update the test.

@jpountz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reuse HNSW graphs when merging segments? [LUCENE-10318]
5 participants