Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize OnHeapHnswGraph's data structure #12651

Merged
merged 11 commits into from
Oct 16, 2023
Merged

Optimize OnHeapHnswGraph's data structure #12651

merged 11 commits into from
Oct 16, 2023

Conversation

zhaih
Copy link
Contributor

@zhaih zhaih commented Oct 10, 2023

Description

Make the OnHeapHnswGraph essentially a 2D array of NeighbourArray, there're several benefits of this change:

  1. We do less resizing of arrays: previously we're using ArrayList for level0 and TreeMap for rest of the levels which both have overheads on resizing or inserting. After this change, we don't need TreeMap anymore, and during indexing time, we only need to resize one array no matter how many levels we have, and the lookup will be a simpler 2d array lookup.
  2. Multithread should be easier: this PR doesn't contain any attempt to make indexing concurrent, but in case that we want to make merging HNSW graph concurrent, this change makes things much easier, because when we merge we already know the total size of graph so the first dimension of the graph array is fixed, such that we don't need any sync on the graph for it to resize.

The only regression on this approach is that when we get node for a non-zero level it need to traverse the whole graph. But since that is not a common operation and is only called when we serialize to disk, I made a cache for it such that the cost of first non-zero level call is O(whole_graph) and subsequent call will be trivial.

Test runs

I use writer buffer of 256MB and forcemerge at the end, and measured forceMerge time as well

Baseline

recall	latency	nDoc	fanout	maxConn	beamWidth	visited	index ms
Force merge done, time: 3 ms
0.838	 0.28	10000	0	64	250	100	3868	1.00	post-filter
Force merge done, time: 2 ms
0.755	 1.21	100000	0	64	250	100	101113	1.00	post-filter
Force merge done, time: 632325 ms
0.603	10.27	1000000	0	64	250	100	1622348	1.00	post-filter

Candidate

recall	latency	nDoc	fanout	maxConn	beamWidth	visited	index ms
Force merge done, time: 2 ms
0.838	 0.27	10000	0	64	250	100	3652	1.00	post-filter
Force merge done, time: 2 ms
0.755	 1.13	100000	0	64	250	100	92449	1.00	post-filter
Force merge done, time: 625726 ms
0.608	10.22	1000000	0	64	250	100	1529417	1.00	post-filter

@zhaih zhaih changed the title Optimize OnHeapHnswGraph Optimize OnHeapHnswGraph's data structure Oct 11, 2023
@zhaih zhaih marked this pull request as ready for review October 11, 2023 20:49
@zhaih zhaih requested a review from msokolov October 11, 2023 20:49
@msokolov
Copy link
Contributor

I like this! Actually I think when we are merging we can preallocate the entire array so we don't need to resize at all which should greatly simplify making this beast thread-safe (since the array at least will be immutable).

@zhaih
Copy link
Contributor Author

zhaih commented Oct 12, 2023 via email

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick once-over and left a few comments

@@ -232,7 +231,6 @@ void searchLevel(
graphSeek(graph, level, topCandidateNode);
int friendOrd;
while ((friendOrd = graphNextNeighbor(graph)) != NO_MORE_DOCS) {
assert friendOrd < size : "friendOrd=" + friendOrd + "; size=" + size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep? I think we will need all the assertions we can get to try to ensure thread-safety?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason above, I'll update the way the searcher calculate the size

@zhaih zhaih mentioned this pull request Oct 12, 2023
@zhaih
Copy link
Contributor Author

zhaih commented Oct 12, 2023

OK I have incorporate all the learning I have from #12660 and added several more assertions to make it safer, please take a look again when you have time @msokolov, thanks!

@zhaih zhaih requested a review from msokolov October 12, 2023 19:19
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this change. It makes the graph faster and easier to understand.

My only concern was JVM memory, but I think this will actually use less memory as ArrayList has its own overhead and still stores things in contiguous space like a regular array.

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I understand what you did better now and it all looks correct to me. I added some cleanup / testing / grammar comments.

@@ -284,6 +285,13 @@ int graphNextNeighbor(HnswGraph graph) throws IOException {
return graph.nextNeighbor();
}

private static int getGraphSize(HnswGraph graph) {
if (graph instanceof OnHeapHnswGraph) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - can be a follow-up, but perhaps we introduce capacity() to HnswGraph so we can use it uniformly without casting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to maxNodeId and default it to size() -1 in HnswGraph

entryNode = node;
if (node >= graph.length) {
if (noGrowth) {
throw new AssertionError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should either be assert node < graph.length || noGrowth == false: "...message..." or else we should throw an IllegalArgumentException - I don't think we ought to be throwing AssertionError in production code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah let me throw IllegalStateException instead? (I think that's better than IllegalArgumentException maybe?)

@zhaih
Copy link
Contributor Author

zhaih commented Oct 16, 2023

I have reran the benchmark and still get the similar perf and same recall. (Just to make sure the later edits have not messed up things)

@zhaih zhaih merged commit a1cf22e into apache:main Oct 16, 2023
5 checks passed
zhaih added a commit that referenced this pull request Oct 16, 2023
make the internal graph representation a 2d array
@zhaih zhaih deleted the HnswOpti branch October 16, 2023 20:22
@zhaih zhaih added this to the 9.9.0 milestone Oct 16, 2023
clayburn added a commit to runningcode/lucene that referenced this pull request Oct 20, 2023
…ache.org

* upstream/main: (239 commits)
  Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation (apache#12633)
  Fix index out of bounds when writing FST to different metaOut (apache#12697) (apache#12698)
  Avoid object construction when linear searching arcs (apache#12692)
  chore: update the Javadoc example in Analyzer (apache#12693)
  coorect position on entry in CHANGES.txt
  Refactor ByteBlockPool so it is just a "shift/mask big array" (apache#12625)
  Extract the hnsw graph merging from being part of the vector writer (apache#12657)
  Specialize `BlockImpactsDocsEnum#nextDoc()`. (apache#12670)
  Speed up TestIndexOrDocValuesQuery. (apache#12672)
  Remove over-counting of deleted terms (apache#12586)
  Use MergeSorter in StableStringSorter (apache#12652)
  Use radix sort to speed up the sorting of terms in TermInSetQuery (apache#12587)
  Add timeouts to github jobs. Estimates taken from empirical run times (actions history), with a generous buffer added. (apache#12687)
  Optimize OnHeapHnswGraph's data structure (apache#12651)
  Add createClassLoader to replicator permissions (block specific to jacoco). (apache#12684)
  Move changes entry before backporting
  CHANGES
  Move testing properties to provider class (no classloading deadlock possible) and fallback to default provider in non-test mode
  simple cleanups to vector code (apache#12680)
  Better detect vector module in non-default setups (e.g., custom module layers) (apache#12677)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants