-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize OnHeapHnswGraph's data structure #12651
Conversation
I like this! Actually I think when we are merging we can preallocate the entire array so we don't need to resize at all which should greatly simplify making this beast thread-safe (since the array at least will be immutable). |
Yes that's the idea, although I actually made some mistakes here so the
merging is not entirely pre-allocated, also something in searching might be
broken due to the size() behavior change, I will try to fix them later on.
…On Wed, Oct 11, 2023, 21:17 Michael Sokolov ***@***.***> wrote:
I like this! Actually I think when we are merging we can preallocate the
entire array so we don't need to resize at all which should greatly
simplify making this beast thread-safe (since the array at least will be
immutable).
—
Reply to this email directly, view it on GitHub
<#12651 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEFSB7FHP6Y6SIQTOU5H6BDX65VPPANCNFSM6AAAAAA525PM6Q>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick once-over and left a few comments
@@ -232,7 +231,6 @@ void searchLevel( | |||
graphSeek(graph, level, topCandidateNode); | |||
int friendOrd; | |||
while ((friendOrd = graphNextNeighbor(graph)) != NO_MORE_DOCS) { | |||
assert friendOrd < size : "friendOrd=" + friendOrd + "; size=" + size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep? I think we will need all the assertions we can get to try to ensure thread-safety?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason above, I'll update the way the searcher calculate the size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this change. It makes the graph faster and easier to understand.
My only concern was JVM memory, but I think this will actually use less memory as ArrayList has its own overhead and still stores things in contiguous space like a regular array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I understand what you did better now and it all looks correct to me. I added some cleanup / testing / grammar comments.
@@ -284,6 +285,13 @@ int graphNextNeighbor(HnswGraph graph) throws IOException { | |||
return graph.nextNeighbor(); | |||
} | |||
|
|||
private static int getGraphSize(HnswGraph graph) { | |||
if (graph instanceof OnHeapHnswGraph) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - can be a follow-up, but perhaps we introduce capacity()
to HnswGraph
so we can use it uniformly without casting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to maxNodeId
and default it to size() -1
in HnswGraph
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java
Outdated
Show resolved
Hide resolved
entryNode = node; | ||
if (node >= graph.length) { | ||
if (noGrowth) { | ||
throw new AssertionError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should either be assert node < graph.length || noGrowth == false: "...message..."
or else we should throw an IllegalArgumentException
- I don't think we ought to be throwing AssertionError
in production code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah let me throw IllegalStateException
instead? (I think that's better than IllegalArgumentException
maybe?)
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java
Outdated
Show resolved
Hide resolved
I have reran the benchmark and still get the similar perf and same recall. (Just to make sure the later edits have not messed up things) |
make the internal graph representation a 2d array
…ache.org * upstream/main: (239 commits) Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation (apache#12633) Fix index out of bounds when writing FST to different metaOut (apache#12697) (apache#12698) Avoid object construction when linear searching arcs (apache#12692) chore: update the Javadoc example in Analyzer (apache#12693) coorect position on entry in CHANGES.txt Refactor ByteBlockPool so it is just a "shift/mask big array" (apache#12625) Extract the hnsw graph merging from being part of the vector writer (apache#12657) Specialize `BlockImpactsDocsEnum#nextDoc()`. (apache#12670) Speed up TestIndexOrDocValuesQuery. (apache#12672) Remove over-counting of deleted terms (apache#12586) Use MergeSorter in StableStringSorter (apache#12652) Use radix sort to speed up the sorting of terms in TermInSetQuery (apache#12587) Add timeouts to github jobs. Estimates taken from empirical run times (actions history), with a generous buffer added. (apache#12687) Optimize OnHeapHnswGraph's data structure (apache#12651) Add createClassLoader to replicator permissions (block specific to jacoco). (apache#12684) Move changes entry before backporting CHANGES Move testing properties to provider class (no classloading deadlock possible) and fallback to default provider in non-test mode simple cleanups to vector code (apache#12680) Better detect vector module in non-default setups (e.g., custom module layers) (apache#12677) ...
Description
Make the
OnHeapHnswGraph
essentially a 2D array ofNeighbourArray
, there're several benefits of this change:ArrayList
for level0 andTreeMap
for rest of the levels which both have overheads on resizing or inserting. After this change, we don't needTreeMap
anymore, and during indexing time, we only need to resize one array no matter how many levels we have, and the lookup will be a simpler 2d array lookup.The only regression on this approach is that when we get node for a non-zero level it need to traverse the whole graph. But since that is not a common operation and is only called when we serialize to disk, I made a cache for it such that the cost of first non-zero level call is O(whole_graph) and subsequent call will be trivial.
Test runs
I use writer buffer of 256MB and forcemerge at the end, and measured forceMerge time as well
Baseline
Candidate