Implement NodeQueue#pushAll and AbstractLongHeap#addAll#415
Merged
marianotepper merged 2 commits intomainfrom Apr 7, 2025
Merged
Implement NodeQueue#pushAll and AbstractLongHeap#addAll#415marianotepper merged 2 commits intomainfrom
marianotepper merged 2 commits intomainfrom
Conversation
Fixes #409 Adds bulk add operations to the AbstractLongHeap and to the NodeQueue to reduce comparisons required when adding many elements at once. See issue for additional detail.
Contributor
|
Looks very solid to me. Let's just add a pointer to the link that @michaeljmarshall shared with me that contains the theoretical justifications https://stackoverflow.com/a/18742428. The rest looks good. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #409
Adds bulk add operations to the
AbstractLongHeapand to theNodeQueueto reduce comparisons required when adding many elements at once.The current
NodeQueueimplementation forces users to add elements one at a time, which requiresO(n*log(n))time. A bulk addition of elements followed by a single heapify operation is insteadO(n)time. In some brute force scenarios, I found that we could add 400k elements to a queue at a time, which makes for a significant difference in the time to build the queue. This is particularly relevant for a brute force scenario datastax/cassandra#1643.I propose that we add an option for the
NodeQueue(and theAbstractLongHeap) to consume an iterator that produces the node id and score, adds those scores to the heap without runningupheap, and then applies the bulkdownheapoperation to re-heapify the heap. The iterator solution would help keep the space complexity down.It's not clear to me if there are applications for this logic within jvector, which might determine the utility of this feature to the project. Note that there are several places where we call
push()iteratively, which suggests they might benefit from this change. However, for small cardinalities, the performance difference is likely negligible.Further, if this library takes on brute force calculations, this change will become meaningful, so I propose we add it.
Back of the envelope math (a.k.a chatgpt) suggests that for the 400k example I provided, we're talking about 800k comparisons in the bulk add scenario and 8M in the iterative add scenario.