Skip to content

Implement NodeQueue#pushAll and AbstractLongHeap#addAll#415

Merged
marianotepper merged 2 commits intomainfrom
bulk-add
Apr 7, 2025
Merged

Implement NodeQueue#pushAll and AbstractLongHeap#addAll#415
marianotepper merged 2 commits intomainfrom
bulk-add

Conversation

@michaeljmarshall
Copy link
Member

@michaeljmarshall michaeljmarshall commented Apr 3, 2025

Fixes #409

Adds bulk add operations to the AbstractLongHeap and to the NodeQueue to reduce comparisons required when adding many elements at once.

The current NodeQueue implementation forces users to add elements one at a time, which requires O(n*log(n)) time. A bulk addition of elements followed by a single heapify operation is instead O(n) time. In some brute force scenarios, I found that we could add 400k elements to a queue at a time, which makes for a significant difference in the time to build the queue. This is particularly relevant for a brute force scenario datastax/cassandra#1643.

I propose that we add an option for the NodeQueue (and the AbstractLongHeap) to consume an iterator that produces the node id and score, adds those scores to the heap without running upheap, and then applies the bulk downheap operation to re-heapify the heap. The iterator solution would help keep the space complexity down.

It's not clear to me if there are applications for this logic within jvector, which might determine the utility of this feature to the project. Note that there are several places where we call push() iteratively, which suggests they might benefit from this change. However, for small cardinalities, the performance difference is likely negligible.

Further, if this library takes on brute force calculations, this change will become meaningful, so I propose we add it.

Back of the envelope math (a.k.a chatgpt) suggests that for the 400k example I provided, we're talking about 800k comparisons in the bulk add scenario and 8M in the iterative add scenario.

Fixes #409

Adds bulk add operations to the AbstractLongHeap
and to the NodeQueue to reduce comparisons
required when adding many elements at once.

See issue for additional detail.
@michaeljmarshall michaeljmarshall added the enhancement New feature or request label Apr 3, 2025
@michaeljmarshall michaeljmarshall self-assigned this Apr 3, 2025
@marianotepper
Copy link
Contributor

Looks very solid to me. Let's just add a pointer to the link that @michaeljmarshall shared with me that contains the theoretical justifications https://stackoverflow.com/a/18742428.

The rest looks good.

Copy link
Contributor

@marianotepper marianotepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@marianotepper marianotepper merged commit 7a68c6c into main Apr 7, 2025
6 checks passed
@marianotepper marianotepper deleted the bulk-add branch April 7, 2025 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize NodeQueue/AbstractLongHeap with option to bulk add elements

2 participants