Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract the hnsw graph merging from being part of the vector writer #12657

Merged
merged 14 commits into from
Oct 17, 2023

Conversation

benwtrent
Copy link
Member

While working on the quantization codec & thinking about how merging will evolve, it became clearer that having merging attached directly to the vector writer is weird.

I extracted it out to its own class and removed the "initializedNodes" logic from the base class builder.

Also, there was on other refactoring around grabbing sorted nodes from the neighbor iterator, I just moved that static method so its not attached to the writer (as all bwc writers need it and all future HNSW writers will as well).

@benwtrent benwtrent requested a review from zhaih October 11, 2023 19:21
Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this refactor is great, will take another closer look later

@benwtrent
Copy link
Member Author

@zhaih I updated the API a bit. This is more like I was thinking. Having a builder that accepts readers, doc maps, etc. And then can build with the final merge state.

*
* @lucene.internal
*/
public final class InitializedHnswGraphBuilder extends HnswGraphBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it'll be better to allow HnswGraphBuilder to accept a OnHeapHnswGraph as a constructor parameter and then when we init from graph we will use this InitializedHnswGraphBuilder to build a graph and then pass the built graph and a node filter (to avoid reindex the same nodes) to the normal HnswGraphBuilder?

Then for example if we want to have a multi-thread ConcurrentHnswGraphBuilder we can still use this InitializedHnswGraphBuilder to build the init graph and pass it to the ConcurentHnswGraphBuilder?

I mentioned this because in my draft concurrent HNSW merge PR #12660 I do need to pass the HNSW graph to a builder per thread, altho it can be done in various ways. But I still feel using parent/child class to separate this can make things a little bit hard later? Like, if I want a concurrent builder will the concurrent builder extend from this class? If so we need to be quite careful not to inherit a wrong behavior from the original HnswGraphBuilder and things can become quite complex?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaih I think we can do that. What I also want to do is remove all this Map<Integer, Integer> oldToNewOrdinalMap. The caller should handle all that, not the constructor, especially since there is probably a way faster and simpler way to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't follow this closely, but do we really have a map like that? If so, let's at least switch to IntIntHashMap to avoid boxing (if we're unable to remove it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov I think we can just have int[], where int[oldOrd]=newOrd since old vector ordinals are continuous.

@benwtrent
Copy link
Member Author

Just being paranoid, I tested and verified that recall is absolutely unchanged between these changes.

baseline:

0.500	 0.10	100000	10	4	50	20	11496	1.00	post-filter
0.533	 0.10	100000	10	4	100	20	18504	1.00	post-filter
0.844	 0.21	100000	10	16	50	20	22211	1.00	post-filter
0.875	 0.24	100000	10	16	100	20	44031	1.00	post-filter

candidate:

0.500	 0.10	100000	10	4	50	20	11778	1.00	post-filter
0.533	 0.10	100000	10	4	100	20	18439	1.00	post-filter
0.844	 0.20	100000	10	16	50	20	24012	1.00	post-filter
0.875	 0.25	100000	10	16	100	20	46249	1.00	post-filter

The performance isn't reliable, this was running on my laptop while it was doing lots of other work.

Comment on lines -1150 to +1142
for (int offset = 0; offset < size; offset += random.nextInt(3) + 1) {
for (int offset = 0; offset < size; offset++) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was to simulate sparse vectors. But the sparse vector iterator never returns null for the vectors. Instead it just skips to the next non-null vector from what I can tell.

Comment on lines 164 to 166
// Since there are no deleted documents in our chosen segment, we can assume that the ordinals
// are unchanged
// meaning we only need to know what ordinal offset to select and apply it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be wrong when index sort is configured? E.g. if we have 2 segments each have 2 docs

seg0
    doc0: rank=0
    doc1: rank=2
seg1
    doc0: rank=1
    doc1: rank=3

So after merge those doc will be in an interleaved order but not simply old order + base?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct! I will revert back to the integer map.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaih I switched to an int[] where the indices are the old ordinals (since these are contiguous values as we don't allow deleted docs...for now).

The values in the indices are the newOrdinal calculated in a similar loop as before with the Map<Integer, Integer>

* This selects the biggest Hnsw graph from the provided merge state and initializes a new
* HnswGraphBuilder with that graph as a starting point.
*/
public class IncrementalHnswGraphMerger {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put it in util/hnsw package instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

This would make sense for a ConcurrentMerger & "BetterFutureHnswGraphMerger" or whatever it will be.

@benwtrent benwtrent requested a review from zhaih October 16, 2023 18:27
Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only some minor nits

@benwtrent benwtrent merged commit ea272d0 into apache:main Oct 17, 2023
4 checks passed
@benwtrent benwtrent deleted the refactor/extract-hnsw-merging branch October 17, 2023 17:45
benwtrent added a commit that referenced this pull request Oct 17, 2023
…12657)

While working on the quantization codec & thinking about how merging will evolve, it became clearer that having merging attached directly to the vector writer is weird.

I extracted it out to its own class and removed the "initializedNodes" logic from the base class builder.

Also, there was on other refactoring around grabbing sorted nodes from the neighbor iterator, I just moved that static method so its not attached to the writer (as all bwc writers need it and all future HNSW writers will as well).
clayburn added a commit to runningcode/lucene that referenced this pull request Oct 20, 2023
…ache.org

* upstream/main: (239 commits)
  Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation (apache#12633)
  Fix index out of bounds when writing FST to different metaOut (apache#12697) (apache#12698)
  Avoid object construction when linear searching arcs (apache#12692)
  chore: update the Javadoc example in Analyzer (apache#12693)
  coorect position on entry in CHANGES.txt
  Refactor ByteBlockPool so it is just a "shift/mask big array" (apache#12625)
  Extract the hnsw graph merging from being part of the vector writer (apache#12657)
  Specialize `BlockImpactsDocsEnum#nextDoc()`. (apache#12670)
  Speed up TestIndexOrDocValuesQuery. (apache#12672)
  Remove over-counting of deleted terms (apache#12586)
  Use MergeSorter in StableStringSorter (apache#12652)
  Use radix sort to speed up the sorting of terms in TermInSetQuery (apache#12587)
  Add timeouts to github jobs. Estimates taken from empirical run times (actions history), with a generous buffer added. (apache#12687)
  Optimize OnHeapHnswGraph's data structure (apache#12651)
  Add createClassLoader to replicator permissions (block specific to jacoco). (apache#12684)
  Move changes entry before backporting
  CHANGES
  Move testing properties to provider class (no classloading deadlock possible) and fallback to default provider in non-test mode
  simple cleanups to vector code (apache#12680)
  Better detect vector module in non-default setups (e.g., custom module layers) (apache#12677)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants