Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10054 Make HnswGraph hierarchical #416

Closed
wants to merge 17 commits into from
Closed

LUCENE-10054 Make HnswGraph hierarchical #416

wants to merge 17 commits into from

Conversation

mayya-sharipova
Copy link
Contributor

@mayya-sharipova mayya-sharipova commented Oct 27, 2021

Currently HNSW has only a single layer.
This patch attempts to make it multi-layered.

This PR is based on the feature branch that already went through several PR's approval #250, #267, #287, #315, #536

Currently HNSW has only a single layer.
This is the first part to make it multi-layered.

To keep changes small, this PR only adds
 multiple layers in the HnswGraph class.
This patch handles hierarchy in graph construction and search,
but only in memory.

Changes:

- HnswGraphBuilder has an extra parameter :ml normalization factor for level
  generation. When ml=0, the graph will have only a single layer: SNW.
  When ml > 0, the graph will have multiple layers.
  - The recommended ml value from 2018 HNSW paper is :
    ml = 1 / Math.log(1.0 * maxConn), which was used for tests in
    TestHnswGraph class.
  - When ml = 0, we use the previous code in the method buildSNW
  - When ml > 0, the method buildHNSW is used, that according the paper,
    for every new node: generates a random level for this node, and then
     places the new node in those levels with its connections.
- HnswGraph's search method has also been modified to handle two cases:
  - When ml = 0, we use the previous code for a flat graph: generate
    boundedNumSeed number of random entry points, and use them to search
    a flat graph
  - When ml > 0, we use the hierarchical graph: from max level till level 1
    using topK=1, and on the level 0th using topK=boundedNumSeed

Work left for future: handle hierarchy on disk
Currently we are storing neighbourhoods of all nodes for all levels,
even though level > 0 contain fewer nodes.

This patch fixes this by:
- graph stores for a level a list of neighbourhoods only for nodes
  present on this level. Neighbours on all levels are presented
  as level 0's ordinals.
- nodesByLevel defines which nodes are stored on this level, presented
  as level 0's nodes' ordinals.
- to get neighbours for a given node for level > 0, we first do a binary
  search on nodesByLevel for this level to find node's index in the
  graph's lists of neighbourhoods for this level. We then use this
  index to retieve a desired neighbourhood.
Disk write and read of hierarchical nsw graph.

Modify graph files as following:

- .vem - meta file
- .vec - vector data file
- .vex - vector graph index, index file for the vector graph file
- .vgr - vector graph file, stores graph connections (previously
  called .vex file)

.vem file
+-------------+--------+------------------+------------------+-
| FieldNumber | SimFun | VectorDataOffset | vectorDataLength |
+-------------+--------+------------------+------------------+-

-+------------------+------------------+-----------------+-----------------+-
 | graphIndexOffset | graphIndexLength | graphDataOffset | graphDataLength |
-+------------------+------------------+-----------------+-----------------+-

--+-------------+------------+------+--------+
  | NumOfLevels | Dimensions | Size | DocIds |
--+-------------+------------+------+--------+

- graph offsets are moved to .vex file.
This allows to keep metada file smaller, and possibly
later to to load graph offsets on first use, or make
them off-heap.

.vec file stays the same

.vex file:
+-------------+--------------------+--------------------+---------------+--
| NumOfLevels | NumOfNodesOnLevel0 | NumOfNodesOnLevel1 | NodesOnLevel1 | ...
+-------------+--------------------+--------------------+---------------+--

-+----------------------+-----------------+-
 | NumOfNodesOnLevelMax | NodesOnLevelMax |
-+----------------------+-----------------+--

--+------------------------------+----+--------------------------------+
  | GraphOffsetsForNodesOnLevel0 | ...| GraphOffsetsForNodesOnLevelMax |
--+------------------------------+----+--------------------------------+

.vgr file:
+------------------------+------------------------+-----+--------------------------+
| Level0NodesConnections | Level1NodesConnections | ....| LevelMaxNodesConnections |
+------------------------+------------------------+-----+--------------------------+
@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Oct 27, 2021

Benchmarking based on @jtibshirani setup

baseline: main branch
candidate: this PR

glove-25-angular

baseline recall baseline QPS candidate recall candidate QPS
n_cands=10 0.626 10962.821 0.631 8869.807
n_cands=50 0.888 4409.952 0.889 4111.685
n_cands=100 0.946 2621.846 0.947 2734.787
n_cands=500 0.994 661.253 0.994 686.700
n_cands=800 0.997 430.172 0.997 459.356
n_cands=1000 0.998 342.915 0.998 355.238

glove-200-angular

baseline recall baseline QPS candidate recall candidate QPS
n_cands=10 0.285 4843.028 0.312 5208.453
n_cands=50 0.556 2119.933 0.558 2250.213
n_cands=100 0.655 1399.261 0.648 1454.996
n_cands=500 0.806 379.745 0.806 410.553
n_cands=800 0.836 252.796 0.836 276.456
n_cands=1000 0.849 201.012 0.849 220.739

sift-128-euclidean

baseline recall baseline QPS candidate recall candidate QPS
n_cands=10 0.601 6948.736 0.607 6677.931
n_cands=50 0.889 3003.781 0.892 3202.925
n_cands=100 0.952 1622.276 0.953 1996.992
n_cands=500 0.996 444.135 0.996 540.368
n_cands=800 0.998 296.835 0.998 367.316
n_cands=1000 0.999 245.498 0.999 311.339

As can be seen from the comparison, there is very slight change that the hierarchy brings: a small increase in recall by at the expense of lower QPSs.

@mayya-sharipova mayya-sharipova requested review from msokolov, jtibshirani and jpountz and removed request for msokolov October 27, 2021 17:21
@jtibshirani
Copy link
Member

As can be seen from the comparison, there is very slight change that the hierarchy brings: a small increase in recall by at the expense of lower QPSs

It looks like QPS is sometimes worse, but often better (like in all the sift-128-euclidean runs, num_cands >=50). I wonder if the first runs are affected by a lack of warm-up? My original set-up you linked to didn't include a warm-up, and in LUCENE-9937 we found that this can have a big impact on the first runs.

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Oct 27, 2021

@jtibshirani Thanks for the comment.

I wonder if the first runs are affected by a lack of warm-up?

In my benchmarks I had a warmup stage as well starting with bogus query args in ann benchmarking algorithm .

@@ -205,6 +215,43 @@ private FieldEntry readField(DataInput input) throws IOException {
return new FieldEntry(input, similarityFunction);
}

private void fillGraphNodesAndOffsetsByLevel() throws IOException {
for (FieldEntry entry : fields.values()) {
IndexInput input =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the graph index file is only used to populate these data structures on FieldEntry (the graph level information). It feels a bit surprising that FieldEntry is constructed across two different files (both the metadata and graph index files). It also means FieldEntry isn't immutable.

hmm, I'm not sure what typically belongs in a metadata file. I assume it doesn't make sense to move the graph index information there, alongside the ord -> doc mapping. Maybe we could keep the file as-is but create a new class to hold the graph index, something like GraphLevels ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed we populate FieldEntry from the graph index (levels) file.

It feels a bit surprising that FieldEntry is constructed across two different files

Indeed, I was not happy with this as well. Alternatively, we can put all this information about levels and layers into a single meta file as it is currently done. Happy to hear other suggestions as well.

It also means FieldEntry isn't immutable

No, FieldEntry is immutable, as all its fields are final, they are initialized and filled in the Lucene90HnswVectorsReader constructor.

In #315 was trying to fill graph data lazily on the first use, but it looks like we don't like lazy loading.

Maybe we could keep the file as-is but create a new class to hold the graph index, something like GraphLevels?

Is GraphLevels instead of FieldEntry? I like the approach of a single FieldEntry for every field, as it follows how other field types are organized.

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz I wonder if you can suggest us better how to organize vector files?

For the context: currently, .vem (vector metadata) can be quite big as it contains graph offsets (for 1M docs: 1M * 8 bytes = 8Mb; for 10M docs: 80 Mb). This PR tries to store graph offsets into a separate file, but still load them during initialization of FieldEntry. The problem then as @jtibshirani noticed FieldEntry data is constructed from two files: the metadata file and the file that stores graph offsets. May be it is worth to keep the original format and store all this data in the metadata file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani

It feels a bit surprising that FieldEntry is constructed across two different files (both the metadata and graph index files)

As an alternative, to make FieldEntry fields only from a single meta file, I've created a PR that keeps the current 3 files ( .vec, .vex, and .vem), where .vem file stores all the level's nodes; but instead of storing neighbour's offsets we calculate them, which allows to keep .vem metadata file smaller. What do you think of this approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea. How would you like to proceed -- work on that PR first (since it seems useful on its own), or move forward with this one and follow-up with a fix soon after?

To clarify, I was not thinking that GraphLevels would replace FieldEntry. It would be a second data structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you like to proceed -- work on that PR first (since it seems useful on its own), or move forward with this one and follow-up with a fix soon after?

I was under the impression that we are not happy with the current state of this PR and would not want to merge it without some changes. No?

To clarify, I was not thinking that GraphLevels would replace FieldEntry. It would be a second data structure.

Can you please elaborate more how to do you see it is organized? How GraphLevels connected to a FieldEntry? Do you suggest put GraphLevels into a separate file and load them on a first use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is proving tricky to resolve and personally I don't feel like it needs to block merging. We could look into improving it in a follow-up PR.

*
* @param query search query vector
* @param topK the number of nodes to be returned
* @param numSeed the size of the queue maintained while searching, and controls the number of
* random entry points to sample
* @param numSeed the size of the queue maintained while searching
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove this parameter numSeed? It seems particularly out-of-place now that we don't sample random entry points.

I think it'd be a nice simplification to only have one parameter topK. I can't think of a case where callers would set numSeed to be something different from topK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks for the comment, I have thought long about this, and I think we should keep it until we have a better strategy.

Currently, we search HNSW graphs through KnnVectorQuery providing k as numCandidates (numSeed or fanout), as this query doesn't support numCandidates parameter. I think it is not optimal that instead of returning let's say top 10 docs from every segment (or shard in a distributed search), we return 100 (200, 800 etc).

Also the signature of KnnVectorQuery and HnswGraph::search need to be updated to reflect the fact that k is indeed the number of candidates we are considering.

May be can discuss this outside this PR, whether and where it makes sense to have numCandidates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works for me to have a separate discussion. Maybe at least in this PR we can rename this to numCandidates, since the 'seed' naming no longer makes sense?

As context, I still think it makes sense to remove the numCandidates vs. k distinction in HnswGraph. The public signature KnnVectorsReader#search does not include a notion of "num candidates", so users have no way to even use this distinction. I'd be in favor of removing it from HnswGraph, then having a follow-up discussion about whether the vector search APIs should handle numCandidates vs k.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks, for now removing numSeed with a possible re-discussion makes sense to me. Addressed in b421dce

// Typically with diversity criteria we see nodes not fully occupied;
// average fanout seems to be about 1/2 maxConn.
// There is some indexing time penalty for under-allocating, but saves RAM
graph.get(i).add(new NeighborArray(Math.max(32, maxConn / 4)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused why we are careful not to overallocate here, but in addNode we do new NeighborArray(maxConn + 1)));? I see you maintained the existing behavior, so not critical to address in this PR, I was just curious about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curios as well; this was indeed carried over from the existing code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, this makes no sense at all! Git-spelunking, it looks like I added this fancy dancing when adding the diversity criteria, but overlooked addNode - which does the majority of the allocation! It could be worth taking a lok to see if we can cheaply save some RAM

* @param level level for which to get all nodes
* @return an iterator over nodes where {@code nextDoc} returns a next node ordinal
*/
// TODO: return a more suitable iterator over nodes than DocIdSetIterator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I guess we never addressed this TODO. I don't have a good alternative suggestion, maybe others do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could make one up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would IntStream be a good fit? Or PrimitiveIterator.OfInt?

@jtibshirani
Copy link
Member

Got it, it sounds like you already adjusted my set-up to include warm-ups. Overall it looks like a positive performance improvement. I'm in favor of merging this even though the improvement is relatively small -- I think it's good to implement the actual algorithm that we claim to! I also think this sets us up well for future performance improvements, by closely comparing to other HNSW implementations.

One last thing to check regarding performance: does it have an impact on indexing speed?

Reviewing the code with fresh eyes, I found some more parts where I had questions. I know 9.0 feature freeze is coming up really soon, maybe we want to discuss the timing of this PR?

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Nov 17, 2021

@jtibshirani Thanks for your extensive review. I've tried to address some comments, and some comments I think we still need to discuss

does it have an impact on indexing speed?

I've run another benchmarking experiment with KnnGraphTester , and as can be seen below there seems to be a small impact on indexing speed: 6% increase in indexing time;.

Comparison of glove-100

baseline: main branch
candidate: this PR

Baseline candidate
built 1180000 in 23061/2280885 ms built 1180000 in 23113/2295566 ms
flushed: segment=_1 ramUsed=474.061 MB newFlushedSize=513.284 MB docs/MB=2,305.768 flushed: segment=_0 ramUsed=474.061 MB newFlushedSize=517.153 MB docs/MB=2,288.517
baseline recall baseline QPS candidate recall candidate QPS
n_cands=10 0.482 4199.190 0.496 3982.133
n_cands=20 0.553 3563.974 0.561 3412.251
n_cands=40 0.631 2713.881 0.635 2455.465
n_cands=80 0.707 1902.787 0.709 1863.012
n_cands=120 0.748 1497.979 0.747 1400.822
n_cands=200 0.790 1031.353 0.791 1018.244
n_cands=400 0.840 577.106 0.841 600.762
n_cands=600 0.865 417.961 0.865 436.488
n_cands=800 0.881 319.900 0.881 334.503

@jtibshirani
Copy link
Member

@mayya-sharipova thanks for all your work on this PR. For me, there aren't blockers to merging. I'll do a final review tomorrow (with my morning coffee :)) to make sure I didn't miss anything. What are the next steps -- are you also hoping for a review from Adrien or Mike, who are experienced at designing file formats? Any open questions for them?

@mayya-sharipova
Copy link
Contributor Author

@jtibshirani Thanks for continuing your review.

My main questions to @jpountz and @msokolov is indeed about file formats.

Is it better to keep the current design of 3 files: .vem (metadata), .vec (vectors), and .vex (graphs neighbours), and extra information about graph levels into the metadata file? This make metadata file even bigger, but will allow to load all info for FieldEntry from a single meta file.

Or is it better to break the current metadata file into 2 files 1) .vem - a smaller field description file and 2) a bigger .vel file about graph levels and neighbours' offsets? This make metadata file small, but FieldEntry is loaded from two files.

@msokolov
Copy link
Contributor

@mayya-sharipova I don't have strong feelings about the file structure; either way seems OK to me. I guess if pressed I would probably choose to keep a smaller number of files since it will reduce the amount of managing/opening/closing files, and the directory will be less cluttered (most of the formats have one or two files?). I don't see any particular downside to having different kinds of information in a file, although then I guess the file parsing code can be a little more complicated, so it's a tradeoff without a clear deciding factor for me?

Currently we store for each node an offset in the graph
neighbours file from where to read this node's neighbours.
We also load these offsets into the heap.

This patch instead of storing offsets calculates them when needed.
This allows to save heap space and disk space for offsets.

To make offsets calculable:
1) we write neighbours as Int instead of the current VInt to
 have predictable offsets (some extra space here)
2) for each node we allocate ((maxConn + 1) * 4) bytes for
storing the node's neighbours, where "maxConn" is the maximum number
of connections the node can have. If a node has less than maxConn we
add extra padding to fill the leftover space (some extra space here).
In big graphs most nodes have "maxConn" of neighbours so there
should not be much of a waste space.
@jpountz
Copy link
Contributor

jpountz commented Jan 11, 2022

Sorry @mayya-sharipova I had missed your ping. I realize we don't have good guidelines on writing file formats so I started something at #598. I'll have a quick look at this PR now.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way files are organized makes sense to me.

* @param level level for which to get all nodes
* @return an iterator over nodes where {@code nextDoc} returns a next node ordinal
*/
// TODO: return a more suitable iterator over nodes than DocIdSetIterator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would IntStream be a good fit? Or PrimitiveIterator.OfInt?

Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I left some last tiny comments, but it looks ready to me. As we discussed on #536, before merging we plan to move these changes to a new format Lucene91HnswVectorsFormat to preserve backwards compatibility.

latch.await();
IndexSearcher searcher = manager.acquire();
try {
KnnVectorQuery query = new KnnVectorQuery("vector", new float[] {0f, 0.1f}, 5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny comment, should we use assertGraphSearch or doKnnSearch here like we do elsewhere in the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertGraphSearch expects an IndexReader and user reader fo search, while in this test we have SearcherManager that uses IndexSearcher for search; but the assertion of results is copied from assertGraphSearch .

The test tests that there is no problem with accessing a graph from multiple threads at the same time. We can move this test to TestKnnVectorQuery if you think that it will fit there better.

mayya-sharipova added a commit to mayya-sharipova/lucene that referenced this pull request Jan 17, 2022
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.

This PR is based on the following PRs:
 apache#250, apache#267, apache#287, apache#315, apache#536, apache#416

Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.
mayya-sharipova added a commit to mayya-sharipova/lucene that referenced this pull request Jan 17, 2022
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.

This PR is based on the following PRs:
 apache#250, apache#267, apache#287, apache#315, apache#536, apache#416

Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.

TODO: tests for KNN search for graphs built in pre 9.1 version;
tests for merge of indices of pre 9.1 + current versions.
@mayya-sharipova
Copy link
Contributor Author

Thanks everyone for the review. I am closing this PR in favour of #608 that copies these changes into a new Lucene91Codec and Lucene91HnswVectorsFormat.

mayya-sharipova added a commit to mayya-sharipova/lucene that referenced this pull request Jan 17, 2022
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.

This PR is based on the following PRs:
 apache#250, apache#267, apache#287, apache#315, apache#536, apache#416

Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.

TODO: tests for KNN search for graphs built in pre 9.1 version;
tests for merge of indices of pre 9.1 + current versions.
mayya-sharipova added a commit that referenced this pull request Jan 25, 2022
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.

This PR is based on the following PRs:
 #250, #267, #287, #315, #536, #416

Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.

TODO: tests for KNN search for graphs built in pre 9.1 version;
tests for merge of indices of pre 9.1 + current versions.
mayya-sharipova added a commit to mayya-sharipova/lucene that referenced this pull request Jan 27, 2022
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.

This PR is based on the following PRs:
 apache#250, apache#267, apache#287, apache#315, apache#536, apache#416

Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.

TODO: tests for KNN search for graphs built in pre 9.1 version;
tests for merge of indices of pre 9.1 + current versions.
mayya-sharipova added a commit that referenced this pull request Jan 28, 2022
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.

This PR is based on the following PRs:
 #250, #267, #287, #315, #536, #416

Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.

TODO: tests for KNN search for graphs built in pre 9.1 version;
tests for merge of indices of pre 9.1 + current versions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants