Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9004: KNN vector search using NSW graphs #2022

Merged
merged 17 commits into from Nov 13, 2020

Conversation

msokolov
Copy link
Contributor

Phew this has been a long time coming, but I think it is in good shape now. We started with a scratchy prototype about a year ago, then @mocobeta got it on a better footing by adding a new codec and also implemented the full hierarchical algorithm, making the graph search faithful to the published literature. Then we took a step back to add underlying vector format as a separate patch, now landed. This patch builds on the new vector format, providing KNN search with NSW graphs. It's the simplest implementation I could tease out (single layer graph, simple neighbor selection, no max fanout control), but I think it will be a good foundation. I've done some pretty extensive performance testing and hyperparameter exploration using the (included) KnnGraphTester with some proprietary data, and get good results. I will follow up later with specifics, but single-threaded latencies in a few ms on my i7 laptop over a 1M x 256-dim dataset seems pretty good. Followups will include repeatable benchmarks on public datasets.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is impressive @msokolov!

I left a bunch of small comments, mostly asking naive questions ;)

Are we planning on adding a VectorsQuery in a follow-on issue/PR, to make it simpler for users to combine ANN search with other clauses in a BooleanQuery, to ensure deletions are filtered out, etc.?

* vectors. Unlike relevance scores, vector scores may be negative.
* @param target the vector-valued query
* @param k the number of docs to return
* @param fanout control the accuracy/speed tradeoff - larger values give better recall at higher cost
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there also an index-time fanout control? Do we allow users to tune that (where?)? If so, maybe link to that from this search-time fanout javadoc too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think there needs to be a follow-on exposing the index-time controls, which indeed are much more potent than this search-time fanout, which has only a small impact on recall and latency. In this patch they are globals in HnswGraphBuilder, but there is no API for setting them. I am thinking the index-time hyperparameters would be specified in IndexWriterConfig?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think there needs to be a follow-on exposing the index-time controls, which indeed are much more potent than this search-time fanout, which has only a small impact on recall and latency. In this patch they are globals in HnswGraphBuilder, but there is no API for setting them.

OK, makes sense.

I am thinking the index-time hyperparameters would be specified in IndexWriterConfig?

Hmm, maybe these could be codec level controls? Or maybe FieldInfo? They would be per-vector-field configuration right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point. While experimenting with GloVe I'm learning that different settings are appropriate for different vectors, so field-level control might be needed. I'm not sure how codec-level controls are exposed. Don't Codecs get created automatically using no-args constructors and service autodiscovery? Did you mean something like perFieldVectorFormat? Except I doubt we need a new format; it's more about some metadata values that we would store in the field, so I think yeah it would go in FieldInfo. But I'm reluctant to expose hnsw-specific hyperparameters in VectorField, which we want to support other algorithms as well. Maybe this is a good use case for IndexableField.getAttributes()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this probably needs to be field level. Different strokes for different collections of vectors. I'm not sure how to expose since the parameters will be different for different ANN implementations. Might be a good use case for generic IndexedField.attributes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't Codecs get created automatically using no-args constructors and service autodiscovery?

They do at read (search) time! But at write time, you can pass parameters that alter how the Codec does its work, as long as the resulting index is then readable at search time with no-args constructors.

I vaguely remember talking about having ways for Codec at read-time to also take options, but I'm not sure that was ever fully designed / pushed ... @s1monw may remember?

But I'm reluctant to expose hnsw-specific hyperparameters in VectorField, which we want to support other algorithms as well.
Might be a good use case for generic IndexedField.attributes?

Yeah, maybe? I agree it is not obvious where the API should live and how it then finds its way into the ANN data structure construction when writing each segment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikemccand it was pushed but removed again in https://issues.apache.org/jira/browse/LUCENE-9257 a little while ago. I get why it's removed but it seems useful maybe we add it back?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh OK thanks for the context @s1monw.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @msokolov, this is an awesome progress-not-perfection start to adding ANN to Lucene!

int versionMeta = -1;
long checksum = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm is this unused?

@Override
public TopDocs search(float[] vector, int topK, int fanout) throws IOException {
// use a seed that is fixed for the index so we get reproducible results for the same query
final Random random = new Random(checksumSeed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever seed!

@msokolov msokolov merged commit b36b4af into apache:master Nov 13, 2020
@msokolov msokolov deleted the LUCENE-9004 branch November 13, 2020 13:54
@uschindler
Copy link
Contributor

uschindler commented Nov 16, 2020

Hi,
this broke javadocs:

> /home/jenkins/workspace/Lucene-Solr-master-Linux/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java:43: error: heading used out of sequence: <H3>, compared to implicit preceding heading: <H1>
 * <h3>Hyperparameters</h3>
   ^
1 error

This line is wrong:


(should be <h2>)

@msokolov
Copy link
Contributor Author

Thanks Uwe, I'll push a fix soon. True enough, I am building with Java 11, which is not as fussy about such things. Indeed, I'm used to using arbitrary heading levels to achieve different presentation, but I guess that's not OK!

msfroh pushed a commit to msfroh/lucene-solr that referenced this pull request Nov 18, 2020
epugh pushed a commit to epugh/lucene-solr-1 that referenced this pull request Jan 15, 2021
gsmiller pushed a commit to gsmiller/lucene-solr that referenced this pull request Mar 17, 2021
fix javadoc

Harden TestVectorValues.testAddIndexesDirectory01

Resolve merge conflict

Fix random TestVectorValues failures by use of forceMerge (and getOnlyLeafReader)

LUCENE-9322: Make sure to account for vectors in SortingCodecReader. (apache#2028)

LUCENE-9583: extract separate RandomAccessVectorValues interface (apache#2037)

LUCENE-9322: fix minor cosmetic refactoring error in logging string in IndexWriter's infoStream logging. It was always printing 'vector values' for all merging times instead of the other parts of Lucene index ('doc values', 'stored fields', etc.)

LUCENE-9322: Some fixes to SimpleTextVectorFormat. (apache#2071)

* Make sure the file extensions are unique.

* Fix bug in vector reading.

LUCENE-9004: KNN vector search using NSW graphs (apache#2022)

LUCENE-9610: fix TestKnnGraph.testMerge

LUCENE-9610: fix bug in previous test fix

Make Backward compatible
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants