New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9004: KNN vector search using NSW graphs #2022
Conversation
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorReader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is impressive @msokolov!
I left a bunch of small comments, mostly asking naive questions ;)
Are we planning on adding a VectorsQuery
in a follow-on issue/PR, to make it simpler for users to combine ANN search with other clauses in a BooleanQuery
, to ensure deletions are filtered out, etc.?
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorReader.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorReader.java
Outdated
Show resolved
Hide resolved
* vectors. Unlike relevance scores, vector scores may be negative. | ||
* @param target the vector-valued query | ||
* @param k the number of docs to return | ||
* @param fanout control the accuracy/speed tradeoff - larger values give better recall at higher cost |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there also an index-time fanout control? Do we allow users to tune that (where?)? If so, maybe link to that from this search-time fanout
javadoc too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think there needs to be a follow-on exposing the index-time controls, which indeed are much more potent than this search-time fanout, which has only a small impact on recall and latency. In this patch they are globals in HnswGraphBuilder, but there is no API for setting them. I am thinking the index-time hyperparameters would be specified in IndexWriterConfig?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think there needs to be a follow-on exposing the index-time controls, which indeed are much more potent than this search-time fanout, which has only a small impact on recall and latency. In this patch they are globals in HnswGraphBuilder, but there is no API for setting them.
OK, makes sense.
I am thinking the index-time hyperparameters would be specified in IndexWriterConfig?
Hmm, maybe these could be codec level controls? Or maybe FieldInfo
? They would be per-vector-field configuration right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good point. While experimenting with GloVe I'm learning that different settings are appropriate for different vectors, so field-level control might be needed. I'm not sure how codec-level controls are exposed. Don't Codecs get created automatically using no-args constructors and service autodiscovery? Did you mean something like perFieldVectorFormat? Except I doubt we need a new format; it's more about some metadata values that we would store in the field, so I think yeah it would go in FieldInfo. But I'm reluctant to expose hnsw-specific hyperparameters in VectorField
, which we want to support other algorithms as well. Maybe this is a good use case for IndexableField.getAttributes()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this probably needs to be field level. Different strokes for different collections of vectors. I'm not sure how to expose since the parameters will be different for different ANN implementations. Might be a good use case for generic IndexedField.attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't Codecs get created automatically using no-args constructors and service autodiscovery?
They do at read (search) time! But at write time, you can pass parameters that alter how the Codec does its work, as long as the resulting index is then readable at search time with no-args constructors.
I vaguely remember talking about having ways for Codec at read-time to also take options, but I'm not sure that was ever fully designed / pushed ... @s1monw may remember?
But I'm reluctant to expose hnsw-specific hyperparameters in VectorField, which we want to support other algorithms as well.
Might be a good use case for generic IndexedField.attributes?
Yeah, maybe? I agree it is not obvious where the API should live and how it then finds its way into the ANN data structure construction when writing each segment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikemccand it was pushed but removed again in https://issues.apache.org/jira/browse/LUCENE-9257 a little while ago. I get why it's removed but it seems useful maybe we add it back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh OK thanks for the context @s1monw.
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorFormat.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/index/KnnGraphValues.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/index/KnnGraphValues.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @msokolov, this is an awesome progress-not-perfection start to adding ANN to Lucene!
int versionMeta = -1; | ||
long checksum = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm is this unused?
@Override | ||
public TopDocs search(float[] vector, int topK, int fanout) throws IOException { | ||
// use a seed that is fixed for the index so we get reproducible results for the same query | ||
final Random random = new Random(checksumSeed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever seed!
Hi,
This line is wrong:
(should be <h2> )
|
Thanks Uwe, I'll push a fix soon. True enough, I am building with Java 11, which is not as fussy about such things. Indeed, I'm used to using arbitrary heading levels to achieve different presentation, but I guess that's not OK! |
fix javadoc Harden TestVectorValues.testAddIndexesDirectory01 Resolve merge conflict Fix random TestVectorValues failures by use of forceMerge (and getOnlyLeafReader) LUCENE-9322: Make sure to account for vectors in SortingCodecReader. (apache#2028) LUCENE-9583: extract separate RandomAccessVectorValues interface (apache#2037) LUCENE-9322: fix minor cosmetic refactoring error in logging string in IndexWriter's infoStream logging. It was always printing 'vector values' for all merging times instead of the other parts of Lucene index ('doc values', 'stored fields', etc.) LUCENE-9322: Some fixes to SimpleTextVectorFormat. (apache#2071) * Make sure the file extensions are unique. * Fix bug in vector reading. LUCENE-9004: KNN vector search using NSW graphs (apache#2022) LUCENE-9610: fix TestKnnGraph.testMerge LUCENE-9610: fix bug in previous test fix Make Backward compatible
Phew this has been a long time coming, but I think it is in good shape now. We started with a scratchy prototype about a year ago, then @mocobeta got it on a better footing by adding a new codec and also implemented the full hierarchical algorithm, making the graph search faithful to the published literature. Then we took a step back to add underlying vector format as a separate patch, now landed. This patch builds on the new vector format, providing KNN search with NSW graphs. It's the simplest implementation I could tease out (single layer graph, simple neighbor selection, no max fanout control), but I think it will be a good foundation. I've done some pretty extensive performance testing and hyperparameter exploration using the (included) KnnGraphTester with some proprietary data, and get good results. I will follow up later with specifics, but single-threaded latencies in a few ms on my i7 laptop over a 1M x 256-dim dataset seems pretty good. Followups will include repeatable benchmarks on public datasets.