New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10577: enable quantization of HNSW vectors to 8 bits #947
Conversation
@@ -129,6 +128,12 @@ public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) | |||
try { | |||
// write the vector data to a temporary file | |||
DocsWithFieldSet docsWithField = writeVectorData(tempVectorData, vectors); | |||
int byteSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops some left-over change that came in while rebasing; I'll remove
If folks have time to review, that would be great. The main thing to focus on I think is
|
@@ -41,11 +42,11 @@ abstract class OffHeapVectorValues extends VectorValues | |||
protected final int byteSize; | |||
protected final float[] value; | |||
|
|||
OffHeapVectorValues(int dimension, int size, IndexInput slice) { | |||
OffHeapVectorValues(int dimension, int size, IndexInput slice, int byteSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For DOT_PRODUCT8 case for vectorValue(..)
functions should we use slice.readBytes
instead of slice.ReadFloats
and then convert them to floats? Otherwise, it looks to me we are reading wrong values.
@msokolov Thanks for your work, this is a very exciting feature. Really looking forward to it. Extra benefit – less disk space used for vector values. Overall, this PR looks very good to me. I just have several questions
|
I'm looking to address various comments; just pushed a commit that makes the vector encoding explicit by adding a new enum and parameter "vectorEncoding", splitting this out from "similarityFunction".
Oh good catch, @mayya-sharipova I will look into addressing this.
I don't see how to do this efficiently (without many conversions from byte to float) and neatly (without code duplication in tricky algorithmic areas) and with complete API purity, so I sacrificed some purity. If you have any ideas how to do it better, I'm open to changing it though. |
Also - if anybody has advice about how to rebase while maintaining this PR I'd be interested. Should I |
In fact after checking, I don't think we are doing this expand/compress step even though getVectorValues() returns |
OK, this last round of commits moves the new vector encoding parameter out of IndexableField and FieldInfo into Codec constructor and internally to the codec, in FieldEntry. It certainly has less visible surface area now. I also merged from main and resolved a bunch of conflicts with the scoring change. I think it is correct (all the unit tests pass), but it wasn't trivial and I think it would be worth running some integration/performance tests just to make sure all is still well. There's a little bit of code duplication in HnswGraphSearcher where we now have the logic for switching from approximate to exact knn in two places that I don't like. Maybe that can be factored better? |
This last commit moves "exact" KNN search from |
I pushed an updated luceneutil PR adapting to these changes mikemccand/luceneutil#181. Running that perf test I saw consistent gains (20-55% depending on the test case) as compared to the earlier test runs. I also noticed that the profiler shows the most expensive function during indexing is FixedBitSet.clear(), which makes me think we mioght want to use sparse bitsets for the "upper" layers of the graph which have many fewer nodes. |
I think the title of the PR is wrong? We shouldn't be quantizing anything. The user should be supplying a |
switch (fieldEntry.vectorEncoding) { | ||
case BYTE -> numBytes = (long) fieldEntry.size() * dimension; | ||
case FLOAT32 -> numBytes = (long) fieldEntry.size() * dimension * Float.BYTES; | ||
default -> throw new AssertionError("unknown vector encoding " + fieldEntry.vectorEncoding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also update the error message that follows for BYTE
case?
total += a.bytes[aOffset++] * b.bytes[bOffset++]; | ||
} | ||
// divide by 2 * 2^14 (maximum absolute value of product of 2 signed bytes) * len | ||
return (1 + total) / (float) (len * (1 << 15)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make scores non-negative should we do instead:
total / (float) (len * (1 << 15)) + 1
?
protected static TopDocs exhaustiveSearch( | ||
VectorValues vectorValues, | ||
DocIdSetIterator acceptDocs, | ||
VectorSimilarityFunction similarityFunction, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like similarityFunction
is not necessary here, as we always use dotProductScore
This is PR #3 for this feature. It is very close to the previous one, just "rebased" on top of the Lucene93 Codec. In this PR I moved the new vector utility methods as package private in util.hnsw so they would be easier to change in the future. I did not attempt any loop-unrolling optimizations. I have tried some incubating vector api implementations, but nothing is ready to share. I re-ran luceneutil tests, and will open a separate PR for adding support for this format to luceneutil. Results continue to look promising