Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10577: enable quantization of HNSW vectors to 8 bits #947

Closed
wants to merge 14 commits into from

Conversation

msokolov
Copy link
Contributor

@msokolov msokolov commented Jun 7, 2022

This is PR #3 for this feature. It is very close to the previous one, just "rebased" on top of the Lucene93 Codec. In this PR I moved the new vector utility methods as package private in util.hnsw so they would be easier to change in the future. I did not attempt any loop-unrolling optimizations. I have tried some incubating vector api implementations, but nothing is ready to share. I re-ran luceneutil tests, and will open a separate PR for adding support for this format to luceneutil. Results continue to look promising

Task QPS baseline StdDev candidate StdDev Pct diff p-value
PKLookup 134.03 (19.3%) 135.28 (15.9%) 0.9% ( -28% - 44%) 0.868
LowTermVector 637.48 (9.0%) 695.72 (10.7%) 9.1% ( -9% - 31%) 0.003
AndHighMedVector 445.66 (8.0%) 559.35 (11.8%) 25.5% ( 5% - 49%) 0.000
HighTermVector 578.16 (9.5%) 737.93 (16.2%) 27.6% ( 1% - 59%) 0.000
AndHighHighVector 418.25 (9.8%) 555.38 (15.2%) 32.8% ( 7% - 64%) 0.000
MedTermVector 383.02 (8.4%) 550.12 (13.5%) 43.6% ( 20% - 71%) 0.000
AndHighLowVector 450.92 (9.5%) 713.77 (20.1%) 58.3% ( 26% - 97%) 0.000

@@ -129,6 +128,12 @@ public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader)
try {
// write the vector data to a temporary file
DocsWithFieldSet docsWithField = writeVectorData(tempVectorData, vectors);
int byteSize;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops some left-over change that came in while rebasing; I'll remove

@msokolov
Copy link
Contributor Author

msokolov commented Jun 8, 2022

If folks have time to review, that would be great. The main thing to focus on I think is

  1. how the HnswGraphSearcher/Writer implementation is forked into float[] and BytesRef using generics
  2. adding a VERSION to the index format so we can handle back-compat in a reasonable way
  3. the fact that some of the details of VectorSimilarityFunction are now directly handled in util.hnsw -- I didn't see a better way, but it blurs the abstraction boundary

@@ -41,11 +42,11 @@ abstract class OffHeapVectorValues extends VectorValues
protected final int byteSize;
protected final float[] value;

OffHeapVectorValues(int dimension, int size, IndexInput slice) {
OffHeapVectorValues(int dimension, int size, IndexInput slice, int byteSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For DOT_PRODUCT8 case for vectorValue(..) functions should we use slice.readBytes instead of slice.ReadFloats and then convert them to floats? Otherwise, it looks to me we are reading wrong values.

@mayya-sharipova
Copy link
Contributor

@msokolov Thanks for your work, this is a very exciting feature. Really looking forward to it. Extra benefit – less disk space used for vector values.

Overall, this PR looks very good to me. I just have several questions

  • During merging when writing a merged vector field it looks like we first expand vector values only to again to compress them later? Would be nice to avoid this.
  • Not sure if possible at this stage, but it would be nice if HnswGraphBuilder and HnswGraphSearcher were not aware of different calculations needed for different similarity functions, and refer all this calculations (dotProduct(BytesRef a..)) to VectorSimilarityFunction

@msokolov
Copy link
Contributor Author

I'm looking to address various comments; just pushed a commit that makes the vector encoding explicit by adding a new enum and parameter "vectorEncoding", splitting this out from "similarityFunction".

During merging when writing a merged vector field it looks like we first expand vector values only to again to compress them later? Would be nice to avoid this.

Oh good catch, @mayya-sharipova I will look into addressing this.

Not sure if possible at this stage, but it would be nice if HnswGraphBuilder and HnswGraphSearcher were not aware of different calculations needed for different similarity functions, and refer all this calculations (dotProduct(BytesRef a..)) to VectorSimilarityFunction

I don't see how to do this efficiently (without many conversions from byte to float) and neatly (without code duplication in tricky algorithmic areas) and with complete API purity, so I sacrificed some purity. If you have any ideas how to do it better, I'm open to changing it though.

@msokolov
Copy link
Contributor Author

Also - if anybody has advice about how to rebase while maintaining this PR I'd be interested. Should I git merge from main??

@msokolov
Copy link
Contributor Author

During merging when writing a merged vector field it looks like we first expand vector values only to again to compress them later? Would be nice to avoid this.

In fact after checking, I don't think we are doing this expand/compress step even though getVectorValues() returns ExpandingVectorValues to the merger. This is because the merger uses the binaryValue() call to write the vectors themselves, and that value is unchanged by EVV, and the graph is created by the (now) polymorphic hnsw utils that also call binaryValue() when they are dealing with a field encoded as bytes.

@msokolov
Copy link
Contributor Author

OK, this last round of commits moves the new vector encoding parameter out of IndexableField and FieldInfo into Codec constructor and internally to the codec, in FieldEntry. It certainly has less visible surface area now. I also merged from main and resolved a bunch of conflicts with the scoring change. I think it is correct (all the unit tests pass), but it wasn't trivial and I think it would be worth running some integration/performance tests just to make sure all is still well.

There's a little bit of code duplication in HnswGraphSearcher where we now have the logic for switching from approximate to exact knn in two places that I don't like. Maybe that can be factored better?

@msokolov
Copy link
Contributor Author

This last commit moves "exact" KNN search from KnnVectorQuery to KnnVectorsReader. I think it is the right thing to do since it lets us optimize the exact KNN and return correct scores, but it does add a new method to the reader API. I called it "searchExhaustively" since I wanted to convey that it scans the whole index (or at least all the documents matching acceptDocs); we don't want users to try "searchExact" since of course that must be better than approximate? But I am open to better names - it is kind of long.

@msokolov
Copy link
Contributor Author

I pushed an updated luceneutil PR adapting to these changes mikemccand/luceneutil#181. Running that perf test I saw consistent gains (20-55% depending on the test case) as compared to the earlier test runs.

I also noticed that the profiler shows the most expensive function during indexing is FixedBitSet.clear(), which makes me think we mioght want to use sparse bitsets for the "upper" layers of the graph which have many fewer nodes.

@rmuir
Copy link
Member

rmuir commented Jul 15, 2022

I think the title of the PR is wrong? We shouldn't be quantizing anything. The user should be supplying a byte[] vector for 8-bit vectors. Floats should not be involved.

switch (fieldEntry.vectorEncoding) {
case BYTE -> numBytes = (long) fieldEntry.size() * dimension;
case FLOAT32 -> numBytes = (long) fieldEntry.size() * dimension * Float.BYTES;
default -> throw new AssertionError("unknown vector encoding " + fieldEntry.vectorEncoding);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also update the error message that follows for BYTE case?

total += a.bytes[aOffset++] * b.bytes[bOffset++];
}
// divide by 2 * 2^14 (maximum absolute value of product of 2 signed bytes) * len
return (1 + total) / (float) (len * (1 << 15));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make scores non-negative should we do instead:
total / (float) (len * (1 << 15)) + 1?

protected static TopDocs exhaustiveSearch(
VectorValues vectorValues,
DocIdSetIterator acceptDocs,
VectorSimilarityFunction similarityFunction,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like similarityFunction is not necessary here, as we always use dotProductScore

@msokolov msokolov closed this Jul 30, 2022
@msokolov msokolov deleted the LUCENE-10577 branch July 30, 2022 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants