Add BitVectors format and make flat vectors format easier to extend #13288

benwtrent · 2024-04-10T14:01:39Z

Instead of making a separate thing pluggable inside of the FieldFormat, this instead keeps the vector similarities as they are, but allows a custom scorer to be provided to the FlatVector storage used by HNSW.

This idea is akin to the compression extensions we have. But in this case, its for vector scorers.

To show how this would work in practice, I took the liberty of adding a new HnswBitVectorsFormat in the sandbox module.

A larger part of the change is a refactor of the RandomAccessVectorValues<T> to remove the <T>. Nothing actually uses that any longer, and we should instead rely on well defined classes and stop relying on casting with generics (yuck).

jpountz

I only had a quick look but I like the idea. (I also like that you removed the generics!)

lucene/core/src/java/org/apache/lucene/codecs/FlatVectorsScorer.java

lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapFloatVectorValues.java

lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/bitvectors/HnswBitVectorsFormat.java

ChrisHegarty

I like this, and will do a more detailed review once I get it into my IDE.

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecModel.java

jimczi

I like this approach, it isolates the customisation and extensibility to a specific case (the flat format). We have some cleanup to do with all the random vector scorers and suppliers but these is a step forward in terms of simplification, thanks @benwtrent

lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapRandomAccessVectorValues.java

...sandbox/src/java/org/apache/lucene/sandbox/codecs/bitvectors/OnHeapFlatBitVectorsScorer.java

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsWriter.java

lucene/core/src/java/org/apache/lucene/codecs/hnsw/OnHeapFlatVectorScorer.java

jimczi · 2024-04-10T22:04:02Z

...ard-codecs/src/java/org/apache/lucene/backward_codecs/lucene92/OffHeapFloatVectorValues.java

@@ -28,7 +28,7 @@

 /** Read the vector values from the index input. This supports both iterated and random access. */
 abstract class OffHeapFloatVectorValues extends FloatVectorValues
-    implements RandomAccessVectorValues<float[]> {
+    implements RandomAccessVectorValues.Floats {


Not for this PR but I would like to try splitting FloatVectorValues and RandomAccessVectorValues.Floats. Having a single hierarchy that mixes the access pattern is not ideal. With the FlatVectorFomat in the mix we should be able to produce RandomAccessVectorValues and FloatVectorValues independently. This change should help this simplification :)

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java

lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/bitvectors/HnswBitVectorsFormat.java

…ible-flat-vector-storage

jpountz

My main suggestion is to move the new format to lucene/codecs. Otherwise LGTM.

lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java

lucene/core/src/java/module-info.java

lucene/core/src/java/org/apache/lucene/codecs/hnsw/DefaultFlatVectorScorer.java

lucene/core/src/java/org/apache/lucene/codecs/hnsw/FlatBitVectorsScorer.java

lucene/core/src/java/org/apache/lucene/codecs/hnsw/ScalarQuantizedVectorScorer.java

lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java

tteofili

+1 on moving the new format to codecs, other than that LGTM.

lucene/core/src/java/org/apache/lucene/codecs/hnsw/package-info.java

…ible-flat-vector-storage

ChrisHegarty

LGTM.

benwtrent · 2024-04-12T16:30:15Z

Hey @uschindler I didn't want to move forward on merging without your thoughts. This is a separate idea from: #13200

This change is more inline to what we do with custom compression functions for other formats. It continues to rely on the reliance of "default formats", which is still an enumeration.

However, this allows for a custom set of scorers be provided by a custom codec. The first example of this is the bit vector codec.

While working on #13200, it just kept looking more and more like a backwards compatibility nightmare and I just couldn't figure out a good interface with formats (like Scalar quantization) that need to know the exact similarity kind.

jimczi

@benwtrent I wish we can avoid introducing another hierarchy of vector scorer (FlatVectorScorer) and reuse the original RandomVectorScorer(Supplier). We have too many overlapping concepts imo so I tried to simplify here:
jimczi@cd7d6bf
The proposed simplification is to use the RandomVectorScorerSupplier consistently in the HNSW graph and in the flat vectors codec for customisation.
The change is built on top of this PR, let me know what you think.

benwtrent · 2024-04-12T18:05:31Z

@jimczi I do like the further simplification. I can see about pulling in some of your ideas

uschindler · 2024-04-12T22:17:40Z

Hey @uschindler I didn't want to move forward on merging without your thoughts. This is a separate idea from: #13200

Will check tomorrow.

ChrisHegarty · 2024-04-15T19:48:32Z

I would like to suggest that we reintroduce getSlice. The getSlice method is critical to any serious implementation that wants to take things into its own hands. The getSlice methods allows to store and retrieve additional metadata per vector, say, like for example the current int8 SQ does (with the per-vector float offset values). The interfaces here are "expert", so I see no issue getSlice. While not inconceivable or a requirement of this work, I would expect that it be possible to rewrite the existing int8 SQ atop this interface, which is a good reason why getSlice should be reintroduced. ( I also eventually want to move towards direct off-heap access, but that is orthogonal )

benwtrent · 2024-04-15T20:07:58Z

@jimczi OK, I read a bit more of your suggestion.

I am not a huge fan of how every scorer can now just get a "queryOrdinal" and overwrite whatever query was passed to it.

Some of the code reduction you did does seem nice, I am not sure I like the API however. I would need to fully flesh it out to see it in action.

…ible-flat-vector-storage

jimczi · 2024-04-16T17:05:59Z

I am not a huge fan of how every scorer can now just get a "queryOrdinal" and overwrite whatever query was passed to it.

Yep that's tricky. I couldn't find a better way since my goal was to avoid having three level of vector scorers (FlatVectorScorer -> RandomVectorSupplier -> RandomVectorScorer). I'd still argue that this way of exposing things is more straightforward and allows to reduce the amount of code that relies on generic interface that needs to be casted. This setQueryOrd is only for the builder case though so completely internal and not something that the custom score should worry about (on how to use it).

…ible-flat-vector-storage

…pache#13288) Instead of making a separate thing pluggable inside of the FieldFormat, this instead keeps the vector similarities as they are, but allows a custom scorer to be provided to the FlatVector storage used by HNSW. This idea is akin to the compression extensions we have. But in this case, its for vector scorers. To show how this would work in practice, I took the liberty of adding a new HnswBitVectorsFormat in the sandbox module. A larger part of the change is a refactor of the `RandomAccessVectorValues<T>` to remove the `<T>`. Nothing actually uses that any longer, and we should instead rely on well defined classes and stop relying on casting with generics (yuck).

…13288) (#13316) Instead of making a separate thing pluggable inside of the FieldFormat, this instead keeps the vector similarities as they are, but allows a custom scorer to be provided to the FlatVector storage used by HNSW. This idea is akin to the compression extensions we have. But in this case, its for vector scorers. To show how this would work in practice, I took the liberty of adding a new HnswBitVectorsFormat in the sandbox module. A larger part of the change is a refactor of the `RandomAccessVectorValues<T>` to remove the `<T>`. Nothing actually uses that any longer, and we should instead rely on well defined classes and stop relying on casting with generics (yuck).

navneet1v · 2024-05-26T04:52:59Z

@benwtrent I see that with this PR and enabled the flat vectors format easier to extend. You showed it with an example for BitVectorsFormat.

Does this mean now Lucene supports BitVectorsFormat officially? Or it was more a prototype and not intended for production use?
Another reason why I am asking this is because from the VectorSimilarity enum standpoint I cannot find Hamming Bit there. So if bitvector is supposed to be for production use then what should be the VectorSimilarity should be used for BitVectors. Ref:

lucene/lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java

Line 29 in b247afe

public enum VectorSimilarityFunction {

If a user overriding the Scorer for the flatVectorsFormat, then does this mean requirement of VectorSimilarity Function is not a required attribute now? If ans is then, are there plans to remove the VectorSimilarity param while creating the VectorField? Ref:

lucene/lucene/core/src/java/org/apache/lucene/document/FieldType.java

Lines 374 to 383 in b247afe

    
           public void setVectorAttributes( 
        
               int numDimensions, VectorEncoding encoding, VectorSimilarityFunction similarity) { 
        
             checkIfFrozen(); 
        
             if (numDimensions <= 0) { 
        
               throw new IllegalArgumentException("vector numDimensions must be > 0; got " + numDimensions); 
        
             } 
        
             this.vectorDimension = numDimensions; 
        
             this.vectorSimilarityFunction = Objects.requireNonNull(similarity); 
        
             this.vectorEncoding = Objects.requireNonNull(encoding); 
        
           }

If the format is not intended for production use, I would like to enhance the format. Please let me know your thoughts.

benwtrent · 2024-05-29T13:11:30Z

No, BitVector format is not in the backwards compatible package.
Correct, there have been previous discussions in an effort to add it as a similarity value, but those conversations are blocked until we come up with a better system. We don't want to add fully-backwards compatible similarities that our core formats should support until we have a road for deprecating the existing ones.
"If a user overriding the Scorer for the flatVectorsFormat..." This implies a custom vector format, so the user will handle that themselves. However, this doesn't obviate the need for configured similarities as the default core (and fully bwc formats), still use it.

navneet1v · 2024-06-05T00:52:58Z

@benwtrent

I am little confused here. I am still looking for an ans of this question: Does this mean now Lucene supports BitVectorsFormat officially? Or it was more a prototype and not intended for production use?

Another place where I don't have clarity is: what is the point of VectorSimilarity functions in case of bitvectors format. I can set a MAX_INNER_PRODUCT for bits vectors but the codec will use Hamming distance for similarity calculation. So it means getting setting vector similarity from a field is not the source truth for what vector similarity function to be used. Hence the implementations should come up with other ways to know what is the vector similarity function.

benwtrent · 2024-06-05T13:34:43Z

@navneet1v

Does this mean now Lucene supports BitVectorsFormat officially?

The answer is no.

Or it was more a prototype and not intended for production use?

The answer is yes.

what is the point of VectorSimilarity functions in case of bitvectors format.

Currently there is none. But I could see it being updated where cosine and dot-product aren't actually just hamming distance (as hamming is more akin to euclidean).

So it means getting setting vector similarity from a field is not the source truth for what vector similarity function to be used.

For the default and core codecs, keeping a nice separation so that users don't have to know about the codec and trusting it is doing the right thing is important.

Using the similarity in Field Info allows users to have a pick of some default supported vector similarity functions without futzing around with codecs (which is complicated for normal Lucene users). It is important for ease of use.

As for the format summarily ignoring the input, this could always be done. The format stores, reads, scores, etc. any way it wants. If the advance user chooses a custom format that ignores the similarity applied to the field, its their prerogative.

For example, its conceptual that a format could actually ignore cosine altogether and always normalize, store the magnitude, and always do dot-product.

I do not think the bit-vector format necessitates a different contract between vector similarities and formats.

Add BitVectors format and make flat vectors format easier to extend

4a368ef

benwtrent added this to the 9.11.0 milestone Apr 10, 2024

benwtrent requested a review from jimczi April 10, 2024 14:01

benwtrent mentioned this pull request Apr 10, 2024

Add new pluggable vector similarity to field info #13200

Closed

jpountz reviewed Apr 10, 2024

View reviewed changes

ChrisHegarty reviewed Apr 10, 2024

View reviewed changes

lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecModel.java Show resolved Hide resolved

moving flat vector stuffs into a new hnsw package to indicate its usage

b5d76dd

jimczi reviewed Apr 11, 2024

View reviewed changes

benwtrent added 4 commits April 11, 2024 10:29

Moving and renaming things, making things simpler

9095006

Merge remote-tracking branch 'upstream/main' into feature/more-extens…

407dc08

…ible-flat-vector-storage

spotless

8df6fb7

Removing unused interface

dce5737

jpountz approved these changes Apr 12, 2024

View reviewed changes

tteofili approved these changes Apr 12, 2024

View reviewed changes

lucene/core/src/java/org/apache/lucene/codecs/hnsw/package-info.java Outdated Show resolved Hide resolved

benwtrent added 2 commits April 12, 2024 09:05

addressing PR comments

7a85251

Merge remote-tracking branch 'upstream/main' into feature/more-extens…

57847f5

…ible-flat-vector-storage

jpountz approved these changes Apr 12, 2024

View reviewed changes

tteofili approved these changes Apr 12, 2024

View reviewed changes

ChrisHegarty approved these changes Apr 12, 2024

View reviewed changes

jimczi reviewed Apr 12, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into feature/more-extens…

dbddaf4

…ible-flat-vector-storage

benwtrent added 2 commits April 16, 2024 14:28

applying some of jims suggestions

36bbe96

Merge remote-tracking branch 'upstream/main' into feature/more-extens…

bfe8c2e

…ible-flat-vector-storage

benwtrent merged commit 3d86ff2 into apache:main Apr 17, 2024
4 checks passed

benwtrent deleted the feature/more-extensible-flat-vector-storage branch April 17, 2024 17:13

This was referenced Apr 22, 2024

Making vector comparisons pluggable #13182

Closed

Adding binary Hamming distance as similarity option for byte vectors #13076

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BitVectors format and make flat vectors format easier to extend #13288

Add BitVectors format and make flat vectors format easier to extend #13288

benwtrent commented Apr 10, 2024 •

edited

jpountz left a comment

ChrisHegarty left a comment

jimczi left a comment

jimczi Apr 10, 2024

jpountz left a comment

tteofili left a comment •

edited

ChrisHegarty left a comment

benwtrent commented Apr 12, 2024

jimczi left a comment

benwtrent commented Apr 12, 2024

uschindler commented Apr 12, 2024

ChrisHegarty commented Apr 15, 2024

benwtrent commented Apr 15, 2024

jimczi commented Apr 16, 2024

navneet1v commented May 26, 2024 •

edited

benwtrent commented May 29, 2024

navneet1v commented Jun 5, 2024

benwtrent commented Jun 5, 2024

Add BitVectors format and make flat vectors format easier to extend #13288

Add BitVectors format and make flat vectors format easier to extend #13288

Conversation

benwtrent commented Apr 10, 2024 • edited

jpountz left a comment

Choose a reason for hiding this comment

ChrisHegarty left a comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

jimczi Apr 10, 2024

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

tteofili left a comment • edited

Choose a reason for hiding this comment

ChrisHegarty left a comment

Choose a reason for hiding this comment

benwtrent commented Apr 12, 2024

jimczi left a comment

Choose a reason for hiding this comment

benwtrent commented Apr 12, 2024

uschindler commented Apr 12, 2024

ChrisHegarty commented Apr 15, 2024

benwtrent commented Apr 15, 2024

jimczi commented Apr 16, 2024

navneet1v commented May 26, 2024 • edited

benwtrent commented May 29, 2024

navneet1v commented Jun 5, 2024

benwtrent commented Jun 5, 2024

benwtrent commented Apr 10, 2024 •

edited

tteofili left a comment •

edited

navneet1v commented May 26, 2024 •

edited