FieldInfosFormat translation should be independent of VectorSimilartyFunction enum #13119

ChrisHegarty · 2024-02-20T09:54:14Z

This commit updates the FieldInfosFormat translation of vector similarity functions to be independent of the VectorSimilartyFunction enum.

The VectorSimilartyFunction enum lives outside of the codec format, and the format should not inadvertently depend upon the declaration order or values in VectorSimilartyFunction. The format should be in charge of the translation of similarity function to format ordinal (and visa versa). In reality, and for now, the translation remains the same as the declaration order, but this may not be the case in the future.

Note: did we introduce a potential index corruption issue when adding maximum inner product in 9.8.0? since the format was not updated when the enum value was added - the ordinal for maximum inner product is unknown to Lucene 9.7.0, which uses the same format.

…yFunction enum

ChrisHegarty · 2024-02-20T09:59:41Z

This PR is a prerequisite for future work to make the similarity function in the format symbolic and lookup-able, see #13076 (comment).

lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94FieldInfosFormat.java

tteofili

Index format wise, I think the index corruption can occur when reading a Lucene 9.8.0 index with Lucene 9.7.0, as the format would allow that, but I am not sure this is an expected scenario.

lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94FieldInfosFormat.java

uschindler

Hi,
as stated in the other issue: I am not really happy to have that enum at all! The similarity/distance functions should be pluggable using NamedSPILoader. To implement that, the ordinals must removed in a new file format version and instead names be written using the Codec utility classes.

As a first step this PR is fine as it does not change file format and just decouples the ordinals from the enum. In future, when we have SPI, we can use the current code of the ordinals

In my opinion, the strings as lookup keys are not needed: Just define it as List<VectorSimilarityFunction> to get the link between them. At a later stage the backwards layer could then fallback to the list with SPI instances to lookup the legacy ordinals. The coec and the enum are still enough decoupled.

uschindler · 2024-02-20T12:24:25Z

Index format wise, I think the index corruption can occur when reading a Lucene 9.8.0 index with Lucene 9.7.0, as the format would allow that, but I am not sure this is an expected scenario.

This is perfectly fine.

ChrisHegarty · 2024-02-20T12:40:16Z

Hi, as stated in the other issue: I am not really happy to have that enum at all! The similarity/distance functions should be pluggable using NamedSPILoader. To implement that, the ordinals must removed in a new file format version and instead names be written using the Codec utility classes.

Agreed on where we wanna get to. Just trying to get there incrementally, since format changes are quite noisy.

As a first step this PR is fine as it does not change file format and just decouples the ordinals from the enum. In future, when we have SPI, we can use the current code of the ordinals

Exactly, this is just a first step. It (for the most part) encapsulates the translation in the format. When we add a new format and/or evolve VectorSimilarityFunction, this format should be largely immune to the change.

In my opinion, the strings as lookup keys are not needed: Just define it as List<VectorSimilarityFunction> to get the link between them. At a later stage the backwards layer could then fallback to the list with SPI instances to lookup the legacy ordinals. The coec and the enum are still enough decoupled.

Yeah, that's probably good enough for now. Updated.

ChrisHegarty · 2024-02-20T12:42:22Z

Index format wise, I think the index corruption can occur when reading a Lucene 9.8.0 index with Lucene 9.7.0, as the format would allow that, but I am not sure this is an expected scenario.

This is perfectly fine.

Ok cool. I was worried for nothing then.

uschindler · 2024-02-20T12:56:07Z

Hi, as stated in the other issue: I am not really happy to have that enum at all! The similarity/distance functions should be pluggable using NamedSPILoader. To implement that, the ordinals must removed in a new file format version and instead names be written using the Codec utility classes.

Agreed on where we wanna get to. Just trying to get there incrementally, since format changes are quite noisy.

As a first step this PR is fine as it does not change file format and just decouples the ordinals from the enum. In future, when we have SPI, we can use the current code of the ordinals

Exactly, this is just a first step. It (for the most part) encapsulates the translation in the format. When we add a new format and/or evolve VectorSimilarityFunction, this format should be largely immune to the change.

In my opinion, the strings as lookup keys are not needed: Just define it as List<VectorSimilarityFunction> to get the link between them. At a later stage the backwards layer could then fallback to the list with SPI instances to lookup the legacy ordinals. The coec and the enum are still enough decoupled.

Yeah, that's probably good enough for now. Updated.

The comment is a bit outdated. I was thinking of making it even more vebose by having 2 maps for the lookup...

+1 looks fine.

benwtrent

Depending on enum order was always trappy. Thanks for decoupling!

uschindler

if you fix the comment and remove "names" from

maybe be sure to explicitly say: add new ones always at end of list :-)

ChrisHegarty · 2024-02-20T14:01:31Z

I see now that we have a similar dependency in Lucene99HnswVectorsReader. I'll update in a similar way.

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java

ChrisHegarty · 2024-02-22T17:08:16Z

Thanks for the reviews. All comments have been addressed.

…Function enum (#13119) This commit updates the FieldInfosFormat translation of vector similarity functions to be independent of the VectorSimilartyFunction enum. The VectorSimilartyFunction enum lives outside of the codec format, and the format should not inadvertently depend upon the declaration order or values in VectorSimilartyFunction. The format should be in charge of the translation of similarity function to format ordinal (and visa versa). In reality, and for now, the translation remains the same as the declaration order, but this may not be the case in the future.

msokolov · 2024-02-25T23:09:53Z

OK, this is really weird to me. For some reason, we are writing the dimension & similarity into the vector metdata but that information is retained in the field info already.

@benwtrent honestly don't remember, but I do know that early on we tried things a few different ways. There was some discussion about whether the similarity function and dimensions should be in the codec vs in the field info. I suspect it evolved and we did not end up removing the redundant version?

benwtrent · 2024-02-26T07:36:23Z

@msokolov thanks for clarifying. I just wanted to make sure there wasn't an important reason that I missed.

FieldInfosFormat translation should be independent of VectorSimilarit…

9c3f3c5

…yFunction enum

ChrisHegarty requested a review from uschindler February 20, 2024 10:01

ChrisHegarty commented Feb 20, 2024

View reviewed changes

lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94FieldInfosFormat.java Show resolved Hide resolved

ChrisHegarty requested a review from benwtrent February 20, 2024 10:46

tteofili reviewed Feb 20, 2024

View reviewed changes

lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94FieldInfosFormat.java Show resolved Hide resolved

uschindler reviewed Feb 20, 2024

View reviewed changes

review comments

f32c57e

remove names

e353a6b

benwtrent approved these changes Feb 20, 2024

View reviewed changes

uschindler approved these changes Feb 20, 2024

View reviewed changes

remove more names and add-to-end-of-list comment

bd7f777

Hnsw reader/writer

dd5424d

ChrisHegarty requested review from benwtrent and uschindler February 21, 2024 09:30

ChrisHegarty commented Feb 21, 2024

View reviewed changes

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java Show resolved Hide resolved

benwtrent approved these changes Feb 22, 2024

View reviewed changes

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java Show resolved Hide resolved

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java Show resolved Hide resolved

ChrisHegarty added 2 commits February 22, 2024 17:06

comment

e361d09

Merge branch 'main' into simFunc_fieldInfos_format

828371b

ChrisHegarty merged commit 17cbedc into apache:main Feb 22, 2024
3 checks passed

ChrisHegarty deleted the simFunc_fieldInfos_format branch February 22, 2024 17:29

benwtrent added a commit to benwtrent/lucene that referenced this pull request Feb 22, 2024

Fixing compilation after apache#13119 backport

e5058e7

iamsanjay mentioned this pull request Mar 16, 2024

SOLR-17164: Add 2 arg variant of vectorSimilarity() function apache/solr#2292

Closed

7 tasks

uschindler mentioned this pull request Apr 19, 2024

Deprecate COSINE VectorSimilarity function #13308

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FieldInfosFormat translation should be independent of VectorSimilartyFunction enum #13119

FieldInfosFormat translation should be independent of VectorSimilartyFunction enum #13119

ChrisHegarty commented Feb 20, 2024

ChrisHegarty commented Feb 20, 2024 •

edited

tteofili left a comment

uschindler left a comment

uschindler commented Feb 20, 2024

ChrisHegarty commented Feb 20, 2024

ChrisHegarty commented Feb 20, 2024

uschindler commented Feb 20, 2024

benwtrent left a comment

uschindler left a comment

ChrisHegarty commented Feb 20, 2024

ChrisHegarty commented Feb 22, 2024

msokolov commented Feb 25, 2024

benwtrent commented Feb 26, 2024

FieldInfosFormat translation should be independent of VectorSimilartyFunction enum #13119

FieldInfosFormat translation should be independent of VectorSimilartyFunction enum #13119

Conversation

ChrisHegarty commented Feb 20, 2024

ChrisHegarty commented Feb 20, 2024 • edited

tteofili left a comment

Choose a reason for hiding this comment

uschindler left a comment

Choose a reason for hiding this comment

uschindler commented Feb 20, 2024

ChrisHegarty commented Feb 20, 2024

ChrisHegarty commented Feb 20, 2024

uschindler commented Feb 20, 2024

benwtrent left a comment

Choose a reason for hiding this comment

uschindler left a comment

Choose a reason for hiding this comment

ChrisHegarty commented Feb 20, 2024

ChrisHegarty commented Feb 22, 2024

msokolov commented Feb 25, 2024

benwtrent commented Feb 26, 2024

ChrisHegarty commented Feb 20, 2024 •

edited