Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FieldInfosFormat translation should be independent of VectorSimilartyFunction enum #13119

Merged
merged 7 commits into from Feb 22, 2024

Conversation

ChrisHegarty
Copy link
Contributor

This commit updates the FieldInfosFormat translation of vector similarity functions to be independent of the VectorSimilartyFunction enum.

The VectorSimilartyFunction enum lives outside of the codec format, and the format should not inadvertently depend upon the declaration order or values in VectorSimilartyFunction. The format should be in charge of the translation of similarity function to format ordinal (and visa versa). In reality, and for now, the translation remains the same as the declaration order, but this may not be the case in the future.

Note: did we introduce a potential index corruption issue when adding maximum inner product in 9.8.0? since the format was not updated when the enum value was added - the ordinal for maximum inner product is unknown to Lucene 9.7.0, which uses the same format.

@ChrisHegarty
Copy link
Contributor Author

ChrisHegarty commented Feb 20, 2024

This PR is a prerequisite for future work to make the similarity function in the format symbolic and lookup-able, see #13076 (comment).

Copy link
Contributor

@tteofili tteofili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index format wise, I think the index corruption can occur when reading a Lucene 9.8.0 index with Lucene 9.7.0, as the format would allow that, but I am not sure this is an expected scenario.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
as stated in the other issue: I am not really happy to have that enum at all! The similarity/distance functions should be pluggable using NamedSPILoader. To implement that, the ordinals must removed in a new file format version and instead names be written using the Codec utility classes.

As a first step this PR is fine as it does not change file format and just decouples the ordinals from the enum. In future, when we have SPI, we can use the current code of the ordinals

In my opinion, the strings as lookup keys are not needed: Just define it as List<VectorSimilarityFunction> to get the link between them. At a later stage the backwards layer could then fallback to the list with SPI instances to lookup the legacy ordinals. The coec and the enum are still enough decoupled.

@uschindler
Copy link
Contributor

Index format wise, I think the index corruption can occur when reading a Lucene 9.8.0 index with Lucene 9.7.0, as the format would allow that, but I am not sure this is an expected scenario.

This is perfectly fine.

@ChrisHegarty
Copy link
Contributor Author

Hi, as stated in the other issue: I am not really happy to have that enum at all! The similarity/distance functions should be pluggable using NamedSPILoader. To implement that, the ordinals must removed in a new file format version and instead names be written using the Codec utility classes.

Agreed on where we wanna get to. Just trying to get there incrementally, since format changes are quite noisy.

As a first step this PR is fine as it does not change file format and just decouples the ordinals from the enum. In future, when we have SPI, we can use the current code of the ordinals

Exactly, this is just a first step. It (for the most part) encapsulates the translation in the format. When we add a new format and/or evolve VectorSimilarityFunction, this format should be largely immune to the change.

In my opinion, the strings as lookup keys are not needed: Just define it as List<VectorSimilarityFunction> to get the link between them. At a later stage the backwards layer could then fallback to the list with SPI instances to lookup the legacy ordinals. The coec and the enum are still enough decoupled.

Yeah, that's probably good enough for now. Updated.

@ChrisHegarty
Copy link
Contributor Author

Index format wise, I think the index corruption can occur when reading a Lucene 9.8.0 index with Lucene 9.7.0, as the format would allow that, but I am not sure this is an expected scenario.

This is perfectly fine.

Ok cool. I was worried for nothing then.

@uschindler
Copy link
Contributor

Hi, as stated in the other issue: I am not really happy to have that enum at all! The similarity/distance functions should be pluggable using NamedSPILoader. To implement that, the ordinals must removed in a new file format version and instead names be written using the Codec utility classes.

Agreed on where we wanna get to. Just trying to get there incrementally, since format changes are quite noisy.

As a first step this PR is fine as it does not change file format and just decouples the ordinals from the enum. In future, when we have SPI, we can use the current code of the ordinals

Exactly, this is just a first step. It (for the most part) encapsulates the translation in the format. When we add a new format and/or evolve VectorSimilarityFunction, this format should be largely immune to the change.

In my opinion, the strings as lookup keys are not needed: Just define it as List<VectorSimilarityFunction> to get the link between them. At a later stage the backwards layer could then fallback to the list with SPI instances to lookup the legacy ordinals. The coec and the enum are still enough decoupled.

Yeah, that's probably good enough for now. Updated.

The comment is a bit outdated. I was thinking of making it even more vebose by having 2 maps for the lookup...

+1 looks fine.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on enum order was always trappy. Thanks for decoupling!

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you fix the comment and remove "names" from

maybe be sure to explicitly say: add new ones always at end of list :-)

@ChrisHegarty
Copy link
Contributor Author

I see now that we have a similar dependency in Lucene99HnswVectorsReader. I'll update in a similar way.

@ChrisHegarty
Copy link
Contributor Author

Thanks for the reviews. All comments have been addressed.

@ChrisHegarty ChrisHegarty merged commit 17cbedc into apache:main Feb 22, 2024
3 checks passed
ChrisHegarty added a commit that referenced this pull request Feb 22, 2024
…Function enum (#13119)

This commit updates the FieldInfosFormat translation of vector similarity functions to be independent of the VectorSimilartyFunction enum.

The VectorSimilartyFunction enum lives outside of the codec format, and the format should not inadvertently depend upon the declaration order or values in VectorSimilartyFunction. The format should be in charge of the translation of similarity function to format ordinal (and visa versa). In reality, and for now, the translation remains the same as the declaration order, but this may not be the case in the future.
@ChrisHegarty ChrisHegarty deleted the simFunc_fieldInfos_format branch February 22, 2024 17:29
benwtrent added a commit to benwtrent/lucene that referenced this pull request Feb 22, 2024
@msokolov
Copy link
Contributor

OK, this is really weird to me. For some reason, we are writing the dimension & similarity into the vector metdata but that information is retained in the field info already.

@benwtrent honestly don't remember, but I do know that early on we tried things a few different ways. There was some discussion about whether the similarity function and dimensions should be in the codec vs in the field info. I suspect it evolved and we did not end up removing the redundant version?

@benwtrent
Copy link
Member

@msokolov thanks for clarifying. I just wanted to make sure there wasn't an important reason that I missed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants