Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to compute vector similarity scores with DoubleValuesSource #12548

Merged
merged 3 commits into from
Oct 12, 2023

Conversation

shubhamvishu
Copy link
Contributor

@shubhamvishu shubhamvishu commented Sep 11, 2023

Description

This PR addresses the issue #12394. It adds an API similarityToQueryVector to DoubleValuesSource to compute vector similarity scores between the query vector and the KnnByteVectorField/KnnFloatVectorField for documents using the 2 new DVS implementations (ByteVectorSimilarityValuesSource for byte vectors and FloatVectorSimilarityValuesSource for float vectors). Below are the method signatures added to DVS in this PR:

  • DoubleValues similarityToQueryVector(LeafReaderContext ctx, float[] queryVector, String vectorField) (uses ByteVectorSimilarityValuesSource)
  • DoubleValues similarityToQueryVector(LeafReaderContext ctx, byte[] queryVector, String vectorField) (uses FloatVectorSimilarityValuesSource)

Closes #12394

@shubhamvishu shubhamvishu changed the title Add API to compute vector similarity in DoubleValuesSource Ability to compute vector similarity scores in DoubleValuesSource Sep 11, 2023
@shubhamvishu shubhamvishu changed the title Ability to compute vector similarity scores in DoubleValuesSource Ability to compute vector similarity scores with DoubleValuesSource Sep 11, 2023
*/
public static DoubleValues similarityToQueryVector(
LeafReaderContext ctx, byte[] queryVector, String vectorField) throws IOException {
assert ctx.reader().getFieldInfos().fieldInfo(vectorField).getVectorEncoding()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of throwing an exception even when assertions are not enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! But I wonder if thats the case in many places in codebase?. I think we should rather throw an IAE in this case.

* A {@link DoubleValuesSource} which computes the vector similarity scores between the query vector
* and the {@link org.apache.lucene.document.KnnFloatVectorField} for documents.
*/
class FloatVectorSimilarityValuesSource extends DoubleValuesSource {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and BytesVectorSimilarityValuesSource look very similar. Should they inherit from a VectorSimilarityValuesSource to ensure we can add common functionality in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some common bits. I have moved those to VectorSimilarityValuesSource now. Let me know if this looks more better or cleaner?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I think this is the right approach.

Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the comments. The new version looks good!

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API looks OK to me.

@shubhamvishu
Copy link
Contributor Author

Thanks for reviewing!

@benwtrent I have added a CHANGES.txt entry now, could you help me merging this if all looks good?

@benwtrent
Copy link
Member

@shubhamvishu running CI :). I will see about merging and such soon.

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you! I left some minor comments. Mostly I think we should move the static methods to a vector-specific file

import org.apache.lucene.index.VectorSimilarityFunction;

/**
* A {@link DoubleValuesSource} which computes the vector similarity scores between the query vector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: my English profs always said to use "that" in place of "which" when possible

* @return DoubleValues instance
* @throws IOException if an {@link IOException} occurs
*/
public static DoubleValues similarityToQueryVector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move these to VectorSimilaritySource? This file seem to be a dumping ground for all the things; let's not pile on.

Copy link
Contributor Author

@shubhamvishu shubhamvishu Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I see different DVS use cases implemented as static classes here making this less readable. How about we move those static classes and functions(like you pointed) into a new class DoubleValuesSourceUtil maybe(only thing those classes are private currently)?. I'll open a followup PR to address that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I don't think we should rename all the existing static methods here - that would just cause confusion. Maybe that's not what you were saying?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I didn't mean to move all the static methods here(sorry for the confusion). My idea was along your suggestion, should we also move the private static classes that are in this class(a lot in number, example - one, two and more) which is bloating this class out into another pkg-private class lets say DoubleValuesSourceUtil (the only implication being we would make these private classes pkg-private which is fine I think?). This way we could only keep the required API methods in DVS class and move out the rest stuff making this lean and readable. If that makes sense I'll put up these changes too with the followup CR(like below). Let me know what do you think?

Here is an short example commit I made to explain what I was saying - shubhamvishu@604f1fd

@msokolov
Copy link
Contributor

oh I missed that @benwtrent had approved -- it's OK w/me to merge as-is, just had some idea about organizing the static methods into a different class...

@shubhamvishu
Copy link
Contributor Author

oh I missed that @benwtrent had approved -- it's OK w/me to merge as-is, just had some idea about organizing the static methods into a different class...

Thanks @msokolov ! I'll open up a follow up PR to address your ideas if that sounds good?

@benwtrent
Copy link
Member

These are good ideas @msokolov , @shubhamvishu a follow up PR is most welcome.

@benwtrent benwtrent merged commit 52dfe50 into apache:main Oct 12, 2023
4 checks passed
benwtrent pushed a commit that referenced this pull request Oct 12, 2023
…12548)

### Description

This PR addresses the issue #12394. It adds an API **`similarityToQueryVector`** to `DoubleValuesSource` to compute vector similarity scores between the query vector and the `KnnByteVectorField`/`KnnFloatVectorField` for documents using the 2 new DVS implementations (`ByteVectorSimilarityValuesSource` for byte vectors and `FloatVectorSimilarityValuesSource` for float vectors). Below are the method signatures added to DVS in this PR:

- `DoubleValues similarityToQueryVector(LeafReaderContext ctx, float[] queryVector, String vectorField)` *(uses ByteVectorSimilarityValuesSource)*
- `DoubleValues similarityToQueryVector(LeafReaderContext ctx, byte[] queryVector, String vectorField)` *(uses FloatVectorSimilarityValuesSource)*

Closes #12394
@shubhamvishu
Copy link
Contributor Author

@msokolov @benwtrent Raised a followup PR #12671 for this and also moved some other relevant static classes etc out of DVS(it now looks much better to me). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add the ability to compute vector similarity scores with the new ValuesSource API
4 participants