-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to compute vector similarity scores with DoubleValuesSource #12548
Conversation
lucene/core/src/java/org/apache/lucene/search/ByteVectorSimilarityValuesSource.java
Outdated
Show resolved
Hide resolved
*/ | ||
public static DoubleValues similarityToQueryVector( | ||
LeafReaderContext ctx, byte[] queryVector, String vectorField) throws IOException { | ||
assert ctx.reader().getFieldInfos().fieldInfo(vectorField).getVectorEncoding() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of throwing an exception even when assertions are not enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! But I wonder if thats the case in many places in codebase?. I think we should rather throw an IAE
in this case.
* A {@link DoubleValuesSource} which computes the vector similarity scores between the query vector | ||
* and the {@link org.apache.lucene.document.KnnFloatVectorField} for documents. | ||
*/ | ||
class FloatVectorSimilarityValuesSource extends DoubleValuesSource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and BytesVectorSimilarityValuesSource
look very similar. Should they inherit from a VectorSimilarityValuesSource
to ensure we can add common functionality in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some common bits. I have moved those to VectorSimilarityValuesSource
now. Let me know if this looks more better or cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I think this is the right approach.
lucene/core/src/test/org/apache/lucene/search/TestVectorSimilarityValuesSource.java
Show resolved
Hide resolved
lucene/core/src/test/org/apache/lucene/search/TestVectorSimilarityValuesSource.java
Show resolved
Hide resolved
a7a19b5
to
97ff777
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing the comments. The new version looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API looks OK to me.
Thanks for reviewing! @benwtrent I have added a CHANGES.txt entry now, could you help me merging this if all looks good? |
@shubhamvishu running CI :). I will see about merging and such soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thank you! I left some minor comments. Mostly I think we should move the static methods to a vector-specific file
import org.apache.lucene.index.VectorSimilarityFunction; | ||
|
||
/** | ||
* A {@link DoubleValuesSource} which computes the vector similarity scores between the query vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: my English profs always said to use "that" in place of "which" when possible
* @return DoubleValues instance | ||
* @throws IOException if an {@link IOException} occurs | ||
*/ | ||
public static DoubleValues similarityToQueryVector( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move these to VectorSimilaritySource? This file seem to be a dumping ground for all the things; let's not pile on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I see different DVS use cases implemented as static classes here making this less readable. How about we move those static classes and functions(like you pointed) into a new class DoubleValuesSourceUtil
maybe(only thing those classes are private currently)?. I'll open a followup PR to address that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well I don't think we should rename all the existing static methods here - that would just cause confusion. Maybe that's not what you were saying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I didn't mean to move all the static methods here(sorry for the confusion). My idea was along your suggestion, should we also move the private static classes that are in this class(a lot in number, example - one, two and more) which is bloating this class out into another pkg-private class lets say DoubleValuesSourceUtil
(the only implication being we would make these private classes pkg-private which is fine I think?). This way we could only keep the required API methods in DVS class and move out the rest stuff making this lean and readable. If that makes sense I'll put up these changes too with the followup CR(like below). Let me know what do you think?
Here is an short example commit I made to explain what I was saying - shubhamvishu@604f1fd
oh I missed that @benwtrent had approved -- it's OK w/me to merge as-is, just had some idea about organizing the static methods into a different class... |
Thanks @msokolov ! I'll open up a follow up PR to address your ideas if that sounds good? |
These are good ideas @msokolov , @shubhamvishu a follow up PR is most welcome. |
…12548) ### Description This PR addresses the issue #12394. It adds an API **`similarityToQueryVector`** to `DoubleValuesSource` to compute vector similarity scores between the query vector and the `KnnByteVectorField`/`KnnFloatVectorField` for documents using the 2 new DVS implementations (`ByteVectorSimilarityValuesSource` for byte vectors and `FloatVectorSimilarityValuesSource` for float vectors). Below are the method signatures added to DVS in this PR: - `DoubleValues similarityToQueryVector(LeafReaderContext ctx, float[] queryVector, String vectorField)` *(uses ByteVectorSimilarityValuesSource)* - `DoubleValues similarityToQueryVector(LeafReaderContext ctx, byte[] queryVector, String vectorField)` *(uses FloatVectorSimilarityValuesSource)* Closes #12394
@msokolov @benwtrent Raised a followup PR #12671 for this and also moved some other relevant static classes etc out of DVS(it now looks much better to me). Thanks! |
Description
This PR addresses the issue #12394. It adds an API
similarityToQueryVector
toDoubleValuesSource
to compute vector similarity scores between the query vector and theKnnByteVectorField
/KnnFloatVectorField
for documents using the 2 new DVS implementations (ByteVectorSimilarityValuesSource
for byte vectors andFloatVectorSimilarityValuesSource
for float vectors). Below are the method signatures added to DVS in this PR:DoubleValues similarityToQueryVector(LeafReaderContext ctx, float[] queryVector, String vectorField)
(uses ByteVectorSimilarityValuesSource)DoubleValues similarityToQueryVector(LeafReaderContext ctx, byte[] queryVector, String vectorField)
(uses FloatVectorSimilarityValuesSource)Closes #12394