You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Opening an issue to continue discussion originating here: #13076 (comment)
Making vector similarities pluggable via SPI will enable users to provide their own specialized similarities without the additional burden of Lucene core having to provide BWC for all the various similarity functions (e.g. hamming, jaccard, cosine).
It is probably best that the plug-in-play aspect is placed on FieldInfo, though this would be a bit of work as FieldInfo isn't currently pluggable. Attaching it directly to a particular Vector Format would place undue burden on users, requiring a new format for any field that desires a separate similarity.
While I am not 100% sure how to add it to FieldInfo, I do want to try and figure out the API for such a change.
When used within a particular vector format, the following scenarios would be useful:
Indexing, comparing on-heap vectors accessible via ordinals
Merging, comparing off-heap vectors, possibly reading them directly on-heap via ordinals
During search, an on-heap user provided vector being compared with off-heap vectors via ordinals (potentially reading them on-heap).
Some of the optimizations discussed here: #12703 show some significant gains in being able to simply have a memory segment and an ordinal offset. While we are not there yet in Lucene, it indicates that we shouldn’t force the API to reading off-heap vectors into a float[] or byte[] arrays..
I was thinking of something like this. 100%, this is not finalized, just wanting to start the discussion.
publicabstractclassVectorSimilarityInterfaceimplementsNamedSPILoader.NamedSPI {
privatestaticfinalclassHolder {
privatestaticfinalNamedSPILoader<VectorSimilarityInterface> LOADER =
newNamedSPILoader<>(VectorSimilarityInterface.class);
privateHolder() {}
staticNamedSPILoader<VectorSimilarityInterface> getLoader() {
if (LOADER == null) {
thrownewIllegalStateException(
"You tried to lookup a VectorSimilarityInterface name before all formats could be initialized. "
+ "This likely happens if you call VectorSimilarityInterface#forName from a VectorSimilarityInterface's ctor.");
}
returnLOADER;
}
}
publicstaticVectorSimilarityInterfaceforName(Stringname) {
returnVectorSimilarityInterface.Holder.getLoader().lookup(name);
}
privatefinalStringname;
protectedVectorSimilarityInterface(Stringname) {
NamedSPILoader.checkServiceName(name);
this.name = name;
}
@OverridepublicStringgetName() {
returnname;
}
// Comparing an "on heap" query with vectorValues that may or may not be on-heap// Maybe we don't need this and the `byte[]` version as we could hide the "on-heap query"// in an "IdentityRandomAccessVectorValues" which only returns the query vector...publicabstractVectorScorergetVectorScorer(RandomAccessVectorValues<float[]> vectorValues, float[] target) throwsException;
publicabstractVectorComparatorgetFloatVectorComparator(RandomAccessVectorValues<float[]> vectorValues) throwsException;
publicabstractVectorScorergetVectorScorer(RandomAccessVectorValues<byte[]> vectorValues, byte[] target) throwsException;
publicabstractVectorComparatorgetByteVectorComparator(RandomAccessVectorValues<byte[]> vectorValues) throwsException;
staticinterfaceVectorScorerextendsCloseable {
floatscore(inttargetOrd);
}
staticinterfaceVectorComparator {
floatcompare(intvectorOrd1, intvectorOrd2);
}
}
It looks like the SPI injection could occur in FieldInfosFormat#read & FieldInfosFormat#write (though a new one would have to be built Lucene911FieldInfosFormat or something).
This would also include a new codec as the field format will change.
I am not 100% sold on how this API looks myself. I don't think RandomAccessVectorValues is 100% the correct API as it either exposes too much (e.g. ordToDoc) or too little (for off-heap, we don't get access to the MemorySegment nor files...).
The text was updated successfully, but these errors were encountered:
Description
Opening an issue to continue discussion originating here:
#13076 (comment)
Making vector similarities pluggable via SPI will enable users to provide their own specialized similarities without the additional burden of Lucene core having to provide BWC for all the various similarity functions (e.g. hamming, jaccard, cosine).
It is probably best that the plug-in-play aspect is placed on
FieldInfo
, though this would be a bit of work asFieldInfo
isn't currently pluggable. Attaching it directly to a particular Vector Format would place undue burden on users, requiring a new format for any field that desires a separate similarity.While I am not 100% sure how to add it to FieldInfo, I do want to try and figure out the API for such a change.
When used within a particular vector format, the following scenarios would be useful:
Some of the optimizations discussed here: #12703 show some significant gains in being able to simply have a memory segment and an ordinal offset. While we are not there yet in Lucene, it indicates that we shouldn’t force the API to reading off-heap vectors into a
float[]
orbyte[]
arrays..I was thinking of something like this. 100%, this is not finalized, just wanting to start the discussion.
It looks like the SPI injection could occur in
FieldInfosFormat#read
&FieldInfosFormat#write
(though a new one would have to be builtLucene911FieldInfosFormat
or something).This would also include a new codec as the field format will change.
I am not 100% sold on how this API looks myself. I don't think
RandomAccessVectorValues
is 100% the correct API as it either exposes too much (e.g.ordToDoc
) or too little (for off-heap, we don't get access to the MemorySegment nor files...).The text was updated successfully, but these errors were encountered: