Making vector comparisons pluggable #13182

benwtrent · 2024-03-13T15:31:57Z

Description

Opening an issue to continue discussion originating here:
#13076 (comment)

Making vector similarities pluggable via SPI will enable users to provide their own specialized similarities without the additional burden of Lucene core having to provide BWC for all the various similarity functions (e.g. hamming, jaccard, cosine).

It is probably best that the plug-in-play aspect is placed on FieldInfo, though this would be a bit of work as FieldInfo isn't currently pluggable. Attaching it directly to a particular Vector Format would place undue burden on users, requiring a new format for any field that desires a separate similarity.

While I am not 100% sure how to add it to FieldInfo, I do want to try and figure out the API for such a change.

When used within a particular vector format, the following scenarios would be useful:

Indexing, comparing on-heap vectors accessible via ordinals
Merging, comparing off-heap vectors, possibly reading them directly on-heap via ordinals
During search, an on-heap user provided vector being compared with off-heap vectors via ordinals (potentially reading them on-heap).

Some of the optimizations discussed here: #12703 show some significant gains in being able to simply have a memory segment and an ordinal offset. While we are not there yet in Lucene, it indicates that we shouldn’t force the API to reading off-heap vectors into a float[] or byte[] arrays..

I was thinking of something like this. 100%, this is not finalized, just wanting to start the discussion.

public abstract class VectorSimilarityInterface implements NamedSPILoader.NamedSPI {
  
  private static final class Holder {
    private static final NamedSPILoader<VectorSimilarityInterface> LOADER =
      new NamedSPILoader<>(VectorSimilarityInterface.class);
    private Holder() {}
    static NamedSPILoader<VectorSimilarityInterface> getLoader() {
      if (LOADER == null) {
        throw new IllegalStateException(
          "You tried to lookup a VectorSimilarityInterface name before all formats could be initialized. "
            + "This likely happens if you call VectorSimilarityInterface#forName from a VectorSimilarityInterface's ctor.");
      }
      return LOADER;
    }
  }
  
  public static VectorSimilarityInterface forName(String name) {
    return VectorSimilarityInterface.Holder.getLoader().lookup(name);
  }
  
  private final String name;
  protected VectorSimilarityInterface(String name) {
    NamedSPILoader.checkServiceName(name);
    this.name = name;
  }
  @Override
  public String getName() {
    return name;
  }
  
  // Comparing an "on heap" query with vectorValues that may or may not be on-heap
  // Maybe we don't need this and the `byte[]` version as we could hide the "on-heap query"
  // in an "IdentityRandomAccessVectorValues" which only returns the query vector...
  public abstract VectorScorer getVectorScorer(RandomAccessVectorValues<float[]> vectorValues, float[] target) throws Exception;
  
  public abstract VectorComparator getFloatVectorComparator(RandomAccessVectorValues<float[]> vectorValues) throws Exception;
  
  public abstract VectorScorer getVectorScorer(RandomAccessVectorValues<byte[]> vectorValues, byte[] target) throws Exception;
  
  public abstract VectorComparator getByteVectorComparator(RandomAccessVectorValues<byte[]> vectorValues) throws Exception;
  static interface VectorScorer extends Closeable {
    float score(int targetOrd);
  }
  
  static interface VectorComparator {
    float compare(int vectorOrd1, int vectorOrd2);
  }
}

It looks like the SPI injection could occur in FieldInfosFormat#read & FieldInfosFormat#write (though a new one would have to be built Lucene911FieldInfosFormat or something).

This would also include a new codec as the field format will change.

I am not 100% sold on how this API looks myself. I don't think RandomAccessVectorValues is 100% the correct API as it either exposes too much (e.g. ordToDoc) or too little (for off-heap, we don't get access to the MemorySegment nor files...).

The text was updated successfully, but these errors were encountered:

benwtrent · 2024-04-22T20:50:17Z

Closing as fixed by #13288

benwtrent added the type:enhancement label Mar 13, 2024

benwtrent mentioned this issue Mar 14, 2024

Avoid recalculating the norm of the target vector when using cosine metric #13185

Closed

msokolov mentioned this issue Mar 18, 2024

Pass custom similarity function to similarityToQueryVector API #13187

Open

benwtrent mentioned this issue Mar 22, 2024

Add new pluggable vector similarity to field info #13200

Closed

benwtrent closed this as completed Apr 22, 2024

alessandrobenedetti added the vector-based-search label May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making vector comparisons pluggable #13182

Making vector comparisons pluggable #13182

benwtrent commented Mar 13, 2024 •

edited

benwtrent commented Apr 22, 2024

Making vector comparisons pluggable #13182

Making vector comparisons pluggable #13182

Comments

benwtrent commented Mar 13, 2024 • edited

Description

benwtrent commented Apr 22, 2024

benwtrent commented Mar 13, 2024 •

edited