Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making vector comparisons pluggable #13182

Closed
benwtrent opened this issue Mar 13, 2024 · 1 comment
Closed

Making vector comparisons pluggable #13182

benwtrent opened this issue Mar 13, 2024 · 1 comment

Comments

@benwtrent
Copy link
Member

benwtrent commented Mar 13, 2024

Description

Opening an issue to continue discussion originating here:
#13076 (comment)

Making vector similarities pluggable via SPI will enable users to provide their own specialized similarities without the additional burden of Lucene core having to provide BWC for all the various similarity functions (e.g. hamming, jaccard, cosine).

It is probably best that the plug-in-play aspect is placed on FieldInfo, though this would be a bit of work as FieldInfo isn't currently pluggable. Attaching it directly to a particular Vector Format would place undue burden on users, requiring a new format for any field that desires a separate similarity.

While I am not 100% sure how to add it to FieldInfo, I do want to try and figure out the API for such a change.

When used within a particular vector format, the following scenarios would be useful:

  • Indexing, comparing on-heap vectors accessible via ordinals
  • Merging, comparing off-heap vectors, possibly reading them directly on-heap via ordinals
  • During search, an on-heap user provided vector being compared with off-heap vectors via ordinals (potentially reading them on-heap).

Some of the optimizations discussed here: #12703 show some significant gains in being able to simply have a memory segment and an ordinal offset. While we are not there yet in Lucene, it indicates that we shouldn’t force the API to reading off-heap vectors into a float[] or byte[] arrays..

I was thinking of something like this. 100%, this is not finalized, just wanting to start the discussion.

public abstract class VectorSimilarityInterface implements NamedSPILoader.NamedSPI {
  
  private static final class Holder {
    private static final NamedSPILoader<VectorSimilarityInterface> LOADER =
      new NamedSPILoader<>(VectorSimilarityInterface.class);
    private Holder() {}
    static NamedSPILoader<VectorSimilarityInterface> getLoader() {
      if (LOADER == null) {
        throw new IllegalStateException(
          "You tried to lookup a VectorSimilarityInterface name before all formats could be initialized. "
            + "This likely happens if you call VectorSimilarityInterface#forName from a VectorSimilarityInterface's ctor.");
      }
      return LOADER;
    }
  }
  
  public static VectorSimilarityInterface forName(String name) {
    return VectorSimilarityInterface.Holder.getLoader().lookup(name);
  }
  
  private final String name;
  protected VectorSimilarityInterface(String name) {
    NamedSPILoader.checkServiceName(name);
    this.name = name;
  }
  @Override
  public String getName() {
    return name;
  }
  
  // Comparing an "on heap" query with vectorValues that may or may not be on-heap
  // Maybe we don't need this and the `byte[]` version as we could hide the "on-heap query"
  // in an "IdentityRandomAccessVectorValues" which only returns the query vector...
  public abstract VectorScorer getVectorScorer(RandomAccessVectorValues<float[]> vectorValues, float[] target) throws Exception;
  
  public abstract VectorComparator getFloatVectorComparator(RandomAccessVectorValues<float[]> vectorValues) throws Exception;
  
  public abstract VectorScorer getVectorScorer(RandomAccessVectorValues<byte[]> vectorValues, byte[] target) throws Exception;
  
  public abstract VectorComparator getByteVectorComparator(RandomAccessVectorValues<byte[]> vectorValues) throws Exception;
  static interface VectorScorer extends Closeable {
    float score(int targetOrd);
  }
  
  static interface VectorComparator {
    float compare(int vectorOrd1, int vectorOrd2);
  }
}

It looks like the SPI injection could occur in FieldInfosFormat#read & FieldInfosFormat#write (though a new one would have to be built Lucene911FieldInfosFormat or something).

This would also include a new codec as the field format will change.

I am not 100% sold on how this API looks myself. I don't think RandomAccessVectorValues is 100% the correct API as it either exposes too much (e.g. ordToDoc) or too little (for off-heap, we don't get access to the MemorySegment nor files...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants