Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a MemorySegment Vector scorer - for scoring without copying on-heap #13339

Merged
merged 43 commits into from
May 21, 2024

Conversation

ChrisHegarty
Copy link
Contributor

@ChrisHegarty ChrisHegarty commented May 2, 2024

Add a MemorySegment Vector scorer - for scoring without copying on-heap.

The vector scorer loads values directly from the backing memory segment when available. Otherwise, if the vector data spans across segments, or is a query vector, the scorer copies the vector data on-heap.

A benchmark shows ~2x performance improvement of this scorer over the default copy-on-heap scorer. The benchmark need a little more scrutiny and evaluation on different platforms. Here's the results on a Max M2:

Benchmark                                      (size)   Mode  Cnt  Score   Error   Units
VectorScorerBenchmark.binaryDotProductDefault    1024  thrpt    5  1.391 ± 0.016  ops/us
VectorScorerBenchmark.binaryDotProductMemSeg     1024  thrpt    5  3.013 ± 0.207  ops/us

The scorer currently only operates on vectors with an element size of byte, since loading vector data from float[] (the fallback), is only supported in JDK 22. We can evaluate if and how to support floats separately. See https://bugs.openjdk.org/browse/JDK-8318678

The vector scorer is implicitly tied to the Panama Vector Utils implementation - you can only have a Memory segment scorer if the Panama vector implementation is present. There is a little room for improvement in how these things are initialised and structured.


protected final MemorySegment getSegment(int ord, byte[] scratch) throws IOException {
checkOrdinal(ord, maxOrd);
int byteOffset = ord * vectorByteSize; // TODO: random + meta size
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanna generalise this so we can use it for scalar quantised too - so a vector byte size + N bytes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I don't foresee us doing this, but it is conceivable that the flat vector storage will change.

I think we have adequate test coverage to catch such a weird behavior, but maybe we need to change the name or add a comment reflecting that it relies on how Lucene95 stored the flat vectors?

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the additional Provider interface. There should only be one instantiating all implementations that can be vectorized.

@msokolov
Copy link
Contributor

msokolov commented May 3, 2024

So excited to see this finally come to fruition! No more double-buffering!

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much better now. I will have a closer look later, so not yet a +1

uschindler
uschindler previously approved these changes May 3, 2024
@uschindler
Copy link
Contributor

How can i change the review to "undecided"?

@ChrisHegarty
Copy link
Contributor Author

How can i change the review to "undecided"?

I re-requested ur review - so there is no official reviewer yet. Take ur time. I have some luceneutil benchmarks to run, etc.

@uschindler
Copy link
Contributor

How can i change the review to "undecided"?

I re-requested ur review - so there is no official reviewer yet. Take ur time. I have some luceneutil benchmarks to run, etc.

The benchmark code seems broken after your changes.

@ChrisHegarty ChrisHegarty dismissed uschindler’s stale review May 3, 2024 15:58

Dismissing Uwe's review, since he is undecided. Can be explicitly added later, when we convince him ;-)

@ChrisHegarty
Copy link
Contributor Author

Dismissing Uwe's review, since he is undecided. Can be explicitly added later, when we convince him ;-)

(Oh, this looks harsh!) I hope that I did this right. If not, I apologise. No offence intended. I just want to reflect @uschindler's comment above about being currently undecided.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this looks fine. Very clean.
It's not easy to undertstand, but at some point we may change this to be in core classes and we can get rid of all those wrappers.

One thing to think about: I had another idea, which may also help to allow direct use of our vectors from NIOFSDirectory and ByteBuffersDirectory (used by NRT):

How about changing from MemorySegment to ByteBuffer views? They can also be used by the vectorization code.

This would also allow to have the interface we need in main classes and not required to be in Java 21 soureSet. In addition ByteBuffersDirectory and NIOFSDircetory could directly return bytebuffer views, too.

We may add a method like getByteBufferSlice(....).

What do you think?

@ChrisHegarty
Copy link
Contributor Author

ChrisHegarty commented May 21, 2024

We may add a method like getByteBufferSlice(....).

I experimented locally with similar before, and the performance impact when converting to/from MemorySegment was horrible. I don't disagree that for NIO/BB dir implementations this could be useful, but it would not intersect with the memory segment implementation, since it performs horribly (when converting from BB to MS, and vice versa).

@uschindler
Copy link
Contributor

We may add a method like getByteBufferSlice(....).

I experimented locally with similar before, and the performance impact when converting to/from MemorySegment was horrible. I don't disagree that for NIO/BB dir implementations this could be useful, but it would not intersect with the memory segment implementation, since it performs horribly (when converting from BB to MS, and vice versa).

OK, this is not good. From my code review it looked like the DirectByteBuffer also only has a addres and therefor the whole thing is identical by performance (except that we have to get a MemorySegment slice first and then call toByteBuffer()). So this is interesting, but then I we can't do anything.

Mabe theres some optimizations miissing in ByteBuffer support of vector API.

So looks fine then. Working on a final review.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question: How can we migrate to a later index format version? Because we have the MemorySegmentFlatVectorScorer (which has no knowledge about version) and this one returns the Lucene99MemorySegment* classes.

So in my opinion this is only solvable if we rename also the MemorySegmentFlatVectorScorer to Lucene99MemorySegmentFlatVectorScorer and also adapt the getter name.

Except this inconsistency it looks fine. A bit hairy but if performance is high prio, I accept this as a step towards getting rid of IndexInput totally.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now all fine.

@ChrisHegarty ChrisHegarty merged commit 05f04aa into apache:main May 21, 2024
3 checks passed
@ChrisHegarty ChrisHegarty deleted the msscorer branch May 21, 2024 16:34
ChrisHegarty added a commit to ChrisHegarty/lucene that referenced this pull request May 21, 2024
…ap (apache#13339)

Add a MemorySegment Vector scorer - for scoring without copying on-heap.

The vector scorer loads values directly from the backing memory segment when available. Otherwise, if the vector data spans across segments the scorer copies the vector data on-heap.

A benchmark shows ~2x performance improvement of this scorer over the default copy-on-heap scorer.

The scorer currently only operates on vectors with an element size of byte. We can evaluate if and how to support floats separately.
ChrisHegarty added a commit to ChrisHegarty/lucene that referenced this pull request May 21, 2024
…ap (apache#13339)

Add a MemorySegment Vector scorer - for scoring without copying on-heap.

The vector scorer loads values directly from the backing memory segment when available. Otherwise, if the vector data spans across segments the scorer copies the vector data on-heap.

A benchmark shows ~2x performance improvement of this scorer over the default copy-on-heap scorer.

The scorer currently only operates on vectors with an element size of byte. We can evaluate if and how to support floats separately.
ChrisHegarty added a commit to ChrisHegarty/lucene that referenced this pull request May 21, 2024
…ap (apache#13339)

Add a MemorySegment Vector scorer - for scoring without copying on-heap.

The vector scorer loads values directly from the backing memory segment when available. Otherwise, if the vector data spans across segments the scorer copies the vector data on-heap.

A benchmark shows ~2x performance improvement of this scorer over the default copy-on-heap scorer.

The scorer currently only operates on vectors with an element size of byte. We can evaluate if and how to support floats separately.
@uschindler
Copy link
Contributor

Looks like first Java 22 build also worked fine, so no API incompatibilities in JDK (foreign preview vs final): https://jenkins.thetaphi.de/job/Lucene-main-Linux/48322/consoleText

@ChrisHegarty ChrisHegarty added this to the 9.11.0 milestone May 29, 2024
@ChrisHegarty ChrisHegarty self-assigned this May 29, 2024
@gautamworah96
Copy link
Contributor

gautamworah96 commented Jun 6, 2024

Hey @ChrisHegarty do we have an open issue for extending this to float vectors?
If not, I can create one

Edit: nevermind, I see you had pointed this out earlier

The scorer currently only operates on vectors with an element size of byte, since loading vector data from float[] (the fallback), is only supported in JDK 22. We can evaluate if and how to support floats separately. See https://bugs.openjdk.org/browse/JDK-8318678

Perhaps we can implement some logic to leverage fast dot product scoring for float[] if the runtime is >=22.

@ChrisHegarty
Copy link
Contributor Author

ChrisHegarty commented Jun 21, 2024

Hey @ChrisHegarty do we have an open issue for extending this to float vectors? If not, I can create one

Edit: nevermind, I see you had pointed this out earlier

The scorer currently only operates on vectors with an element size of byte, since loading vector data from float[] (the fallback), is only supported in JDK 22. We can evaluate if and how to support floats separately. See https://bugs.openjdk.org/browse/JDK-8318678

Perhaps we can implement some logic to leverage fast dot product scoring for float[] if the runtime is >=22.

I filed the following issue to evaluate adding mode off-heap scoring. #13515

@uschindler
Copy link
Contributor

Hey @ChrisHegarty do we have an open issue for extending this to float vectors? If not, I can create one
Edit: nevermind, I see you had pointed this out earlier

The scorer currently only operates on vectors with an element size of byte, since loading vector data from float[] (the fallback), is only supported in JDK 22. We can evaluate if and how to support floats separately. See https://bugs.openjdk.org/browse/JDK-8318678

Perhaps we can implement some logic to leverage fast dot product scoring for float[] if the runtime is >=22.

I filed the following issue to evaluate adding mode off-heap scoring.

Hi,
if you map the MemorySegment slice into a ByteBuffer and add a Float view on top of it, it might work.

But I remember that you said using Buffers instead of MemorySegments suffered from bad performance. Because that wozuld have been my favoourite way for off-heap vector operations, because it would also work with old MMapDirectory and ByteBuffersDirectory.

Uwe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants