New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10183: KnnVectorsWriter#writeField to take KnnVectorsReader instead of VectorValues #534
LUCENE-10183: KnnVectorsWriter#writeField to take KnnVectorsReader instead of VectorValues #534
Conversation
…stead of VectorValues
…e-KnnVectorsReader
import org.apache.lucene.util.Bits; | ||
|
||
/** Abstract base class implementing a {@link KnnVectorsReader} that has no vector values. */ | ||
public abstract class EmptyKnnVectorsReader extends KnnVectorsReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class mirrors https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/EmptyDocValuesProducer.java for doc values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's kind of funny that the only way we use it here, it's not actually empty, due to overriding, but it's OK, naming has already been established; let's be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do it for doc values because doc values support 5 different types (numeric, sorted numeric, sorted, sorted set, binary) and the empty producer helps only implement the doc values type that we care about. Since there is a single type of vectors, I don't think we need this empty producer, let's remove it and extend KnnVectorsReader directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense. Thanks for the context! I've removed it and extended directly.
Maybe also and a CHANGELOG entry? I guess it's a somewhat internal API, being in codecs, but it's not marked @experimental, so if someone had implemented their own KnnVectorsWriter codec, they'd need to change it ... |
import org.apache.lucene.util.Bits; | ||
|
||
/** Abstract base class implementing a {@link KnnVectorsReader} that has no vector values. */ | ||
public abstract class EmptyKnnVectorsReader extends KnnVectorsReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's kind of funny that the only way we use it here, it's not actually empty, due to overriding, but it's OK, naming has already been established; let's be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, with a small comment around naming.
@@ -40,7 +41,8 @@ | |||
protected KnnVectorsWriter() {} | |||
|
|||
/** Write all values contained in the provided reader */ | |||
public abstract void writeField(FieldInfo fieldInfo, VectorValues values) throws IOException; | |||
public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorReader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment, we use both vectorsReader
and knnVectorReader
for this variable. It'd be nice to use one name consistently (with plural "vectors").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion! I've updated them to use knnVectorsReader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zacharymorn for cleaning this up, I left a few comments.
import org.apache.lucene.util.Bits; | ||
|
||
/** Abstract base class implementing a {@link KnnVectorsReader} that has no vector values. */ | ||
public abstract class EmptyKnnVectorsReader extends KnnVectorsReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do it for doc values because doc values support 5 different types (numeric, sorted numeric, sorted, sorted set, binary) and the empty producer helps only implement the doc values type that we care about. Since there is a single type of vectors, I don't think we need this empty producer, let's remove it and extend KnnVectorsReader directly?
new EmptyKnnVectorsReader() { | ||
@Override | ||
public VectorValues getVectorValues(String field) throws IOException { | ||
return sortMap != null ? new SortingVectorValues(vectorValues, sortMap) : vectorValues; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect as it would return the same instance every time it is called. We should instantiate a new BufferedVectorValues instance every time this method is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good catch! I think with the current usage it probably never matters since we only call this once and dispose of the reader, but it would be a trap waiting for some future usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ops sorry good catch! I've fixed it.
…consistent, and fix vector values reuse bug
Just added an entry into |
Thanks @msokolov @jtibshirani @jpountz for the review and suggestions! I've pushed an update to address the feedback. |
Hi @jpountz @msokolov @jtibshirani , just want to check back on this and see if the changes look good? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked it over; looks ok to me. Personally I don't think it's necessary to acknowledge casual reviewers in the CHANGES, but thank you!
Happy new year! Thanks @msokolov for the approval! I'll merge the PR in a few days if there's no further feedback from others. |
…e-KnnVectorsReader
|
||
@Override | ||
public VectorValues getVectorValues(String field) throws IOException { | ||
return new VectorValuesMerger(subs, mergeState); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way, the VectorValues
in subs would be shared across all callers of getVectorValues
. I think we need to re-create subs
every time getVectorValues
is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ops sorry, I should have realized this when fixing for VectorValuesWriter
. I've fixed this one as well and asserted for new VectorValues
to be created upon calls in a test.
fieldsWritten.add(fieldInfo.name); | ||
writer.writeField(fieldInfo, values); | ||
// assert that knnVectorsReader#getVectorValues returns different instances upon repeated | ||
// calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the sort of things that we usually check via AssertingKnnVectorsReader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right. I've moved it to AssertingKnnVectorsReader
.
Thanks again for the review and approval @jpountz @msokolov @jtibshirani ! |
…stead of VectorValues (#534)
@zacharymorn as a heads up, I just backported this to |
Sorry I have been away for a bit. Yes that's my intention, thanks @jtibshirani for the backporting! |
Description
KnnVectorsWriter#writeField to take KnnVectorsReader instead of VectorValues
Tests
Passed existing tests via
./gradlew clean; ./gradlew check
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.