Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10183: KnnVectorsWriter#writeField to take KnnVectorsReader instead of VectorValues #534

Conversation

zacharymorn
Copy link
Contributor

Description

KnnVectorsWriter#writeField to take KnnVectorsReader instead of VectorValues

Tests

Passed existing tests via ./gradlew clean; ./gradlew check

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

import org.apache.lucene.util.Bits;

/** Abstract base class implementing a {@link KnnVectorsReader} that has no vector values. */
public abstract class EmptyKnnVectorsReader extends KnnVectorsReader {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's kind of funny that the only way we use it here, it's not actually empty, due to overriding, but it's OK, naming has already been established; let's be consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do it for doc values because doc values support 5 different types (numeric, sorted numeric, sorted, sorted set, binary) and the empty producer helps only implement the doc values type that we care about. Since there is a single type of vectors, I don't think we need this empty producer, let's remove it and extend KnnVectorsReader directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense. Thanks for the context! I've removed it and extended directly.

@msokolov
Copy link
Contributor

Maybe also and a CHANGELOG entry? I guess it's a somewhat internal API, being in codecs, but it's not marked @experimental, so if someone had implemented their own KnnVectorsWriter codec, they'd need to change it ...

import org.apache.lucene.util.Bits;

/** Abstract base class implementing a {@link KnnVectorsReader} that has no vector values. */
public abstract class EmptyKnnVectorsReader extends KnnVectorsReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's kind of funny that the only way we use it here, it's not actually empty, due to overriding, but it's OK, naming has already been established; let's be consistent.

Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, with a small comment around naming.

@@ -40,7 +41,8 @@
protected KnnVectorsWriter() {}

/** Write all values contained in the provided reader */
public abstract void writeField(FieldInfo fieldInfo, VectorValues values) throws IOException;
public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorReader)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment, we use both vectorsReader and knnVectorReader for this variable. It'd be nice to use one name consistently (with plural "vectors").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! I've updated them to use knnVectorsReader.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zacharymorn for cleaning this up, I left a few comments.

import org.apache.lucene.util.Bits;

/** Abstract base class implementing a {@link KnnVectorsReader} that has no vector values. */
public abstract class EmptyKnnVectorsReader extends KnnVectorsReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do it for doc values because doc values support 5 different types (numeric, sorted numeric, sorted, sorted set, binary) and the empty producer helps only implement the doc values type that we care about. Since there is a single type of vectors, I don't think we need this empty producer, let's remove it and extend KnnVectorsReader directly?

new EmptyKnnVectorsReader() {
@Override
public VectorValues getVectorValues(String field) throws IOException {
return sortMap != null ? new SortingVectorValues(vectorValues, sortMap) : vectorValues;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect as it would return the same instance every time it is called. We should instantiate a new BufferedVectorValues instance every time this method is called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch! I think with the current usage it probably never matters since we only call this once and dispose of the reader, but it would be a trap waiting for some future usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ops sorry good catch! I've fixed it.

@zacharymorn
Copy link
Contributor Author

Maybe also and a CHANGELOG entry? I guess it's a somewhat internal API, being in codecs, but it's not marked @experimental, so if someone had implemented their own KnnVectorsWriter codec, they'd need to change it ...

Just added an entry into 9.1.0. Shall I also add @lucene.experimental for it as well in case it needs to change again? But I don't know about this API enough to foresee that need...

@zacharymorn
Copy link
Contributor Author

Thanks @msokolov @jtibshirani @jpountz for the review and suggestions! I've pushed an update to address the feedback.

@zacharymorn
Copy link
Contributor Author

Hi @jpountz @msokolov @jtibshirani , just want to check back on this and see if the changes look good?

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked it over; looks ok to me. Personally I don't think it's necessary to acknowledge casual reviewers in the CHANGES, but thank you!

@zacharymorn
Copy link
Contributor Author

I looked it over; looks ok to me. Personally I don't think it's necessary to acknowledge casual reviewers in the CHANGES, but thank you!

Happy new year! Thanks @msokolov for the approval! I'll merge the PR in a few days if there's no further feedback from others.


@Override
public VectorValues getVectorValues(String field) throws IOException {
return new VectorValuesMerger(subs, mergeState);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way, the VectorValues in subs would be shared across all callers of getVectorValues. I think we need to re-create subs every time getVectorValues is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ops sorry, I should have realized this when fixing for VectorValuesWriter. I've fixed this one as well and asserted for new VectorValues to be created upon calls in a test.

fieldsWritten.add(fieldInfo.name);
writer.writeField(fieldInfo, values);
// assert that knnVectorsReader#getVectorValues returns different instances upon repeated
// calls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the sort of things that we usually check via AssertingKnnVectorsReader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. I've moved it to AssertingKnnVectorsReader.

@zacharymorn
Copy link
Contributor Author

Thanks again for the review and approval @jpountz @msokolov @jtibshirani !

@zacharymorn zacharymorn merged commit d0ad9f5 into apache:main Jan 7, 2022
jtibshirani pushed a commit that referenced this pull request Jan 18, 2022
@jtibshirani
Copy link
Member

@zacharymorn as a heads up, I just backported this to branch_9x (which I think was your original intention since it's listed under 9.1 in the changelog).

@zacharymorn
Copy link
Contributor Author

@zacharymorn as a heads up, I just backported this to branch_9x (which I think was your original intention since it's listed under 9.1 in the changelog).

Sorry I have been away for a bit. Yes that's my intention, thanks @jtibshirani for the backporting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants