Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9855: Rename nn search vector format #207

Closed
wants to merge 6 commits into from

Conversation

mocobeta
Copy link
Contributor

@mocobeta mocobeta commented Jul 8, 2021

This consists of three parts (commits).

0ae2804 refactors o.a.l.codec package.

  • rename VectorFormat to NnVectorsVormat, VectorReader to NnVectorsReader, and VectorWriter to NnVectorsWriter
  • rename Lucene90HnswVectorFormat to Lucene90HnswVectorsFormat (the SPI name was also changed), Lucene90HnswVectorReader to Lucene90HnswVectorsReader, and Lucene90HnswVectorWriter to Lucene90HnswVectorsWriter
  • rename PerFieldVectorFormat to PefFieldNnVectorsFormat
  • also includes some related changes (test cases etc.)

6e7e60d refactors o.a.l.index package.

  • rename VectorValues to NnVectors - I chose shorter name here, but this could be "NnVectorValues"
  • rename VectorValuesWriter to NnVectorsConsumer
  • rename RandomAccessVectorValues to RandomAccessNnVectors
  • rename RandomAccessVectorValuesProducer to RandomAccessNnVectorsProducer
  • also includes some related changes (test cases etc.)

8824a04 refactors o.a.l.document package.

  • rename VectorField to NnVectorField

--
See https://issues.apache.org/jira/browse/LUCENE-9855 for details.

@mocobeta
Copy link
Contributor Author

mocobeta commented Jul 8, 2021

Most changes were done by IDE's refactoring function, besides I manually modified some comments/javadocs that were figured out by git grep.

@@ -247,12 +246,12 @@ public TopDocs search(String field, float[] target, int k, int fanout) throws IO
return null;
}

OffHeapVectorValues vectorValues = getOffHeapVectorValues(fieldEntry);
OffHeapNnVectors nnVectors = getOffHeapNnVectors(fieldEntry);

// use a seed that is fixed for the index so we get reproducible results for the same query
final Random random = new Random(checksumSeed);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PREDICTABLE_RANDOM: This random generator (java.util.Random) is predictable (details)
(at-me in a reply with help or ignore)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, but annoying. I think we can safely ignore this. The intent is predictable-randomness. @sonatype-lift

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've recorded this as ignored for this pull request. If you change your mind, just comment @sonatype-lift unignore.

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I found a few error/status messages that we might want to update. I also scanned Lucene90HnswVectorWriter, but didn't find any. I think there may be some messages in the old VectorWriter/NnVectorsWriter that need updating?

@@ -2284,28 +2284,28 @@ static void checkImpacts(Impacts impacts, int lastTarget) {
*
* @lucene.experimental
*/
public static Status.VectorValuesStatus testVectors(
public static Status.NnVectorsStatus testVectors(
CodecReader reader, PrintStream infoStream, boolean failFast) throws IOException {
if (infoStream != null) {
infoStream.print(" test: vectors..............");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we change the text here? "nn vectors"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4d301c5

if (fieldInfo.hasVectorValues()) {
int dimension = fieldInfo.getVectorDimension();
if (fieldInfo.hasNnVectors()) {
int dimension = fieldInfo.getNnVectorDimension();
if (dimension <= 0) {
throw new RuntimeException(
"Field \""
+ fieldInfo.name
+ "\" has vector values but dimension is "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"vector values" -> "nearest-neighbor vector values"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4d301c5

@@ -34,7 +34,7 @@
*
* @lucene.experimental
*/
class VectorValuesWriter {
class NnVectorsConsumer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -7,7 +7,7 @@ http://s.apache.org/luceneversions

New Features

* LUCENE-9322: Vector-valued fields, Lucene90 Codec (Mike Sokolov, Julie Tibshirani, Tomoko Uchida)
* LUCENE-9322 LUCENE-9855: Vector-valued fields, Lucene90 Codec (Mike Sokolov, Julie Tibshirani, Tomoko Uchida)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be grouped with LUCENE-9322?

@mocobeta
Copy link
Contributor Author

Thanks @msokolov for reviewing. I'll wait for a few more days to let others give comments on this then merge it to the upstream if there is no another feedback.

@jpountz
Copy link
Contributor

jpountz commented Jul 11, 2021

rename VectorValues to NnVectors - I chose shorter name here, but this could be "NnVectorValues"

Maybe this one should not be renamed since it isn't related to nearest-neighbor search, it only allows iterating over vectors in doc ID order?

@jtibshirani
Copy link
Member

+1 to keeping the name VectorValues. Otherwise the changes look good to me too.

One thought: I actually find the "Nn" prefix hard to read/ type, it almost looks like a typo. For me either "NN" or "Knn" (which is @msokolov's preferred option) would be more elegant. I really don't mean to nitpick, and understand it was difficult to reach a decision, so feel free to ignore this comment if you'd like :)

@msokolov
Copy link
Contributor

Re: VectorValues; I think we changed it to avoid possible confusion w/term vectors, but perhaps we are agreed that it is distnguished enough already. I'm fine keeping as is. re: Nn vs Knn 🤷

@mocobeta
Copy link
Contributor Author

mocobeta commented Jul 15, 2021

Maybe this one should not be renamed since it isn't related to nearest-neighbor search, it only allows iterating over vectors in doc ID order?

VectorValues may be ok, but I think I don't fully understand your point yet. Besides iterating "vector"s, VectorValues has SimilarityFunction enum that is a clear indication of nearest neighbour search to me. If we will decouple the vector value class from any search algorithm, should we factor out the enum into somewhere?
It seems you all agree with that we should keep the name VectorValues and I don't have strong opinion about it, but would you help me to make things clear before reverting the change?

@jpountz
Copy link
Contributor

jpountz commented Jul 16, 2021

Good point. I think having the similarity function on VectorValues is a bit awkward too since VectorValues is only about retrieving vectors. Maybe we should remove it from VectorValues?

@jpountz
Copy link
Contributor

jpountz commented Jul 16, 2021

I opened #213 to remove VectorValues#getSimilarityFunction.

@mocobeta
Copy link
Contributor Author

mocobeta commented Jul 16, 2021

Thanks @jpountz, I like the simpler VectorValues interface in #213.
This and #213 could conflict with each other on some files I think, maybe I should wait until #213 is merged into main? By then, I'll revert NnVectors to VectorValues.

@jpountz
Copy link
Contributor

jpountz commented Jul 19, 2021

Sounds good @mocobeta I'll merge shortly.

A new question is whether VectorSimilarityFunction needs renaming now that it's been extracted out from VectorValues.

@mocobeta
Copy link
Contributor Author

I opened #218 since I found that it's much easier than to revert previous commits and resolve the conflicts with the latest main here. Can you have a look at it?

"Knn" vs. "Nn" - While both are fine for me, I noticed there are already several nearest-neighbor related classes/variables that have "knn" in their name (e.g. KnnGraphValues). For consistency and visibility, I used "Knn" this time.

@mocobeta
Copy link
Contributor Author

A new question is whether VectorSimilarityFunction needs renaming now that it's been extracted out from VectorValues.

I'm fine with VectorSimilarityFunction and haven't come up with a more suitable name for it, let me know if there are any suggestions on the name.

@mocobeta mocobeta closed this Jul 22, 2021
@mocobeta mocobeta deleted the jira/LUCENE-9855 branch July 22, 2021 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants