-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move max vector dims limit to Codec #12436
Move max vector dims limit to Codec #12436
Conversation
Move vector max dimension limits enforcement into the default Codec's KnnVectorsFormat implementation. This allows different implementation of knn search algorithms define their own limits of a maximum vector dimenstions that they can handle. Closes apache#12309
pf.fieldName, | ||
s.vectorDimension, | ||
indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not going to do what we want when a PerFieldKnnVectorsFormat
is used, as this would check the limit on PerFieldKnnVectorsFormat
, rather than on the actual format that is used for pf.fieldName
. Maybe getMaxDimensions
should be on KnnVectorsWriter
and we could forward to pf.knnVectorsWrtier
here for checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually my suggestion wouldn't work, as the writer would already be created with the number of dimensions of the field type when we run the check. So I guess we either need to add the field name to getMaxDimensions
or make the codec responsible for performing the check rather than IndexingChain
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that this adds a hashtable lookup on a hot code path. Maybe it's not that bad for vectors, which are slow to index anyway, but I'd rather avoid it. What about making the codec responsible for checking the limit? Something like below:
diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
index cb3e5ef8b10..6c365e53528 100644
--- a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
+++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
@@ -108,6 +108,9 @@ public final class Lucene95HnswVectorsFormat extends KnnVectorsFormat {
public static final int VERSION_START = 0;
public static final int VERSION_CURRENT = VERSION_START;
+ /** A maximum number of vector dimensions supported by this codeс */
+ public static final int MAX_DIMENSIONS = 1024;
+
/**
* A maximum configurable maximum max conn.
*
@@ -177,7 +180,7 @@ public final class Lucene95HnswVectorsFormat extends KnnVectorsFormat {
@Override
public KnnVectorsWriter fieldsWriter(SegmentWriteState state) throws IOException {
- return new Lucene95HnswVectorsWriter(state, maxConn, beamWidth);
+ return new Lucene95HnswVectorsWriter(state, maxConn, beamWidth, MAX_DIMENSIONS);
}
@Override
diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
index 5358d66f16e..196f12a21ad 100644
--- a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
+++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
@@ -60,13 +60,15 @@ public final class Lucene95HnswVectorsWriter extends KnnVectorsWriter {
private final IndexOutput meta, vectorData, vectorIndex;
private final int M;
private final int beamWidth;
+ private final int maxDimension;
private final List<FieldWriter<?>> fields = new ArrayList<>();
private boolean finished;
- Lucene95HnswVectorsWriter(SegmentWriteState state, int M, int beamWidth) throws IOException {
+ Lucene95HnswVectorsWriter(SegmentWriteState state, int M, int beamWidth, int maxDimension) throws IOException {
this.M = M;
this.beamWidth = beamWidth;
+ this.maxDimension = maxDimension;
segmentWriteState = state;
String metaFileName =
IndexFileNames.segmentFileName(
@@ -117,6 +119,9 @@ public final class Lucene95HnswVectorsWriter extends KnnVectorsWriter {
@Override
public KnnFieldVectorsWriter<?> addField(FieldInfo fieldInfo) throws IOException {
+ if (fieldInfo.getVectorDimension() > maxDimension) {
+ throw new IllegalArgumentException("Number of dimensions " + fieldInfo.getVectorDimension() + " for field " + fieldInfo.name + " exceeds the limit of " + maxDimension);
+ }
FieldWriter<?> newField =
FieldWriter.create(fieldInfo, M, beamWidth, segmentWriteState.infoStream);
fields.add(newField);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz Thank you for the additional feedback.
I worry that this adds a hashtable lookup on a hot code path. Maybe it's not that bad for vectors, which are slow to index anyway, but I'd rather avoid it.
This is not really a hot code path. We ask for getCodec().knnVectorsFormat().getMaxDimensions
in the initializeFieldInfo
function, that happens only once per a new field per segment.
What about making the codec responsible for checking the limit?
Thanks for the suggestion, I experimented with this idea, and encountered the following difficulty with it:
- we need to create a new
FieldInfo
before passing it toKnnFieldVectorsWriter<?> addField(FieldInfo fieldInfo)
. - The way we create it is :
FieldInfo fi = fieldInfos.add(
by adding to the global fieldInfos. This means that ifFieldInfo
contains incorrect number of dimensions, it will be stored like this in the global fieldInfos, and we can't change it (for example with a second document with correct number of dims).
May be as an alternative we can do a validation as a separate method of KnnVectorsWriter
:
public void validateFieldDims(int dims)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhh thanks for explaining, I had not fully understood how your change worked. I like that we're retaining the property that the field info doesn't even get created if its number of dimensions is above the limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this looks fine. The hashtable lookup is no issue at all.
As getMaxDimensions() is a public codec method, we can let us do the check here. A separate validateFieldDims() is not needed.
I only don't like the long call chain to actually get the vectors format from the indexWriterConfig. But I can live with that.
Add a field name for getMaxDimensions function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mayya-sharipova, it took me time to understand how your change works but I think it's good, in particular the fact that it's transactional: a field will not make it to the IndexWriter's field infos if it has a number of dimensions above the limit.
@uschindler I'm curious what you think of this change since you seemed to support the idea of moving the limit to the codec in the past.
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
Show resolved
Hide resolved
* @param fieldName the field name | ||
* @return the maximum number of vector dimensions. | ||
*/ | ||
public int getMaxDimensions(String fieldName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In main branch this should be abstract, only in 9.x we should have a default impl.
@@ -80,6 +80,11 @@ public KnnVectorsReader fieldsReader(SegmentReadState state) throws IOException | |||
return new FieldsReader(state); | |||
} | |||
|
|||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the only subclass of KnnVectorsFormat that we have?
In addition, we should also add explicit number of dimensions in backwards codecs, because at the time when it was implemented 1024 was their default. In backwards codec the method should be final, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have SimpleTextKnnVectorsFormat
too.
"f", | ||
new float[getVectorsMaxDimensions("f") + 1], | ||
VectorSimilarityFunction.DOT_PRODUCT)); | ||
Exception exc = expectThrows(IllegalArgumentException.class, () -> w.addDocument(doc)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically the check is now delayed to addDocument()
? Great!
I was afraid that the check comes delayed while indexing of flushing is happening. So to me this looks good.
pf.fieldName, | ||
s.vectorDimension, | ||
indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this looks fine. The hashtable lookup is no issue at all.
As getMaxDimensions() is a public codec method, we can let us do the check here. A separate validateFieldDims() is not needed.
I only don't like the long call chain to actually get the vectors format from the indexWriterConfig. But I can live with that.
This looks fine to me. I am happy that we do not have stupid system properties. Somebody who wants to raise the limit (or lower it to 32 like in the test) can simply implement an own codec. One question I have: What happens if you open an index with a higher limit in field infos and you use default codec? I think this is unsupported, but in that case the implementor of the codec should possibly use own codec name. |
Thanks @jpountz and @uschindler for the reviews. I will do the following:
|
Move vector max dimension limits enforcement into the default Codec's KnnVectorsFormat implementation. This allows different implementation of knn search algorithms define their own limits of a maximum vector dimensions that they can handle. Closes #12309
…ensionTooLarge Depending whether a document with dimensions > maxDims created on a new segment or already existing segment, we may get different error messages. This fix adds another possible error message we may get. Relates to apache#12436
…ensionTooLarge Depending whether a document with dimensions > maxDims created on a new segment or already existing segment, we may get different error messages. This fix adds another possible error message we may get. Relates to apache#12436
This change doesn't touch the read logic, it only adds write-time validation. So if you first index vectors above the default limit using a custom codec that reuses the same codec name as the default codec and then open this index with the default codec, you will be able to perform reads, but writes will fail. Using a different codec name would certainly make things simpler, to force Lucene to use the same codec at read time instead of the default codec. |
That's what I expected. Thanks. |
+ "]" | ||
+ "vector's dimensions must be <= [" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: #12605 proposes to have a space before the vector's dimensions
words here.
Move vector max dimension limits enforcement into the default Codec's
KnnVectorsFormat implementation. This allows different implementation
of knn search algorithms define their own limits of a maximum
vector dimensions that they can handle.
Closes #12309