-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[ES|QL] addressing vector similarity concurrency issue with byte vectors #137883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ES|QL] addressing vector similarity concurrency issue with byte vectors #137883
Conversation
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
|
Hi @pmpailis, I've created a changelog YAML for you. |
…y_issue_with_byte_vectors
carlosdelest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @pmpailis ! LGTM
A nit about adding two different classes - but I'm not sure it will be worth doing
|
|
||
| public byte[] vectorAsBytes() { | ||
| assert vectorAsBytes != null : "vectorAsBytes is null, call forByteVector() first"; | ||
| assert vectorAsBytes != null : "vectorAsBytes is null, maybe incorrect element type during construction?"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a similar assertion to the vector() method.
I think it would be cleaner to have two separate config classes 🤔 . WDYT? Is it worthwhile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this would also introduce some overhead in updating DenseVectorSimilarityProcessor and possibly the calls for the similarity functions.
I've updated to make use of org.elasticsearch.search.vectors.VectorData in 31ff07b, but honestly I don't feel strong about either approach. IMO it should be ok to leave as-is, but happy to update or split if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep as it is - thanks for checking it
| List<Object[]> params = new ArrayList<>(); | ||
|
|
||
| for (ElementType elementType : Set.of(ElementType.FLOAT, ElementType.BYTE)) { | ||
| for (ElementType elementType : Set.of(ElementType.FLOAT, ElementType.BYTE, ElementType.BIT)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bit requires dims % 8 == 0, idk if that will cause flakiness here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When setting up the docs we have:
for (int j = 0; j < numDims; j++) {
switch (elementType) {
case FLOAT -> vector.add(randomFloat());
case BYTE, BIT -> vector.add((byte) randomIntBetween(-128, 127));
So, I think that while dims might be a bit misleading (as we're adding a byte for every dim) and also given that we don't specify dims in the mapping, we should probably be ok. Will update though to make it a bit more clear.
…ng tests to properly ensure valid dims for bit vectors
…h_byte_vectors' of github.com:pmpailis/elasticsearch into fix_137625_esql_vector_similarity_concurrency_issue_with_byte_vectors
…y_issue_with_byte_vectors
…r_similarity_concurrency_issue_with_byte_vectors
|
run elasticsearch-ci/part-5 |
…y_issue_with_byte_vectors
…r_similarity_concurrency_issue_with_byte_vectors
|
Just a heads up, this got muted recently. I changed the name of one of the parameters away from a java-default toString. |
…r_similarity_concurrency_issue_with_byte_vectors
|
Thanks @nik9000 ! Merged main & unmuted the failing |
For ES|QL vector similarity queries, it seems that we initialize the
VectorSimilarityFunctionConfigonce and then pass it to the workers. However, each worker could potentially update the underlying byte representation depending on theelementTypeinelasticsearch/server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java
Line 2871 in 9a9bf3a
Given that the same instance might be accessed by different threads, there could be a scenario where the supporting
byte[]for a worker might be reset by another worker through the:elasticsearch/server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java
Line 3419 in 9a9bf3a
so when we try to compute the vector similarity, we might end up with incorrect scores.
In this PR, we eagerly try to identify whether we need a
byteorfloatarray and initialize/read data accordingly, to avoid both passing this responsibility to each worker and the extra space from thefloatarray when we have to deal withbytevectors.Example log output (using the
6ea437abas reference) where the0...0vectors indicate the issue:Closes #137625