Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dims parameter to dense_vector mapping #43444

Conversation

mayya-sharipova
Copy link
Contributor

Typically, dense vectors of both documents and queries must have the same
number of dimensions. Different number of dimensions among documents
or query vector indicate an error. This PR enforces that all vectors
for the same field have the same number of dimensions. It also enforces
that query vectors have the same number of dimensions as doc vectors.

@mayya-sharipova mayya-sharipova added the :Search Relevance/Ranking Scoring, rescoring, rank evaluation. label Jun 20, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@mayya-sharipova
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/1

@mayya-sharipova
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc

@mayya-sharipova
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/default-distro

Typically, dense vectors of both documents and queries must have the same
number of dimensions. Different number of dimensions among documents
or query vector indicate an error. This PR enforces that all vectors
for the same field have the same number of dimensions. It also enforces
that query vectors have the same number of dimensions.
@mayya-sharipova mayya-sharipova force-pushed the add_number_of_dims_to_dense_vector branch from 3179461 to bac0c34 Compare June 21, 2019 13:38
Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mayya-sharipova, this change makes sense to me. For the kinds of use cases we're targeting with vector fields, it does seem that having vectors with mixed dimensions almost always indicates an error, and it is helpful to be alerted to it. I can't think of a case where this behavior would be overly restrictive (and couldn't be worked around easily).

While reviewing this change, I noticed that for the cosineSimilarity and dotProduct functions, we accept documents without the field and always return 0. This behavior isn't very intuitive to me -- I would expect for the similarity function to error, and that the user should explicitly handle documents without the field in the script. The current behavior could mask errors in the data (the type of error we're trying to avoid with the current change).

int offset = 0;
int dim = 0;
for (Token token = context.parser().nextToken(); token != Token.END_ARRAY; token = context.parser().nextToken()) {
if (dim++ >= dims) {
throw new MapperParsingException("Field [" + name() + "] of type [" + typeName() + "] of doc [" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like IllegalArgumentException would make more sense here, since we are parsing a document and not a mapping.

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Jun 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani I agree with you. This should be IllegalArgumentException, but these exceptions will be caught by DocumentParser::parseDocument and wrapped into the MapperParsingException, which eventually come up as MapperParsingExceptions with a cause underneath it as IllegalArgumentException.
That's why I thought it makes sense to throw MapperParsingException from the beginning.
What do you think - does it still makes sense to have IllegalArgumentException underneath MapperParsingException?
Actually, I would really benefit from more clarification around exceptions, may be worth discussion with the team.

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Jun 28, 2019

@jtibshirani Thanks for your feedback, the next commit in the PR tries to address it.
Two things left unaddressed:

  1. Exceptions. I was wondering if it makes difference for you to have a single-level MapperParsingException or MapperParsingException with wrapped IllegalArgumentException?

  2. the cosineSimilarity and dotProduct functions, we accept documents without the field and always return 0.

The way the API and VectorScriptDocValues are designed right now it is NOT possible to check if a document has a value for a vector field. Usually for other fields this is done by doc['my_field'].value == null or doc['my_field'].size() == 0, but VectorScriptDocValues doesn't have getValue method and get and size methods are not supported. This is the decision we made not to leak the internal representations of vectors.

As there is not way for users to skip documents without a vector field, I chose to have a default value 0 for this case. But I wonder how much this is an issue, and if it is, we may redesign the vectors API and VectorScriptDocValues to have access to value instead.

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exceptions. I was wondering if it makes difference for you to have a single-level MapperParsingException or MapperParsingException with wrapped IllegalArgumentException?

Thanks @mayya-sharipova, I didn't see that MapperParsingException is also used when parsing documents. I don't have a strong preference, I think I'd still lean towards IllegalArgumentException since it's more commonly used in this context (I verified this through a quick code search).

The way the API and VectorScriptDocValues are designed right now it is NOT possible to check if a document has a value for a vector field.

I wonder if it'd at least make sense to support size()? In any case, it is not too relevant to this PR and I agree we can wait for more feedback to see if we need to rethink the VectorScriptDocValues API.

The last comment I had is that we should probably update the line "
If a document’s dense vector field has a number of dimensions different from the query’s vector..." from this documentation: https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-script-score-query.html#vector-functions

@@ -24,7 +22,8 @@ PUT my_index
"mappings": {
"properties": {
"my_vector": {
"type": "dense_vector"
"type": "dense_vector", <1>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this callout <1> should go on the next line.

@@ -80,11 +97,17 @@ public DenseVectorFieldMapper build(BuilderContext context) {
@Override
public Mapper.Builder<?,?> parse(String name, Map<String, Object> node, ParserContext parserContext) throws MapperParsingException {
DenseVectorFieldMapper.Builder builder = new DenseVectorFieldMapper.Builder(name);
return builder;
Object dimsField = node.remove("dims");
if (dimsField == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the same reasoning as above, it would be nice to move this check to DenseVectorFieldMapper.Builder#build.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks, but we need this check here to ensure we don't get NullPointerException on the line that follows after that:

int dims = XContentMapValues.nodeIntegerValue(dimsField);

Also, I was thinking that public Builder dims(int dims) can accept dims parameter as a primitive integer.

Copy link
Contributor

@jtibshirani jtibshirani Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approach would be to check if the node is null, and only parse it and call Builder#dims if it is not null. Then Builder#build could complain if Builder#dims has never been called. Anyways, this is a small comment, the change looks good to me!

@mayya-sharipova
Copy link
Contributor Author

@elasticmachine test this please

@mayya-sharipova
Copy link
Contributor Author

@jtibshirani Thanks Julie, I have addressed your 2nd feedback in the last commit.

I wonder if it'd at least make sense to support size()? In any case, it is not too relevant to this PR and I agree we can wait for more feedback to see if we need to rethink the VectorScriptDocValues API

Right, let's do it in a separate PR.

@mayya-sharipova
Copy link
Contributor Author

@elasticmachine test this please

@mayya-sharipova
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/packaging-sample

1 similar comment
@mayya-sharipova
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/packaging-sample

@mayya-sharipova
Copy link
Contributor Author

@elasticmachine test this please

@mayya-sharipova mayya-sharipova merged commit 66e1e56 into elastic:master Jul 2, 2019
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this pull request Jul 2, 2019
Typically, dense vectors of both documents and queries must have the same
number of dimensions. Different number of dimensions among documents
or query vector indicate an error. This PR enforces that all vectors
for the same field have the same number of dimensions. It also enforces
that query vectors have the same number of dimensions.
@mayya-sharipova mayya-sharipova added :Search Foundations/Mapping Index mappings, including merging and defining field types and removed :Search Relevance/Ranking Scoring, rescoring, rank evaluation. labels Jul 2, 2019
mayya-sharipova added a commit that referenced this pull request Jul 3, 2019
Typically, dense vectors of both documents and queries must have the same
number of dimensions. Different number of dimensions among documents
or query vector indicate an error. This PR enforces that all vectors
for the same field have the same number of dimensions. It also enforces
that query vectors have the same number of dimensions.
mayya-sharipova added a commit that referenced this pull request Jul 3, 2019
After backport to 7.3 adjust yml tests

Relates to #43444
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types v7.3.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants