Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a `feature_vector` field. #31102

Merged
merged 5 commits into from Jun 7, 2018
Merged

Add a `feature_vector` field. #31102

merged 5 commits into from Jun 7, 2018

Conversation

@jpountz
Copy link
Contributor

@jpountz jpountz commented Jun 5, 2018

This field is similar to the feature field but is better suited to index
sparse feature vectors. A use-case for this field could be to record topics
associated with every documents alongside a metric that quantifies how well
the topic is connected to this document, and then boost queries based on the
topics that the logged user is interested in.

Relates #27552

This field is similar to the `feature` field but is better suited to index
sparse feature vectors. A use-case for this field could be to record topics
associated with every documents alongside a metric that quantifies how well
the topic is connected to this document, and then boost queries based on the
topics that the logged user is interested in.

Relates #27552
@elasticmachine
Copy link
Collaborator

@elasticmachine elasticmachine commented Jun 5, 2018

Pinging @elastic/es-search-aggs

@gibrown
Copy link
Contributor

@gibrown gibrown commented Jun 5, 2018

This looks exciting and does somewhat sound better to me than #27552 because the indexing syntax is a lot cleaner. But I would have expected that being able to use the TF data would give even better performance by reusing all of the existing optimizations there (though would be limited to integers).

But I do like the ability to specify floats associated with terms (I actually left that off the other thread). I don't fully understand the limitations, but supporting negative numbers would be pretty useful also. Not fully clear to me from the current docs if zero and floats are supported. I think it is saying that floating point is fine, but I'm not sure why it doesn't say it is just a Java float? I guess maybe this has to do with how the performance gets optimized?

I like the idea of this sort of syntax for setting TF in the doc better than the kinda odd tokenization that DelimitedTermFrequencyTokenFilter uses where the text has to get analyzed.

@jpountz
Copy link
Contributor Author

@jpountz jpountz commented Jun 6, 2018

But I would have expected that being able to use the TF data would give even better performance by reusing all of the existing optimizations there

What kind of optimizations are you thinking about? I can't think of an optimization that gets disabled because of the way that we abstract the term frequency here.

supporting negative numbers would be pretty useful also

This is challenging due to the fact that scores are not allowed to be negative. We could take negative values in, but we'd have to turn them into a positive score somehow. Which then doesn't play well with the fact that documents without a term would get a score contribution of 0. It makes it hard to boost hits negatively. I think these use-cases would have to keep using function_score.

Not fully clear to me from the current docs if zero and floats are supported.

I'll clarify. Positive floats are accepted, 0 is not, and we do not retain full precision, only 9 bits which translates to (roughly) a 0.4% error. This allows to store term frequencies on 16 bits, which helps keep the index space-efficient.

I like the idea of this sort of syntax for setting TF in the doc better than the kinda odd tokenization that DelimitedTermFrequencyTokenFilter uses where the text has to get analyzed.

This is good feedback, thanks!

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Thanks Adrien, very interesting and useful feature

throw new IllegalArgumentException("[feature_vector] fields do not support indexing multiple values for the same " +
"feature [" + key + "] in the same document");
}
context.doc().addWithKey(key, new FeatureField(name(), feature, value));

This comment has been minimized.

@mayya-sharipova

mayya-sharipova Jun 6, 2018
Contributor

Did I understand correctly - if we index in ES:

"topics": {
	"politics": 20,
	"economics": 50
}

in Lucene, it will translate to a field with a name topics, with values: "politics" with freq 20, and "economics" with freq 50?

This comment has been minimized.

@jpountz

jpountz Jun 6, 2018
Author Contributor

The actual term frequency will actually be Float.floatToIntBits(20) >>> 15 rather than 20, but other than that this is correct. You can look at FeatureField.tokenStream for more details. So basically it is a float but with only 8 bits for the mantissa rather than 23.

Copy link
Member

@colings86 colings86 left a comment

@jpountz I left some comments on the documentation but this LGTM and seems like a great enhancement on top of the feature field 😄

NOTE: `feature_vector` fields only support single-valued features and strictly
positive values. Multi-valued fields and negative values will be rejected.

NOTE: `feature_vector` fields do not support querying, sorting or aggregating.

This comment has been minimized.

@colings86

colings86 Jun 6, 2018
Member

This sounds a bit weird because we say querying is not supported but then say later that is is supported via the feature query. Maybe we should reword this to say:

`feature_vector` fields do not support sorting or aggregating and may 
only be queried using <<query-dsl-feature-query,`feature`>> queries.
}
}
}
--------------------------------------------------

This comment has been minimized.

@colings86

colings86 Jun 6, 2018
Member

I wonder if at least one of the example values above should be a fractional value to make it clearer that the field type accepts float values and not just integer values? It might also be worth adding a note to this effect below?

This comment has been minimized.

@jpountz

jpountz Jun 6, 2018
Author Contributor

Agreed, @gibrown had a similar comment

jpountz added 3 commits Jun 6, 2018
@gibrown
Copy link
Contributor

@gibrown gibrown commented Jun 6, 2018

That all makes sense to me, thanks for the explanations. Very cool, and ya I think this would address a number of my use cases.

@jpountz jpountz merged commit 458bca1 into elastic:master Jun 7, 2018
3 checks passed
3 checks passed
CLA Commit author is a member of Elasticsearch
Details
elasticsearch-ci Build finished.
Details
elasticsearch-ci/packaging-sample Build finished.
Details
@jpountz jpountz deleted the jpountz:feature_vector branch Jun 7, 2018
@colings86 colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.