Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a feature_vector field. #31102

Merged
merged 5 commits into from
Jun 7, 2018
Merged

Add a feature_vector field. #31102

merged 5 commits into from
Jun 7, 2018

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jun 5, 2018

This field is similar to the feature field but is better suited to index
sparse feature vectors. A use-case for this field could be to record topics
associated with every documents alongside a metric that quantifies how well
the topic is connected to this document, and then boost queries based on the
topics that the logged user is interested in.

Relates #27552

This field is similar to the `feature` field but is better suited to index
sparse feature vectors. A use-case for this field could be to record topics
associated with every documents alongside a metric that quantifies how well
the topic is connected to this document, and then boost queries based on the
topics that the logged user is interested in.

Relates elastic#27552
@jpountz jpountz added >feature :Search Foundations/Mapping Index mappings, including merging and defining field types v7.0.0 labels Jun 5, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@gibrown
Copy link
Contributor

gibrown commented Jun 5, 2018

This looks exciting and does somewhat sound better to me than #27552 because the indexing syntax is a lot cleaner. But I would have expected that being able to use the TF data would give even better performance by reusing all of the existing optimizations there (though would be limited to integers).

But I do like the ability to specify floats associated with terms (I actually left that off the other thread). I don't fully understand the limitations, but supporting negative numbers would be pretty useful also. Not fully clear to me from the current docs if zero and floats are supported. I think it is saying that floating point is fine, but I'm not sure why it doesn't say it is just a Java float? I guess maybe this has to do with how the performance gets optimized?

I like the idea of this sort of syntax for setting TF in the doc better than the kinda odd tokenization that DelimitedTermFrequencyTokenFilter uses where the text has to get analyzed.

@jpountz
Copy link
Contributor Author

jpountz commented Jun 6, 2018

But I would have expected that being able to use the TF data would give even better performance by reusing all of the existing optimizations there

What kind of optimizations are you thinking about? I can't think of an optimization that gets disabled because of the way that we abstract the term frequency here.

supporting negative numbers would be pretty useful also

This is challenging due to the fact that scores are not allowed to be negative. We could take negative values in, but we'd have to turn them into a positive score somehow. Which then doesn't play well with the fact that documents without a term would get a score contribution of 0. It makes it hard to boost hits negatively. I think these use-cases would have to keep using function_score.

Not fully clear to me from the current docs if zero and floats are supported.

I'll clarify. Positive floats are accepted, 0 is not, and we do not retain full precision, only 9 bits which translates to (roughly) a 0.4% error. This allows to store term frequencies on 16 bits, which helps keep the index space-efficient.

I like the idea of this sort of syntax for setting TF in the doc better than the kinda odd tokenization that DelimitedTermFrequencyTokenFilter uses where the text has to get analyzed.

This is good feedback, thanks!

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Adrien, very interesting and useful feature

throw new IllegalArgumentException("[feature_vector] fields do not support indexing multiple values for the same " +
"feature [" + key + "] in the same document");
}
context.doc().addWithKey(key, new FeatureField(name(), feature, value));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I understand correctly - if we index in ES:

"topics": {
	"politics": 20,
	"economics": 50
}

in Lucene, it will translate to a field with a name topics, with values: "politics" with freq 20, and "economics" with freq 50?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual term frequency will actually be Float.floatToIntBits(20) >>> 15 rather than 20, but other than that this is correct. You can look at FeatureField.tokenStream for more details. So basically it is a float but with only 8 bits for the mantissa rather than 23.

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz I left some comments on the documentation but this LGTM and seems like a great enhancement on top of the feature field 😄

NOTE: `feature_vector` fields only support single-valued features and strictly
positive values. Multi-valued fields and negative values will be rejected.

NOTE: `feature_vector` fields do not support querying, sorting or aggregating.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds a bit weird because we say querying is not supported but then say later that is is supported via the feature query. Maybe we should reword this to say:

`feature_vector` fields do not support sorting or aggregating and may 
only be queried using <<query-dsl-feature-query,`feature`>> queries.

}
}
}
--------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if at least one of the example values above should be a fractional value to make it clearer that the field type accepts float values and not just integer values? It might also be worth adding a note to this effect below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, @gibrown had a similar comment

@gibrown
Copy link
Contributor

gibrown commented Jun 6, 2018

That all makes sense to me, thanks for the explanations. Very cool, and ya I think this would address a number of my use cases.

@jpountz jpountz merged commit 458bca1 into elastic:master Jun 7, 2018
@jpountz jpountz deleted the feature_vector branch June 7, 2018 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants