-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a feature_vector
field.
#31102
Add a feature_vector
field.
#31102
Conversation
This field is similar to the `feature` field but is better suited to index sparse feature vectors. A use-case for this field could be to record topics associated with every documents alongside a metric that quantifies how well the topic is connected to this document, and then boost queries based on the topics that the logged user is interested in. Relates elastic#27552
Pinging @elastic/es-search-aggs |
This looks exciting and does somewhat sound better to me than #27552 because the indexing syntax is a lot cleaner. But I would have expected that being able to use the TF data would give even better performance by reusing all of the existing optimizations there (though would be limited to integers). But I do like the ability to specify floats associated with terms (I actually left that off the other thread). I don't fully understand the limitations, but supporting negative numbers would be pretty useful also. Not fully clear to me from the current docs if zero and floats are supported. I think it is saying that floating point is fine, but I'm not sure why it doesn't say it is just a Java float? I guess maybe this has to do with how the performance gets optimized? I like the idea of this sort of syntax for setting TF in the doc better than the kinda odd tokenization that DelimitedTermFrequencyTokenFilter uses where the text has to get analyzed. |
What kind of optimizations are you thinking about? I can't think of an optimization that gets disabled because of the way that we abstract the term frequency here.
This is challenging due to the fact that scores are not allowed to be negative. We could take negative values in, but we'd have to turn them into a positive score somehow. Which then doesn't play well with the fact that documents without a term would get a score contribution of 0. It makes it hard to boost hits negatively. I think these use-cases would have to keep using
I'll clarify. Positive floats are accepted, 0 is not, and we do not retain full precision, only 9 bits which translates to (roughly) a 0.4% error. This allows to store term frequencies on 16 bits, which helps keep the index space-efficient.
This is good feedback, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Adrien, very interesting and useful feature
throw new IllegalArgumentException("[feature_vector] fields do not support indexing multiple values for the same " + | ||
"feature [" + key + "] in the same document"); | ||
} | ||
context.doc().addWithKey(key, new FeatureField(name(), feature, value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I understand correctly - if we index in ES:
"topics": {
"politics": 20,
"economics": 50
}
in Lucene, it will translate to a field with a name topics
, with values: "politics" with freq 20, and "economics" with freq 50?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual term frequency will actually be Float.floatToIntBits(20) >>> 15
rather than 20, but other than that this is correct. You can look at FeatureField.tokenStream
for more details. So basically it is a float
but with only 8 bits for the mantissa rather than 23.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz I left some comments on the documentation but this LGTM and seems like a great enhancement on top of the feature field 😄
NOTE: `feature_vector` fields only support single-valued features and strictly | ||
positive values. Multi-valued fields and negative values will be rejected. | ||
|
||
NOTE: `feature_vector` fields do not support querying, sorting or aggregating. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds a bit weird because we say querying is not supported but then say later that is is supported via the feature
query. Maybe we should reword this to say:
`feature_vector` fields do not support sorting or aggregating and may
only be queried using <<query-dsl-feature-query,`feature`>> queries.
} | ||
} | ||
} | ||
-------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if at least one of the example values above should be a fractional value to make it clearer that the field type accepts float values and not just integer values? It might also be worth adding a note to this effect below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, @gibrown had a similar comment
That all makes sense to me, thanks for the explanations. Very cool, and ya I think this would address a number of my use cases. |
This field is similar to the
feature
field but is better suited to indexsparse feature vectors. A use-case for this field could be to record topics
associated with every documents alongside a metric that quantifies how well
the topic is connected to this document, and then boost queries based on the
topics that the logged user is interested in.
Relates #27552