Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
This field is similar to the
This field is similar to the `feature` field but is better suited to index sparse feature vectors. A use-case for this field could be to record topics associated with every documents alongside a metric that quantifies how well the topic is connected to this document, and then boost queries based on the topics that the logged user is interested in. Relates #27552
This looks exciting and does somewhat sound better to me than #27552 because the indexing syntax is a lot cleaner. But I would have expected that being able to use the TF data would give even better performance by reusing all of the existing optimizations there (though would be limited to integers).
But I do like the ability to specify floats associated with terms (I actually left that off the other thread). I don't fully understand the limitations, but supporting negative numbers would be pretty useful also. Not fully clear to me from the current docs if zero and floats are supported. I think it is saying that floating point is fine, but I'm not sure why it doesn't say it is just a Java float? I guess maybe this has to do with how the performance gets optimized?
I like the idea of this sort of syntax for setting TF in the doc better than the kinda odd tokenization that DelimitedTermFrequencyTokenFilter uses where the text has to get analyzed.
What kind of optimizations are you thinking about? I can't think of an optimization that gets disabled because of the way that we abstract the term frequency here.
This is challenging due to the fact that scores are not allowed to be negative. We could take negative values in, but we'd have to turn them into a positive score somehow. Which then doesn't play well with the fact that documents without a term would get a score contribution of 0. It makes it hard to boost hits negatively. I think these use-cases would have to keep using
I'll clarify. Positive floats are accepted, 0 is not, and we do not retain full precision, only 9 bits which translates to (roughly) a 0.4% error. This allows to store term frequencies on 16 bits, which helps keep the index space-efficient.
This is good feedback, thanks!