New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make significant terms work on fields that are indexed with points. #18031
Make significant terms work on fields that are indexed with points. #18031
Conversation
Many thanks for this, @jpountz . However if we go down this route of deriving counts on the fly what do you think of an approach where we try to re-use the aggregation framework - so a broadening of "significant terms" into "significant buckets". That would give us the flexibility to do the following analysis:
In each of these scenarios the significance algorithm is tuning out the uneven-ness in the data in order to identify buckets of interest e.g. those with a propensity to commit a crime or purchase a product. We know so much of data is Zipfy and skewed towards the popular so background-diffing is a useful lens through which to view most data (e.g. in the same way "per-capita" stats help as a saner basis of comparisons). There are several advantages to re-using aggs for the bulk of this work:
There's a lot to think about in adopting an agg-based approach - the JSON syntax, the changes to existing aggs to support this background-stats use case. I appreciate this is likely too much to debate here on this PR but I wanted to run it by you because it is related to the changes being made here. |
I think that is worth exploring but this looks quite ambitious at the same time. We should try to do baby steps as much as possible. The first one would probably to be to compute doc counts of the background set using doc values (just like an aggregation would do).Then we could try to see how we can plug the terms agg in to do the job. And afterwards maybe supporting arbitrary numeric metric aggs. |
Numeric fields have been refactored to use a different data structure that | ||
performs better for range queries. However, since this data structure does | ||
not record document frequencies, numeric fields can no longer be used for | ||
significant terms aggregations. It is recommended to use <<keyword,`keyword`>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth adding a note about potential performance degradation with numerics and a recommendation to reindex as keyword if this is an issue?
Was trying with my weblog data to test performance but hit this array index out of bounds exception:
Exception:
|
@markharwood I pushed a new commit to address your comment about the documentation. Regarding the test failure you got, we need #18003 to make terms aggs work with ip addresses. The issue here is that elasticsearch tried to convert some ip bytes to an utf8 string, but this failed since the array boundary was crossed while reading what looked like a 4-bytes UTF8 code point. |
For now I hacked SignificantStringTerms.getKeyAsString() so I can run some benchmarks without causing errors. There was a noticeable slow-down on IP address fields compared to the |
.add(filter, Occur.FILTER) | ||
.build(); | ||
} | ||
return context.searchContext().searcher().count(query); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a concern that when sig_terms was embedded under a parent agg (eg most significant IP address inside a per-day date_histogram) then we would search for the same terms over and over again. I experimented with a count-cache for this use case on my test data but didn't get a noticeable improvement.
@markharwood Do you think this is good to merge or would you like me to change the way it works? |
My only concern was around the likelihood of duplicated frequency lookups when a sig_term agg is embedded under a parent terms agg or similar. I experimented with adding a term->count cache on some weblog data but failed to get a noticeable improvement. There may already be some caching effects occurring at the Lucene or OS file system level that make this agg-level caching redundant? Otherwise LGTM |
There is caching happening indeed through the filesystem and the query cache. It is not as efficient as the terms enum though, which even if it did not cache, would still have the benefit of reusing |
…lastic#18031 It will keep using the caching terms enum for keyword/text fields and falls back to IndexSearcher.count for fields that do not use the inverted index for searching (such as numbers and ip addresses). Note that this probably means that significant terms aggregations on these fields will be less efficient than they used to be. It should be ok under a sampler aggregation though. This moves tests back to the state they were in before numbers started using points, and also adds a new test that significant terms aggs fail if a field is not indexed. In the long term, we might want to follow the approach that Robert initially proposed that consists in collecting all documents from the background filter in order to compute frequencies using doc values. This would also mean that significant terms aggregations do not require fields to be indexed anymore.
bd79bc8
to
866a545
Compare
It will keep using the caching terms enum for keyword/text fields and falls back
to IndexSearcher.count for fields that do not use the inverted index for
searching (such as numbers and ip addresses). Note that this probably means that
significant terms aggregations on these fields will be less efficient than they
used to be. It should be ok under a sampler aggregation though.
This moves tests back to the state they were in before numbers started using
points, and also adds a new test that significant terms aggs fail if a field is
not indexed.
In the long term, we might want to follow the approach that Robert initially
proposed that consists in collecting all documents from the background filter in
order to compute frequencies using doc values. This would also mean that
significant terms aggregations do not require fields to be indexed anymore.