Make significant terms work on fields that are indexed with points. #18031

jpountz · 2016-04-28T08:31:46Z

It will keep using the caching terms enum for keyword/text fields and falls back
to IndexSearcher.count for fields that do not use the inverted index for
searching (such as numbers and ip addresses). Note that this probably means that
significant terms aggregations on these fields will be less efficient than they
used to be. It should be ok under a sampler aggregation though.

This moves tests back to the state they were in before numbers started using
points, and also adds a new test that significant terms aggs fail if a field is
not indexed.

In the long term, we might want to follow the approach that Robert initially
proposed that consists in collecting all documents from the background filter in
order to compute frequencies using doc values. This would also mean that
significant terms aggregations do not require fields to be indexed anymore.

markharwood · 2016-04-28T09:39:28Z

Many thanks for this, @jpountz .
I see this uses term queries to derive counts on the fly when they aren't stored.

However if we go down this route of deriving counts on the fly what do you think of an approach where we try to re-use the aggregation framework - so a broadening of "significant terms" into "significant buckets". That would give us the flexibility to do the following analysis:

Individual terms (the limit of what we do today)
Numeric ranges e.g. age bands
Geo areas
Term groups (e.g. Java=[jsp,ejb,jts....] vs Javascript=[react,angular...] vs ..)

In each of these scenarios the significance algorithm is tuning out the uneven-ness in the data in order to identify buckets of interest e.g. those with a propensity to commit a crime or purchase a product. We know so much of data is Zipfy and skewed towards the popular so background-diffing is a useful lens through which to view most data (e.g. in the same way "per-capita" stats help as a saner basis of comparisons).

There are several advantages to re-using aggs for the bulk of this work:

Bucket aggs provide a clean syntax for grouping low-level index entries (e.g. date agg's "1m", "1w" etc)
Metric aggs allow us to sum things like paymentAmount - this can be a more useful number to consider as a basis for computing significance than what we use today (just volumes of documents)

There's a lot to think about in adopting an agg-based approach - the JSON syntax, the changes to existing aggs to support this background-stats use case. I appreciate this is likely too much to debate here on this PR but I wanted to run it by you because it is related to the changes being made here.

jpountz · 2016-04-28T14:06:17Z

I think that is worth exploring but this looks quite ambitious at the same time. We should try to do baby steps as much as possible. The first one would probably to be to compute doc counts of the background set using doc values (just like an aggregation would do).Then we could try to see how we can plug the terms agg in to do the job. And afterwards maybe supporting arbitrary numeric metric aggs.

markharwood · 2016-05-03T08:25:53Z

docs/reference/migration/migrate_5_0/aggregations.asciidoc

-Numeric fields have been refactored to use a different data structure that
-performs better for range queries. However, since this data structure does
-not record document frequencies, numeric fields can no longer be used for
-significant terms aggregations. It is recommended to use <<keyword,`keyword`>>


Is it worth adding a note about potential performance degradation with numerics and a recommendation to reindex as keyword if this is an issue?

markharwood · 2016-05-03T11:33:38Z

Was trying with my weblog data to test performance but hit this array index out of bounds exception:

DELETE test

PUT test
{
   "settings": {
      "number_of_replicas": 0,
      "number_of_shards":1
   },
   "mappings": {
      "test": {                
         "properties": {
            "ipField": {
               "type": "ip"
            }
         }
      }
   }
}


POST test/test
{"ipField":"95.108.241.251"}
POST test/test
{"ipField":"95.108.241.251"}
POST test/test
{"ipField":"95.108.241.251"}
POST test/test
{"ipField":"95.108.241.251"}


GET test/_search?q=95.108.241.251
{
   "size":0,
   "aggs": {
      "boom": {
         "significant_terms": {
            "field": "ipField"
         }
      }
   }
}

Exception:

Failed to execute phase [merge], [reduce] 
    at org.elasticsearch.action.search.SearchQueryAndFetchAsyncAction$1.onFailure(SearchQueryAndFetchAsyncAction.java:77)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:438)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 16
    at org.apache.lucene.util.UnicodeUtil.UTF8toUTF16(UnicodeUtil.java:602)
    at org.apache.lucene.util.BytesRef.utf8ToString(BytesRef.java:152)
    at org.elasticsearch.search.aggregations.bucket.significant.SignificantStringTerms$Bucket.getKeyAsString(SignificantStringTerms.java:115)
    at org.elasticsearch.search.aggregations.bucket.significant.InternalSignificantTerms.doReduce(InternalSignificantTerms.java:176)
    at org.elasticsearch.search.aggregations.InternalAggregation.reduce(InternalAggregation.java:158)
    at org.elasticsearch.search.aggregations.InternalAggregations.reduce(InternalAggregations.java:159)
    at org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:403)
    at org.elasticsearch.action.search.SearchQueryAndFetchAsyncAction$1.doRun(SearchQueryAndFetchAsyncAction.java:65)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:452)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)

jpountz · 2016-05-04T08:30:21Z

@markharwood I pushed a new commit to address your comment about the documentation.

Regarding the test failure you got, we need #18003 to make terms aggs work with ip addresses. The issue here is that elasticsearch tried to convert some ip bytes to an utf8 string, but this failed since the array boundary was crossed while reading what looked like a 4-bytes UTF8 code point.

markharwood · 2016-05-04T14:57:22Z

For now I hacked SignificantStringTerms.getKeyAsString() so I can run some benchmarks without causing errors. There was a noticeable slow-down on IP address fields compared to the keyword equivalent on my test index (~4 times slower) but that is to be expected.

markharwood · 2016-05-04T15:01:22Z

.../elasticsearch/search/aggregations/bucket/significant/SignificantTermsAggregatorFactory.java

+                    .add(filter, Occur.FILTER)
+                    .build();
+        }
+        return context.searchContext().searcher().count(query);


I had a concern that when sig_terms was embedded under a parent agg (eg most significant IP address inside a per-day date_histogram) then we would search for the same terms over and over again. I experimented with a count-cache for this use case on my test data but didn't get a noticeable improvement.

jpountz · 2016-05-10T14:35:36Z

@markharwood Do you think this is good to merge or would you like me to change the way it works?

markharwood · 2016-05-10T16:18:55Z

My only concern was around the likelihood of duplicated frequency lookups when a sig_term agg is embedded under a parent terms agg or similar. I experimented with adding a term->count cache on some weblog data but failed to get a noticeable improvement. There may already be some caching effects occurring at the Lucene or OS file system level that make this agg-level caching redundant?

Otherwise LGTM

jpountz · 2016-05-10T16:37:26Z

There is caching happening indeed through the filesystem and the query cache. It is not as efficient as the terms enum though, which even if it did not cache, would still have the benefit of reusing IndexInputs.

…lastic#18031 It will keep using the caching terms enum for keyword/text fields and falls back to IndexSearcher.count for fields that do not use the inverted index for searching (such as numbers and ip addresses). Note that this probably means that significant terms aggregations on these fields will be less efficient than they used to be. It should be ok under a sampler aggregation though. This moves tests back to the state they were in before numbers started using points, and also adds a new test that significant terms aggs fail if a field is not indexed. In the long term, we might want to follow the approach that Robert initially proposed that consists in collecting all documents from the background filter in order to compute frequencies using doc values. This would also mean that significant terms aggregations do not require fields to be indexed anymore.

jpountz added >enhancement :Analytics/Aggregations Aggregations v5.0.0-alpha3 labels Apr 28, 2016

jpountz assigned markharwood Apr 28, 2016

jpountz mentioned this pull request Apr 28, 2016

Fix xcontent rendering of ip terms aggs. #18003

Merged

markharwood reviewed May 3, 2016
View reviewed changes

markharwood reviewed May 4, 2016
View reviewed changes

jpountz force-pushed the fix/significant_terms_on_points branch from bd79bc8 to 866a545 Compare May 11, 2016 14:53

jpountz merged commit 866a545 into elastic:master May 11, 2016

jpountz deleted the fix/significant_terms_on_points branch May 11, 2016 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make significant terms work on fields that are indexed with points. #18031

Make significant terms work on fields that are indexed with points. #18031

jpountz commented Apr 28, 2016

markharwood commented Apr 28, 2016 •

edited

jpountz commented Apr 28, 2016

markharwood May 3, 2016

markharwood commented May 3, 2016

jpountz commented May 4, 2016

markharwood commented May 4, 2016

markharwood May 4, 2016

jpountz commented May 10, 2016

markharwood commented May 10, 2016

jpountz commented May 10, 2016

Make significant terms work on fields that are indexed with points. #18031

Make significant terms work on fields that are indexed with points. #18031

Conversation

jpountz commented Apr 28, 2016

markharwood commented Apr 28, 2016 • edited

jpountz commented Apr 28, 2016

markharwood May 3, 2016

Choose a reason for hiding this comment

markharwood commented May 3, 2016

jpountz commented May 4, 2016

markharwood commented May 4, 2016

markharwood May 4, 2016

Choose a reason for hiding this comment

jpountz commented May 10, 2016

markharwood commented May 10, 2016

jpountz commented May 10, 2016

markharwood commented Apr 28, 2016 •

edited