Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make significant terms work on fields that are indexed with points. #18031

Merged
merged 1 commit into from May 11, 2016

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Apr 28, 2016

It will keep using the caching terms enum for keyword/text fields and falls back
to IndexSearcher.count for fields that do not use the inverted index for
searching (such as numbers and ip addresses). Note that this probably means that
significant terms aggregations on these fields will be less efficient than they
used to be. It should be ok under a sampler aggregation though.

This moves tests back to the state they were in before numbers started using
points, and also adds a new test that significant terms aggs fail if a field is
not indexed.

In the long term, we might want to follow the approach that Robert initially
proposed that consists in collecting all documents from the background filter in
order to compute frequencies using doc values. This would also mean that
significant terms aggregations do not require fields to be indexed anymore.

@markharwood
Copy link
Contributor

markharwood commented Apr 28, 2016

Many thanks for this, @jpountz .
I see this uses term queries to derive counts on the fly when they aren't stored.

However if we go down this route of deriving counts on the fly what do you think of an approach where we try to re-use the aggregation framework - so a broadening of "significant terms" into "significant buckets". That would give us the flexibility to do the following analysis:

  • Individual terms (the limit of what we do today)
  • Numeric ranges e.g. age bands
  • Geo areas
  • Term groups (e.g. Java=[jsp,ejb,jts....] vs Javascript=[react,angular...] vs ..)

In each of these scenarios the significance algorithm is tuning out the uneven-ness in the data in order to identify buckets of interest e.g. those with a propensity to commit a crime or purchase a product. We know so much of data is Zipfy and skewed towards the popular so background-diffing is a useful lens through which to view most data (e.g. in the same way "per-capita" stats help as a saner basis of comparisons).

There are several advantages to re-using aggs for the bulk of this work:

  1. Bucket aggs provide a clean syntax for grouping low-level index entries (e.g. date agg's "1m", "1w" etc)
  2. Metric aggs allow us to sum things like paymentAmount - this can be a more useful number to consider as a basis for computing significance than what we use today (just volumes of documents)

There's a lot to think about in adopting an agg-based approach - the JSON syntax, the changes to existing aggs to support this background-stats use case. I appreciate this is likely too much to debate here on this PR but I wanted to run it by you because it is related to the changes being made here.

@jpountz
Copy link
Contributor Author

jpountz commented Apr 28, 2016

I think that is worth exploring but this looks quite ambitious at the same time. We should try to do baby steps as much as possible. The first one would probably to be to compute doc counts of the background set using doc values (just like an aggregation would do).Then we could try to see how we can plug the terms agg in to do the job. And afterwards maybe supporting arbitrary numeric metric aggs.

Numeric fields have been refactored to use a different data structure that
performs better for range queries. However, since this data structure does
not record document frequencies, numeric fields can no longer be used for
significant terms aggregations. It is recommended to use <<keyword,`keyword`>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding a note about potential performance degradation with numerics and a recommendation to reindex as keyword if this is an issue?

@markharwood
Copy link
Contributor

Was trying with my weblog data to test performance but hit this array index out of bounds exception:

DELETE test

PUT test
{
   "settings": {
      "number_of_replicas": 0,
      "number_of_shards":1
   },
   "mappings": {
      "test": {                
         "properties": {
            "ipField": {
               "type": "ip"
            }
         }
      }
   }
}


POST test/test
{"ipField":"95.108.241.251"}
POST test/test
{"ipField":"95.108.241.251"}
POST test/test
{"ipField":"95.108.241.251"}
POST test/test
{"ipField":"95.108.241.251"}


GET test/_search?q=95.108.241.251
{
   "size":0,
   "aggs": {
      "boom": {
         "significant_terms": {
            "field": "ipField"
         }
      }
   }
}

Exception:

Failed to execute phase [merge], [reduce] 
    at org.elasticsearch.action.search.SearchQueryAndFetchAsyncAction$1.onFailure(SearchQueryAndFetchAsyncAction.java:77)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:438)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 16
    at org.apache.lucene.util.UnicodeUtil.UTF8toUTF16(UnicodeUtil.java:602)
    at org.apache.lucene.util.BytesRef.utf8ToString(BytesRef.java:152)
    at org.elasticsearch.search.aggregations.bucket.significant.SignificantStringTerms$Bucket.getKeyAsString(SignificantStringTerms.java:115)
    at org.elasticsearch.search.aggregations.bucket.significant.InternalSignificantTerms.doReduce(InternalSignificantTerms.java:176)
    at org.elasticsearch.search.aggregations.InternalAggregation.reduce(InternalAggregation.java:158)
    at org.elasticsearch.search.aggregations.InternalAggregations.reduce(InternalAggregations.java:159)
    at org.elasticsearch.search.controller.SearchPhaseController.merge(SearchPhaseController.java:403)
    at org.elasticsearch.action.search.SearchQueryAndFetchAsyncAction$1.doRun(SearchQueryAndFetchAsyncAction.java:65)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:452)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)    

@jpountz
Copy link
Contributor Author

jpountz commented May 4, 2016

@markharwood I pushed a new commit to address your comment about the documentation.

Regarding the test failure you got, we need #18003 to make terms aggs work with ip addresses. The issue here is that elasticsearch tried to convert some ip bytes to an utf8 string, but this failed since the array boundary was crossed while reading what looked like a 4-bytes UTF8 code point.

@markharwood
Copy link
Contributor

For now I hacked SignificantStringTerms.getKeyAsString() so I can run some benchmarks without causing errors. There was a noticeable slow-down on IP address fields compared to the keyword equivalent on my test index (~4 times slower) but that is to be expected.

.add(filter, Occur.FILTER)
.build();
}
return context.searchContext().searcher().count(query);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a concern that when sig_terms was embedded under a parent agg (eg most significant IP address inside a per-day date_histogram) then we would search for the same terms over and over again. I experimented with a count-cache for this use case on my test data but didn't get a noticeable improvement.

@jpountz
Copy link
Contributor Author

jpountz commented May 10, 2016

@markharwood Do you think this is good to merge or would you like me to change the way it works?

@markharwood
Copy link
Contributor

My only concern was around the likelihood of duplicated frequency lookups when a sig_term agg is embedded under a parent terms agg or similar. I experimented with adding a term->count cache on some weblog data but failed to get a noticeable improvement. There may already be some caching effects occurring at the Lucene or OS file system level that make this agg-level caching redundant?

Otherwise LGTM

@jpountz
Copy link
Contributor Author

jpountz commented May 10, 2016

There is caching happening indeed through the filesystem and the query cache. It is not as efficient as the terms enum though, which even if it did not cache, would still have the benefit of reusing IndexInputs.

…lastic#18031

It will keep using the caching terms enum for keyword/text fields and falls back
to IndexSearcher.count for fields that do not use the inverted index for
searching (such as numbers and ip addresses). Note that this probably means that
significant terms aggregations on these fields will be less efficient than they
used to be. It should be ok under a sampler aggregation though.

This moves tests back to the state they were in before numbers started using
points, and also adds a new test that significant terms aggs fail if a field is
not indexed.

In the long term, we might want to follow the approach that Robert initially
proposed that consists in collecting all documents from the background filter in
order to compute frequencies using doc values. This would also mean that
significant terms aggregations do not require fields to be indexed anymore.
@jpountz jpountz force-pushed the fix/significant_terms_on_points branch from bd79bc8 to 866a545 Compare May 11, 2016 14:53
@jpountz jpountz merged commit 866a545 into elastic:master May 11, 2016
@jpountz jpountz deleted the fix/significant_terms_on_points branch May 11, 2016 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants