Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terms agg with zero-result filtered query searches whole index #8406

Closed
avleen opened this issue Nov 9, 2014 · 12 comments
Closed

Terms agg with zero-result filtered query searches whole index #8406

avleen opened this issue Nov 9, 2014 · 12 comments
Assignees

Comments

@avleen
Copy link

avleen commented Nov 9, 2014

We're using a filtered query followed by a terms agg, as so:

{
  "query":{
    "filtered":{
      "query":{ 
        "query_string":{
          "query": "type:web_info_log AND log_namespace:FOO AND message:\"bar\""
        }
      },
      "filter":{
        "bool":{
          "must":[{
            "range":{
              "@timestamp":{
                "from":1415464349189,"to":1415465564077
              }
            }
          }]
        }
      }
    }
  },
  "aggs":{
    "histogram_aggregation":{
      "histogram":{
        "interval":10,"field":"@timestamp"
      },
      "aggs":{
        "terms_bucket":{
          "terms":{
            "field":"log_message","execution_hint":"map"
          }
        }
      }
    }
  }
}

Normally the filtered query returns some small number of requests (under 100) and everything is fine.
However, when it returns zero results, the terms aggregations runs on the entire index (in our case, ~4bn docs), causing repeated GCs which take many hours and make the cluster unusable.
It should, in fact, not run at all because it got zero results from the filtered query.

@avleen
Copy link
Author

avleen commented Nov 9, 2014

This is on Elasticsearch 1.3.2.
If this is repeatable for others, any chance the fix could make it into 1.4? It's brought our cluster down repeatedly in the last 2 weeks and I can reproduce it here easily.

@jpountz jpountz self-assigned this Nov 10, 2014
@jpountz
Copy link
Contributor

jpountz commented Nov 10, 2014

@avleen This looks wrong indeed. I just tried to reproduce the issue with no success, can you please help me understand a bit more what happens:

  • do you actually get a doc_count for every term in the response or did you infer that it ran on all docs because of CPU or I/O activity?
  • can you confirm you are not setting min_doc_count to 0? (this option combined with the map execution hint could make elasticsearch search the whole index).
  • if you can still reproduce this issue, could you try to capture the output of the hot threads while the query is running so that I can see what is calling the aggregation.

@avleen
Copy link
Author

avleen commented Nov 10, 2014

Hi @jpountz
Here're the results from the query when searching over 2 days of indices, and running the agg on a field with very low cardinality (maybe a few dozen unique entries):

{
   "took": 6148,
   "timed_out": false,
   "_shards": {
      "total": 78,
      "successful": 78,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   },
   "aggregations": {
      "terms": {
         "buckets": []
      }
   }
}

When I run it against our log_message field, which is not_analyzed and has a ton of large strings (up to 10kb long) and the vast majority of the messages are unique:

{
   "took": 184340,
   "timed_out": false,
   "_shards": {
      "total": 39,
      "successful": 39,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   },
   "aggregations": {
      "terms": {
         "buckets": []
      }
   }
}

@jpountz
Copy link
Contributor

jpountz commented Nov 10, 2014

Hmm this might be due to fielddata loading. This is expected since we need to load fielddata before knowing what matches but I will check that we are not loading global ordinals with the "map" execution hint since they are not required.

Can you confirm that only the first few queries are slow?

@avleen
Copy link
Author

avleen commented Nov 10, 2014

This seems to be pretty slow all the time. It's hard to confirm because running the query takes down our production cluster for several minutes. It just happened again and one node fell out of the cluster because it was trying to do garbage collection for several minutes :-)

Running the filtered query part is really fast. It rarely takes more than 5 or 10s for ~5bn docs.
Running the aggs on top of that, even when are zero results from the query, takes many minutes.
So in this case, why is fielddata being loaded when there were zero results?

@jpountz
Copy link
Contributor

jpountz commented Nov 12, 2014

I can try to explain a bit more how running works on a shard level: Elasticsearch creates an object called BulkScorer which is responsible for finding matching documents and optionally scoring them. This BulkScorer takes a Collector which is basically a callback that gets called for every matching document. Examples of collectors include TotalHitCountCollector (to just count the number of matches), TopDocsCollector (to compute the top matching documents) or AggregationsCollector (to run aggregations). For instance your request would use both a TopDocsCollector and a AggregationsCollector. When we create the collector for aggregations, we need to prepare everything to get ready for collecting matches, which includes loading fielddata. But at this point we don't know if the query is going to match anything yet. This is why even queries that do not match any hits load field data.

@jpountz
Copy link
Contributor

jpountz commented Nov 12, 2014

I just checked that the map execution hint does not load global ordinals, as expected. So the only idea I have is that these slow times for queries that match no queries is due to fielddata loading.

However, I'm still concerned that you mentioned that everything works fine when the query returns ~100 matches. Do you mean that requests that match no documents are more harmful to your cluster than those queries that match few documents?

@avleen
Copy link
Author

avleen commented Nov 24, 2014

It seemed that way but don't take it as gospel. It's possible that during
those instances, field data was already cached.

Could the field data loading be delayed until we know what documents
actually need to be loaded in?
It seems problematic is always load in huge amounts of field data when you
are running a filtered query to reduce the number of documents for
aggregation.

On Wed, Nov 12, 2014, 01:27 Adrien Grand notifications@github.com wrote:

I just checked that the map execution hint does not load global ordinals,
as expected. So the only idea I have is that these slow times for queries
that match no queries is due to fielddata loading.

However, I'm still concerned that you mentioned that everything works fine
when the query returns ~100 matches. Do you mean that requests that match
no documents are more harmful to your cluster than those queries that match
few documents?

Reply to this email directly or view it on GitHub
#8406 (comment)
.

@clintongormley
Copy link

@avleen If you want to use a field for sorting/aggregating, then you need it in fielddata. Perhaps this query doesn't match but the next one will. Fielddata is loaded from disk by uninverting the index, so it has to happen in one go. It would be much much much slower if we were to load only matching docs.

Instead, use doc values for this field. Then the memory requirements go away.

@avleen
Copy link
Author

avleen commented Nov 25, 2014

We'll give doc values another site Clinton. When we tested them a while
back,we traded lower memory for significantly higher disk IO which slowed
indexing.
We'll test to see if the recent perf improvements with doc values make them
more viable now :)

Thanks!

On Tue, Nov 25, 2014, 14:10 Clinton Gormley notifications@github.com
wrote:

@avleen https://github.com/avleen If you want to use a field for
sorting/aggregating, then you need it in fielddata. Perhaps this query
doesn't match but the next one will. Fielddata is loaded from disk by
uninverting the index, so it has to happen in one go. It would be much much
much slower if we were to load only matching docs.

Instead, use doc values for this field. Then the memory requirements go
away.

Reply to this email directly or view it on GitHub
#8406 (comment)
.

@clintongormley
Copy link

@avleen Make sure you aren't using them on the message.raw field (in fact, just disable the message.raw field completely). Also, in Lucene 5, doc values merging is better than it is today, which should help as well. Would be interested in any stats you can give us about the effect of doc values on your indexing.

@clintongormley
Copy link

Nothing more to do here. Closing in favour of #8312

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants