Search Profiler causes runaway CPU usage even on simple queries #26502

joedj · 2017-09-05T06:08:51Z

Elasticsearch version (bin/elasticsearch --version):
Version: 5.5.2, Build: b2f0c09/2017-08-14T12:33:14.154Z, JVM: 1.8.0_141

Plugins installed:
discovery-ec2 ingest-geoip ingest-user-agent x-pack

JVM version (java -version):
openjdk version "1.8.0_141"
OpenJDK Runtime Environment (build 1.8.0_141-b16)
OpenJDK 64-Bit Server VM (build 25.141-b16, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux 55d1beb7fa7e 4.9.43-17.38.amzn1.x86_64 #1 SMP Thu Aug 17 00:20:39 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
(running inside docker container)

Description of the problem including expected versus actual behavior:

Using the Search Profiler in Kibana causes runaway CPU usage on ES node(s), even when using the default query:

Index: _any
Query:

{
  "query":{
    "match_all" : {}
  }
}

I can reproduce this in production (8 nodes, 7.5TB primary, 73 indices, 524 shards), and in a test environment (1 node, 1TB primary, 93 indices, 97 shards), but not on an empty cluster running locally.

In test env, it causes CPU on the single to spike to 100% for ~5 minutes before returning to normal.

In production, it caused CPU to spike to 100% on 4 out of 8 nodes for over half an hour (at which point I used the Task Cancellation API).

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

I have not tried to put together a minimal example yet (other than trying it on an empty cluster). Reproduction for me (on real systems) just involves navigating to https://kibana/app/kibana#/dev_tools/searchprofiler?_g=() and clicking the 'Profile' button.

Provide logs (if relevant):

Hot threads (in test env, on a node that is otherwise almost idle):

::: {55d1beb7fa7e}{98KLc2f5SQebjsekSGmGDg}{TFUNTxuYRpqHvigoyx0kdg}{10.2.92.71}{10.2.92.71:9300}{aws_availability_zone=us-west-2c}
   Hot threads at 2017-09-05T05:54:17.904Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   50.2% (250.7ms out of 500ms) cpu usage by thread 'elasticsearch[55d1beb7fa7e][search][T#2]'
     8/10 snapshots sharing following 18 elements
       org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:246)
       org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:197)
       org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:668)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:388)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:248)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:263)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:330)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:327)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     2/10 snapshots sharing following 18 elements
       org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:244)
       org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:197)
       org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:668)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:388)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:248)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:263)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:330)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:327)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
   
   50.0% (250ms out of 500ms) cpu usage by thread 'elasticsearch[55d1beb7fa7e][search][T#4]'
     9/10 snapshots sharing following 18 elements
       org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:246)
       org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:197)
       org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:668)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:388)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:248)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:263)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:330)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:327)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     unique snapshot
       org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:244)
       org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:197)
       org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:668)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:388)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:248)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:263)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:330)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:327)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
   
   49.8% (249.1ms out of 500ms) cpu usage by thread 'elasticsearch[55d1beb7fa7e][search][T#3]'
     9/10 snapshots sharing following 18 elements
       org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:246)
       org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:197)
       org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:668)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:388)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:248)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:263)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:330)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:327)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     unique snapshot
       org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:244)
       org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:197)
       org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:668)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:388)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:248)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:263)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:330)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:327)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

::: {ip-10-38-75-80-62c1ea431ae5}{0Tq96erBTLWE7MT06ssANg}{Fkk03QXhTWaGX0jd4hhgUw}{10.38.75.80}{10.38.75.80:9300}{aws_availability_zone=us-west-2c}
   Hot threads at 2017-09-05T05:54:17.907Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

Screenshot after clicking 'Profile', waiting for ES to recover, clicking it again, waiting for recovery:

The text was updated successfully, but these errors were encountered:

markharwood · 2017-09-05T07:37:07Z

The volume of data not the complexity of the query here is likely to be the issue. Even the smallest overhead can amount to a lot when you have a tight loop on a very large number of docs.
From the docs:

Like any profiler, the Profile API introduces a non-negligible overhead to search execution. 
The act of instrumenting low-level method calls ... can be fairly expensive, since these 
methods are called in tight loops

Can you confirm the number of docs you have? If it's very large then the profiler is working-as-designed I expect.

joedj · 2017-09-05T07:44:52Z

I see, makes sense.
Test env: 150 million docs
Production: 5 billion docs

Is there a way to disable the profile API?

jpountz · 2017-09-05T08:25:41Z

I think @markharwood 's explanation is the right one. org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:246) is just a loop that iterates over matches of the query, so the more documents, the longer it takes.

For the record, profiling will have less overhead in 5.6 (#25772).

Is there a way to disable the profile API?

No. I'm not sure it would make sense to make it possible to disable it given that there are many other ways to consume a lot of CPU on a cluster that you have read access to.

markharwood added feedback_needed :Search/Search Search-related issues that do not fall into other categories labels Sep 5, 2017

jpountz closed this as completed Sep 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search Profiler causes runaway CPU usage even on simple queries #26502

Search Profiler causes runaway CPU usage even on simple queries #26502

joedj commented Sep 5, 2017

markharwood commented Sep 5, 2017

joedj commented Sep 5, 2017

jpountz commented Sep 5, 2017

Search Profiler causes runaway CPU usage even on simple queries #26502

Search Profiler causes runaway CPU usage even on simple queries #26502

Comments

joedj commented Sep 5, 2017

markharwood commented Sep 5, 2017

joedj commented Sep 5, 2017

jpountz commented Sep 5, 2017