Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search Profiler causes runaway CPU usage even on simple queries #26502

Closed
joedj opened this issue Sep 5, 2017 · 3 comments
Closed

Search Profiler causes runaway CPU usage even on simple queries #26502

joedj opened this issue Sep 5, 2017 · 3 comments
Labels
feedback_needed :Search/Search Search-related issues that do not fall into other categories

Comments

@joedj
Copy link

joedj commented Sep 5, 2017

Elasticsearch version (bin/elasticsearch --version):
Version: 5.5.2, Build: b2f0c09/2017-08-14T12:33:14.154Z, JVM: 1.8.0_141

Plugins installed:
discovery-ec2 ingest-geoip ingest-user-agent x-pack

JVM version (java -version):
openjdk version "1.8.0_141"
OpenJDK Runtime Environment (build 1.8.0_141-b16)
OpenJDK 64-Bit Server VM (build 25.141-b16, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux 55d1beb7fa7e 4.9.43-17.38.amzn1.x86_64 #1 SMP Thu Aug 17 00:20:39 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
(running inside docker container)

Description of the problem including expected versus actual behavior:

Using the Search Profiler in Kibana causes runaway CPU usage on ES node(s), even when using the default query:

Index: _any
Query:

{
  "query":{
    "match_all" : {}
  }
}

I can reproduce this in production (8 nodes, 7.5TB primary, 73 indices, 524 shards), and in a test environment (1 node, 1TB primary, 93 indices, 97 shards), but not on an empty cluster running locally.

In test env, it causes CPU on the single to spike to 100% for ~5 minutes before returning to normal.

In production, it caused CPU to spike to 100% on 4 out of 8 nodes for over half an hour (at which point I used the Task Cancellation API).

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

I have not tried to put together a minimal example yet (other than trying it on an empty cluster). Reproduction for me (on real systems) just involves navigating to https://kibana/app/kibana#/dev_tools/searchprofiler?_g=() and clicking the 'Profile' button.

Provide logs (if relevant):

Hot threads (in test env, on a node that is otherwise almost idle):

::: {55d1beb7fa7e}{98KLc2f5SQebjsekSGmGDg}{TFUNTxuYRpqHvigoyx0kdg}{10.2.92.71}{10.2.92.71:9300}{aws_availability_zone=us-west-2c}
   Hot threads at 2017-09-05T05:54:17.904Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   50.2% (250.7ms out of 500ms) cpu usage by thread 'elasticsearch[55d1beb7fa7e][search][T#2]'
     8/10 snapshots sharing following 18 elements
       org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:246)
       org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:197)
       org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:668)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:388)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:248)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:263)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:330)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:327)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     2/10 snapshots sharing following 18 elements
       org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:244)
       org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:197)
       org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:668)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:388)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:248)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:263)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:330)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:327)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
   
   50.0% (250ms out of 500ms) cpu usage by thread 'elasticsearch[55d1beb7fa7e][search][T#4]'
     9/10 snapshots sharing following 18 elements
       org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:246)
       org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:197)
       org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:668)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:388)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:248)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:263)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:330)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:327)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     unique snapshot
       org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:244)
       org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:197)
       org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:668)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:388)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:248)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:263)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:330)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:327)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
   
   49.8% (249.1ms out of 500ms) cpu usage by thread 'elasticsearch[55d1beb7fa7e][search][T#3]'
     9/10 snapshots sharing following 18 elements
       org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:246)
       org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:197)
       org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:668)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:388)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:248)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:263)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:330)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:327)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
     unique snapshot
       org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:244)
       org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:197)
       org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:668)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:388)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:248)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:263)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:330)
       org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:327)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

::: {ip-10-38-75-80-62c1ea431ae5}{0Tq96erBTLWE7MT06ssANg}{Fkk03QXhTWaGX0jd4hhgUw}{10.38.75.80}{10.38.75.80:9300}{aws_availability_zone=us-west-2c}
   Hot threads at 2017-09-05T05:54:17.907Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

Screenshot after clicking 'Profile', waiting for ES to recover, clicking it again, waiting for recovery:

screen shot 2017-09-05 at 4 04 48 pm

@markharwood markharwood added feedback_needed :Search/Search Search-related issues that do not fall into other categories labels Sep 5, 2017
@markharwood
Copy link
Contributor

The volume of data not the complexity of the query here is likely to be the issue. Even the smallest overhead can amount to a lot when you have a tight loop on a very large number of docs.
From the docs:

Like any profiler, the Profile API introduces a non-negligible overhead to search execution. 
The act of instrumenting low-level method calls ... can be fairly expensive, since these 
methods are called in tight loops

Can you confirm the number of docs you have? If it's very large then the profiler is working-as-designed I expect.

@joedj
Copy link
Author

joedj commented Sep 5, 2017

I see, makes sense.
Test env: 150 million docs
Production: 5 billion docs

Is there a way to disable the profile API?

@jpountz
Copy link
Contributor

jpountz commented Sep 5, 2017

I think @markharwood 's explanation is the right one. org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:246) is just a loop that iterates over matches of the query, so the more documents, the longer it takes.

For the record, profiling will have less overhead in 5.6 (#25772).

Is there a way to disable the profile API?

No. I'm not sure it would make sense to make it possible to disable it given that there are many other ways to consume a lot of CPU on a cluster that you have read access to.

@jpountz jpountz closed this as completed Sep 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback_needed :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

3 participants