Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES java very high CPU usage #4288

Closed
ydnitin opened this issue Nov 28, 2013 · 26 comments
Closed

ES java very high CPU usage #4288

ydnitin opened this issue Nov 28, 2013 · 26 comments
Assignees

Comments

@ydnitin
Copy link

ydnitin commented Nov 28, 2013

I am having similar issues as described on this thread. #1940
Although leap second already happened in June but I tried resetting the time anyway. Didn't work.
This is causing a lot of frustration. Any help is much appreciated.

I have upgraded ES to the latest (elasticsearch-0.90.7-1). Have restarted ES, front end, logstash. no joy at all.

java -version
java version "1.7.0_45"
OpenJDK Runtime Environment (rhel-2.4.3.2.el6_4-x86_64 u45-b15)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)

After the upgrade I am seeing these in the logs, which seems like a totally different issues.

[2013-11-28 03:48:32,796][WARN ][discovery.zen ] [LA Logger] received a join request for an existing node [[Night Thrasher][ecqZvZhDSTGiVkrjj6G_hw][inet[/192.168.1 28.146:9300]]{client=true, data=false}]
[2013-11-28 03:48:36,006][WARN ][discovery.zen ] [LA Logger] received a join request for an existing node [[Night Thrasher][ecqZvZhDSTGiVkrjj6G_hw][inet[/192.168.1 28.146:9300]]{client=true, data=false}]

Hot thread output: https://gist.github.com/ydnitin/7687098

@ghost ghost assigned uboness and spinscale Nov 28, 2013
@spinscale
Copy link
Contributor

Hey there,

checking out your hot threads output, the last third of the output is the important part, as you can clearly see, that a lot of queries are eating up your CPU.

Do you run some complex queries including fuzzy support. Is your system acting different if you do not query it at all?

And last question to get a better overview: What is your setup? number of nodes, etc? Looks like the cluster state might not be up to date on that node, where the client is trying to rejoin (due to load).. but this is just a wild assumption, which is not backed by facts.

@s1monw
Copy link
Contributor

s1monw commented Nov 28, 2013

I agree with @spinscale this seems very much like you shooting some queries against elasticsearch and they cause load? Can you identify if that is abnormal?

@ydnitin
Copy link
Author

ydnitin commented Nov 28, 2013

I suspected the same so I dropped all traffic to ES and the load came down right away and then enabled the traffic again load jumped up. My users query ES via kibana interface. I am not familiar with fuzzy query concept. ES is totally new for me.
Load came down after I posted this last night. I guess user closed his kibana session?
Is there a tool to look at the running queries, like pgmonitor?
Is there a simplified query syntax documentation out there? http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-queries.html this one contains too much information for my users.

I am running just one node, two shards.
Server: Physical
RAM: 24G
HeapSize set: 8G
CPU: 8

I am to add another node to ES. Any suggestions as to how I should profile my ES setup to order new hardware?

@spinscale
Copy link
Contributor

Hey,

there is a slow search log mechanism in elasticsearch. See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-slowlog.html#search-slow-log

Furthermore you should talk to the kibana users and see what kind of queries they execute (almost every kibana widget can reveal the query being executed). Also make sure, that they are not reloading their data in kibana too often, maybe that already helps a bit (without getting to the cause of the problem, maybe it is just one query, you would have to try out the queries one by one).

There is no possibility to give any hardware recommendations without knowing data, configuration, mapping, documents, query load, index load etc...

Adding more nodes and replicated the same data to them most likely will allow to at least spread the load.

@s3k
Copy link

s3k commented Feb 12, 2014

I have same problem with high cpu usage.
(mb pro, osx, standard java 7, 2 core, 2.5Ghz, i5)

Here some tips:
On my local machine i set in config/elasticsearch.yml

index.number_of_shards: 1
index.number_of_replicas: 0

For 1 index with 185k docs my cpu load is 2.5-5% for ES java process

Also plugins makes HUGE performance reduce.
For me that is Marvel and analysis-morphology plugins. When i turned it off, that has given me additional cpu resources for 2 times, maybe 4.

I think that yml config is the solution.

P.S Also i have server (Ubuntu 12.04, openjdk 7, 2 xeon cores for 2.4Ghz and 2Gb ram) with default config, with plugins (marvel and analysis-morphology) which works fine for 1 500 000 docs on index.

@davidgraca
Copy link

Don't forget that if you installed Marvel to delete all indexes after you removed, since it doesn't do that.

@gentunian
Copy link

Same issue with marvel plugin. I kept getting out of heap space at a minute of starting (1GB heap). I removed the plugin and all indices and now cpu is fine, here a graph showing the improved cpu USER time:
screenshot - 12092014 - 02 41 11 pm

@bleskes
Copy link
Contributor

bleskes commented Dec 10, 2014

@gentunian did you have the errors while the marvel UI was open in a browser, or also without. Would be great to get the out put of http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html#cluster-nodes-hot-threads to see what the node is doing.

@gentunian
Copy link

@bleskes on both cases I ran out of heap space. I could see in elasticsearch.log every 2 or 3 seconds what I assume was GC calls (I can revise the log tomorrow at work if you want to).

curl -s -XGET 'localhost:9200/_cat/thread_pool?v' was all 0s for every column.

I had a lot of shards, maybe more than 500 (logstash defaults to 5 shards for every index and I have more than 100).

In the marvel stats, the marvel indices were on top of each column (index rate, search query, etc. Sorry if the names are wrong, I'm typing this from memory).

All this in QA. Tomorrow after finding out the CPU usage, I'm planning a production deployment.

@bleskes
Copy link
Contributor

bleskes commented Dec 11, 2014

@gentunian I can think of a couple of things that may be going on. Maybe open a new issue with your details, as your primary concern is memory and not CPU?

@ACV2
Copy link

ACV2 commented May 19, 2015

I'm using elasticsearch 1.5

and it is working perfectly the most part of the time, but everyday at the same time it becomes crazy, CPU % goes to ~70% when the average is around 3-5% there are SUPER servers with 32GB reserved for lucene, swap it is lock and clearing the cache doesn't solve the problem (it doesn't take down the heap mem)

Settings:

3 servers (nodes) 32 cores and 128GB RAM each
2 buckets (indices) one with ~18 million documents (this one doesn't receive updates pretty often just indexing new docs) the other one have around 7-8 million documents but we are constantly bombarding it with updates search delete and indexing as well

The best distribution for our structure, was to have only 1 shard per node with not replicas, we can afford to have a % of the data off for few seconds, that will be back as soon as the server get online again, and this process is fast enough since it doesn't need to relocate anything. previously we used to have 3 shards with 1 replica, but the issue mentioned above occurs as well, so is easy to figure it out that the problem is not related with the distribution.

Things that I already tried,

Merging, i try to use the Optimize API trying to give less load to the schedule merge, but actually the merging process takes a lot of R/W of the disk but it doesn't affect substantially the mem or the CPU load.

Flushing, I tried to flush with long and shot intervals, and the results were the same nothing
pic1-1
pic2-1
pic2-2
pic1-2

changed, since flushing affects directly the merging process and as mentioned above, merging process doesn't takes that much of the CPU or mem usage.

managing the cache, clearing it manually but it doesn't seems to take the cpu load to normal state not even for a moment.

Here is the most of the elasticsearch.yml configs

Force all memory to be locked, forcing the JVM to never swap

bootstrap.mlockall: true

Threadpool Settings

Search pool

threadpool.search.type: fixed
threadpool.search.size: 20
threadpool.search.queue_size: 200

Bulk pool

threadpool.bulk.type: fixed
threadpool.bulk.size: 60
threadpool.bulk.queue_size: 3000

Index pool

threadpool.index.type: fixed
threadpool.index.size: 20
threadpool.index.queue_size: 1000

Indices settings

indices.memory.index_buffer_size: 30%
indices.memory.min_shard_index_buffer_size: 12mb
indices.memory.min_index_buffer_size: 96mb

Cache Sizes

indices.fielddata.cache.size: 30%
#indices.fielddata.cache.expire: 6h #will be depreciated & Dev recomend not to use it
indices.cache.filter.size: 30%
#indices.cache.filter.expire: 6h #will be depreciated & Dev recomend not to use it

Indexing Settings for Writes

index.refresh_interval: 30s
#index.translog.flush_threshold_ops: 50000
#index.translog.flush_threshold_size: 1024mb
index.translog.flush_threshold_period: 5m
index.merge.scheduler.max_thread_count: 1

here is the stats when the server is in a normal state:
node_stats_normal.txt

Node stats during the problem.
node_stats.txt

I will appreciate any help or discussion that can point me in the right direction to get rid of this behavior

thanks in advance..

Regards,

Daniel

@clintongormley
Copy link

Hi @ACV2

This issue is closed. I suggest you follow the advice given above (ie check the hot threads output, enable the slow query log), and see what happens at the time of day when your cluster goes wild. If the answer isn't immediately obvious, I'd suggest asking about it in the forum: discuss.elastic.co

@travisbell
Copy link

@AVC2 Did you learn anything since your post about this? We're seeing more or less, the same problem. Sometimes daily, sometimes only one a week (it's that random) one of our ES hosts decides to go crazy. We usually restart the instance when we see this because it affects our end users response times but it's starting to be a problem.

In our situation, these hosts aren't idle, they're trucking along with 25-30% CPU and then BAM, 70-100%. Sometimes they correct themselves after 24-48 hours, other times they don't.

I was just curious is you learned anything that could help point us in the right direction. I'll see about checking the hot threads output and slowlogs.

@ACV2
Copy link

ACV2 commented Jun 12, 2015

Hey there, @travis. Actually it was a journey... Elastic Search is everything but human friendly, after this post where I actually checked every single configuration trying to optimize the performance to avoid this behavior, with not success... I reduce the query time from 12 sec to 2-3 in a period of 10-15 minutes still huge and unacceptable, however we are working to optimize our back end to stabilize the performance, that being said.. My findings:

-merge process
-flush process
-background process (script) bombarding the server with queries slowing it down

Can you post your server configurations so I can help you a little bit more :)

Regards
Daniel

On 7/6/2015, at 16:27, Travis Bell notifications@github.com wrote:

@AVC2 Did you learn anything since your post about this? We're seeing more or less, the same problem. Sometimes daily, sometimes only one a week (it's that random) one of our ES hosts decides to go crazy. We usually restart the instance when we see this because it affects our end users response times but it's starting to be a problem.

I was just curious is you learned anything that could help point us in the right direction. I'll see about checking the hot threads output and slowlogs.


Reply to this email directly or view it on GitHub.

@travisbell
Copy link

Hi @ACV2

Thanks for the reply. We noticed this was usually tied to when we ran some big re-indexes, when the indexes were merging lots. During the re-indexes, we still have a full request load.

At the end of the re-index I added a task to run an optimize the index and low and behold, we haven't seen this problem crop up in the past ~12 days. Forcing an optimize to fix the runaway CPU problem doesn't feel like an actual fix but it does alleviate the issue for us. ES seems to get stuck in some kind of a weird state after these re-indexes and the optimize seems to kick it out of that state.

@suminda123
Copy link

I had a same issue, high cpu and high memory, for me it was marvel creating a new big index everyday.
so I deleted all marvel indices and ran _flush, it cleared my memory and cpu came down fro 90% to 4%.

@travisbell
Copy link

Yes, the daily optimize reduced our CPU in a similar amount (~60% ➞ ~10%). We still run our daily optimize task and haven't seen this issue since enabling it at the beginning of June.

@Alino
Copy link

Alino commented Jun 2, 2016

have someone tried to upgrade to ES 2.+ ? does it fix this cpu issue?

@travisbell
Copy link

travisbell commented Jun 2, 2016

Hi @Alino, for us the problem was actually the fact we were using G1GC. As soon as we switched to CMS the problem went away (our ES cluster has been up for 51 days now!) What I mentioned in my last post ^ was still true, it greatly reduced the times this happened but the runaway CPU problems continued to happen every ~7 days or so.

There's a long and interesting thread of some other users and I discussing this on the Elastic discussion board here: https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229

@speedplane
Copy link
Contributor

@travisbell Hi Travis, I've been seeing very similar patters. Everything runs perfectly for a full week, then all of a sudden we get really high CPU usage, and elsaticsearch is unresponsive. Most operations fail. Can you expand a bit on what you did to fix this issue? How do you switch from G1GC to CMS?

I have a feeling that this is caused by a node running out of java heap space. But I would imagine that would lead to a crash, not 100% CPU and I don't see OutOfMemory errors in the logs, so I'm not sure.

@speedplane
Copy link
Contributor

speedplane commented Jun 23, 2016

I see the following in the logs, right before I get 100% CPU, which is why I suspect it's an out of memory issue.

[2016-06-23 17:44:23,621][WARN ][monitor.jvm] 
[elasticsearch1-elasticsearch-5-vm] [gc][old][218473][270] 
duration [1.2m], collections [1]/[1.2m], total [1.2m]/[17.7m], 
memory [25.2gb]->[25.2gb]/[25.9gb], 
all_pools {[young] [196.5mb]->[141.6mb]/[532.5mb]}{[survivor] 
[66.5mb]->[0b]/[66.5mb]}{[old] [25gb]->[25gb]/[25.3gb]}

(formatted for easier reading)

@jasontedor
Copy link
Member

jasontedor commented Jun 23, 2016

Not quite, but very close.

[...] duration [1.2m] [...] memory [25.2gb]->[25.2gb]/[25.9gb] [...]

This indicates a completely ineffective garbage collection cycle lasting for over a minute. The JVM is consuming CPU trying to collect when collecting is ineffective. You'll likely hit GC overhead limits soon at which point you will see OutOfMemoryErrors.

It's best to open a new post on the Elastic Discourse forum.

@speedplane
Copy link
Contributor

@jasontedor I've started a discussion here. Any help on the matter would be appreciated.

@neumachen
Copy link

@ACV2 what tool are you using for the visual monitoring data?

@ianchan0817
Copy link

ianchan0817 commented May 2, 2018

@ACV2 this issue also happen to me when search *../../../../../../../../../../../../../../../../.*
it never timeout

indices:data/read/search - transport 1525253037206 02:23:57 13.9m x.x.x.x U5w91H7 indices[products], types[product], search_type[QUERY_THEN_FETCH], source[{"from":0,"size":24,"query":{"bool":{"must":[{"match":{"owner_id":{"query":"","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":1.0}}},{"match":{"status":{"query":"active","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":1.0}}}],"must_not":[{"match":{"status":{"query":"removed","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":1.0}}},{"match":{"type":{"query":"addon","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":1.0}}},{"match":{"type":{"query":"gift","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":1.0}}}],"should":[{"regexp":{"title_translations.zh-hant":{"value":".*../../../../../../../../../../../../../../../../.*","flags_value":65535,"max_determinized_states":10000,"boost":10.0}}},{"regexp":{"title_translations.en":{"value":".*../../../../../../../../../../../../../../../../.*","flags_value":65535,"max_determinized_states":10000,"boost":10.0}}},{"match":{"sku":{"query":"../../../../../../../../../../../../../../../../","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":100.0}}},{"match":{"title_translations.zh-hant":{"query":"../../../../../../../../../../../../../../../../","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":50.0}}},{"match":{"title_translations.en":{"query":"../../../../../../../../../../../../../../../../","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":50.0}}},{"match":{"categories.name_translations.en":{"query":"../../../../../../../../../../../../../../../../","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":20.0}}},{"match":{"categories.name_translations.zh-hant":{"query":"../../../../../../../../../../../../../../../../","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":20.0}}},{"match":{"variations.sku":{"query":"../../../../../../../../../../../../../../../../","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","boost":100.0}}}],"disable_coord":false,"adjust_pure_negative":true,"minimum_should_match":"1","boost":1.0}},"sort":[{"_score":{"order":"desc"}},{"created_at":{"order":"desc"}}],"ext":{}}]

Btw, I am using ElasticSearch Service in AWS
No configuration there

@flat010
Copy link

flat010 commented May 25, 2018

i also meet the problem,and the _nodes/hot_threads command return as flowing:
::: {dmsnode-192}{q8IITWyORxO8a0zZ5rTIXA}{192.168.90.192}{192.168.90.192:9300}{rack=r1}
Hot threads at 2018-05-25T07:54:39.738Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

62.4% (312ms out of 500ms) cpu usage by thread 'elasticsearch[dmsnode-192][transport_client_worker][T#15]{New I/O worker #15}'
2/10 snapshots sharing following 10 elements
org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:434)
org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests