Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.3.4 heap used growing and crashing ES #8249

Closed
johnc10uk opened this Issue Oct 28, 2014 · 29 comments

Comments

Projects
None yet
6 participants
@johnc10uk
Copy link

johnc10uk commented Oct 28, 2014

I have upgraded from v1.0.0 to 1.3.4 and now see the heap used grow from 60 to 99% over 7 days and crashes ES. When hits 70% heap used see 100% on one CPU at a time. ES is working fine, just appears to be a memory leak or failing to complete GC. Can I safely downgrade to v1.3.1 to see if fixes the problem? Currently have 55M documents, 23GB size.

Thanks, John.

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Oct 28, 2014

@johnc10uk Few questions:

  • what custom settings are you using?
  • what is your ES_HEAP_SIZE
  • do you have swap disabled?

can you send the output of these two requests:

curl localhost:9200/_nodes > nodes.json
curl localhost:9200/_nodes/stats?fields=* > nodes_stats.json
@johnc10uk

This comment has been minimized.

Copy link
Author

johnc10uk commented Oct 28, 2014

No custom settings, 8.8GB heap and no swap. Again, this was working fine at 1.0.0 .

<script src="https://gist.github.com/johnc10uk/441b58a6181bc07af4c1.js"></script>
@johnc10uk

This comment has been minimized.

@johnc10uk

This comment has been minimized.

Copy link
Author

johnc10uk commented Oct 28, 2014

Tried doubling RAM to 16GB but just lasts longer before filling heap. Upgraded Java from 25 to 55, no difference.

java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Oct 28, 2014

@johnc10uk thanks for the info. Could I ask you to provide this output as well please:

curl localhost:9200/_nodes/hot_threads > hot_threads.txt
@johnc10uk

This comment has been minimized.

Copy link
Author

johnc10uk commented Oct 28, 2014

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Oct 28, 2014

Thanks. Nothing terribly unusual so far. Do you have a heap dump left from when one of the nodes crashed? If not, could you take a heap dump when memory usage is high, and share it with us?

thanks

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Oct 28, 2014

Link for heap dump: http://stackoverflow.com/questions/17135721/generating-heap-dumps-java-jre7

Please could you .tar.gz it, and you can share its location with me privately: clinton dot gormley at elasticsearch dot com.

Also, your mappings and config please.

@johnc10uk

This comment has been minimized.

Copy link
Author

johnc10uk commented Oct 28, 2014

Had to add -F to get jmap to work "Unable to open socket file: target process not responding or HotSpot VM not loaded"

jmap has just taken elasticsearch1 out of the cluster and taking forever to dump. Had to kill it, dump only 10Mb. Found out need to run jmap as elasticsearch user. Emailed link to files on S3.

Thanks, John.

@johnc10uk

This comment has been minimized.

Copy link
Author

johnc10uk commented Oct 28, 2014

Ended up with both elasticsearch1 and 3 server processes being restarted to get all in sync after 1 dropped out. However the heap used on 2 also dropped to about 10% and all now at 40% heap used. Some sort of communication failure not completing jobs?

@mikemccand

This comment has been minimized.

Copy link
Contributor

mikemccand commented Oct 28, 2014

I looked at the heap dump here: it's dominated by the filter cache, which for some reason is not clearing itself. What settings do you have for the filter cache?

It's filled with many (5.5 million) TermFilter instances, and the Term in each of these is typically very large (~200 bytes).

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Oct 29, 2014

@johnc10uk when I looked at the stats you sent, filters were only using 0.5GB of memory. It looks like you are using the default filter cache settings, which defaults to 10% of heap, ie 0.9GB in your case (see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-cache.html#filter)

Please could you keep an eye on the output from the following request, and see if the filter cache is growing unbounded:

curl -XGET "http://localhost:9200/_nodes/stats/indices/filter_cache?human"

In particular, I'd be interested in hearing if the filter cache usage doesn't grow, but memory usage does.

@johnc10uk

This comment has been minimized.

Copy link
Author

johnc10uk commented Oct 29, 2014

Yes, all defaults. The HQ node diagnostics is currently showing 320Mb filter cache and I don't think it ever reached 880Mb to start any evictions. I'll keep a log of filter cache, it is growing.

:{"filter_cache":{"memory_size":"322.3mb","memory_size_in_bytes":337965460,"evictions":0}}}

{"filter_cache":{"memory_size":"325.8mb","memory_size_in_bytes":341677692,"evictions":0}}}

Thanks, John.

@Yakx

This comment has been minimized.

Copy link

Yakx commented Oct 29, 2014

We are expiriencing this, in production, as well since the upgrade to 1.3.4.
when do you have plans to fix this?

@johnc10uk

This comment has been minimized.

Copy link
Author

johnc10uk commented Oct 29, 2014

We have turned off caching for the TermFilter you identified which should stop the memory creep (it didn't need caching anyway). Restarted each node to clear out heap and will monitor. Thanks for help, very informative.

@cregev

This comment has been minimized.

Copy link

cregev commented Oct 29, 2014

Hi ,
We have upgraded our Elasticsearch cluster to version 1.3.4, after the upgrade we see that the heap used grow from 60 to 99% in a few hours, when the Heap reaches 99% some of the machines disconnect from the cluster and reconnect again.
In addition , when the Heap used reaches 90% + it does not return to normal.

I have added your esdiagdump.sh script output , so you could check our cluster stats.
Please advise ?

https://drive.google.com/a/totango.com/file/d/0B17CHQUEl6W8dE15MGVveWlycFk/view?usp=sharing

Thanks,
Costya.

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Oct 29, 2014

@CostyaRegev you have high filter eviction rates. I'm guessing that you're also using filters with very long keys which don't match any documents?

@cregev

This comment has been minimized.

Copy link

cregev commented Oct 29, 2014

@clintongormley - as for your question:
It is a test environment of us, and we don't query it much. We usually have less than 1 query per second, and at most 20 filter cache evictions per second. Most of the time, there are 0 evictions per second. And yet, the JVM memory % of some nodes hits the high nineties and never go down. The nodes also crash every here and there. BTW, the CPU usage percentage is not that high, something that might indicate a memory leak of some sort.

Here's our Marvel screenshot -
screen shot 2014-10-29 at 6 07 02 pm

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Oct 29, 2014

20 filter evictions per second is a lot - it indicates that you're caching lots of filters that shouldn't be cached. Also, you didn't answer my question about the length of cache keys. Are you using filters with, eg, many string IDs in them?

@cregev

This comment has been minimized.

Copy link

cregev commented Oct 29, 2014

Practically, no - not in our test environment. But then again, we have search shard query rate of <0.1 almost always - it's mostly idle. And we have almost no indexing at all all day long (except for a few peaks)... Why is all the memory occupied to the point that nodes just crash?

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Oct 29, 2014

@CostyaRegev Please try this to confirm it is the same problem:

curl -XPOST 'http://localhost:9200/_cache/clear?filter=true'

Within one minute of sending the request, you should see your heap size decrease (although it may take a while longer for a GC to kick in).

If that doesn't work then I suggest taking a heap dump of one of the nodes, and using eg the YourKit profiler to figure out what is taking so much memory.

@cregev

This comment has been minimized.

Copy link

cregev commented Oct 29, 2014

After trying the sent request , indeed i saw our heap size decrease.
It there a workaround for this problem ? When will you fix this problem?

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Oct 29, 2014

@clintongormley regarding waiting here maybe we should do the same thing we did for FieldData here too in 65ce5ac

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Oct 30, 2014

@s1monw I've opened #8285 for that.

@CostyaRegev The problem is as follows:

The filter cache size is the sum of the size of the values in the cache. The size does not take the size of the cache keys into account, because this info is difficult to access.

Previously, cached filters used 1 bit for every document in the segment, so the sum of the size of the values worked as a reasonable heuristic for the total amount of memory consumed. When the cache filled up with too many values, old filters were evicted.

Now, cached filters which match (almost) all docs or (almost) no docs can be represented in a much more efficient manner, and consume very little memory indeed. The result is that many more of these filters can be cached before triggering evictions.

If you have:

  • many unique filters
  • with long cache keys (eg terms filters on thousands of IDs)
  • that match almost all or almost no docs

...then you will run into this situation, where the cache keys can consume all of your heap, while the sum of the size of the values in the cache is small.

We are looking at ways to deal with this situation better in #8268. Either we will figure out a way to include the size of cache keys in the calculation (difficult), or we will just limit the number of entries allowed in the cache (easy).

Workarounds for now:

@johnc10uk

This comment has been minimized.

Copy link
Author

johnc10uk commented Oct 30, 2014

@clintongormley as far as getting the size of the cache keys, is that something that could/should be added to Lucene API?

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Oct 30, 2014

@johnc10uk no - the caching happens on our side.

@Yakx

This comment has been minimized.

Copy link

Yakx commented Nov 1, 2014

ES folks,
Thanks for the suggestions so far. It helped. However, IMO the problem is much bigger then you think. we have found the following:

  1. The ID cache is also having the same problem as Field data and filter cache. we had to stop using all our parent-child queries. This was not a big loss as this model is practically not usable since it is way too slow.
  2. THE MAIN problem is the cluster management. When a node id crashing after OOM cluster management does not recognize that and then:
  3. Calls for the status api return all green
  4. Cluster is not using replicas and queries never return

Our production cluster is crashing every day. The workarounds suggested reduce the number of crashes on our test env and will be implemented in production.
The main problem though, is the cluster recovery.

Let us know what other info/help you need in order to fix.

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Nov 1, 2014

@Yakx it sounds like you have a number of configuration issues, which are unrelated to this thread. Please ask these questions in the mailing list instead: http://elasticsearch.org/community

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Nov 3, 2014

Closed by #8304

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.