Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to regulate off-heap usage? #26269

Closed
erik-stephens opened this issue Aug 18, 2017 · 42 comments
Closed

How to regulate off-heap usage? #26269

erik-stephens opened this issue Aug 18, 2017 · 42 comments
Labels
:Distributed/Network Http and internode communication implementations feedback_needed v5.5.0

Comments

@erik-stephens
Copy link

Elasticsearch version (bin/elasticsearch --version):
Version: 5.5.0, Build: 260387d/2017-06-30T23:16:05.735Z, JVM: 1.8.0_131

Plugins installed: []

JVM version (java -version):
openjdk version "1.8.0_66-internal"
OpenJDK Runtime Environment (build 1.8.0_66-internal-b17)
OpenJDK 64-Bit Server VM (build 25.66-b17, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux 4.4.0-72-generic #93-Ubuntu SMP Fri Mar 31 14:07:41 UTC 2017 x86_64 GNU/Linux

Docker version:
Docker version 1.6.2, build 7c8fca2

Description of the problem including expected versus actual behavior:

I expect the process RSS to not greatly exceed the sum of -Xmx & -XX:MaxDirectMemorySize. On a system with 128G of system memory, the elasticsearch process quickly reaches that sum (30G + 32G) and increases slowly until a host level OOM (OS, not JVM). I've ruled out MetaSpace and things related to threads (num+stack size, thread memory pools). Please refer to this thread for more background.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

  1. Start a data-only node with heap & native/direct memory size defined.
  2. The workload is index heavy. What querying there is handled by tribe nodes. No aggregations. This is not a heap issue.
  3. Wait for OOM.

Provide logs (if relevant):

@jasontedor
Copy link
Member

That would not be the limit because Netty (our underlying networking framework) uses its own direct buffer space. With the settings we provide by default, that should lead to up to another MaxDirectMemorySize being used. Can you do a few things:

  • start Elasticsearch with logger.io.netty.util.internal.PlatformDependent=debug (either -E on the command line or in the elasticsearch.yml) and share the output of the log lines that match i.n.u.i.PlatformDependent?
  • use native memory tracking: start Elasticsearch with -XX:NativeMemoryTracking=detail, immediately after start execute jcmd <PID> VM.native_memory baseline then capture a few times as the leak grows jcmd <PID> VM.native_memory detail.diff
  • remove the JVM option -XX:+DisableExplicitGC (see Stop disabling explicit GC #25759 for why)

@tlrx tlrx added :Distributed/Network Http and internode communication implementations feedback_needed v5.5.0 labels Aug 18, 2017
@Shizuu
Copy link

Shizuu commented Aug 18, 2017

Elasticsearch version (bin/elasticsearch --version):
Version: 5.5.1, Build: 19c13d0/2017-07-18T20:44:24.823Z, JVM: 1.8.0_144

Plugins installed: []

JVM version (java -version):
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux 3.13.0-128-generic #177-Ubuntu SMP Tue Aug 8 11:40:23 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
We have the same issue with 5.5.1, we upgraded from 5.3.0.
It only affects our master nodes though, especially the active master.
We didn't see the problem in staging, but our staging environment doesn't have the workload of our production environment (heavy indexing).

The HEAP is fine, it seems to be non HEAP. With 5.3.0, the RSS was always ~100% the HEAP size. With 5.5.1, the RSS will grow until ~200% the HEAP size and then vary between ~150% and ~200%. We confirmed this by reducing the HEAP size. We used to have OOM because our heap size was half of RAM, but now it's fine with heap size at the quarter of RAM.

Steps to reproduce:

  1. Start a cluster with fixed HEAP size, memory_lock = true
  2. Do heavy indexing
  3. Master RSS will grow until

Provide logs (if relevant):

  • With logger.io.netty.util.internal.PlatformDependent=debug :
[2017-08-18T11:10:19,079][DEBUG][i.n.u.i.PlatformDependent] Java version: 8
[2017-08-18T11:10:19,080][DEBUG][i.n.u.i.PlatformDependent] maxDirectMemory: 4056416256 bytes (maybe)
[2017-08-18T11:10:19,080][DEBUG][i.n.u.i.PlatformDependent] -Dio.netty.tmpdir: /tmp (java.io.tmpdir)
[2017-08-18T11:10:19,080][DEBUG][i.n.u.i.PlatformDependent] -Dio.netty.bitMode: 64 (sun.arch.data.model)
[2017-08-18T11:10:19,081][DEBUG][i.n.u.i.PlatformDependent] -Dio.netty.noPreferDirect: true
[2017-08-18T11:10:19,081][DEBUG][i.n.u.i.PlatformDependent] -Dio.netty.maxDirectMemory: -1 bytes
[2017-08-18T11:10:19,081][DEBUG][i.n.u.i.PlatformDependent] -Dio.netty.uninitializedArrayAllocationThreshold: -1
[2017-08-18T11:10:19,099][DEBUG][i.n.u.i.PlatformDependent] org.jctools-core.MpscChunkedArrayQueue: unavailable

@jasontedor
Copy link
Member

@Shizuu Can you verify the JVM options that Elasticearch is started with? They are printed in the logs on startup, can you share the log line here?

@erik-stephens
Copy link
Author

I disabled the disabling of explicit GC and enabled that netty debug logging:

[2017-08-18T18:21:55,973][DEBUG][io.netty.util.internal.PlatformDependent] Java version: 8
[2017-08-18T18:21:55,974][DEBUG][io.netty.util.internal.PlatformDependent] maxDirectMemory: 32098877440 bytes (maybe)
[2017-08-18T18:21:55,975][DEBUG][io.netty.util.internal.PlatformDependent] -Dio.netty.tmpdir: /tmp (java.io.tmpdir)
[2017-08-18T18:21:55,975][DEBUG][io.netty.util.internal.PlatformDependent] -Dio.netty.bitMode: 64 (sun.arch.data.model)
[2017-08-18T18:21:55,977][DEBUG][io.netty.util.internal.PlatformDependent] -Dio.netty.noPreferDirect: true
[2017-08-18T18:21:55,977][DEBUG][io.netty.util.internal.PlatformDependent] -Dio.netty.maxDirectMemory: -1 bytes
[2017-08-18T18:21:55,977][DEBUG][io.netty.util.internal.PlatformDependent] -Dio.netty.uninitializedArrayAllocationThreshold: -1
[2017-08-18T18:21:56,003][DEBUG][io.netty.util.internal.PlatformDependent] org.jctools-core.MpscChunkedArrayQueue: unavailable

My environment doesn't allow me to enable & query NMT. Instead of enabling explicit GC's, does adding -Dio.netty.maxDirectMemory=0 [1] seem like a way to help cap total off-heap usage? Thank you for your help!

[1] https://netty.io/4.0/xref/io/netty/util/internal/PlatformDependent.html#141

@erik-stephens
Copy link
Author

And in case this helps, this is what cmd-line looks like:

java -Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError -Xms30G -Xmx30G -XX:MaxDirectMemorySize=32G -XX:-DisableExplicitGC -Des.path.home=/elasticsearch -cp /elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -Epath.conf=/app/config

@jasontedor
Copy link
Member

You still have disabling of explicit GC there.

@Shizuu
Copy link

Shizuu commented Aug 18, 2017

Here it is :

JVM arguments [-Xms2g, -Xmx2g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.
awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled
=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Xms4g, -Xmx4g, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=70, -XX:+UseCMSInitiatingOccupancyOnly, -XX:NewRatio
=2, -XX:SurvivorRatio=4, -XX:+HeapDumpOnOutOfMemoryError, -verbose:gc, -XX:+PrintGCDetails, -XX:+PrintHeapAtGC, -XX:+PrintTenuringDistribution, -Xloggc:/var/opt/hosting/log/elasticsearch/gc.log, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLo
gFiles=3, -XX:GCLogFileSize=10M, -Dcom.sun.management.jmxremote.port=7199, -Dcom.sun.management.jmxremote.ssl=false, -Dcom.sun.management.jmxremote.authenticate=false, -javaagent:/usr/share/java/jolokia-jvm-1.1.4-agent.jar=port=6666,host=
localhost, -XX:NativeMemoryTracking=detail, -Des.path.home=/usr/share/elasticsearch]

Explicit GC is disabled (I took care to check in jvm.option when I updated to 5.5.1).

Before posting the issue here I tried to :

  • re-enabling Explicit GC
  • update my java from 8u112 to 8u141
  • update kernel & system, reboot
    but it didn't change anything

@jasontedor
Copy link
Member

jasontedor commented Aug 18, 2017

You have JMX and an agent installed. Would you please remove those and see if the issue continues to reproduce? I want to be clear: I am not suggesting these are the cause (I doubt they are), rather I need to get this down to the essence that I can reproduce before I can make progress on this. The more variables we remove, the simpler our effort will be. And maybe we get lucky before rolling up our sleeves.

@Shizuu
Copy link

Shizuu commented Aug 18, 2017

No problem, I can remove them. We use them mostly to monitor the JVM.
Here are the new options :

JVM arguments [-Xms2g, -Xmx2g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.
awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled
=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Xms4g, -Xmx4g, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=70, -XX:+UseCMSInitiatingOccupancyOnly, -XX:NewRatio
=2, -XX:SurvivorRatio=4, -XX:+HeapDumpOnOutOfMemoryError, -verbose:gc, -XX:+PrintGCDetails, -XX:+PrintHeapAtGC, -XX:+PrintTenuringDistribution, -Xloggc:/var/opt/hosting/log/elasticsearch/gc.log, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLo
gFiles=3, -XX:GCLogFileSize=10M, -XX:NativeMemoryTracking=detail, -Des.path.home=/usr/share/elasticsearch]

But RSS is still growing above 6G.
Here is another memory detail :
memory-6G.txt

@jasontedor
Copy link
Member

Thanks for the quick turnaround. Let's try another thing: from Elasticsearch 5.3.0 to 5.5.1, we upgraded Netty 4 from 4.1.7 to 4.1.11, and Lucene from 6.4.1 to 6.6.0. Those are quite some changes, on top of new features and so on (but the most likely culprits to leaking native memory are Netty and Lucene).

Let's start by seeing if it could be if it's something new in Netty 4. Can you start Elasticsearch with -Etransport.type=netty3 -Ehttp.type=netty3 (or put them in the elasticsearch.yml). If the leak still occurs, we know it's not Netty 4. If the leak does not occur, we know it's Netty 4. Either way, this is hugely valuable.

@Shizuu
Copy link

Shizuu commented Aug 19, 2017

I think you nailed it. The RSS stabilized around 4.7G. 👍
It's still moving though, but no more than by small steps of less then 500ko. (In 5.3, the RSS doesn't move at all after 2min after the start.)
Need to wait a day or two to see if it grows further.
If I can I'll post some graphs of memory over time.

@jasontedor
Copy link
Member

I would not call it nailing it yet. We only confirmed a strong hunch I had that it was Netty 4, now I need to understand what changed in Netty 4 that is doing this. 😄

@jasontedor
Copy link
Member

So let's go back to the beginning: now we have a culprit, now we need to find what in Netty 4 is causing this. What can you offer in the way of a reproduction? Any straightforward steps I can follow that reliably reproduces the issue? With a reproduction in hand, and knowing where to look, it would be time to roll up our sleeves and find the root cause.

@Shizuu
Copy link

Shizuu commented Aug 19, 2017

I don't know exactly how to reproduce.
In staging, we didn't have the issue. Staging is 1 master node, 3 data nodes, and 2 additionnal nodes for kibana with an injection between 100 and 3,000 docs/s.
Our production is : 3 master nodes, 24 data nodes, 2 additionnal nodes with an injection rate between 10,000 and 100,000 docs/s. (1700 indices, 8700 primary shards, with 1 replica)

Another thing I just noticed : our active master is using way more CPU than before.
With 5.3.0, 10% of CPU would be used continuously, but with 5.5.1, it's around 35%, with some cores at 70% continuously. Even with netty 3, the CPU usage is high.
Edit: verified CPU usage on data nodes too, it didn't change, again, the issue is only on the active master.

@jasontedor
Copy link
Member

I tried a few different approaches tonight to see if I could come up with a reproduction and I do not have one. What is your method for ingesting documents into Elasticsearch?

@Shizuu
Copy link

Shizuu commented Aug 19, 2017

We use bulk indexing across all 24 datanodes (in the process on building ingest nodes).
Bulk size is 3000 documents at max. Indices are daily and ~50 indices by day.
Mapping is dynamic, and is updated every few minutes depending on the documents we receive.
We have a few mapping exceptions every few seconds too because some clients are sending us conflicting documents.

I had an idea since the CPU usage is anormally high, I checked the hot_threads API on the active master and I have this :
hot_threads_master02.txt
There's always 3 or 4 threads named "{New I/O worker #18}" taking ~60-100% or a core.
I hope this is relevant. I don't see this type of threads on staging, neither do I see them on our second production cluster still in 5.3.0 (who has half the workload).
Edit: these New I/O Worker threads may be from netty3, I will rollback to Netty 4 and look at hot_threads again

@Shizuu
Copy link

Shizuu commented Aug 19, 2017

With netty 4, same threads taking 100% of a core :
hot_threads_master01_netty4.txt

@imotov
Copy link
Contributor

imotov commented Aug 19, 2017

Mapping is dynamic, and is updated every few minutes depending on the documents we receive.

@Shizuu how big is your cluster state at the moment? Looking at the hot threads, all these threads are doing is sending mappings over the network.

@Shizuu
Copy link

Shizuu commented Aug 19, 2017

Yes we have a pretty big cluster state, because we use a lot of types, even for similar documents we can have dozens of types (i know it's bad, we plan to reduce that to be ready for 6.0 & 7.0).
Right know a get of cluster state (without pretty) give me a json of 65Mo.
I can count in the mapping near 50,000 types and 1,5 millions fields.

@Shizuu
Copy link

Shizuu commented Aug 23, 2017

Little update : I reproduced a equivalent cluster state in another environment, and doing heavy indexing and triggering a lot of mapping changes, but didn't manage to reproduce the issue.

Then I asked myself if the issue is really linked to indexing and mapping changes.
So, I completelly stopped the indexation processes on the production cluster for a few minutes, then restarted all the masters to be clean.

And the result is the issue was still there on the active master :

  • RSS memory still growing
  • several hot threads still taking 100% CPU and trying to update mappings, even if no mapping is changing (I checked the log)

So I don't know, maybe my cluster is a weird state after the upgrade.

@Shizuu
Copy link

Shizuu commented Aug 23, 2017

I found the culprit ! and managed to reproduce the issue (kinda)

I searched why the master was always spreading mapping information on the network, even when the mapping was not changing.
I looked at hot_threads on all the nodes, hoping to find a thread reading mapping, and I found that it was always my 2 non master/non data nodes.

What is on these nodes ? Kibana. 15 instances of Kibana on each nodes, so 30 total.
To test that is was indeed Kibana : I stopped it, restarted the masters, and the issue was gone. Then, as soon as I restarted Kibana, the issue was back.

I managed to isolate what query Kibana was doing that was generating this load on the master, here it is:

/_cluster/settings?include_defaults=true&filter_path=**.script.engine.*.inline

And each Kibana instance is doing this query every 4 seconds.

I tried to spam this query on our staging environment that has elasticsearch 5.5.1, and I observed the same behavior : RSS increasing and a lot of threads on the master.

Side note though; Kibana 5.3.0 is not doing this request.
So I tried to spam this query on a cluster that still has 5.3.0 version : the CPU usage was there, but didn't observed RSS increasing much.

In conclusion :

  • Kibana 5.5 is doing a request that generate a lot of CPU usage on active master (not noticeable with 1 instance of Kibana but very noticeable with 30 instances)
  • Threads managing this request seems to use more memory in 5.5 than in 5.3 (seems to have to do with netty4 from our previous investigation)

I'm still wondering why this specific requests is working with mapping information, because the response has little to do with mapping (even without the filter_path).

What do you think of all of this ?

(and sorry @erik-stephens I may have hijacked your issue for something completely unrelated)

@jasontedor
Copy link
Member

jasontedor commented Aug 23, 2017

Thank you, this is massively helpful! I will see if it reproduces on my side as well and start serious debugging now that we have a possible reproduction.

@erik-stephens
Copy link
Author

erik-stephens commented Aug 23, 2017

@Shizuu No worries, your involvement is welcome! To help summarize things on our end:

  • Setting -XX:MaxDirectMemorySize=32G stabilized things quite a bit. 30G heap + 32G NIO for elasticsearch/lucene + 32G NIO for netty + small amount for miscellaneous (eg MetaSpace) keeps us from OOM'ing.

  • We configured to use to netty=3 yesterday with no ill side-effects. I probably should've isolated that change from the MaxDirectMemorySize change to better see how that helps.

  • We don't have any kibana instances pointing at this cluster.

  • I tried to duplicate locally by bulk-indexing to a single node cluster in a loop while watching NMT diffs. I couldn't reproduce anything that looked like a leak. I think I'd have to try with a multi node cluster and more indices, suspecting that a large cluster state is key to reproducing.

  • I have not run with explicit GC re-enabled.

@jasontedor
Copy link
Member

Sadly, the leak does not reproduce as reported. Is there something special about your cluster state (aside from being large)?

@Shizuu
Copy link

Shizuu commented Aug 24, 2017

We have some very long types, but I don't think this is it.

Did you test with "transport.tcp.compress: true" ? We have this setting activated, it's false by default. And in the hot_threads you can see the threads doing deflating and inflating.
I can't test right now because it is not a setting dynamically updatable.

@Shizuu
Copy link

Shizuu commented Aug 24, 2017

I'm trying to replicate the issue on another cluster we upgraded in 5.5.1, without success for now, but it has a small cluster state. I'm trying to generate a bigger state and to figure what in it might be triggering the issue.

@jasontedor
Copy link
Member

Yes, I was using transport.tcp.compress set to true. I have a cluster state that is 44955643 bytes of pretty-printed json and 992131 bytes of compressed pretty-printed json and the issue still does not reproduce for me.

@Shizuu
Copy link

Shizuu commented Aug 25, 2017

Okayyyy, I have done countless tests, tried to generate many indices, mappings, etc. playing with options we have activated etc.... no luck...
Finally I copied the state of my cluster with the issue on an empty cluster, saw that the issue was there even without the data, and tried to trim down the cluster state to a minimum that could generate the issue...
(I spend really too much time on this... :) )

But I think this time I have a reliable reproduction (if not well... nevermind!)

So you need :

  • 1 master node
  • 1 non master/non data node
  • no data node needed

Then :

  • Get this file : index_settings.txt
    This is the settings and mappings (anonymized) of one of our daily index that I found was causing the issue.
  • Generate 40 indices with settings and mapping of this file :
    for i in $(seq -w 1 40); do curl -s -XPUT localhost:9200/index_2017.08.$i -d @index_settings.txt & done
    (I have seen the issue triggered at 30 indices but you may need more depending on your hardware)
  • On the non master node, spawn 6 or more shells and launch this in each shell to simulate severals Kibana :
    while true; do curl -s "localhost:9200/_cluster/settings">/dev/null; done
  • The RSS should increase quickly to 2x the HEAP memory.
  • If you see nothing try so spawn more shells or more indices.

In the end, I found the issue is caused by indices with too many types, and with very long types. (tried to put the same mapping with shorter types like 'type-001, type-002 etc.. but the issue wasn't triggering).
So, there may be a bug, but it should be gone when we will migrate to a single type.
And if you don't succeed in replicating the issue, I think I will not look further into it and instead accelerate our single type migration ;)

@jasontedor
Copy link
Member

(I spend really too much time on this... :) )

First, thank you for all your time on this. Second, I'm right there with you the last few days. I will immediately try your reproduction report.

@jasontedor
Copy link
Member

The RSS should increase quickly to 2x the HEAP memory.

This happened. That's okay though, it's expected. Did something problematic happen after this?

@Shizuu
Copy link

Shizuu commented Aug 25, 2017

It didn't happen in 5.3.0 so we were surprised.
Like I said the RSS would never increase beyond 1x HEAP size before the upgrade.
We have 15Go of RAM and HEAP was set at 8Go. Everything was stable.
After the upgrade we didn't see immediatelly that memory usage was going upper than the HEAP and since our HEAP was set at half the RAM the kernel would oomkill the active master, a second master would be elected, and would be oomkill soon after and then the quorum would not be met anymore so the cluster would be out of service.
We setted the HEAP at 6Go, but we still got some kernel oomkills in the night, meaning it had gone at 2.5x the HEAP.
We setted the HEAP at 4Go and didn't see oomkills since then. But I often see the elasticseach process taking 10Go of RAM.
If you say this is normal even if 5.3.0 did not have the issue, it's okay.
I mean we don't use types like we should so the fault is more at us.
And if the issue was not visible in 5.3.0 it may be because Kibana was not doing these cluster settings requests. (needs testing)
Either way thank you for your time too, it's appreciated ;)

@jasontedor
Copy link
Member

jasontedor commented Aug 25, 2017

I tested your reproduction on 5.3.0 and it behaved the same as on 5.5.2.

@jasontedor
Copy link
Member

I do not think there is a leak here. I think that Netty and the JVM are behaving as expected, with Netty using a pool of direct byte buffers up to the maximum direct memory. I compiled jemalloc with --enable-prof and profiled the allocations of the JVM while following your reproducing steps. The allocations all come from Unsafe_AllocateMemory and they are still tracked by the JVM (i.e., they are not leaked); these are direct byte buffers used for channel writes. I see the cleaner running from time to time freeing these native buffers.

The reason that Netty 3 does not behave this way is due to it using unpooled heap byte buffers for the large writes that result from sending the compressed cluster state across the wire. These heap byte buffers will be copied to direct byte buffers managed by the JVM as opposed to being pooled by Netty (the JVM will use a cached direct buffer, or allocate a new one if no cached direct buffer is available). If you start the JVM with -Djdk.nio.maxCachedBufferSize=262144 it will neuter the JVMs ability to reuse these large cached direct buffers forcing the JVM to constantly allocate and free direct buffers which you can track happening.

The reason that you only see this with transport.tcp.compress set to true and only with long type names is because with compression disabled, the size of the message will be so large that Netty will use an unpooled direct byte buffer, and the size after compression has to be large enough that these allocations are never served from Netty's internal buffer cache and never satisfied from any pool list. This leads to repeated direct buffer allocations until reaching max direct memory.

You can control this by setting max direct memory, or by tuning components of the Netty buffer pool.

Thank you for the report, and I appreciate the responsiveness and the effort to work through this. Issues like this can be serious and nasty, so we investigate them thoroughly treating them as guilty until proven innocent.

In this case, I am going to close this as a non-issue.

@erik-stephens erik-stephens changed the title Possible memory leak How to regulate off-heap usage? Aug 30, 2017
@erik-stephens
Copy link
Author

You can control this by setting max direct memory, or by tuning components of the Netty buffer pool.

With -Xmx30G and -XX:MaxDirectMemorySize=15G, I expect RSS not to greatly exceed 45G but it does (over 60G) currently:

64008820 19594 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError -Xms30G -Xmx30G -XX:MaxDirectMemorySize=15G -Des.path.home=/elasticsearch -cp /elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -Epath.conf=/app/config -Epath.logs=/app/logs

Before focusing on netty, I used this script to quantify how much attributed to elasticsearch/lucene, it looks to not be honoring that setting:

# python parse_smaps.py -t Pss /proc/$pid/smaps  | awk '$3 ~ "indices" { n += $1 } END { print n / 1048576 }'
29.8111

That analysis look correct? Any other ideas/settings to regulate that usage? Thanks!

@erik-stephens
Copy link
Author

To address the question of whether re-allowing explicit GC's will help, should a non-explicit GC attempt to free the native buffers? We are seeing full GC's without a corresponding release of native memory. I was able to re-enable explicit GC's via ES_JVM_OPTIONS but haven't noticed a difference. This process is at 80G RSS with 49G from mmap'd files under indices directory:

/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -XX:NativeMemoryTracking=detail -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/app/logs/gc.log -Xms30G -Xmx30G -XX:MaxDirectMemorySize=15G -Des.path.home=/elasticsearch -cp /elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -Epath.conf=/app/config -Epath.logs=/app/logs

Also, I appreciate that this might not be an elasticsearch issue and perhaps more a lucene or jvm issue. Bugging elasticsearch first hoping someone has already tackled this kind of problem. It sounds like most people are running in a dedicated machine (as opposed to a container) and that letting it grow unbounded, managed by the OS is acceptable but it does beg the question of why MaxDirectMemorySize (aka ES_DIRECT_SIZE) is mentioned in the docs.

@erik-stephens
Copy link
Author

I just learned that mmap'd files are not considered direct memory, so that at least explains why MaxDirectMemorySize wasn't having any impact. The memory isn't locked (according to our configs & smaps output), so still trying to figure out why OOM'ing when so much available in the fs cache. I need to look at system level and stop bugging y'all :)

@xgwu
Copy link

xgwu commented Dec 28, 2017

@erik-stephens @jasontedor

I ran into similar issue after upgrading a cluster from version 5.3.2 to 5.6.4 recently. Some data nodes with high data ingestion rate seem to be leaking memory since the upgrade.
mem_leakage

The sample node above is running on a physical server with 128GB RAM with below JVM options

/home/deploy/env/jre/bin/java -Xms31g -Xmx31g -XX:+UseG1GC -XX:MaxGCPauseMillis=500 -XX:MaxDirectMemorySize=2g -XX:+AlwaysPreTouch -server -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -Dlog4j2.enable.threadlocals=false -XX:+HeapDumpOnOutOfMemoryError -Des.path.home=/home/deploy/app/elasticsearch -cp /home/deploy/app/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -d

ie, I allocated 31GB RAM for JVM but RSS of the process from top command has been as high as 67GB and kept growing.

Pmap shows 422+ anon pages of 64MB in size which accounts for about 26GB RAM.
pmap_output

By looking at smap of ES process, these 64MB anon pages are marked as Private Dirty:
smap

Not sure if these growing anon pages will eventually incur OOM.

@erik-stephens
Copy link
Author

@xgwu For background, our elasticsearch process is co-hosted with another process that could attempt to allocate a lot of memory quickly. If your elasticsearch process is the only major process, then I doubt you'll have much risk of an oom situation.

My understanding is that this usage is a good thing in that the system is putting unused memory to good use. It becomes not so good once that usage starts putting pressure on the kernel to allocate new pages quick enough. If it can't, then it will oom-kill processes, looking at your large elasticsearch process as a good target. I experimented with increasing vm.min_free_kbytes to help leave enough head room for new memory allocations. Also considered OOMScoreAdjust in the elasticsearch systemd unit file as a way to help it survive any oom-kill targeting.

In short, it's a linux memory mgmt issue and not an elasticsearch issue. Some references that helped me:

@xgwu
Copy link

xgwu commented Dec 29, 2017

@erik-stephens Thank you very much for the enlightenment! Then I have nothing to worry about since our elasticsearch is the only major process on the host.

@xgwu
Copy link

xgwu commented Jan 18, 2018

@erik-stephens Our Elasticsearch process eventually consumed over 100GB of memory and got OOM killed by OS. It's the only major process running on the box, so I think there is still something wrong. After more investigation, I suspect it's related to java-bytebuffer-leak described in this blog ->java-bytebuffer-lea. I just added -Djdk.nio.maxCachedBufferSize=262144 to jvm options of an ES instance and restarted it. Will watch for a couple of days to see if off-heap memory usage could be regulated.

@xgwu
Copy link

xgwu commented Feb 22, 2018

@erik-stephens @jasontedor

FYI. For my particular case, the culprit of off heap memory leak seems to be G1 GC on the JVM version (1.8.0_77-b03) running on our environment. Either switching G1 back to the default CMS or upgrading JVM to the latest version (1.8.0_152-b16) solves the problem.

@praveenmak
Copy link

praveenmak commented Oct 26, 2020

I see to have similar issue with Elasticsearch 7.7.0

free -mh
                      total        used       free      shared  buff/cache   available
Mem:           125G       34G        2.3G     16M      88G             90G
Swap:            0B          0B          0B

This is the Java -version. My RAM parcent shows its 99% full and Buffer/Cache has 80-90 GB.

openjdk 14 2020-03-17
OpenJDK Runtime Environment AdoptOpenJDK (build 14+36)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 14+36, mixed mode, sharing)

Here are all 10 nodes in the ES cluster. All of them show almost 99% ram.percent

heap.percent ram.percent cpu 
33           98          0
48           97          0
31           99          0
40           96          0
37           99          0
48           98          0
38           29          0
31           94          0
54           98          0
31           98          0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Network Http and internode communication implementations feedback_needed v5.5.0
Projects
None yet
Development

No branches or pull requests

7 participants