Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemoryError occurred in coordinating node #49699

Closed
wyukawa opened this issue Nov 29, 2019 · 9 comments
Closed

OutOfMemoryError occurred in coordinating node #49699

wyukawa opened this issue Nov 29, 2019 · 9 comments
Assignees

Comments

@wyukawa
Copy link
Contributor

@wyukawa wyukawa commented Nov 29, 2019

Describe the feature:

Elasticsearch version (bin/elasticsearch --version):
7.4.2

Plugins installed: []

JVM version (java -version):

# /usr/share/elasticsearch/jdk/bin/java -version
openjdk version "13.0.1" 2019-10-15
OpenJDK Runtime Environment AdoptOpenJDK (build 13.0.1+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 13.0.1+9, mixed mode, sharing)

JVM option

/usr/share/elasticsearch/jdk/bin/java -Xms16g -Xmx16g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch-14015835058529761743 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Djava.locale.providers=COMPAT -Dio.netty.allocator.type=pooled -XX:MaxDirectMemorySize=10737418240 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=rpm -Des.bundled_jdk=true -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet

OS version (uname -a if on a Unix-like system):

# uname -a
Linux ... 3.10.0-1062.1.2.el7.x86_64 #1 SMP Mon Sep 30 14:19:46 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
OutOfMemoryError occurred in the coordinating node which is not master/data node.
We have 3 coordinating nodes.

Java heap of the coordinating node is 16GB.
After this issue, I increase java heap to 20GB.

Steps to reproduce:
I'm sorry not to reproduce...

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

Provide logs (if relevant):

java.lang.OutOfMemoryError: Java heap space
        at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:206) ~[?:?]
        at io.netty.buffer.PoolArena$HeapArena.newByteArray(PoolArena.java:679) ~[?:?]
        at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:689) ~[?:?]
        at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:245) ~[?:?]
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:227) ~[?:?]
        at io.netty.buffer.PoolArena.reallocate(PoolArena.java:400) ~[?:?]
        at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:119) ~[?:?]
        at io.netty.buffer.AbstractByteBuf.ensureWritable0(AbstractByteBuf.java:303) ~[?:?]
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1120) ~[?:?]
        at org.elasticsearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:125) ~[?:?]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) ~[?:?]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]

Eclipse Memory Analyzer result

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

@elasticmachine elasticmachine commented Nov 29, 2019

Pinging @elastic/es-distributed (:Distributed/Network)

@original-brownbear original-brownbear self-assigned this Nov 29, 2019
@original-brownbear

This comment has been minimized.

Copy link
Member

@original-brownbear original-brownbear commented Nov 29, 2019

@wyukawa Would you be able to provide a better heap dump/analysis that contains the actual classnames for the objects that are taking up most of the heap here?

@wyukawa

This comment has been minimized.

Copy link
Contributor Author

@wyukawa wyukawa commented Dec 2, 2019

Actually I really want to know the actual class name but the resut html report of Eclipse Memory Analyzer with https://gist.github.com/mrbuk/5726c469234238a26327 shows as follows

One instance of "Unresolved Name 0x7f1ea1a113c0" loaded by "jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x429990548" occupies 13,640,405,400 (79.84%) bytes. The memory is accumulated in one instance of "java.lang.Object[]" loaded by "<system class loader>".

It seems that byte [] in netty has huge memory.
https://github.com/netty/netty/blob/netty-4.1.38.Final/common/src/main/java/io/netty/util/internal/PlatformDependent.java#L206

@original-brownbear

This comment has been minimized.

Copy link
Member

@original-brownbear original-brownbear commented Dec 2, 2019

@wyukawa you could use a different tool to analyze the heap dump. Maybe try visualVM? I think this issue may just be a manifestation of the things discussed in #44484 but lets make sure of that if possible.

@wyukawa

This comment has been minimized.

Copy link
Contributor Author

@wyukawa wyukawa commented Dec 3, 2019

Unfortunately, visualVM doesn't work well in my Mac...
But actually, network traffic between nodes increased about 6,7 times during the outage.

@wyukawa

This comment has been minimized.

Copy link
Contributor Author

@wyukawa wyukawa commented Dec 3, 2019

visualvm with jdk12 worked!

@original-brownbear

This comment has been minimized.

Copy link
Member

@original-brownbear original-brownbear commented Dec 3, 2019

@wyukawa interesting. Any indication where these byte[] are referenced from or were created?

@wyukawa

This comment has been minimized.

Copy link
Contributor Author

@wyukawa wyukawa commented Dec 3, 2019

hm, I'm not sure but the following screenshot might be helpful to you.

@original-brownbear

This comment has been minimized.

Copy link
Member

@original-brownbear original-brownbear commented Dec 3, 2019

@wyukawa ah, that looks like http request objects from the fields in it :) A fix that should way improve this situation is incoming in #44564

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Dec 4, 2019
The way things currently work, we read up to 1M from the channel
and then potentially force all of it into the `ByteBuf` passed
by Netty. Since that `ByteBuf` tends to by default be `64k` in size,
large reads will force the buffer to grow, completely circumventing
the logic of `allocHandle`.

This seems like it could break
`io.netty.channel.RecvByteBufAllocator.Handle#continueReading`
since that method for the fixed-size allocator does check
whether the last read was equal to the attempted read size.
So if we set `64k` because that's what the buffer size is,
then wirte `1M` to the buffer we will stop reading on the IO loop,
even though the channel may still have bytes that we can read right away.

More imporatantly though, this can lead to running OOM quite easily
under IO pressure as we are forcing the heap buffers passed to the read
to `reallocate`.

Closes elastic#49699
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Dec 4, 2019
The way things currently work, we read up to 1M from the channel
and then potentially force all of it into the `ByteBuf` passed
by Netty. Since that `ByteBuf` tends to by default be `64k` in size,
large reads will force the buffer to grow, completely circumventing
the logic of `allocHandle`.

This seems like it could break
`io.netty.channel.RecvByteBufAllocator.Handle#continueReading`
since that method for the fixed-size allocator does check
whether the last read was equal to the attempted read size.
So if we set `64k` because that's what the buffer size is,
then wirte `1M` to the buffer we will stop reading on the IO loop,
even though the channel may still have bytes that we can read right away.

More imporatantly though, this can lead to running OOM quite easily
under IO pressure as we are forcing the heap buffers passed to the read
to `reallocate`.

Closes elastic#49699
original-brownbear added a commit that referenced this issue Dec 4, 2019
* Stop Allocating Buffers in CopyBytesSocketChannel (#49825)

The way things currently work, we read up to 1M from the channel
and then potentially force all of it into the `ByteBuf` passed
by Netty. Since that `ByteBuf` tends to by default be `64k` in size,
large reads will force the buffer to grow, completely circumventing
the logic of `allocHandle`.

This seems like it could break
`io.netty.channel.RecvByteBufAllocator.Handle#continueReading`
since that method for the fixed-size allocator does check
whether the last read was equal to the attempted read size.
So if we set `64k` because that's what the buffer size is,
then wirte `1M` to the buffer we will stop reading on the IO loop,
even though the channel may still have bytes that we can read right away.

More imporatantly though, this can lead to running OOM quite easily
under IO pressure as we are forcing the heap buffers passed to the read
to `reallocate`.

Closes #49699
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.