Elastic nodes goes down as gc kicks in #27872

dhruvr · 2017-12-18T18:37:56Z

ES Version: 5.4.1
OS: Ubuntu
Java:1.8

Issue

We currently upgraded our systems from es 2.3 to es5.4.1, everything was going well till we deployed it on production. It started with es node going for some time randomly. On investigation, we saw under logs in es

[2017-12-15T20:33:28,459][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][1723] overhead, spent [252ms] collecting in the last [1s]
[2017-12-15T20:42:53,599][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][2288] overhead, spent [288ms] collecting in the last [1s]
[2017-12-15T20:53:55,846][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][2950] overhead, spent [313ms] collecting in the last [1s]
[2017-12-15T21:03:18,145][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][3512] overhead, spent [325ms] collecting in the last [1s]
[2017-12-15T21:45:45,815][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6059] overhead, spent [340ms] collecting in the last [1s]
[2017-12-15T21:46:39,834][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6113] overhead, spent [445ms] collecting in the last [1s]
[2017-12-15T21:46:59,895][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6133] overhead, spent [281ms] collecting in the last [1s]
[2017-12-15T21:47:27,901][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6161] overhead, spent [306ms] collecting in the last [1s]
[2017-12-15T21:51:56,093][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6429] overhead, spent [297ms] collecting in the last [1.1s]
[2017-12-15T21:51:57,094][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6430] overhead, spent [353ms] collecting in the last [1s]
[2017-12-15T21:57:56,361][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6789] overhead, spent [398ms] collecting in the last [1.1s]
[2017-12-15T21:58:23,407][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6816] overhead, spent [284ms] collecting in the last [1s]
[2017-12-15T21:58:24,444][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6817] overhead, spent [341ms] collecting in the last [1s]
[2017-12-15T22:01:29,536][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7002] overhead, spent [341ms] collecting in the last [1s]
[2017-12-15T22:01:31,663][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7004] overhead, spent [399ms] collecting in the last [1.1s]
[2017-12-15T22:01:35,664][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7008] overhead, spent [273ms] collecting in the last [1s]
[2017-12-15T22:03:08,779][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7101] overhead, spent [281ms] collecting in the last [1s]
[2017-12-15T22:05:54,928][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7267] overhead, spent [310ms] collecting in the last [1.1s]
[2017-12-15T22:05:55,928][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7268] overhead, spent [305ms] collecting in the last [1s]
[2017-12-15T22:05:58,929][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7271] overhead, spent [287ms] collecting in the last [1s]

Earlier the ES 2.3 used to work with 8gb memory on the system with 32 GB ram. We increased the heap space from 8 to 12 and then we increased system ram to 64 GB and give 32 GB heap to elastic search but the es nodes keep going down when there is any heavy query. Please note that earlier everything was working fine.

For 12 GB the details are as under

  "jvm" : {
        "timestamp" : 1513422077508,
        "uptime_in_millis" : 16750330,
        "mem" : {
          "heap_used_in_bytes" : 9402202880,
          "heap_used_percent" : 73,
          "heap_committed_in_bytes" : 12815171584,
          "heap_max_in_bytes" : 12815171584,
          "non_heap_used_in_bytes" : 97693480,
          "non_heap_committed_in_bytes" : 103424000,
          "pools" : {
            "young" : {
              "used_in_bytes" : 441150480,
              "max_in_bytes" : 558432256,
              "peak_used_in_bytes" : 558432256,
              "peak_max_in_bytes" : 558432256
            },
            "survivor" : {
              "used_in_bytes" : 42550952,
              "max_in_bytes" : 69730304,
              "peak_used_in_bytes" : 69730304,
              "peak_max_in_bytes" : 69730304
            },
            "old" : {
              "used_in_bytes" : 8918501448,
              "max_in_bytes" : 12187009024,
              "peak_used_in_bytes" : 8918501448,
              "peak_max_in_bytes" : 12187009024
            }
          }
        },
        "threads" : {
          "count" : 71,
          "peak_count" : 95
        },
        "gc" : {
          "collectors" : {
            "young" : {
              "collection_count" : 104,
              "collection_time_in_millis" : 7078
            },
            "old" : {
              "collection_count" : 1,
              "collection_time_in_millis" : 170
            }
          }
        },
        "buffer_pools" : {
          "direct" : {
            "count" : 67,
            "used_in_bytes" : 847203522,
            "total_capacity_in_bytes" : 847203521
          },
          "mapped" : {
            "count" : 1204,
            "used_in_bytes" : 16787345787,
            "total_capacity_in_bytes" : 16787345787
          }
        },
        "classes" : {
          "current_loaded_count" : 11025,
          "total_loaded_count" : 11025,
          "total_unloaded_count" : 0
        }
      }
    }

But till right now we are not able to find any concrete soln to the problem

The text was updated successfully, but these errors were encountered:

jasontedor · 2017-12-18T18:48:30Z

If this is the case, there should be a note in the logs about a fatal error and the cause being an out of memory error, the GC logs are only a warning of this impending doom and the node shuts itself down when any out of memory error is thrown. This is new behavior in 5.x, you could have been seeing out of memory errors in 2.x too but the logging and the termination behavior are both new in 5.x.

We do not treat this as a bug without sufficient evidence. You need to look at heap dumps and find the queries that are causing problems. From there we can help triage this as:

pushing the system too far
known issues with bucket explosions in aggregations for which we have taken are taking steps to address
some other memory issue

To do this, I would ask that you discuss this on the forum. We reserve GitHub for verified bug reports and feature requests.

rana-prashant-208 · 2019-07-05T09:14:52Z

Hi, it's very late but we are facing the same issue
We are getting the following error message in the es logs.

ES version 5.6.4.

With ES 2.2.0 everything was working fine. The bulk index queue is full and the index requests are spread across many indices.

cause [auto(bulk api)], templates [logs_indices_template, logs_ela_indices_template], shards [1]/[0], mappings [_default_] [2019-07-05T13:25:36,221][WARN ][o.e.m.j.JvmGcMonitorService] [node] [gc][147] overhead, spent [3.3s] collecting in the last [3.4s] [2019-07-05T13:25:37,565][WARN ][o.e.m.j.JvmGcMonitorService] [node] [gc][148] overhead, spent [1.2s] collecting in the last [1.3s] [2019-07-05T13:25:47,781][WARN ][o.e.m.j.JvmGcMonitorService] [node] [gc][149] overhead, spent [8.6s] collecting in the last [10.2s] [2019-07-05T13:25:47,782][ERROR][o.e.t.n.Netty4Utils ] fatal error on the network layer at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:185) at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:83) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1273) at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1084) at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at java.lang.Thread.run(Thread.java:745) [2019-07-05T13:25:47,781][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node] fatal error in thread [elasticsearch[node][refresh][T#6]], exiting java.lang.OutOfMemoryError: GC overhead limit exceeded at org.elasticsearch.common.util.concurrent.ThreadContext$$Lambda$1340/300859499.get$Lambda(Unknown Source) ~[?:?]

jasontedor closed this as completed Dec 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elastic nodes goes down as gc kicks in #27872

Elastic nodes goes down as gc kicks in #27872

dhruvr commented Dec 18, 2017 •

edited

Loading

jasontedor commented Dec 18, 2017

rana-prashant-208 commented Jul 5, 2019

Elastic nodes goes down as gc kicks in #27872

Elastic nodes goes down as gc kicks in #27872

Comments

dhruvr commented Dec 18, 2017 • edited Loading

Issue

jasontedor commented Dec 18, 2017

rana-prashant-208 commented Jul 5, 2019

dhruvr commented Dec 18, 2017 •

edited

Loading