Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic nodes goes down as gc kicks in #27872

Closed
dhruvr opened this issue Dec 18, 2017 · 2 comments
Closed

Elastic nodes goes down as gc kicks in #27872

dhruvr opened this issue Dec 18, 2017 · 2 comments

Comments

@dhruvr
Copy link

dhruvr commented Dec 18, 2017

ES Version: 5.4.1
OS: Ubuntu
Java:1.8

Issue

We currently upgraded our systems from es 2.3 to es5.4.1, everything was going well till we deployed it on production. It started with es node going for some time randomly. On investigation, we saw under logs in es

[2017-12-15T20:33:28,459][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][1723] overhead, spent [252ms] collecting in the last [1s]
[2017-12-15T20:42:53,599][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][2288] overhead, spent [288ms] collecting in the last [1s]
[2017-12-15T20:53:55,846][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][2950] overhead, spent [313ms] collecting in the last [1s]
[2017-12-15T21:03:18,145][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][3512] overhead, spent [325ms] collecting in the last [1s]
[2017-12-15T21:45:45,815][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6059] overhead, spent [340ms] collecting in the last [1s]
[2017-12-15T21:46:39,834][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6113] overhead, spent [445ms] collecting in the last [1s]
[2017-12-15T21:46:59,895][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6133] overhead, spent [281ms] collecting in the last [1s]
[2017-12-15T21:47:27,901][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6161] overhead, spent [306ms] collecting in the last [1s]
[2017-12-15T21:51:56,093][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6429] overhead, spent [297ms] collecting in the last [1.1s]
[2017-12-15T21:51:57,094][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6430] overhead, spent [353ms] collecting in the last [1s]
[2017-12-15T21:57:56,361][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6789] overhead, spent [398ms] collecting in the last [1.1s]
[2017-12-15T21:58:23,407][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6816] overhead, spent [284ms] collecting in the last [1s]
[2017-12-15T21:58:24,444][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6817] overhead, spent [341ms] collecting in the last [1s]
[2017-12-15T22:01:29,536][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7002] overhead, spent [341ms] collecting in the last [1s]
[2017-12-15T22:01:31,663][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7004] overhead, spent [399ms] collecting in the last [1.1s]
[2017-12-15T22:01:35,664][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7008] overhead, spent [273ms] collecting in the last [1s]
[2017-12-15T22:03:08,779][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7101] overhead, spent [281ms] collecting in the last [1s]
[2017-12-15T22:05:54,928][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7267] overhead, spent [310ms] collecting in the last [1.1s]
[2017-12-15T22:05:55,928][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7268] overhead, spent [305ms] collecting in the last [1s]
[2017-12-15T22:05:58,929][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7271] overhead, spent [287ms] collecting in the last [1s]

Earlier the ES 2.3 used to work with 8gb memory on the system with 32 GB ram. We increased the heap space from 8 to 12 and then we increased system ram to 64 GB and give 32 GB heap to elastic search but the es nodes keep going down when there is any heavy query. Please note that earlier everything was working fine.

For 12 GB the details are as under

  "jvm" : {
        "timestamp" : 1513422077508,
        "uptime_in_millis" : 16750330,
        "mem" : {
          "heap_used_in_bytes" : 9402202880,
          "heap_used_percent" : 73,
          "heap_committed_in_bytes" : 12815171584,
          "heap_max_in_bytes" : 12815171584,
          "non_heap_used_in_bytes" : 97693480,
          "non_heap_committed_in_bytes" : 103424000,
          "pools" : {
            "young" : {
              "used_in_bytes" : 441150480,
              "max_in_bytes" : 558432256,
              "peak_used_in_bytes" : 558432256,
              "peak_max_in_bytes" : 558432256
            },
            "survivor" : {
              "used_in_bytes" : 42550952,
              "max_in_bytes" : 69730304,
              "peak_used_in_bytes" : 69730304,
              "peak_max_in_bytes" : 69730304
            },
            "old" : {
              "used_in_bytes" : 8918501448,
              "max_in_bytes" : 12187009024,
              "peak_used_in_bytes" : 8918501448,
              "peak_max_in_bytes" : 12187009024
            }
          }
        },
        "threads" : {
          "count" : 71,
          "peak_count" : 95
        },
        "gc" : {
          "collectors" : {
            "young" : {
              "collection_count" : 104,
              "collection_time_in_millis" : 7078
            },
            "old" : {
              "collection_count" : 1,
              "collection_time_in_millis" : 170
            }
          }
        },
        "buffer_pools" : {
          "direct" : {
            "count" : 67,
            "used_in_bytes" : 847203522,
            "total_capacity_in_bytes" : 847203521
          },
          "mapped" : {
            "count" : 1204,
            "used_in_bytes" : 16787345787,
            "total_capacity_in_bytes" : 16787345787
          }
        },
        "classes" : {
          "current_loaded_count" : 11025,
          "total_loaded_count" : 11025,
          "total_unloaded_count" : 0
        }
      }
    }

But till right now we are not able to find any concrete soln to the problem

@jasontedor
Copy link
Member

If this is the case, there should be a note in the logs about a fatal error and the cause being an out of memory error, the GC logs are only a warning of this impending doom and the node shuts itself down when any out of memory error is thrown. This is new behavior in 5.x, you could have been seeing out of memory errors in 2.x too but the logging and the termination behavior are both new in 5.x.

We do not treat this as a bug without sufficient evidence. You need to look at heap dumps and find the queries that are causing problems. From there we can help triage this as:

  • pushing the system too far
  • known issues with bucket explosions in aggregations for which we have taken are taking steps to address
  • some other memory issue

To do this, I would ask that you discuss this on the forum. We reserve GitHub for verified bug reports and feature requests.

@rana-prashant-208
Copy link

Hi, it's very late but we are facing the same issue
We are getting the following error message in the es logs.

ES version 5.6.4.

With ES 2.2.0 everything was working fine. The bulk index queue is full and the index requests are spread across many indices.

cause [auto(bulk api)], templates [logs_indices_template, logs_ela_indices_template], shards [1]/[0], mappings [_default_] [2019-07-05T13:25:36,221][WARN ][o.e.m.j.JvmGcMonitorService] [node] [gc][147] overhead, spent [3.3s] collecting in the last [3.4s] [2019-07-05T13:25:37,565][WARN ][o.e.m.j.JvmGcMonitorService] [node] [gc][148] overhead, spent [1.2s] collecting in the last [1.3s] [2019-07-05T13:25:47,781][WARN ][o.e.m.j.JvmGcMonitorService] [node] [gc][149] overhead, spent [8.6s] collecting in the last [10.2s] [2019-07-05T13:25:47,782][ERROR][o.e.t.n.Netty4Utils ] fatal error on the network layer at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:185) at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:83) at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1273) at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1084) at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at java.lang.Thread.run(Thread.java:745) [2019-07-05T13:25:47,781][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node] fatal error in thread [elasticsearch[node][refresh][T#6]], exiting java.lang.OutOfMemoryError: GC overhead limit exceeded at org.elasticsearch.common.util.concurrent.ThreadContext$$Lambda$1340/300859499.get$Lambda(Unknown Source) ~[?:?]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants