When a broker goes OOM it should be easily observable #7807

deepthidevaki · 2021-09-13T09:43:50Z

When a broker goes OOM in direct memory, it logs the error but continue to work. But the thread in which OOM occured does not make any progress (it looks like it). This eventually causes OOM on heap. It would be better if it exits when OOM occurs in direct memory. We have enabled +XX:exitOnOutOfMemory, but this works only when OOM on heap.

2021-08-31 17:10:02.351 CEST
Exception in thread "Broker-2-zb-actors-3" java.lang.OutOfMemoryError: Direct buffer memory at java.base/java.nio.Bits.reserveMemory(Bits.java:175) at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118) at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317) at io.camunda.zeebe.util.allocation.DirectBufferAllocator.allocate(DirectBufferAllocator.java:20) at io.camunda.zeebe.util.allocation.BufferAllocators.allocateDirect(BufferAllocators.java:16) at io.camunda.zeebe.dispatcher.DispatcherBuilder.initAllocatedBuffer(DispatcherBuilder.java:147) at io.camunda.zeebe.dispatcher.DispatcherBuilder.build(DispatcherBuilder.java:93) at io.camunda.zeebe.logstreams.impl.log.LogStreamImpl.openAppender(LogStreamImpl.java:280) at io.camunda.zeebe.logstreams.impl.log.LogStreamImpl.createWriter(LogStreamImpl.java:208) at io.camunda.zeebe.logstreams.impl.log.LogStreamImpl.lambda$newLogStreamBatchWriter$1(LogStreamImpl.java:115) at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:94) at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78) at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:191)

#7744 (comment)_

The text was updated successfully, but these errors were encountered:

Zelldon · 2021-09-13T11:20:00Z

Similar things observed here #6059

Zelldon · 2021-09-13T11:21:14Z

@deepthidevaki and me had a closer look at the code and we think it just kills one of our actor threads here https://github.com/camunda-cloud/zeebe/blob/develop/util/src/main/java/io/camunda/zeebe/util/sched/ActorThread.java#L93-L106, if we have multiple actor threads they will still continue to work.

We could solve it via catching it and doing an exit(1) 🙈

Zelldon · 2021-09-13T11:55:09Z

https://github.com/airlift/jvmkill :D

Zelldon · 2021-09-13T11:57:25Z

We could also try -XX:OnError="<cmd args>;<cmd args>" https://www.oracle.com/java/technologies/javase/vmoptions-jsp.html

npepinpe · 2021-09-14T07:49:24Z

Proposal:

Instrument every thread (including, if possible, Netty's) to kill the JVM when fatal errors occur. We can use a starting point the following VirtualMachineError (includes StackOverflowError, OutOfMemoryError, etc.), which from the Javadoc:

Thrown to indicate that the Java Virtual Machine is broken or has run out of resources necessary for it to continue operating.

I think that's a good reason to just end things. We can think about adding also LinkageError in the future (for example, Scala's NonFatal does that), but I'm not sure this is correct for us.

Possibly OnError is already called for any VirtualMachineError, but I couldn't find any documentation specifying this. It just says triggered on fatal errors, which to me implies that but doesn't really confirm it 😅

EDIT: OnError seems to be triggered only by segmentation faults.

npepinpe · 2021-09-14T07:54:29Z

Also regarding -XX:OnError, is this only for HotSpot or most JVMs?

npepinpe · 2021-09-14T08:05:57Z

So -XX:OnError is not triggered by out of direct memory. See this test:

import java.nio.ByteBuffer;

public class Test {
  public static void main(final String[] args) throws InterruptedException {
    final var t =
        new Thread(
            () -> {
              final var buf = ByteBuffer.allocateDirect(128 * 1024 * 1024);
            });
    t.start();
    Thread.sleep(60_000);
  }
}

Ran with: java -XX:OnError="kill -9 %p" -Xms64M -Xmx64M -XX:MaxDirectMemorySize=64M Test.java, which produces:

Exception in thread "Thread-0" java.lang.OutOfMemoryError: Direct buffer memory
        at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
        at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118)
        at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317)
        at io.camunda.zeebe.broker.transport.externalapi.Test.lambda$main$0(Test.java:10)
        at java.base/java.lang.Thread.run(Thread.java:829)

And then sleeps for a minute.

npepinpe · 2021-09-14T08:11:54Z

One quick fix could be to do this in the standalone broker/gateway:

    Thread.setDefaultUncaughtExceptionHandler(
        (thread, error) -> {
          if (error instanceof VirtualMachineError) {
            Loggers.SYSTEM_LOGGER.error(
                "An unexpected fatal error was thrown; exiting now.", error);
            System.exit(1);
          }
        });

I think only in Atomix do we overwrite the uncaught exception handler.

npepinpe · 2021-09-14T08:53:29Z

So this won't quite work everywhere since the thread pool executor will swallow exceptions 😅

So we would have to:

Instrument the actor threads to catch VirtualMachineError and die
Instrument the Atomix ThreadContext implementations to catch VirtualMachineError and die
Instrument the Netty ChannelHandler (e.g. BasicClientChannelInitializer and BasicServerChannelInitializer) to customize the exception catching behavior, catch VirtualMachineError, and die.

We should still set the default one just in case, of course, but that should cover most places.

8327: Always exit on unrecoverable VM errors r=oleschoenburg a=oleschoenburg ## Description * Sets a default uncaught exception handler that shuts down on any `VirtualMachineError`. * Shuts down on `VirtualMachineError`s in atomix threads. * Shuts down on `VirtualMachineError`s in actor threads.  ## Related issues  relates to #7807 but does not close it. Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

8535: [Backport stable/1.2] Always exit on unrecoverable VM errors r=oleschoenburg a=github-actions[bot] # Description Backport of #8327 to `stable/1.2`. relates to #7807 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

8519: [Backport stable/1.3] test(atomix): faster `RaftRule` tests r=oleschoenburg a=github-actions[bot] # Description Backport of #8501 to `stable/1.3`. relates to 8536: [Backport stable/1.3] Always exit on unrecoverable VM errors r=oleschoenburg a=github-actions[bot] # Description Backport of #8327 to `stable/1.3`. relates to #7807 8538: [Backport stable/1.3] fix: print correct json input r=Zelldon a=github-actions[bot] # Description Backport of #8522 to `stable/1.3`. relates to #8284 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Christopher Zell <zelldon91@googlemail.com>

lenaschoenburg · 2022-01-06T15:58:46Z

OOM's should be handled correctly in most places now. I've created a new issue for handling Netty's OutOfDirectMemory: #8543

npepinpe added this to Planned in Zeebe Sep 14, 2021

npepinpe added Impact: Observability scope/broker Marks an issue or PR to appear in the broker section of the changelog kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. labels Oct 13, 2021

npepinpe moved this from Planned to Ready in Zeebe Nov 9, 2021

lenaschoenburg self-assigned this Dec 7, 2021

lenaschoenburg moved this from Ready to In progress in Zeebe Dec 7, 2021

lenaschoenburg mentioned this issue Dec 7, 2021

Always exit on unrecoverable VM errors #8327

Merged

9 tasks

npepinpe moved this from In progress to Review in progress in Zeebe Dec 8, 2021

romansmirnov mentioned this issue Dec 13, 2021

OutOfMemory in follower #7992

Closed

This was referenced Jan 6, 2022

[Backport stable/1.2] Always exit on unrecoverable VM errors #8535

Merged

[Backport stable/1.3] Always exit on unrecoverable VM errors #8536

Merged

lenaschoenburg closed this as completed Jan 6, 2022

Zeebe automation moved this from Review in progress to Done Jan 6, 2022

npepinpe added the version:1.3.1 Marks an issue as being completely or in parts released in 1.3.1 label Jan 17, 2022

npepinpe added the Release: 1.2.10 label Feb 1, 2022

remcowesterhoud added the Release: 1.4.0-alpha1 label Feb 1, 2022

KerstinHebel removed this from Done in Zeebe Mar 23, 2022

npepinpe added the Release: 8.0.0-rc1 label Mar 28, 2022

pihme added the Release: 8.0.0 label Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When a broker goes OOM it should be easily observable #7807

When a broker goes OOM it should be easily observable #7807

deepthidevaki commented Sep 13, 2021

Zelldon commented Sep 13, 2021

Zelldon commented Sep 13, 2021

Zelldon commented Sep 13, 2021

Zelldon commented Sep 13, 2021

npepinpe commented Sep 14, 2021 •

edited

Loading

npepinpe commented Sep 14, 2021

npepinpe commented Sep 14, 2021

npepinpe commented Sep 14, 2021 •

edited

Loading

npepinpe commented Sep 14, 2021

lenaschoenburg commented Jan 6, 2022

When a broker goes OOM it should be easily observable #7807

When a broker goes OOM it should be easily observable #7807

Comments

deepthidevaki commented Sep 13, 2021

Zelldon commented Sep 13, 2021

Zelldon commented Sep 13, 2021

Zelldon commented Sep 13, 2021

Zelldon commented Sep 13, 2021

npepinpe commented Sep 14, 2021 • edited Loading

npepinpe commented Sep 14, 2021

npepinpe commented Sep 14, 2021

npepinpe commented Sep 14, 2021 • edited Loading

npepinpe commented Sep 14, 2021

lenaschoenburg commented Jan 6, 2022

npepinpe commented Sep 14, 2021 •

edited

Loading

npepinpe commented Sep 14, 2021 •

edited

Loading