Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When a broker goes OOM it should be easily observable #7807

Closed
deepthidevaki opened this issue Sep 13, 2021 · 10 comments
Closed

When a broker goes OOM it should be easily observable #7807

deepthidevaki opened this issue Sep 13, 2021 · 10 comments
Assignees
Labels
kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. scope/broker Marks an issue or PR to appear in the broker section of the changelog version:1.3.1 Marks an issue as being completely or in parts released in 1.3.1

Comments

@deepthidevaki
Copy link
Contributor

When a broker goes OOM in direct memory, it logs the error but continue to work. But the thread in which OOM occured does not make any progress (it looks like it). This eventually causes OOM on heap. It would be better if it exits when OOM occurs in direct memory. We have enabled +XX:exitOnOutOfMemory, but this works only when OOM on heap.

2021-08-31 17:10:02.351 CEST
Exception in thread "Broker-2-zb-actors-3" java.lang.OutOfMemoryError: Direct buffer memory at java.base/java.nio.Bits.reserveMemory(Bits.java:175) at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118) at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317) at io.camunda.zeebe.util.allocation.DirectBufferAllocator.allocate(DirectBufferAllocator.java:20) at io.camunda.zeebe.util.allocation.BufferAllocators.allocateDirect(BufferAllocators.java:16) at io.camunda.zeebe.dispatcher.DispatcherBuilder.initAllocatedBuffer(DispatcherBuilder.java:147) at io.camunda.zeebe.dispatcher.DispatcherBuilder.build(DispatcherBuilder.java:93) at io.camunda.zeebe.logstreams.impl.log.LogStreamImpl.openAppender(LogStreamImpl.java:280) at io.camunda.zeebe.logstreams.impl.log.LogStreamImpl.createWriter(LogStreamImpl.java:208) at io.camunda.zeebe.logstreams.impl.log.LogStreamImpl.lambda$newLogStreamBatchWriter$1(LogStreamImpl.java:115) at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:94) at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78) at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:191)

#7744 (comment)_

@Zelldon
Copy link
Member

Zelldon commented Sep 13, 2021

Similar things observed here #6059

@Zelldon
Copy link
Member

Zelldon commented Sep 13, 2021

@deepthidevaki and me had a closer look at the code and we think it just kills one of our actor threads here https://github.com/camunda-cloud/zeebe/blob/develop/util/src/main/java/io/camunda/zeebe/util/sched/ActorThread.java#L93-L106, if we have multiple actor threads they will still continue to work.

We could solve it via catching it and doing an exit(1) 🙈

@Zelldon
Copy link
Member

Zelldon commented Sep 13, 2021

@Zelldon
Copy link
Member

Zelldon commented Sep 13, 2021

We could also try -XX:OnError="<cmd args>;<cmd args>" https://www.oracle.com/java/technologies/javase/vmoptions-jsp.html

@npepinpe npepinpe added this to Planned in Zeebe Sep 14, 2021
@npepinpe
Copy link
Member

npepinpe commented Sep 14, 2021

Proposal:

Instrument every thread (including, if possible, Netty's) to kill the JVM when fatal errors occur. We can use a starting point the following VirtualMachineError (includes StackOverflowError, OutOfMemoryError, etc.), which from the Javadoc:

Thrown to indicate that the Java Virtual Machine is broken or has run out of resources necessary for it to continue operating.

I think that's a good reason to just end things. We can think about adding also LinkageError in the future (for example, Scala's NonFatal does that), but I'm not sure this is correct for us.

Possibly OnError is already called for any VirtualMachineError, but I couldn't find any documentation specifying this. It just says triggered on fatal errors, which to me implies that but doesn't really confirm it 😅

EDIT: OnError seems to be triggered only by segmentation faults.

@npepinpe
Copy link
Member

Also regarding -XX:OnError, is this only for HotSpot or most JVMs?

@npepinpe
Copy link
Member

So -XX:OnError is not triggered by out of direct memory. See this test:

import java.nio.ByteBuffer;

public class Test {
  public static void main(final String[] args) throws InterruptedException {
    final var t =
        new Thread(
            () -> {
              final var buf = ByteBuffer.allocateDirect(128 * 1024 * 1024);
            });
    t.start();
    Thread.sleep(60_000);
  }
}

Ran with: java -XX:OnError="kill -9 %p" -Xms64M -Xmx64M -XX:MaxDirectMemorySize=64M Test.java, which produces:

Exception in thread "Thread-0" java.lang.OutOfMemoryError: Direct buffer memory
        at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
        at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118)
        at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317)
        at io.camunda.zeebe.broker.transport.externalapi.Test.lambda$main$0(Test.java:10)
        at java.base/java.lang.Thread.run(Thread.java:829)

And then sleeps for a minute.

@npepinpe
Copy link
Member

npepinpe commented Sep 14, 2021

One quick fix could be to do this in the standalone broker/gateway:

    Thread.setDefaultUncaughtExceptionHandler(
        (thread, error) -> {
          if (error instanceof VirtualMachineError) {
            Loggers.SYSTEM_LOGGER.error(
                "An unexpected fatal error was thrown; exiting now.", error);
            System.exit(1);
          }
        });

I think only in Atomix do we overwrite the uncaught exception handler.

@npepinpe
Copy link
Member

So this won't quite work everywhere since the thread pool executor will swallow exceptions 😅

So we would have to:

  • Instrument the actor threads to catch VirtualMachineError and die
  • Instrument the Atomix ThreadContext implementations to catch VirtualMachineError and die
  • Instrument the Netty ChannelHandler (e.g. BasicClientChannelInitializer and BasicServerChannelInitializer) to customize the exception catching behavior, catch VirtualMachineError, and die.

We should still set the default one just in case, of course, but that should cover most places.

@npepinpe npepinpe added Impact: Observability scope/broker Marks an issue or PR to appear in the broker section of the changelog kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. labels Oct 13, 2021
@npepinpe npepinpe moved this from Planned to Ready in Zeebe Nov 9, 2021
@lenaschoenburg lenaschoenburg self-assigned this Dec 7, 2021
@lenaschoenburg lenaschoenburg moved this from Ready to In progress in Zeebe Dec 7, 2021
@npepinpe npepinpe moved this from In progress to Review in progress in Zeebe Dec 8, 2021
ghost pushed a commit that referenced this issue Jan 6, 2022
8327: Always exit on unrecoverable VM errors r=oleschoenburg a=oleschoenburg

## Description

 * Sets a default uncaught exception handler that shuts down on any `VirtualMachineError`.
 * Shuts down on `VirtualMachineError`s in atomix threads.
 * Shuts down on `VirtualMachineError`s in actor threads.
 
<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates to #7807 but does not close it. 



Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this issue Jan 6, 2022
8327: Always exit on unrecoverable VM errors r=oleschoenburg a=oleschoenburg

## Description

 * Sets a default uncaught exception handler that shuts down on any `VirtualMachineError`.
 * Shuts down on `VirtualMachineError`s in atomix threads.
 * Shuts down on `VirtualMachineError`s in actor threads.
 
<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates to #7807 but does not close it. 



Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this issue Jan 6, 2022
8535: [Backport stable/1.2] Always exit on unrecoverable VM errors r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8327 to `stable/1.2`.

relates to #7807

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this issue Jan 6, 2022
8519: [Backport stable/1.3] test(atomix): faster `RaftRule` tests r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8501 to `stable/1.3`.

relates to 

8536: [Backport stable/1.3] Always exit on unrecoverable VM errors r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8327 to `stable/1.3`.

relates to #7807

8538: [Backport stable/1.3] fix: print correct json input r=Zelldon a=github-actions[bot]

# Description
Backport of #8522 to `stable/1.3`.

relates to #8284

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
@lenaschoenburg
Copy link
Member

OOM's should be handled correctly in most places now. I've created a new issue for handling Netty's OutOfDirectMemory: #8543

Zeebe automation moved this from Review in progress to Done Jan 6, 2022
@npepinpe npepinpe added the version:1.3.1 Marks an issue as being completely or in parts released in 1.3.1 label Jan 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. scope/broker Marks an issue or PR to appear in the broker section of the changelog version:1.3.1 Marks an issue as being completely or in parts released in 1.3.1
Projects
None yet
Development

No branches or pull requests

6 participants