-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When a broker goes OOM it should be easily observable #7807
Comments
Similar things observed here #6059 |
@deepthidevaki and me had a closer look at the code and we think it just kills one of our actor threads here https://github.com/camunda-cloud/zeebe/blob/develop/util/src/main/java/io/camunda/zeebe/util/sched/ActorThread.java#L93-L106, if we have multiple actor threads they will still continue to work. We could solve it via catching it and doing an exit(1) 🙈 |
We could also try |
Proposal: Instrument every thread (including, if possible, Netty's) to kill the JVM when fatal errors occur. We can use a starting point the following
I think that's a good reason to just end things. We can think about adding also
EDIT: |
Also regarding |
So import java.nio.ByteBuffer;
public class Test {
public static void main(final String[] args) throws InterruptedException {
final var t =
new Thread(
() -> {
final var buf = ByteBuffer.allocateDirect(128 * 1024 * 1024);
});
t.start();
Thread.sleep(60_000);
}
} Ran with: Exception in thread "Thread-0" java.lang.OutOfMemoryError: Direct buffer memory
at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118)
at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317)
at io.camunda.zeebe.broker.transport.externalapi.Test.lambda$main$0(Test.java:10)
at java.base/java.lang.Thread.run(Thread.java:829) And then sleeps for a minute. |
One quick fix could be to do this in the standalone broker/gateway: Thread.setDefaultUncaughtExceptionHandler(
(thread, error) -> {
if (error instanceof VirtualMachineError) {
Loggers.SYSTEM_LOGGER.error(
"An unexpected fatal error was thrown; exiting now.", error);
System.exit(1);
}
}); I think only in Atomix do we overwrite the uncaught exception handler. |
So this won't quite work everywhere since the thread pool executor will swallow exceptions 😅 So we would have to:
We should still set the default one just in case, of course, but that should cover most places. |
8327: Always exit on unrecoverable VM errors r=oleschoenburg a=oleschoenburg ## Description * Sets a default uncaught exception handler that shuts down on any `VirtualMachineError`. * Shuts down on `VirtualMachineError`s in atomix threads. * Shuts down on `VirtualMachineError`s in actor threads. <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> relates to #7807 but does not close it. Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
8327: Always exit on unrecoverable VM errors r=oleschoenburg a=oleschoenburg ## Description * Sets a default uncaught exception handler that shuts down on any `VirtualMachineError`. * Shuts down on `VirtualMachineError`s in atomix threads. * Shuts down on `VirtualMachineError`s in actor threads. <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> relates to #7807 but does not close it. Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
8519: [Backport stable/1.3] test(atomix): faster `RaftRule` tests r=oleschoenburg a=github-actions[bot] # Description Backport of #8501 to `stable/1.3`. relates to 8536: [Backport stable/1.3] Always exit on unrecoverable VM errors r=oleschoenburg a=github-actions[bot] # Description Backport of #8327 to `stable/1.3`. relates to #7807 8538: [Backport stable/1.3] fix: print correct json input r=Zelldon a=github-actions[bot] # Description Backport of #8522 to `stable/1.3`. relates to #8284 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
OOM's should be handled correctly in most places now. I've created a new issue for handling Netty's |
When a broker goes OOM in direct memory, it logs the error but continue to work. But the thread in which OOM occured does not make any progress (it looks like it). This eventually causes OOM on heap. It would be better if it exits when OOM occurs in direct memory. We have enabled
+XX:exitOnOutOfMemory
, but this works only when OOM on heap.#7744 (comment)_
The text was updated successfully, but these errors were encountered: