-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always exit on unrecoverable VM errors #8327
Conversation
The CI fails due to a timeout:
which is a bit suspicious but might just be i temporary issue in our CI environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
- ❌ Generally I think we can have a better error message here, one which explains why this is a fatal error and why we are exiting early, as I can imagine we might forget the "why" in the future, or it might be surprising for new developers/operators later on.
- ❌ Is it possible to test this at all in any way?
- 💭 Would it make sense to reuse this as well in the Netty and Actor code? Just delegate to this exception handler for fatal errors? e.g.
public final class FatalErrorHandler implements UncaughExceptionHandler {
private static final Logger DEFAULT_LOGGER = Loggers.SYSTEM_LOGGER;
private final Logger logger;
public FatalErrorHandler() {
this(DEFAULT_LOGGER);
}
public FatalErrorHandler(final Logger logger) {
this.logger = logger;
}
public void exitFatalErrorOrRethrow(final Throwable error) {
exitOnFatalError(error);
LangUtil.rethrowUnchecked(error);
}
public void exitOnFatalError(final Throwable error) {
if (!isFatalError(error)) {
return;
}
logger.error("An unexpected fatal error was thrown; exiting now.", e);
System.exit(1);
}
@VisibleForTesting // ? maybe?
boolean isFatalError(final Throwable error) {
return error instance VirtualMachineError;
}
@Override
public void uncaughtException(final Thread t, final Throwable e) {
exitOnFatalErrorOrRethrow(e);
}
}
Then, say, in ActorThread
:
try {
resubmit = currentTask.execute(this);
} catch (final Exception e) {
// TODO: check interrupt state?
// TODO: Handle Exception
LOG.error("Unexpected error occurred in task {}", currentTask, e);
// TODO: resubmit on exception?
// resubmit = true;
} catch (Throwable error) {
fatalErrorHandler.exitFatalErrorOrRethrow(error));
} finally {
MDC.remove("actor-name");
clock.update();
}
if (resubmit) {
currentTask.resubmit();
}
}
And in Netty:
@Override
public void exceptionCaught(final ChannelHandlerContext ctx, final Throwable cause)
throws Exception {
fatalErrorHandler.exitOnFatalError(cause);
future.completeExceptionally(cause);
super.exceptionCaught(ctx, cause);
}
Just a thought. I'm still not 100% sure if that's the right approach.
@@ -93,6 +93,9 @@ private void executeCurrentTask() { | |||
|
|||
try { | |||
resubmit = currentTask.execute(this); | |||
} catch (final VirtualMachineError e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ If we set the default thread exception handler, and we don't handle VirtualMachineError
here, would the default exception handler just handle it then? So maybe here it's not necessary?
import io.camunda.zeebe.broker.Loggers; | ||
import java.lang.Thread.UncaughtExceptionHandler; | ||
|
||
public final class ExitingUncaughtExceptionHandler implements UncaughtExceptionHandler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙃 FatalErrorExceptionHandler
? I guess we will use this to handle what we consider fatal errors, at which point the behavior would be to exit. I'm also fine with this name though.
I've incorporated your feedback into a new approach:
There is no rethrow functionality because it's not needed. Netty and Atomix manually call the I still need to verify that this works exactly as I described by either manually testing or ideally writing tests for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good. I have one request (split in two but it's basically the same), and one question: how will we test this? Can we even test this? I admit I didn't spend too much time thinking about it, but I'm curious what you came up with. Happy to do that with you too, ofc.
util/src/main/java/io/camunda/zeebe/util/FatalErrorHandler.java
Outdated
Show resolved
Hide resolved
util/src/main/java/io/camunda/zeebe/util/FatalErrorHandler.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Before merging, please document the FatalErrorHandler
, primarily why we consider VirtualMachineError
unrecoverable and why we decide to exit because of it. No need for another review (but feel free to request one if you'd like)
public void handleError(final Throwable e) { | ||
if (e instanceof VirtualMachineError) { | ||
log.error("An unrecoverable VM error was thrown, exiting now.", e); | ||
System.exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔧 Maybe we want to use a specific error code to indicate OOM? 137 is the one used by the Linux OOM killer (or 128 + 9). We could reuse this, or we could add one. My thoughts here are that when a user specifies "hey the program exited with X", then we know for sure it's because of this. I know we have the log statement of course, but having a specific status code can be useful if you want to react automatically to it. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to idea to use a custom error code. I'd prefer not to use 137
for OOM precisely because it is so commonly used. I believe that this would make it a bit more difficult to distinguish between external OOM handling (the linux OOM killer, kubernetes etc.) and internal, JVM OOM handling. I'd suggest using exit code 156, which is ascii Z + B and as far as I can see only used by one product that isn't easily confused with Zeebe.
|
||
public void handleError(final Throwable e) { | ||
if (e instanceof VirtualMachineError) { | ||
log.error("An unrecoverable VM error was thrown, exiting now.", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔧 Operators are unlikely to read our Javadoc, so we might want to expand the error message to explain why this class of error is unrecoverable and why we decide to exit. If we had a public troubleshooting guide we'd probably want to link it here as well, but I guess there's no user facing doc to link here 😞 I'm open to any ideas or suggestions you have here though. But I'm just thinking, if I'm an SRE and I'm deploying a Zeebe cluster, and I run into this, do I know what I have to do?
@oleschoenburg - check in with Roman about this. He had some good ideas about better error handling for actors and we might want to update this PR now or after with those. |
I've discussed this with @romansmirnov and will try to summarize here: As Roman has described, actors threads only catch My work in this PR only partially solves the problem: As we are now setting a default unhandled exception handler, In my opinion it makes sense to also include a fix for this here as the work of verifying that exception handling is behaving correctly is likely to be the same. |
When I talked to Roman, we also discussed notifying actors on non fatal throwables. I think this is done in ActorJob, not ActorThread. Did you two discuss this, and if so, is there a reason this was discarded? Would we do this as a follow up? Doing this (or maybe instead closing the actor) would be one step towards a supervision like error handling strategy. |
I don't think we have talked about this, no. If I understand correctly, the current situation is that actors get notified about all |
Not a blocker, we can still proceed here and add it later. I was just wondering. |
@oleschoenburg, sorry, seems that I was not too explicit about it, but actually, we touched on handling non-fatal throwables in the actor. As a concrete example, we talked about closing the log stream, and then we tried to understand what will happen in a failure case during the starting phase and during the runtime phase (we tried to understand what the failure listeners are doing, and then figured out the Zeebe partition will trigger a step down in Raft but also try to recover on Zeebe-side). However, on any failure ( In addition, to what you already have implemented, I would suggest that the That way, the log stream can handle those scenarios gracefully and close as mentioned above. And the Let's synchronize on this tomorrow to ensure that we are on the same page :) |
No that was my bad, I wasn't connecting the dots here and I left out that part of our discussion because I wanted to understand the Logstream error handling first. I agree that extending the error handling in ActorJob would make sense. |
d610f37
to
291a99c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleschoenburg, I looked especially at the handling of non-fatal throwables in the actor. To me, it looks good. So, implementation-wise there is nothing to add. The only concern is about potential test cases though. I could imagine that it should be possible to write test cases for the actor to ensure that throwables are handled accordingly. WDYT?
291a99c
to
816a25f
Compare
FYI, there is a way to test methods which also call Unfortunately, package io.camunda.zeebe.util;
import static org.assertj.core.api.Assertions.assertThatCode;
import java.security.Permission;
import java.util.stream.Stream;
import java.util.zip.ZipError;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.extension.ExtendWith;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.MethodSource;
import org.mockito.Mock;
import org.mockito.junit.jupiter.MockitoExtension;
import org.slf4j.Logger;
@ExtendWith(MockitoExtension.class)
final class FatalErrorHandlerTest {
private final ExitTrappingSecurityManager exitTrappingSecurityManager =
new ExitTrappingSecurityManager();
private SecurityManager previousSecurityManager;
private @Mock Logger logger;
@BeforeEach
void beforeEach() {
previousSecurityManager = System.getSecurityManager();
System.setSecurityManager(exitTrappingSecurityManager);
}
@AfterEach
void afterEach() {
System.setSecurityManager(previousSecurityManager);
}
@ParameterizedTest
@MethodSource("provideFatalErrors")
void shouldExitOnFatalError(final Throwable fatalError) {
// given
final FatalErrorHandler handler = new FatalErrorHandler(logger);
// when - then
assertThatCode(() -> handler.handleError(fatalError))
.isInstanceOf(ExitException.class)
.hasFieldOrPropertyWithValue("status", FatalErrorHandler.EXIT_CODE);
}
private static Stream<Throwable> provideFatalErrors() {
return Stream.of(
new OutOfMemoryError(),
new InternalError(),
new StackOverflowError(),
new UnknownError(),
new ZipError("Failure"));
}
private static final class ExitException extends SecurityException {
private final int status;
public ExitException(final int status) {
super();
this.status = status;
}
public int getStatus() {
return status;
}
}
private static final class ExitTrappingSecurityManager extends SecurityManager {
// it's necessary to override checkPermission in general, as otherwise you this would explode
// before we get to throw our exception in checkExit
@Override
public void checkPermission(final Permission perm) {}
@Override
public void checkPermission(final Permission perm, final Object context) {}
@Override
public void checkExit(final int status) {
throw new ExitException(status);
}
}
} I figured I'd document it here for posterity - I'm not inclined to add this to the code base at the moment, but I would like to discuss its value a bit. |
One small thing - it seems So with this, does it mean that the application would exit when this happens? |
@npepinpe Yes, that's correct. |
Then I think we need to explicitly handle it in the Journal, at least until #7607 is done. I understand that with disk watermarks it shouldn't happen, but it can, due to race conditions (e.g. disk fills up before the disk limits kick in). In that case, this is actually a recoverable error. wdyt? |
@npepinpe How would we recover from this in the journal? If we are out of disk there is not much we can do, right? |
Hm, I suppose we can't compact without a new snapshot =/ I guess the user would have to manually expand the disk or make space? In Camunda Cloud, our disk will auto resize on the fly (up to a certain limit), so it would somewhat recover. |
That sounds to me like the behavior in this PR is ok then. If some hypervisor decides to restart the broker after exiting, that leaves a good opportunity to reattach resized disks (if I recall correctly not all managed k8s distributions can resize mounted disks, some need to detach the disk first). |
Hm, let's walk through it.
The behavior seems correct then, but I don't think users will know what to do with this error when it occurs. The error in this case is inscrutable, i.e. I don't think the user will have any idea on how to fix it from the error/message. See the original bug: #6504. The message: |
13e1ce7
to
c0c8a58
Compare
7bc5c6e
to
c27bd13
Compare
Here we are adding the concept of a FatalErrorHandler that ensures that we handle unrecoverable errors consistently. The main implementation is VirtualMachineErrorHandler which exits with a custom status code when the JVM reports VirtualMachineErrors. We are using the FatalErrorHandler in a couple of different places: * actors and actor threads * atomix threads * as a default uncaught exception handler for all threads There are still places where we are swallowing errors, most notably netty's OutOfDirectMemoryError but this patch should already improve the current situation.
c27bd13
to
5c3090f
Compare
As discussed with @npepinpe I've removed all netty-related changes from this PR. As such, it doesn't actually close #7807 but it's still an improvement because VM errors in actors and atomix threads are handled correctly. I've also rebased on develop and squashed the remaining commits as they were not providing any useful history. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Will you create a follow up, or update the other issue on what's left? I was also under the impression we'd discussed further tests for the actor scheduler, but I also can't remember anymore 😄
util/src/main/java/io/camunda/zeebe/util/error/FatalErrorHandlers.java
Outdated
Show resolved
Hide resolved
Let's see what we could test:
I think this is also what we discussed before. As this would only test that we don't remove code and it would require additional constructors to inject spy-able |
I was in favor because removing it would go entirely unnoticed, and would invalidate assumptions in the future about the system behavior, but perhaps the name is ominous enough that nobody would randomly remove it 😄 Let's go with the current state 👍 |
bors r+ |
8327: Always exit on unrecoverable VM errors r=oleschoenburg a=oleschoenburg ## Description * Sets a default uncaught exception handler that shuts down on any `VirtualMachineError`. * Shuts down on `VirtualMachineError`s in atomix threads. * Shuts down on `VirtualMachineError`s in actor threads. <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> relates to #7807 but does not close it. Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
Build failed: |
bors retry |
/backport |
Successfully created backport PR #8535 for |
Successfully created backport PR #8536 for |
8519: [Backport stable/1.3] test(atomix): faster `RaftRule` tests r=oleschoenburg a=github-actions[bot] # Description Backport of #8501 to `stable/1.3`. relates to 8536: [Backport stable/1.3] Always exit on unrecoverable VM errors r=oleschoenburg a=github-actions[bot] # Description Backport of #8327 to `stable/1.3`. relates to #7807 8538: [Backport stable/1.3] fix: print correct json input r=Zelldon a=github-actions[bot] # Description Backport of #8522 to `stable/1.3`. relates to #8284 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
Description
VirtualMachineError
.VirtualMachineError
s in atomix threads.VirtualMachineError
s in actor threads.Related issues
relates to #7807 but does not close it.
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/0.25
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation: