Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always exit on unrecoverable VM errors #8327

Merged
2 commits merged into from
Jan 6, 2022
Merged

Conversation

lenaschoenburg
Copy link
Member

@lenaschoenburg lenaschoenburg commented Dec 7, 2021

Description

  • Sets a default uncaught exception handler that shuts down on any VirtualMachineError.
  • Shuts down on VirtualMachineErrors in atomix threads.
  • Shuts down on VirtualMachineErrors in actor threads.

Related issues

relates to #7807 but does not close it.

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/0.25) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement

@lenaschoenburg lenaschoenburg changed the title 7807 Always exist on VM errors Always exit on unrecoverable VM errors Dec 7, 2021
@lenaschoenburg
Copy link
Member Author

The CI fails due to a timeout:

[2021-12-07T17:43:44.379Z] [INFO] Installing /home/jenkins/agent/workspace/ebe_7807-always-exit-on-vm-error/dist/target/camunda-cloud-zeebe-1.3.0-SNAPSHOT.tar.gz to /home/jenkins/agent/workspace/ebe_7807-always-exit-on-vm-error/m2-repository/io/camunda/camunda-cloud-zeebe/1.3.0-SNAPSHOT/camunda-cloud-zeebe-1.3.0-SNAPSHOT.tar.gz
[2021-12-07T17:48:37.563Z] Cancelling nested steps due to timeout
[2021-12-07T17:48:37.567Z] Sending interrupt signal to process
[2021-12-07T17:48:37.567Z] Killing processes
[2021-12-07T17:48:37.950Z] kill finished with exit code 0
[2021-12-07T17:48:51.002Z] Terminated

which is a bit suspicious but might just be i temporary issue in our CI environment. mvn install work on my machine ™

Copy link
Member

@npepinpe npepinpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

  • ❌ Generally I think we can have a better error message here, one which explains why this is a fatal error and why we are exiting early, as I can imagine we might forget the "why" in the future, or it might be surprising for new developers/operators later on.
  • ❌ Is it possible to test this at all in any way?
  • 💭 Would it make sense to reuse this as well in the Netty and Actor code? Just delegate to this exception handler for fatal errors? e.g.
public final class FatalErrorHandler implements UncaughExceptionHandler {
  private static final Logger DEFAULT_LOGGER = Loggers.SYSTEM_LOGGER;
  private final Logger logger;

  public FatalErrorHandler() {
    this(DEFAULT_LOGGER);
  }

  public FatalErrorHandler(final Logger logger) {
    this.logger = logger;
  }

  public void exitFatalErrorOrRethrow(final Throwable error) {
    exitOnFatalError(error);
    LangUtil.rethrowUnchecked(error);
  }

  public void exitOnFatalError(final Throwable error) {
    if (!isFatalError(error)) {
      return;
    }

    logger.error("An unexpected fatal error was thrown; exiting now.", e);
    System.exit(1);
  }

  @VisibleForTesting // ? maybe?
  boolean isFatalError(final Throwable error) {
    return error instance VirtualMachineError;
  }

  @Override
  public void uncaughtException(final Thread t, final Throwable e) {
    exitOnFatalErrorOrRethrow(e);
  }
}

Then, say, in ActorThread:

    try {
      resubmit = currentTask.execute(this);
    } catch (final Exception e) {
      // TODO: check interrupt state?
      // TODO: Handle Exception
      LOG.error("Unexpected error occurred in task {}", currentTask, e);
      // TODO: resubmit on exception?
      //                resubmit = true;
    } catch (Throwable error) {
      fatalErrorHandler.exitFatalErrorOrRethrow(error));
    } finally {
      MDC.remove("actor-name");
      clock.update();
    }
    if (resubmit) {
      currentTask.resubmit();
    }
  }

And in Netty:

    @Override
    public void exceptionCaught(final ChannelHandlerContext ctx, final Throwable cause)
        throws Exception {
      fatalErrorHandler.exitOnFatalError(cause);
      future.completeExceptionally(cause);
      super.exceptionCaught(ctx, cause);
    }

Just a thought. I'm still not 100% sure if that's the right approach.

@@ -93,6 +93,9 @@ private void executeCurrentTask() {

try {
resubmit = currentTask.execute(this);
} catch (final VirtualMachineError e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ If we set the default thread exception handler, and we don't handle VirtualMachineError here, would the default exception handler just handle it then? So maybe here it's not necessary?

import io.camunda.zeebe.broker.Loggers;
import java.lang.Thread.UncaughtExceptionHandler;

public final class ExitingUncaughtExceptionHandler implements UncaughtExceptionHandler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙃 FatalErrorExceptionHandler? I guess we will use this to handle what we consider fatal errors, at which point the behavior would be to exit. I'm also fine with this name though.

@lenaschoenburg
Copy link
Member Author

I've incorporated your feedback into a new approach:

FatalErrorHandler should be easy to understand:

public final class FatalErrorHandler implements UncaughtExceptionHandler {
  private final Logger log;

  public FatalErrorHandler(final Logger log) {
    this.log = log;
  }

  public void handleError(final Throwable e) {
    if (e instanceof VirtualMachineError) {
      log.error("An unrecoverable VM error was thrown, exiting now.", e);
      System.exit(1);
    }
  }

  @Override
  public void uncaughtException(final Thread t, final Throwable e) {
    handleError(e);
  }
}

There is no rethrow functionality because it's not needed. Netty and Atomix manually call the handleError method to ensure that we properly handle VM errors. All other (normal) threads should do this implicitly because we use FatalErrorHandler as a default uncaught exception handler.

I still need to verify that this works exactly as I described by either manually testing or ideally writing tests for this.

Copy link
Member

@npepinpe npepinpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good. I have one request (split in two but it's basically the same), and one question: how will we test this? Can we even test this? I admit I didn't spend too much time thinking about it, but I'm curious what you came up with. Happy to do that with you too, ofc.

Copy link
Member

@npepinpe npepinpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Before merging, please document the FatalErrorHandler, primarily why we consider VirtualMachineError unrecoverable and why we decide to exit because of it. No need for another review (but feel free to request one if you'd like)

public void handleError(final Throwable e) {
if (e instanceof VirtualMachineError) {
log.error("An unrecoverable VM error was thrown, exiting now.", e);
System.exit(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 Maybe we want to use a specific error code to indicate OOM? 137 is the one used by the Linux OOM killer (or 128 + 9). We could reuse this, or we could add one. My thoughts here are that when a user specifies "hey the program exited with X", then we know for sure it's because of this. I know we have the log statement of course, but having a specific status code can be useful if you want to react automatically to it. Let me know what you think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to idea to use a custom error code. I'd prefer not to use 137 for OOM precisely because it is so commonly used. I believe that this would make it a bit more difficult to distinguish between external OOM handling (the linux OOM killer, kubernetes etc.) and internal, JVM OOM handling. I'd suggest using exit code 156, which is ascii Z + B and as far as I can see only used by one product that isn't easily confused with Zeebe.


public void handleError(final Throwable e) {
if (e instanceof VirtualMachineError) {
log.error("An unrecoverable VM error was thrown, exiting now.", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 Operators are unlikely to read our Javadoc, so we might want to expand the error message to explain why this class of error is unrecoverable and why we decide to exit. If we had a public troubleshooting guide we'd probably want to link it here as well, but I guess there's no user facing doc to link here 😞 I'm open to any ideas or suggestions you have here though. But I'm just thinking, if I'm an SRE and I'm deploying a Zeebe cluster, and I run into this, do I know what I have to do?

@npepinpe
Copy link
Member

@oleschoenburg - check in with Roman about this. He had some good ideas about better error handling for actors and we might want to update this PR now or after with those.

@lenaschoenburg
Copy link
Member Author

I've discussed this with @romansmirnov and will try to summarize here:

As Roman has described, actors threads only catch Exceptions and ignore other Throwables, allowing them to bubble up. This kills the actor thread where an actor is running and leave an orphaned actor that can no longer make progress.

My work in this PR only partially solves the problem: As we are now setting a default unhandled exception handler, VirtualMachineErrors are correctly handled (by exiting), but other Throwables are not. To fully solve the problem we'd need to let actor threads catch all throwables, delegate to FatalErrorHandler and then do the same they have done previously on catching Exceptions: logging the exception and then continuing to work.

In my opinion it makes sense to also include a fix for this here as the work of verifying that exception handling is behaving correctly is likely to be the same.

@npepinpe
Copy link
Member

When I talked to Roman, we also discussed notifying actors on non fatal throwables. I think this is done in ActorJob, not ActorThread. Did you two discuss this, and if so, is there a reason this was discarded? Would we do this as a follow up? Doing this (or maybe instead closing the actor) would be one step towards a supervision like error handling strategy.

@lenaschoenburg
Copy link
Member Author

I don't think we have talked about this, no. If I understand correctly, the current situation is that actors get notified about all Exceptions (but not Throwables) as long as they are not currently starting or closing.

@npepinpe
Copy link
Member

npepinpe commented Dec 14, 2021

Not a blocker, we can still proceed here and add it later. I was just wondering.

@romansmirnov
Copy link
Member

romansmirnov commented Dec 14, 2021

@oleschoenburg, sorry, seems that I was not too explicit about it, but actually, we touched on handling non-fatal throwables in the actor. As a concrete example, we talked about closing the log stream, and then we tried to understand what will happen in a failure case during the starting phase and during the runtime phase (we tried to understand what the failure listeners are doing, and then figured out the Zeebe partition will trigger a step down in Raft but also try to recover on Zeebe-side). However, on any failure (Exceptions and non-fatal Throwables), the log stream must be closed during the starting phase, and in addition, the appender future needs to be completed exceptionally. Otherwise, the log stream would be in an unclear state and the stream processor would still be waiting for the completion of the future. Basically, if something went wrong the actor should have a chance to handle it gracefully so that the entire partition can transition in an expected state and can try to recover accordingly.

In addition, to what you already have implemented, I would suggest that the ActorJob catches unhandled non-fatal Throwables in the same way as the ActorJob already does with regular uncaught Exceptions:

https://github.com/camunda-cloud/zeebe/blob/df9710cc95436a9b512ca751114a1e5fbdc7c685/util/src/main/java/io/camunda/zeebe/util/sched/ActorJob.java#L46-L48

That way, the log stream can handle those scenarios gracefully and close as mentioned above. And the ActorThread does not die.

Let's synchronize on this tomorrow to ensure that we are on the same page :)

@lenaschoenburg
Copy link
Member Author

sorry, seems that I was not too explicit about it, but actually, we touched on handling non-fatal throwables in the actor

No that was my bad, I wasn't connecting the dots here and I left out that part of our discussion because I wanted to understand the Logstream error handling first. I agree that extending the error handling in ActorJob would make sense.

@lenaschoenburg lenaschoenburg force-pushed the 7807-always-exit-on-vm-error branch 2 times, most recently from d610f37 to 291a99c Compare December 15, 2021 16:59
@romansmirnov romansmirnov self-requested a review December 17, 2021 08:53
Copy link
Member

@romansmirnov romansmirnov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleschoenburg, I looked especially at the handling of non-fatal throwables in the actor. To me, it looks good. So, implementation-wise there is nothing to add. The only concern is about potential test cases though. I could imagine that it should be possible to write test cases for the actor to ensure that throwables are handled accordingly. WDYT?

@npepinpe
Copy link
Member

npepinpe commented Dec 22, 2021

FYI, there is a way to test methods which also call System.exit by overriding the default SecurityManager and changing the default behavior to throw an exception instead of allowing the system to exit.

Unfortunately, SecurityManager is deprecated and marked for removal in 18, with no replacement. There's an open issue to introduce an API for this as several IDEs rely on this, but I have no idea if it's being used or not. Here's what a simple unit test would look like:

package io.camunda.zeebe.util;

import static org.assertj.core.api.Assertions.assertThatCode;

import java.security.Permission;
import java.util.stream.Stream;
import java.util.zip.ZipError;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.extension.ExtendWith;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.MethodSource;
import org.mockito.Mock;
import org.mockito.junit.jupiter.MockitoExtension;
import org.slf4j.Logger;

@ExtendWith(MockitoExtension.class)
final class FatalErrorHandlerTest {
  private final ExitTrappingSecurityManager exitTrappingSecurityManager =
      new ExitTrappingSecurityManager();
  private SecurityManager previousSecurityManager;
  private @Mock Logger logger;

  @BeforeEach
  void beforeEach() {
    previousSecurityManager = System.getSecurityManager();
    System.setSecurityManager(exitTrappingSecurityManager);
  }

  @AfterEach
  void afterEach() {
    System.setSecurityManager(previousSecurityManager);
  }

  @ParameterizedTest
  @MethodSource("provideFatalErrors")
  void shouldExitOnFatalError(final Throwable fatalError) {
    // given
    final FatalErrorHandler handler = new FatalErrorHandler(logger);

    // when - then
    assertThatCode(() -> handler.handleError(fatalError))
        .isInstanceOf(ExitException.class)
        .hasFieldOrPropertyWithValue("status", FatalErrorHandler.EXIT_CODE);
  }

  private static Stream<Throwable> provideFatalErrors() {
    return Stream.of(
        new OutOfMemoryError(),
        new InternalError(),
        new StackOverflowError(),
        new UnknownError(),
        new ZipError("Failure"));
  }

  private static final class ExitException extends SecurityException {
    private final int status;

    public ExitException(final int status) {
      super();
      this.status = status;
    }

    public int getStatus() {
      return status;
    }
  }

  private static final class ExitTrappingSecurityManager extends SecurityManager {
    // it's necessary to override checkPermission in general, as otherwise you this would explode 
    // before we get to throw our exception in checkExit
    @Override
    public void checkPermission(final Permission perm) {}

    @Override
    public void checkPermission(final Permission perm, final Object context) {}

    @Override
    public void checkExit(final int status) {
      throw new ExitException(status);
    }
  }
}

I figured I'd document it here for posterity - I'm not inclined to add this to the code base at the moment, but I would like to discuss its value a bit.

@npepinpe
Copy link
Member

npepinpe commented Jan 3, 2022

One small thing - it seems InternalError is covered here, correct? We know that InternalError is the error that is thrown when operations fail with MappedByteBuffer - for example, if the underlying file cannot be grown because there's no disk space anymore. See #7607.

So with this, does it mean that the application would exit when this happens?

@lenaschoenburg
Copy link
Member Author

@npepinpe Yes, that's correct. InternalError is a VirtualMachineError so we exit.

@npepinpe
Copy link
Member

npepinpe commented Jan 3, 2022

Then I think we need to explicitly handle it in the Journal, at least until #7607 is done. I understand that with disk watermarks it shouldn't happen, but it can, due to race conditions (e.g. disk fills up before the disk limits kick in). In that case, this is actually a recoverable error. wdyt?

@lenaschoenburg
Copy link
Member Author

@npepinpe How would we recover from this in the journal? If we are out of disk there is not much we can do, right?

@npepinpe
Copy link
Member

npepinpe commented Jan 3, 2022

Hm, I suppose we can't compact without a new snapshot =/ I guess the user would have to manually expand the disk or make space? In Camunda Cloud, our disk will auto resize on the fly (up to a certain limit), so it would somewhat recover.

@lenaschoenburg
Copy link
Member Author

That sounds to me like the behavior in this PR is ok then. If some hypervisor decides to restart the broker after exiting, that leaves a good opportunity to reattach resized disks (if I recall correctly not all managed k8s distributions can resize mounted disks, some need to detach the disk first).

@npepinpe
Copy link
Member

npepinpe commented Jan 3, 2022

Hm, let's walk through it.

InternalError may be thrown when operations against a MappedByteBuffer fail. This is primarily because the Buffer API doesn't really allow for IOException without breaking its API. So a file being truncated, or not being able to grow, or suddenly have permission changes, would result in an InternalError. This would in turn cause us (most likely) to exit. Now, if this happened, then it means we've ran out of disk space - let's ignore the cases where permissions are changed mid-way, or the file is truncated by an outside observer. Deleting the file itself has little impact as we know, the only thing affecting it is truncation. We don't truncate files in Zeebe (as of yet), therefore the only other case is an external agent doing so - we can ignore this for now. So the only case is running out of disk space. If you're actually out of disk space, it most likely means you can't compact, since you can't produce/replicate a new snapshot (remember - no disk space to store it). I think there's a small window where you will compact (but haven't yet) where this could happen I guess - there's also the case where we leak readers, which might hold onto files and prevent them from being deleted. This may then resolve itself (?), depending on the cause. But I imagine the common case is that it's not recoverable without manual intervention.

The behavior seems correct then, but I don't think users will know what to do with this error when it occurs. The error in this case is inscrutable, i.e. I don't think the user will have any idea on how to fix it from the error/message. See the original bug: #6504. The message: java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code. The stack trace will hint at what the error could be, but I don't expect most users/operators to know how to fix this. This is a bit out of scope, so it could be a follow up issue. But it should be clear how to fix it, and if we back port this, then we should also backport the improved logging.

Here we are adding the concept of a FatalErrorHandler that ensures that
we handle unrecoverable errors consistently.
The main implementation is VirtualMachineErrorHandler which exits with a
custom status code when the JVM reports VirtualMachineErrors.
We are using the FatalErrorHandler in a couple of different places:
* actors and actor threads
* atomix threads
* as a default uncaught exception handler for all threads

There are still places where we are swallowing errors, most notably
netty's OutOfDirectMemoryError but this patch should already improve
the current situation.
@lenaschoenburg
Copy link
Member Author

As discussed with @npepinpe I've removed all netty-related changes from this PR. As such, it doesn't actually close #7807 but it's still an improvement because VM errors in actors and atomix threads are handled correctly.

I've also rebased on develop and squashed the remaining commits as they were not providing any useful history.

Copy link
Member

@npepinpe npepinpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Will you create a follow up, or update the other issue on what's left? I was also under the impression we'd discussed further tests for the actor scheduler, but I also can't remember anymore 😄

@lenaschoenburg
Copy link
Member Author

I was also under the impression we'd discussed further tests for the actor scheduler, but I also can't remember anymore

Let's see what we could test:

  • ActorJob#invoke throws ->ActorJob's FatalErrorHandler is notified
  • ActorTask#execute throws -> ActorThread's FatalErrorHandler is notified
  • ActorThread throws -> default UncaughtExceptionHandler is notified.

I think this is also what we discussed before. As this would only test that we don't remove code and it would require additional constructors to inject spy-able FatalErrorHandlers I'm a bit hesitant to add it. @npepinpe I think you were in favor of this anyway, right?

@npepinpe
Copy link
Member

npepinpe commented Jan 6, 2022

I was in favor because removing it would go entirely unnoticed, and would invalidate assumptions in the future about the system behavior, but perhaps the name is ominous enough that nobody would randomly remove it 😄 Let's go with the current state 👍

@lenaschoenburg
Copy link
Member Author

bors r+

ghost pushed a commit that referenced this pull request Jan 6, 2022
8327: Always exit on unrecoverable VM errors r=oleschoenburg a=oleschoenburg

## Description

 * Sets a default uncaught exception handler that shuts down on any `VirtualMachineError`.
 * Shuts down on `VirtualMachineError`s in atomix threads.
 * Shuts down on `VirtualMachineError`s in actor threads.
 
<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

relates to #7807 but does not close it. 



Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
@ghost
Copy link

ghost commented Jan 6, 2022

Build failed:

@lenaschoenburg
Copy link
Member Author

bors retry

@ghost ghost merged commit 73e5c7b into develop Jan 6, 2022
@ghost ghost deleted the 7807-always-exit-on-vm-error branch January 6, 2022 11:50
@lenaschoenburg
Copy link
Member Author

/backport

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2022

Successfully created backport PR #8535 for stable/1.2.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2022

Successfully created backport PR #8536 for stable/1.3.

ghost pushed a commit that referenced this pull request Jan 6, 2022
8535: [Backport stable/1.2] Always exit on unrecoverable VM errors r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8327 to `stable/1.2`.

relates to #7807

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
ghost pushed a commit that referenced this pull request Jan 6, 2022
8519: [Backport stable/1.3] test(atomix): faster `RaftRule` tests r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8501 to `stable/1.3`.

relates to 

8536: [Backport stable/1.3] Always exit on unrecoverable VM errors r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8327 to `stable/1.3`.

relates to #7807

8538: [Backport stable/1.3] fix: print correct json input r=Zelldon a=github-actions[bot]

# Description
Backport of #8522 to `stable/1.3`.

relates to #8284

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
@npepinpe npepinpe added the version:1.3.1 Marks an issue as being completely or in parts released in 1.3.1 label Jan 17, 2022
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version:1.3.1 Marks an issue as being completely or in parts released in 1.3.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants