Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemory in follower #7992

Closed
deepthidevaki opened this issue Oct 14, 2021 · 24 comments · Fixed by #8605
Closed

OutOfMemory in follower #7992

deepthidevaki opened this issue Oct 14, 2021 · 24 comments · Fixed by #8605
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0 version:1.3.2 Marks an issue as being completely or in parts released in 1.3.2

Comments

@deepthidevaki
Copy link
Contributor

Describe the bug

In the release-1.2.0 benchmark, observed that the node which is the follower for all partitions went out of memory.
Might be related to #7784

image

https://console.cloud.google.com/errors/CMz-q-2yh8X-pQE?service=zeebe-broker&time=P7D&project=zeebe-io

Environment:

  • Zeebe Version: 1.2.0
@deepthidevaki deepthidevaki added the kind/bug Categorizes an issue or PR as a bug label Oct 14, 2021
@deepthidevaki
Copy link
Contributor Author

Here is the heap dump from the node.

@npepinpe npepinpe added 1.2.1 scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround labels Oct 15, 2021
@npepinpe npepinpe added this to Planned in Zeebe Oct 15, 2021
@npepinpe npepinpe removed this from Planned in Zeebe Oct 15, 2021
@npepinpe
Copy link
Member

@deepthidevaki please timebox and check if the issue is obvious from the heap dump, in which case we can just go ahead and fix it.

@deepthidevaki
Copy link
Contributor Author

Looks very similar to #7744 There are many "StreamProcessor" instances which are already closed but not garbage collected.

@npepinpe
Copy link
Member

npepinpe commented Oct 15, 2021

So we didn't fix it? 😅 Do we expect the cause to be the same?

@deepthidevaki
Copy link
Contributor Author

Don't know if it is the same reason. I'm trying to get more info from it. VisualVm goes out of memory when I'm analyzing this heap dump 😀

@Zelldon
Copy link
Member

Zelldon commented Oct 15, 2021

Lol 😆

@lenaschoenburg
Copy link
Member

@deepthidevaki If that helps: I managed to open the heap dump in VisualVm, it just took a while.

@deepthidevaki
Copy link
Contributor Author

Thanks @oleschoenburg Are you able to compute GC root? That gets stuck for me. It makes progress for a while and then everything hangs. Increasing memory did not help.

@lenaschoenburg
Copy link
Member

@deepthidevaki I'm assuming you want to find the GC Root of one of the StreamProcessors? I've started the search but it looks like it's going to take a while.

@deepthidevaki
Copy link
Contributor Author

deepthidevaki commented Oct 15, 2021

Meanwhile I'm looking at the logs to understand what happened.

It seems like the transition is stuck. Initially you see the logs for cleanup and transition of each steps. After sometime time, we see logs that trigger the transitions, But no steps are executed. It looks like the tasks are just queued on the actor waiting for something to happen. This goes on for hours until it went OOM.

I 2021-10-09T10:46:21.814549Z Broker-2-ZeebePartition-1 Cleanup of transition to FOLLOWER on term 4 completed  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:21.814790Z Broker-2-ZeebePartition-1 Transition to INACTIVE on term -1 starting  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:21.815031Z Broker-2-ZeebePartition-1 Transition to INACTIVE on term -1 - executing LogStorage  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:21.815178Z Broker-2-ZeebePartition-1 Transition to INACTIVE on term -1 - executing LogStream  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:21.815306Z Broker-2-ZeebePartition-1 Transition to INACTIVE on term -1 - executing ZeebeDb  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:21.815424Z Broker-2-ZeebePartition-1 Transition to INACTIVE on term -1 - executing QueryService  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:21.815521Z Broker-2-ZeebePartition-1 Transition to INACTIVE on term -1 - executing StreamProcessor  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:21.815623Z Broker-2-ZeebePartition-1 Transition to INACTIVE on term -1 - executing SnapshotDirector  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:21.815718Z Broker-2-ZeebePartition-1 Transition to INACTIVE on term -1 - executing ExporterDirector  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:21.815883Z Broker-2-ZeebePartition-1 Transition to INACTIVE on term -1 completed  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.146910Z Broker-2-ZeebePartition-1 Transition to FOLLOWER on term 4 requested.  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.147277Z Broker-2-ZeebePartition-1 Received cancel signal for transition to INACTIVE on term -1  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.147379Z Broker-2-ZeebePartition-1 Cleanup of transition to INACTIVE on term -1 starting (in preparation for new transition to FOLLOWER)  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.147943Z Broker-2-ZeebePartition-1 Cleanup of transition to INACTIVE on term -1 - executing ExporterDirector  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.148144Z Broker-2-ZeebePartition-1 Cleanup of transition to INACTIVE on term -1 - executing SnapshotDirector  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.148205Z Broker-2-ZeebePartition-1 Cleanup of transition to INACTIVE on term -1 - executing StreamProcessor  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.148259Z Broker-2-ZeebePartition-1 Cleanup of transition to INACTIVE on term -1 - executing QueryService  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.148329Z Broker-2-ZeebePartition-1 Cleanup of transition to INACTIVE on term -1 - executing ZeebeDb  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.148395Z Broker-2-ZeebePartition-1 Cleanup of transition to INACTIVE on term -1 - executing LogStream  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.148589Z Broker-2-ZeebePartition-1 Cleanup of transition to INACTIVE on term -1 - executing LogStorage  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.148690Z Broker-2-ZeebePartition-1 Cleanup of transition to INACTIVE on term -1 completed  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.148774Z Broker-2-ZeebePartition-1 Transition to FOLLOWER on term 4 starting  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.148884Z Broker-2-ZeebePartition-1 Transition to FOLLOWER on term 4 - executing LogStorage  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.148983Z Broker-2-ZeebePartition-1 Transition to FOLLOWER on term 4 - executing LogStream  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.149203Z Broker-2-ZeebePartition-1 Transition to FOLLOWER on term 4 - executing ZeebeDb  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.255477Z Broker-2-ZeebePartition-1 Transition to FOLLOWER on term 4 - executing QueryService  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.256492Z Broker-2-ZeebePartition-1 Engine created. [value-mapper: CompositeValueMapper(List(io.camunda.zeebe.el.impl.feel.MessagePackValueMapper@6a2644c6)), function-provider: io.camunda.zeebe.el.impl.feel.FeelFunctionProvider@7bccb22d, clock: io.camunda.zeebe.el.impl.ZeebeFeelEngineClock@6b2e26cb, configuration: Configuration(false)]  Broker-2-ZeebePartition-1
I 2021-10-09T10:46:22.257444Z Broker-2-ZeebePartition-1 Transition to FOLLOWER on term 4 - executing StreamProcessor  Broker-2-ZeebePartition-1
I 2021-10-09T10:51:21.403454Z Broker-2-ZeebePartition-1 Transition to INACTIVE on term -1 requested.  Broker-2-ZeebePartition-1
I 2021-10-09T10:51:21.403915Z Broker-2-ZeebePartition-1 Received cancel signal for transition to FOLLOWER on term 4  Broker-2-ZeebePartition-1
I 2021-10-09T10:51:21.832878Z Broker-2-ZeebePartition-1 Transition to FOLLOWER on term 4 requested.  Broker-2-ZeebePartition-1
I 2021-10-09T10:51:21.833497Z Broker-2-ZeebePartition-1 Received cancel signal for transition to INACTIVE on term -1  Broker-2-ZeebePartition-1
I 2021-10-09T10:56:21.565062Z Broker-2-ZeebePartition-1 Transition to INACTIVE on term -1 requested.  Broker-2-ZeebePartition-1
I 2021-10-09T10:56:21.565514Z Broker-2-ZeebePartition-1 Received cancel signal for transition to FOLLOWER on term 4  Broker-2-ZeebePartition-1
I 2021-10-09T10:56:22.025631Z Broker-2-ZeebePartition-1 Transition to FOLLOWER on term 4 requested.  Broker-2-ZeebePartition-1
I 2021-10-09T10:56:22.026027Z Broker-2-ZeebePartition-1 Received cancel signal for transition to INACTIVE on term -1  Broker-2-ZeebePartition-1
I 2021-10-09T11:01:21.699949Z Broker-2-ZeebePartition-1 Transition to INACTIVE on term -1 requested.  Broker-2-ZeebePartition-1
I 2021-10-09T11:01:21.700398Z Broker-2-ZeebePartition-1 Received cancel signal for transition to FOLLOWER on term 4  Broker-2-ZeebePartition-1
I 2021-10-09T11:01:22.156694Z Broker-2-ZeebePartition-1 Transition to FOLLOWER on term 4 requested.  Broker-2-ZeebePartition-1
I 2021-10-09T11:01:22.157104Z Broker-2-ZeebePartition-1 Received cancel signal for transition to INACTIVE on term -1  Broker-2-ZeebePartition-1
I 2021-10-09T11:11:21.659003Z Broker-2-ZeebePartition-1 Transition to INACTIVE on term -1 requested.  Broker-2-ZeebePartition-1
I 2021-10-09T11:11:21.659994Z Broker-2-ZeebePartition-1 Received cancel signal for transition to FOLLOWER on term 4  Broker-2-ZeebePartition-1

Similar case also observed for partition 2 on the same broker.

@deepthidevaki
Copy link
Contributor Author

There were two times when this broker's memory use went high. In both cases noticeable thing that happened in an exception
on LogStream "java.lang.IllegalStateException: Segment not open" #7962
This happened a few hours before the above logs.

@deepthidevaki
Copy link
Contributor Author

"Segment not open" exception happened on partition 3. But we don't see an increase in number of log segments for partition 3.

We do see an increase in number of segments for partition 1 and 2. This increase is correlating to the above logs when the partitions gets stuck in "transitioning".

@deepthidevaki
Copy link
Contributor Author

deepthidevaki commented Oct 15, 2021

After the "segment not open" exception, partition 3 is also not making any progress. There is an ongoing snapshot replication for hours.
EDIT: All requests for partition 3 to this node from the leader timesout. That explains why snapshot replication did not complete.

image

@lenaschoenburg
Copy link
Member

lenaschoenburg commented Oct 15, 2021

@deepthidevaki Okay, VisualVM didn't make any progress after a while. MAT was much more successful, however.
Screenshot 2021-10-15 at 16 14 17

@deepthidevaki
Copy link
Contributor Author

Thanks @oleschoenburg I will look into to it next week.

@deepthidevaki
Copy link
Contributor Author

I checked why transition was stuck. The symptomps are similar to #7694 StreamProcessor install never completed. It seems the "onActorStarting()" is never executed.

@npepinpe npepinpe added this to In progress in Zeebe Oct 18, 2021
@deepthidevaki
Copy link
Contributor Author

Update:

I was not able to root cause it. But here are my observations:

(Following @Zelldon's 5 why's 🙂 )

Out of memory happens because:

  1. Many "ActorJobs" are enqueued in the actor. Most of these tasks are transitions triggered due to InstallRequests. These tasks are never dequeued because they are never executed.

The tasks are queued because:
2. ZeebePartition transition is stuck

because
3. StreamProcessor startup is stuck. Starting the StreamProcessor for one transition is stuck. So all the transitions enqueued after this is waiting for the startup to finish, before that transition can be cancelled and new transitions can start.

StreamProcessor startup is stuck because
4. Hypothesis: Creating logStreamWriter has not completed. From the heapdump, we can see that the "appenderFuture" is created, but appender and writebuffer is never created.

logStreamWriter is not created because:
5. No idea. We don't see any errors in the logs. It is possible that if there was an error, the future is not completed (because of improper error handling) and the streamProcessor waiting on that future waits for ever. But we don't see any error in the logs. Another hypothesis is the it is stuck in a loop or so, so the task never completes. One idea was that the reader is stuck trying to read the last position from the journal. Howerver no evidence in logs or heapdump to confirm this. What is interesting is there are two partitions of which the StreamProcessor is in this state.

I can confirm that the root cause is not similar to #7744. But it is sames as #7694

@deepthidevaki
Copy link
Contributor Author

@npepinpe I don't think I can make progress in this to find the root cause.

One suggestion for improvement:
Proper error handling in LogStreamImpl::newLogStreamBatchWriter. For example in openAppender if an exception is thrown while reading the lastposition, the future is not completed.

@npepinpe
Copy link
Member

Then let's do that. If anything occurs to you that could be done to simplify this diagnosis process or that could help root cause this better the next time, please also include it.

@npepinpe
Copy link
Member

npepinpe commented Nov 8, 2021

Do we think this might now be fixed?

@deepthidevaki
Copy link
Contributor Author

Not sure. There might have been multiple reasons for it. #8044 might be one, which is already fixed. But the state was also similar to #7694 for which the root cause is not clear, but may be fixed by #8038.

@romansmirnov
Copy link
Member

romansmirnov commented Dec 13, 2021

Re #7992 (comment): Raised #8369 that solves the situation that a lock is not released which blocks the installation of a received snapshot.

@romansmirnov
Copy link
Member

romansmirnov commented Dec 13, 2021

Re #7992 (comment):

No idea. We don't see any errors in the logs. It is possible that if there was an error, the future is not completed (because of improper error handling) and the streamProcessor waiting on that future waits for ever. But we don't see any error in the logs. Another hypothesis is the it is stuck in a loop or so, so the task never completes. One idea was that the reader is stuck trying to read the last position from the journal. Howerver no evidence in logs or heapdump to confirm this. What is interesting is there are two partitions of which the StreamProcessor is in this state.

To my understanding, the two Actor Threads Broker-2-zb-actors-0 and Broker-2-zb-actors-2 are not scheduled anymore. Thus, they do not execute any Actor Tasks (and Actor Jobs).

When checking the heap dump, the Actor Threads do not have the Thread label assigned:
image

Additionally, the thread group (which manages all scheduled threads) does not contain both Actor Threads:
image

There is also other information indicating that the threads are not scheduled (i.e., terminated) anymore.

When checking the Actor's source code, then the actor handles only all unhandled exceptions of type Exception, like

https://github.com/camunda-cloud/zeebe/blob/2dee25fcb748d8a9c6457eee465d08fd5a4d5471/util/src/main/java/io/camunda/zeebe/util/sched/ActorJob.java#L36-L58

As mentioned by @deepthidevaki, when opening the appender an exception occurred that has not been handled correctly. But with the given facts above, I assume that an exception of type Throwable (but not Exception) has been thrown, which wasn't handled by the LogStream but also not by the Actor. As a consequence of the unhandled throwable exception, the actor threads terminated. The impact is that other Actors and Raft Threads assume that the Log Stream Actor is still active, and thus, they submit additional work (i.e., Actor Jobs) to the Log Stream Actor. The Actor Jobs piles up which leads to an increasing java heap size that results at some point in an OutOfMemoryError.

To my understanding, the fix provided with #8038 does not solve the issue sufficiently. Meaning, it still might happen that a throwable exception is thrown and not handled which terminates the corresponding actor thread.

#7807 already describes the issue of terminated actor threads, which was observed in the case of an OutOfMemoryError in the direct memory. The fix implements a handler to handle uncaught exceptions and in the case of VirtualMachineError the application is exited. However, in case there is any other Throwable (but not Exception) thrown then this issue might still occur as they are not handled improperly.

A possible solution would be to catch Throwable in the actor job, and if it is not a VirtualMachineError then let the actor handle that situation. In the case of the Log Stream, it could close itself and complete exceptionally the appender future so that the Stream Process can handle that situation. Otherwise, the error is logged. But the actor threads remain running and do not get terminated.

@npepinpe npepinpe moved this from Planned to Ready in Zeebe Dec 14, 2021
@romansmirnov
Copy link
Member

Depends on #8327

@korthout korthout added the version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0 label Jan 4, 2022
@npepinpe npepinpe moved this from Ready to In progress in Zeebe Jan 17, 2022
@romansmirnov romansmirnov moved this from In progress to Review in progress in Zeebe Jan 18, 2022
ghost pushed a commit that referenced this issue Jan 19, 2022
8582: fix(log/appender): yield thread when experiencing backpressure r=romansmirnov a=romansmirnov

## Description

Yield the thread, when the log storage appender experiences backpressure when trying to append the fragments to the log storage. That way, the actual actor task (log storage appender) is resubmitted to the working queue, and the actor thread is released to execute other actor tasks.

## Related issues

closes #8540 



8605: fix(log/stream): ensure the appender future always gets completed r=romansmirnov a=romansmirnov

## Description

* Handles any kind of thrown `Throwable`s in the `LogStream` actor, so that the appender future gets completed exceptionally.
* Handles the situation when opening the appender, the `LogStream` actor is supposed to be closed. In this situation, the appender future gets completed exceptionally as well.

## Related issues

closes #7992



8615: deps(maven): bump value from 2.8.9-ea-1 to 2.9.0 r=npepinpe a=dependabot[bot]

Bumps [value](https://github.com/immutables/immutables) from 2.8.9-ea-1 to 2.9.0.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a href="https://github.com/immutables/immutables/commits">compare view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.immutables:value&package-manager=maven&previous-version=2.8.9-ea-1&new-version=2.9.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)


</details>

Co-authored-by: Roman <roman.smirnov@camunda.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@ghost ghost closed this as completed in 9e189a1 Jan 19, 2022
Zeebe automation moved this from Review in progress to Done Jan 19, 2022
ghost pushed a commit that referenced this issue Jan 21, 2022
8628: [Backport stable/1.3] fix(log/stream): ensure the appender future always gets completed r=oleschoenburg a=romansmirnov

## Description

<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

backports #8605
relates #7992



Co-authored-by: Roman <roman.smirnov@camunda.com>
ghost pushed a commit that referenced this issue Jan 21, 2022
8628: [Backport stable/1.3] fix(log/stream): ensure the appender future always gets completed r=oleschoenburg a=romansmirnov

## Description

<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

backports #8605
relates #7992



Co-authored-by: Roman <roman.smirnov@camunda.com>
ghost pushed a commit that referenced this issue Jan 21, 2022
8627: [Backport stable/1.2] fix(log/stream): ensure the appender future always gets completed r=oleschoenburg a=romansmirnov

## Description

<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

backports #8605 
relates #7992



Co-authored-by: Roman <roman.smirnov@camunda.com>
ghost pushed a commit that referenced this issue Jan 25, 2022
8627: [Backport stable/1.2] fix(log/stream): ensure the appender future always gets completed r=oleschoenburg a=romansmirnov

## Description

<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

backports #8605 
relates #7992



Co-authored-by: Roman <roman.smirnov@camunda.com>
@npepinpe npepinpe added version:1.3.2 Marks an issue as being completely or in parts released in 1.3.2 Release: 1.2.10 labels Jan 28, 2022
@KerstinHebel KerstinHebel removed this from Done in Zeebe Mar 23, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0 version:1.3.2 Marks an issue as being completely or in parts released in 1.3.2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants