-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OutOfMemory in follower #7992
Comments
Here is the heap dump from the node. |
@deepthidevaki please timebox and check if the issue is obvious from the heap dump, in which case we can just go ahead and fix it. |
Looks very similar to #7744 There are many "StreamProcessor" instances which are already closed but not garbage collected. |
So we didn't fix it? 😅 Do we expect the cause to be the same? |
Don't know if it is the same reason. I'm trying to get more info from it. VisualVm goes out of memory when I'm analyzing this heap dump 😀 |
Lol 😆 |
@deepthidevaki If that helps: I managed to open the heap dump in VisualVm, it just took a while. |
Thanks @oleschoenburg Are you able to compute GC root? That gets stuck for me. It makes progress for a while and then everything hangs. Increasing memory did not help. |
@deepthidevaki I'm assuming you want to find the GC Root of one of the StreamProcessors? I've started the search but it looks like it's going to take a while. |
Meanwhile I'm looking at the logs to understand what happened. It seems like the transition is stuck. Initially you see the logs for cleanup and transition of each steps. After sometime time, we see logs that trigger the transitions, But no steps are executed. It looks like the tasks are just queued on the actor waiting for something to happen. This goes on for hours until it went OOM.
Similar case also observed for partition 2 on the same broker. |
There were two times when this broker's memory use went high. In both cases noticeable thing that happened in an exception |
"Segment not open" exception happened on partition 3. But we don't see an increase in number of log segments for partition 3. We do see an increase in number of segments for partition 1 and 2. This increase is correlating to the above logs when the partitions gets stuck in "transitioning". |
@deepthidevaki Okay, VisualVM didn't make any progress after a while. MAT was much more successful, however. |
Thanks @oleschoenburg I will look into to it next week. |
I checked why transition was stuck. The symptomps are similar to #7694 StreamProcessor install never completed. It seems the "onActorStarting()" is never executed. |
Update: I was not able to root cause it. But here are my observations: (Following @Zelldon's 5 why's 🙂 ) Out of memory happens because:
The tasks are queued because: because StreamProcessor startup is stuck because logStreamWriter is not created because: I can confirm that the root cause is not similar to #7744. But it is sames as #7694 |
@npepinpe I don't think I can make progress in this to find the root cause. One suggestion for improvement: |
Then let's do that. If anything occurs to you that could be done to simplify this diagnosis process or that could help root cause this better the next time, please also include it. |
Do we think this might now be fixed? |
Re #7992 (comment): Raised #8369 that solves the situation that a lock is not released which blocks the installation of a received snapshot. |
Re #7992 (comment):
To my understanding, the two Actor Threads When checking the heap dump, the Actor Threads do not have the Additionally, the thread group (which manages all scheduled threads) does not contain both Actor Threads: There is also other information indicating that the threads are not scheduled (i.e., terminated) anymore. When checking the Actor's source code, then the actor handles only all unhandled exceptions of type As mentioned by @deepthidevaki, when opening the appender an exception occurred that has not been handled correctly. But with the given facts above, I assume that an exception of type To my understanding, the fix provided with #8038 does not solve the issue sufficiently. Meaning, it still might happen that a throwable exception is thrown and not handled which terminates the corresponding actor thread. #7807 already describes the issue of terminated actor threads, which was observed in the case of an A possible solution would be to catch |
Depends on #8327 |
8582: fix(log/appender): yield thread when experiencing backpressure r=romansmirnov a=romansmirnov ## Description Yield the thread, when the log storage appender experiences backpressure when trying to append the fragments to the log storage. That way, the actual actor task (log storage appender) is resubmitted to the working queue, and the actor thread is released to execute other actor tasks. ## Related issues closes #8540 8605: fix(log/stream): ensure the appender future always gets completed r=romansmirnov a=romansmirnov ## Description * Handles any kind of thrown `Throwable`s in the `LogStream` actor, so that the appender future gets completed exceptionally. * Handles the situation when opening the appender, the `LogStream` actor is supposed to be closed. In this situation, the appender future gets completed exceptionally as well. ## Related issues closes #7992 8615: deps(maven): bump value from 2.8.9-ea-1 to 2.9.0 r=npepinpe a=dependabot[bot] Bumps [value](https://github.com/immutables/immutables) from 2.8.9-ea-1 to 2.9.0. <details> <summary>Commits</summary> <ul> <li>See full diff in <a href="https://github.com/immutables/immutables/commits">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.immutables:value&package-manager=maven&previous-version=2.8.9-ea-1&new-version=2.9.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Co-authored-by: Roman <roman.smirnov@camunda.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
8628: [Backport stable/1.3] fix(log/stream): ensure the appender future always gets completed r=oleschoenburg a=romansmirnov ## Description <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> backports #8605 relates #7992 Co-authored-by: Roman <roman.smirnov@camunda.com>
8628: [Backport stable/1.3] fix(log/stream): ensure the appender future always gets completed r=oleschoenburg a=romansmirnov ## Description <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> backports #8605 relates #7992 Co-authored-by: Roman <roman.smirnov@camunda.com>
8627: [Backport stable/1.2] fix(log/stream): ensure the appender future always gets completed r=oleschoenburg a=romansmirnov ## Description <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> backports #8605 relates #7992 Co-authored-by: Roman <roman.smirnov@camunda.com>
8627: [Backport stable/1.2] fix(log/stream): ensure the appender future always gets completed r=oleschoenburg a=romansmirnov ## Description <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> backports #8605 relates #7992 Co-authored-by: Roman <roman.smirnov@camunda.com>
Describe the bug
In the release-1.2.0 benchmark, observed that the node which is the follower for all partitions went out of memory.
Might be related to #7784
https://console.cloud.google.com/errors/CMz-q-2yh8X-pQE?service=zeebe-broker&time=P7D&project=zeebe-io
Environment:
The text was updated successfully, but these errors were encountered: