StreamProcessors are kept alive #8044

Zelldon · 2021-10-22T11:57:02Z

Describe the bug

We have a support issue, where the customer encountered several OOM issues. I had a deeper look at the dump and found some interesting parts were StreamProcessors seem to kept alive.

Impact: System has been killed by OOM Error at some point. We seem to keep some references to processors, which might lead to inconsistencies.

Investigation

We have the following configuration:

Setup:
cluster size:4
partitions: 24
replication factor: 4
process: 150second duration
PIPS: 2x100
workers: 10
embedded gateway

This means on each partition we have:

Partitions per Node:
N 0: 24
N 1: 24
N 2: 24
N 3: 24

StreamProcessors

We would expect at max 24 StreamProcessors on each node, or in the heap dump. We can see 30.

Using a OQL: select s from instanceof io.camunda.zeebe.engine.processing.streamprocessor.StreamProcessor s where s.shouldProcess == false

We find several paused/suspended StreamProcessors

We can see that we have duplicate stream processors, one which is paused and probably another which is active. Can be shown if we query for there partition ids

E.g. for partition 8 we have the two processors

If we check the suspended processor, we can only see that it is still referenced by an actor task and jobs plus timer subscriptions.

TypedRecordProcessor[]

Furthermore we see 26 TypedRecordProcessor arrays, which have 1300 entries. I feel this is to much (?).

Taking a look at this we can see we have several times the same reference of processors.

Partition

Here we can see we have 24 partitions as expected.

Weird is that the Partition object has two fields call context.

ZeebeDBState

We can see that we have 26 states (more than partition)

And for some partitions are duplicates

To Reproduce

Not sure.

Expected behavior

No duplicates of processors. Resources should be closed correctly.

Log/Stacktrace

Will be added later

Environment:

OS:
Zeebe Version: 1.2.0

Setup:
cluster size:4
partitions: 24
replication factor: 4
process: 150second duration
PIPS: 2x100
workers: 10
embedded gateway

The text was updated successfully, but these errors were encountered:

deepthidevaki · 2021-10-22T11:59:12Z

Same as #7992 ?

Zelldon · 2021-10-22T13:00:38Z

Maybe. It might also relate to #7988

If we suspend the processor other actor consumers are still running like the Message TTL checker etc, which means they can try to delete a message twice? 🤔 @saig0 because it is async via the command

npepinpe · 2021-10-22T13:01:43Z

🤯 I totally forgot. We suspend processing when transitioning no? But the checker and other things might still run? 😖

Zelldon · 2021-10-22T13:05:10Z

Correct this is my current hypothesis. I'm pretty sure this is the case actually 😅

npepinpe · 2021-10-22T13:05:40Z

Did you mention it to @pihme as well? They listed some hypotheses in the other issue but I don't think that was one of them

Zelldon · 2021-10-22T14:23:04Z

It looks like that the following block is not executed https://github.com/camunda-cloud/zeebe/blob/develop/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/impl/steps/StreamProcessorTransitionStep.java#L58-L65 which is the reason why the stream processor is still "alive"

We can see in the heap dump that the CriticalHealthMonitor component still references the StreamProcessor
and is not closed obviously

Zelldon · 2021-10-23T20:05:53Z

On friday @pihme and I had a look at the transitions and we draw the following matrix:

  // current \ target  FOLLOWER   LEADER   CANDIDATE    INACTIVE
  // FOLLOWER            close     close     no close     close
  // LEADER              close      close     close       close
  // CANDIDATE           no close    close     close     close
  // INACTIVE             close      close      close     close

If every transition goes step by step (with time in between) everything should be fine, but I think the problem occurs if multiple transitions come in. @pihme and me checked some tests to verify that the closes are correctly called. Which is the case for the tests we have written, but we haven't tested our real implementation plus not with all Transition possibilities, like INACTIVE and CANDIDATE.

I think I found a scenario where it can come to the situation that we overwrite the StreamProcessor.

Scenario

Say we are in any Role (probably Follower) and we become CANDIDATE
We don't pause the processing
We get another transition to: INACTIVE. Maybe because we receive an install request.
We pause the processing immediately (#newRole is called directly)
For CANDIDATE We iterate over the installed steps to close them. We first close all and remove the installed steps
On Candidate we don't close the StreamProcessor
Since a new role directly after wards came in, we cancel the transition (which means we don't install the Steps again)
We don't become candidate, so no steps are installed ! (This is the important part)
We start with transition to INACTIVE, if a Follower transition was for example also taken in the meantime this would cancel the installation again
The follower will close nothing, since nothing has been installed so StreamProcessor#close is not called
The follower will install the StreamProcessor again and replaces the reference
Only the health monitor and the actor task is still keeping a reference to the StreamProcessor

Now we have a duplicated StreamProcessor for the same Partition, one suspended one running.

If now the StreamProcessor (which is suspended) fails, because it is iterating while the other StreamProcessor deletes messages from the state seen here NPE when deleting message #7988 Then it will fail and mark the Partition as unhealthy!

There are probably also other possible scenarios, but maybe I have overseen something. @pihme or someone else please correct me if I'm wrong. 🙏

Additional Info

I checked the heap dump again, to see what processing mode the suspended processors had. It seem to be both modi, so different scenarios which can lead to this case.

Query:

select s.processingContext.streamProcessorMode
from io.camunda.zeebe.engine.processing.streamprocessor.StreamProcessor s
where s.shouldProcess == false

pihme · 2021-10-24T10:16:51Z

I think you are correct @Zelldon. The shutdown logic of the previous transition assumes that a transition to targetRole will follow immediately. If the following transition is skipped or cancelled, then the shutdown logic essentially prepared for the wrong thing.

deepthidevaki · 2021-10-25T06:38:09Z

Thanks @Zelldon @pihme . So the scenario that you described is caused by skipping transitions. Is that right? So similar to the scenario explained here #7939 ?

Zelldon · 2021-10-25T06:40:39Z

Yes I think it is exactly this :) @deepthidevaki

deepthidevaki · 2021-10-25T06:44:53Z

I'm sorry. I should have investigated #7939 further 😞

Zelldon · 2021-10-25T06:55:31Z

Hey @deepthidevaki no worries! It is not yours or someone else fault. I think we should just concentrate now on fixing it :)

npepinpe · 2021-10-25T07:18:51Z

If I understand, the issue isn't just memory leak but has more implications like corrupting the state, no? If that's the case, I would raise the severity to critical.

npepinpe · 2021-10-25T10:29:08Z

Could #7767 also be related? i.e. if the stream processor is not closed properly, then maybe that explains why some readers are still alive?

Zelldon · 2021-10-25T10:35:16Z

Yes it is the same or related. This is also described in the disk spike issue.

deepthidevaki · 2021-10-25T10:39:15Z

Could #7767 also be related? i.e. if the stream processor is not closed properly, then maybe that explains why some readers are still alive?

Not sure. But what I found in #7767 was that there were no references to the "zombie" readers. The only reference was from the journal where it keeps the list of readers.

8062: Prevent multiple stream processors (develop branch r=pihme a=pihme ## Description ## Description Changes the transition logic as follows: - preparation/cleanup is done for all steps (not just the steps started by the last iteration) - preparation/cleanup is done in the context of the next transition, not in the context of the last transition. - cleanup is renamed to preparation As a consequence, preparation/cleanup will be executed more often than transitionTo. This can also be seen in the log for the new test case. Essentially, when a transition is "skipped" then only it's transitionTo is skipped, but the cleanup is executed anyway. I think one could improve that by making the cleanup react to the cancel signal, but I want to be conservative here. Also, multiple cleanup calls should be fast, because if a cleanup succeeds it sets e.g. the stream processor to null in the context, and any subsequent call will do nothing if it finds no stream processor. Previously: - The old transition did clean up the steps that were started by it - The cleanup assumed that the transitionTo will immediately follow, but this was not a given. The transitionTo might be cancelled, and might eventually transition to a completely different role. - So in essence, it did prepare for a role that maybe never came. ## Related issues relates to #8044 (the issue it is trying to fix) relates to #8059 (the PR for `stable/1.2` and also the source of review comments which have been implemented herein) Co-authored-by: pihme <pihme@users.noreply.github.com> Co-authored-by: Christopher Zell <zelldon91@googlemail.com>

8101: [backport - sort of - 1.2] Prevent multiple stream processors - fix and narrow test r=pihme a=pihme ## Description Changes the transition logic as follows: - preparation/cleanup is done for all steps (not just the steps started by the last iteration) - preparation/cleanup is done in the context of the next transition, not in the context of the last transition. As a consequence, preparation/cleanup will be executed more often than transitionTo. This can also be seen in the log for the new test case. Essentially, when a transition is "skipped" then only it's transitionTo is skipped, but the cleanup is executed anyway. I think one could improve that by making the cleanup react to the cancel signal, but I want to be conservative here. Also, multiple cleanup calls should be fast, because if a cleanup succeeds it sets e.g. the stream processor to null in the context, and any subsequent call will do nothing if it finds no stream processor. Previously: - The old transition did clean up the steps that were started by it - The cleanup assumed that the transitionTo will immediately follow, but this was not a given. The transitionTo might be cancelled, and might eventually transition to a completely different role. - So in essence, it did prepare for a role that maybe never came. ## Related issues closes #8044 subset of #8059 Essentially the same changes as #8062 (develop branch) ## Review Hints This PR is a subset of #8059. It contains the fix and commits related to the first round of review comments. It does not contain the property based test, which is still the object of further discussion. The purpose of this PR is to separate consensus from ongoing discussions. The property based test has been run and in my mind validated the fix. Last week I also ran it with chains of six consecutive transitions and found no inconsistency. This gives me the confidence to merge the fix into the stable branch. Please also indicate whether you think we need to add the property based test to `stable/1.2`. In my mind, it will be sufficient to introduce it in develop. Co-authored-by: pihme <pihme@users.noreply.github.com>

pihme · 2021-11-02T13:38:30Z

Fixed with #8062 in develop and #8101 in stable/1.2

8059: [1.2] Prevent multiple stream processors r=pihme a=pihme ## Description Changes the transition logic as follows: - preparation/cleanup is done for all steps (not just the steps started by the last iteration) - preparation/cleanup is done in the context of the next transition, not in the context of the last transition. As a consequence, preparation/cleanup will be executed more often than transitionTo. This can also be seen in the log for the new test case. Essentially, when a transition is "skipped" then only it's transitionTo is skipped, but the cleanup is executed anyway. I think one could improve that by making the cleanup react to the cancel signal, but I want to be conservative here. Also, multiple cleanup calls should be fast, because if a cleanup succeeds it sets e.g. the stream processor to null in the context, and any subsequent call will do nothing if it finds no stream processor. Previously: - The old transition did clean up the steps that were started by it - The cleanup assumed that the transitionTo will immediately follow, but this was not a given. The transitionTo might be cancelled, and might eventually transition to a completely different role. - So in essence, it did prepare for a role that maybe never came. ## Related issues closes #8044 Co-authored-by: pihme <pihme@users.noreply.github.com>

Zelldon added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround labels Oct 22, 2021

Zelldon mentioned this issue Oct 24, 2021

Spikes in Disk Usage in Benchmark #8028

Closed

Zelldon mentioned this issue Oct 25, 2021

NPE when deleting message #7988

Closed

npepinpe added Impact: Data severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround and removed severity/high Marks a bug as having a noticeable impact on the user with no known workaround labels Oct 25, 2021

npepinpe assigned Zelldon and pihme Oct 25, 2021

npepinpe added 1.2.1 support Marks an issue as related to a customer support request labels Oct 25, 2021

pihme added a commit that referenced this issue Oct 25, 2021

test(broker): add test for #8044

e428d43

pihme mentioned this issue Oct 25, 2021

[1.2] Prevent multiple stream processors #8059

Merged

9 tasks

pihme added a commit that referenced this issue Oct 26, 2021

test(broker): add test for #8044

5990b35

pihme mentioned this issue Oct 26, 2021

Prevent multiple stream processors (develop branch #8062

Merged

9 tasks

pihme added a commit that referenced this issue Oct 27, 2021

test(broker): add test for #8044

e5556b9

pihme mentioned this issue Nov 1, 2021

[backport - sort of - 1.2] Prevent multiple stream processors - fix and narrow test #8101

Merged

9 tasks

Zelldon removed their assignment Nov 1, 2021

pihme added a commit that referenced this issue Nov 1, 2021

test(broker): add test for #8044

7dae89a

pihme mentioned this issue Nov 2, 2021

refactor(broker): re-enable new transition logic #8121

Closed

9 tasks

pihme closed this as completed Nov 2, 2021

npepinpe added the Release: 1.2.4 label Nov 5, 2021

npepinpe removed the Release: 1.2.4 label Nov 5, 2021

menski added the Release: 1.3.0-alpha1 label Nov 8, 2021

deepthidevaki mentioned this issue Nov 9, 2021

OutOfMemory in follower #7992

Closed

npepinpe added Release: 1.2.4 labels Dec 1, 2021

korthout added the version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0 label Jan 4, 2022

lenaschoenburg mentioned this issue May 5, 2022

Overview of ActorScheduler limitations #9183

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StreamProcessors are kept alive #8044

StreamProcessors are kept alive #8044

Zelldon commented Oct 22, 2021 •

edited

Loading

deepthidevaki commented Oct 22, 2021

Zelldon commented Oct 22, 2021

npepinpe commented Oct 22, 2021

Zelldon commented Oct 22, 2021

npepinpe commented Oct 22, 2021 •

edited

Loading

Zelldon commented Oct 22, 2021 •

edited

Loading

Zelldon commented Oct 23, 2021 •

edited

Loading

pihme commented Oct 24, 2021

deepthidevaki commented Oct 25, 2021

Zelldon commented Oct 25, 2021

deepthidevaki commented Oct 25, 2021

Zelldon commented Oct 25, 2021

npepinpe commented Oct 25, 2021 •

edited

Loading

npepinpe commented Oct 25, 2021

Zelldon commented Oct 25, 2021

deepthidevaki commented Oct 25, 2021

pihme commented Nov 2, 2021

StreamProcessors are kept alive #8044

StreamProcessors are kept alive #8044

Comments

Zelldon commented Oct 22, 2021 • edited Loading

Investigation

StreamProcessors

TypedRecordProcessor[]

Partition

ZeebeDBState

deepthidevaki commented Oct 22, 2021

Zelldon commented Oct 22, 2021

npepinpe commented Oct 22, 2021

Zelldon commented Oct 22, 2021

npepinpe commented Oct 22, 2021 • edited Loading

Zelldon commented Oct 22, 2021 • edited Loading

Zelldon commented Oct 23, 2021 • edited Loading

Scenario

Additional Info

pihme commented Oct 24, 2021

deepthidevaki commented Oct 25, 2021

Zelldon commented Oct 25, 2021

deepthidevaki commented Oct 25, 2021

Zelldon commented Oct 25, 2021

npepinpe commented Oct 25, 2021 • edited Loading

npepinpe commented Oct 25, 2021

Zelldon commented Oct 25, 2021

deepthidevaki commented Oct 25, 2021

pihme commented Nov 2, 2021

Zelldon commented Oct 22, 2021 •

edited

Loading

npepinpe commented Oct 22, 2021 •

edited

Loading

Zelldon commented Oct 22, 2021 •

edited

Loading

Zelldon commented Oct 23, 2021 •

edited

Loading

npepinpe commented Oct 25, 2021 •

edited

Loading