Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StreamProcessors are kept alive #8044

Closed
Zelldon opened this issue Oct 22, 2021 · 17 comments
Closed

StreamProcessors are kept alive #8044

Zelldon opened this issue Oct 22, 2021 · 17 comments
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround support Marks an issue as related to a customer support request version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0

Comments

@Zelldon
Copy link
Member

Zelldon commented Oct 22, 2021

Describe the bug

We have a support issue, where the customer encountered several OOM issues. I had a deeper look at the dump and found some interesting parts were StreamProcessors seem to kept alive.

Impact: System has been killed by OOM Error at some point. We seem to keep some references to processors, which might lead to inconsistencies.

Investigation

We have the following configuration:

Setup:
cluster size:4
partitions: 24
replication factor: 4
process: 150second duration
PIPS: 2x100
workers: 10
embedded gateway

This means on each partition we have:

Partitions per Node:
N 0: 24
N 1: 24
N 2: 24
N 3: 24

StreamProcessors

We would expect at max 24 StreamProcessors on each node, or in the heap dump. We can see 30.

streamprocessors-general

Using a OQL: select s from instanceof io.camunda.zeebe.engine.processing.streamprocessor.StreamProcessor s where s.shouldProcess == false

We find several paused/suspended StreamProcessors

suspendedProcessors

We can see that we have duplicate stream processors, one which is paused and probably another which is active. Can be shown if we query for there partition ids

duplicateProcessors

E.g. for partition 8 we have the two processors

If we check the suspended processor, we can only see that it is still referenced by an actor task and jobs plus timer subscriptions.
processor-refs

TypedRecordProcessor[]

Furthermore we see 26 TypedRecordProcessor arrays, which have 1300 entries. I feel this is to much (?).
recordProcessors

Taking a look at this we can see we have several times the same reference of processors.

refprocessors

Partition

partition

Here we can see we have 24 partitions as expected.

Weird is that the Partition object has two fields call context.

contextinPartition

ZeebeDBState

We can see that we have 26 states (more than partition)
dbstate

And for some partitions are duplicates
dbstate-duplicates

To Reproduce

Not sure.

Expected behavior

No duplicates of processors. Resources should be closed correctly.

Log/Stacktrace

Will be added later

Environment:

  • OS:
  • Zeebe Version: 1.2.0

Setup:
cluster size:4
partitions: 24
replication factor: 4
process: 150second duration
PIPS: 2x100
workers: 10
embedded gateway

@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround labels Oct 22, 2021
@deepthidevaki
Copy link
Contributor

Same as #7992 ?

@Zelldon
Copy link
Member Author

Zelldon commented Oct 22, 2021

Maybe. It might also relate to #7988

If we suspend the processor other actor consumers are still running like the Message TTL checker etc, which means they can try to delete a message twice? 🤔 @saig0 because it is async via the command

@npepinpe
Copy link
Member

🤯 I totally forgot. We suspend processing when transitioning no? But the checker and other things might still run? 😖

@Zelldon
Copy link
Member Author

Zelldon commented Oct 22, 2021

Correct this is my current hypothesis. I'm pretty sure this is the case actually 😅

@npepinpe
Copy link
Member

npepinpe commented Oct 22, 2021

Did you mention it to @pihme as well? They listed some hypotheses in the other issue but I don't think that was one of them

@Zelldon
Copy link
Member Author

Zelldon commented Oct 22, 2021

It looks like that the following block is not executed https://github.com/camunda-cloud/zeebe/blob/develop/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/impl/steps/StreamProcessorTransitionStep.java#L58-L65 which is the reason why the stream processor is still "alive"

We can see in the heap dump that the CriticalHealthMonitor component still references the StreamProcessor
streamprocessorreferences and is not closed obviously

@Zelldon
Copy link
Member Author

Zelldon commented Oct 23, 2021

On friday @pihme and I had a look at the transitions and we draw the following matrix:

  // current \ target  FOLLOWER   LEADER   CANDIDATE    INACTIVE
  // FOLLOWER            close     close     no close     close
  // LEADER              close      close     close       close
  // CANDIDATE           no close    close     close     close
  // INACTIVE             close      close      close     close

If every transition goes step by step (with time in between) everything should be fine, but I think the problem occurs if multiple transitions come in. @pihme and me checked some tests to verify that the closes are correctly called. Which is the case for the tests we have written, but we haven't tested our real implementation plus not with all Transition possibilities, like INACTIVE and CANDIDATE.

I think I found a scenario where it can come to the situation that we overwrite the StreamProcessor.

Scenario

Now we have a duplicated StreamProcessor for the same Partition, one suspended one running.

  • If now the StreamProcessor (which is suspended) fails, because it is iterating while the other StreamProcessor deletes messages from the state seen here NPE when deleting message #7988 Then it will fail and mark the Partition as unhealthy!

There are probably also other possible scenarios, but maybe I have overseen something. @pihme or someone else please correct me if I'm wrong. 🙏

Additional Info

I checked the heap dump again, to see what processing mode the suspended processors had. It seem to be both modi, so different scenarios which can lead to this case.

modes

Query:

select s.processingContext.streamProcessorMode
from io.camunda.zeebe.engine.processing.streamprocessor.StreamProcessor s
where s.shouldProcess == false

@pihme
Copy link
Contributor

pihme commented Oct 24, 2021

I think you are correct @Zelldon. The shutdown logic of the previous transition assumes that a transition to targetRole will follow immediately. If the following transition is skipped or cancelled, then the shutdown logic essentially prepared for the wrong thing.

@deepthidevaki
Copy link
Contributor

Thanks @Zelldon @pihme . So the scenario that you described is caused by skipping transitions. Is that right? So similar to the scenario explained here #7939 ?

@Zelldon
Copy link
Member Author

Zelldon commented Oct 25, 2021

Yes I think it is exactly this :) @deepthidevaki

@deepthidevaki
Copy link
Contributor

I'm sorry. I should have investigated #7939 further 😞

@Zelldon
Copy link
Member Author

Zelldon commented Oct 25, 2021

Hey @deepthidevaki no worries! It is not yours or someone else fault. I think we should just concentrate now on fixing it :)

@npepinpe
Copy link
Member

npepinpe commented Oct 25, 2021

If I understand, the issue isn't just memory leak but has more implications like corrupting the state, no? If that's the case, I would raise the severity to critical.

@npepinpe npepinpe added Impact: Data severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround and removed severity/high Marks a bug as having a noticeable impact on the user with no known workaround labels Oct 25, 2021
@npepinpe npepinpe added 1.2.1 support Marks an issue as related to a customer support request labels Oct 25, 2021
@npepinpe
Copy link
Member

Could #7767 also be related? i.e. if the stream processor is not closed properly, then maybe that explains why some readers are still alive?

@Zelldon
Copy link
Member Author

Zelldon commented Oct 25, 2021

Yes it is the same or related. This is also described in the disk spike issue.

@deepthidevaki
Copy link
Contributor

Could #7767 also be related? i.e. if the stream processor is not closed properly, then maybe that explains why some readers are still alive?

Not sure. But what I found in #7767 was that there were no references to the "zombie" readers. The only reference was from the journal where it keeps the list of readers.

pihme added a commit that referenced this issue Oct 25, 2021
pihme added a commit that referenced this issue Oct 26, 2021
ghost pushed a commit that referenced this issue Oct 27, 2021
8062: Prevent multiple stream processors (develop branch r=pihme a=pihme

## Description
## Description
Changes the transition logic as follows:
- preparation/cleanup is done for all steps (not just the steps started by the last iteration)
- preparation/cleanup is done in the context of the next transition, not in the context of the last transition. 
- cleanup is renamed to preparation

As a consequence, preparation/cleanup will be executed more often than transitionTo. This can also be seen in the log for the new test case. Essentially, when a transition is "skipped" then only it's transitionTo is skipped, but the cleanup is executed anyway. I think one could improve that by making the cleanup react to the cancel signal, but I want to be conservative here. Also, multiple cleanup calls should be fast, because if a cleanup succeeds it sets e.g. the stream processor to null in the context, and any subsequent call will do nothing if it finds no stream processor.

Previously:
- The old transition did clean up the steps that were started by it
- The cleanup assumed that the transitionTo will immediately follow, but this was not a given. The transitionTo might be cancelled, and might eventually transition to a completely different role.
- So in essence, it did prepare for a role that maybe never came.

## Related issues
relates to #8044 (the issue it is trying to fix)
relates to #8059 (the PR for `stable/1.2` and also the source of review comments which have been implemented herein)



Co-authored-by: pihme <pihme@users.noreply.github.com>
Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
pihme added a commit that referenced this issue Oct 27, 2021
@Zelldon Zelldon removed their assignment Nov 1, 2021
pihme added a commit that referenced this issue Nov 1, 2021
ghost pushed a commit that referenced this issue Nov 1, 2021
8101: [backport - sort of - 1.2] Prevent multiple stream processors - fix and narrow test r=pihme a=pihme

## Description

Changes the transition logic as follows:
- preparation/cleanup is done for all steps (not just the steps started by the last iteration)
- preparation/cleanup is done in the context of the next transition, not in the context of the last transition.

As a consequence, preparation/cleanup will be executed more often than transitionTo. This can also be seen in the log for the new test case. Essentially, when a transition is "skipped" then only it's transitionTo is skipped, but the cleanup is executed anyway. I think one could improve that by making the cleanup react to the cancel signal, but I want to be conservative here. Also, multiple cleanup calls should be fast, because if a cleanup succeeds it sets e.g. the stream processor to null in the context, and any subsequent call will do nothing if it finds no stream processor.

Previously:
- The old transition did clean up the steps that were started by it
- The cleanup assumed that the transitionTo will immediately follow, but this was not a given. The transitionTo might be cancelled, and might eventually transition to a completely different role.
- So in essence, it did prepare for a role that maybe never came.


## Related issues

closes #8044
subset of #8059 
Essentially the same changes as #8062 (develop branch)

## Review Hints
This PR is a subset of #8059. It contains the fix and commits related to the first round of review comments. 
It does not contain the property based test, which is still the object of further discussion.

The purpose of this PR is to separate consensus from ongoing discussions.

The property based test has been run and in my mind validated the fix. Last week I also ran it with chains of six consecutive transitions and found no inconsistency. This gives me the confidence to merge the fix into the stable branch.

Please also indicate whether you think we need to add the property based test to `stable/1.2`. In my mind, it will be sufficient to introduce it in develop.



Co-authored-by: pihme <pihme@users.noreply.github.com>
ghost pushed a commit that referenced this issue Nov 1, 2021
8101: [backport - sort of - 1.2] Prevent multiple stream processors - fix and narrow test r=pihme a=pihme

## Description

Changes the transition logic as follows:
- preparation/cleanup is done for all steps (not just the steps started by the last iteration)
- preparation/cleanup is done in the context of the next transition, not in the context of the last transition.

As a consequence, preparation/cleanup will be executed more often than transitionTo. This can also be seen in the log for the new test case. Essentially, when a transition is "skipped" then only it's transitionTo is skipped, but the cleanup is executed anyway. I think one could improve that by making the cleanup react to the cancel signal, but I want to be conservative here. Also, multiple cleanup calls should be fast, because if a cleanup succeeds it sets e.g. the stream processor to null in the context, and any subsequent call will do nothing if it finds no stream processor.

Previously:
- The old transition did clean up the steps that were started by it
- The cleanup assumed that the transitionTo will immediately follow, but this was not a given. The transitionTo might be cancelled, and might eventually transition to a completely different role.
- So in essence, it did prepare for a role that maybe never came.


## Related issues

closes #8044
subset of #8059 
Essentially the same changes as #8062 (develop branch)

## Review Hints
This PR is a subset of #8059. It contains the fix and commits related to the first round of review comments. 
It does not contain the property based test, which is still the object of further discussion.

The purpose of this PR is to separate consensus from ongoing discussions.

The property based test has been run and in my mind validated the fix. Last week I also ran it with chains of six consecutive transitions and found no inconsistency. This gives me the confidence to merge the fix into the stable branch.

Please also indicate whether you think we need to add the property based test to `stable/1.2`. In my mind, it will be sufficient to introduce it in develop.



Co-authored-by: pihme <pihme@users.noreply.github.com>
ghost pushed a commit that referenced this issue Nov 1, 2021
8101: [backport - sort of - 1.2] Prevent multiple stream processors - fix and narrow test r=pihme a=pihme

## Description

Changes the transition logic as follows:
- preparation/cleanup is done for all steps (not just the steps started by the last iteration)
- preparation/cleanup is done in the context of the next transition, not in the context of the last transition.

As a consequence, preparation/cleanup will be executed more often than transitionTo. This can also be seen in the log for the new test case. Essentially, when a transition is "skipped" then only it's transitionTo is skipped, but the cleanup is executed anyway. I think one could improve that by making the cleanup react to the cancel signal, but I want to be conservative here. Also, multiple cleanup calls should be fast, because if a cleanup succeeds it sets e.g. the stream processor to null in the context, and any subsequent call will do nothing if it finds no stream processor.

Previously:
- The old transition did clean up the steps that were started by it
- The cleanup assumed that the transitionTo will immediately follow, but this was not a given. The transitionTo might be cancelled, and might eventually transition to a completely different role.
- So in essence, it did prepare for a role that maybe never came.


## Related issues

closes #8044
subset of #8059 
Essentially the same changes as #8062 (develop branch)

## Review Hints
This PR is a subset of #8059. It contains the fix and commits related to the first round of review comments. 
It does not contain the property based test, which is still the object of further discussion.

The purpose of this PR is to separate consensus from ongoing discussions.

The property based test has been run and in my mind validated the fix. Last week I also ran it with chains of six consecutive transitions and found no inconsistency. This gives me the confidence to merge the fix into the stable branch.

Please also indicate whether you think we need to add the property based test to `stable/1.2`. In my mind, it will be sufficient to introduce it in develop.



Co-authored-by: pihme <pihme@users.noreply.github.com>
@pihme
Copy link
Contributor

pihme commented Nov 2, 2021

Fixed with #8062 in develop and #8101 in stable/1.2

@pihme pihme closed this as completed Nov 2, 2021
ghost pushed a commit that referenced this issue Nov 5, 2021
8059: [1.2] Prevent multiple stream processors r=pihme a=pihme

## Description
Changes the transition logic as follows:
- preparation/cleanup is done for all steps (not just the steps started by the last iteration)
- preparation/cleanup is done in the context of the next transition, not in the context of the last transition. 

As a consequence, preparation/cleanup will be executed more often than transitionTo. This can also be seen in the log for the new test case. Essentially, when a transition is "skipped" then only it's transitionTo is skipped, but the cleanup is executed anyway. I think one could improve that by making the cleanup react to the cancel signal, but I want to be conservative here. Also, multiple cleanup calls should be fast, because if a cleanup succeeds it sets e.g. the stream processor to null in the context, and any subsequent call will do nothing if it finds no stream processor.

Previously:
- The old transition did clean up the steps that were started by it
- The cleanup assumed that the transitionTo will immediately follow, but this was not a given. The transitionTo might be cancelled, and might eventually transition to a completely different role.
- So in essence, it did prepare for a role that maybe never came.

## Related issues

closes #8044



Co-authored-by: pihme <pihme@users.noreply.github.com>
ghost pushed a commit that referenced this issue Nov 5, 2021
8059: [1.2] Prevent multiple stream processors r=pihme a=pihme

## Description
Changes the transition logic as follows:
- preparation/cleanup is done for all steps (not just the steps started by the last iteration)
- preparation/cleanup is done in the context of the next transition, not in the context of the last transition. 

As a consequence, preparation/cleanup will be executed more often than transitionTo. This can also be seen in the log for the new test case. Essentially, when a transition is "skipped" then only it's transitionTo is skipped, but the cleanup is executed anyway. I think one could improve that by making the cleanup react to the cancel signal, but I want to be conservative here. Also, multiple cleanup calls should be fast, because if a cleanup succeeds it sets e.g. the stream processor to null in the context, and any subsequent call will do nothing if it finds no stream processor.

Previously:
- The old transition did clean up the steps that were started by it
- The cleanup assumed that the transitionTo will immediately follow, but this was not a given. The transitionTo might be cancelled, and might eventually transition to a completely different role.
- So in essence, it did prepare for a role that maybe never came.

## Related issues

closes #8044



Co-authored-by: pihme <pihme@users.noreply.github.com>
@korthout korthout added the version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0 label Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/critical Marks a stop-the-world bug, with a high impact and no existing workaround support Marks an issue as related to a customer support request version:1.3.0 Marks an issue as being completely or in parts released in 1.3.0
Projects
None yet
Development

No branches or pull requests

6 participants