-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StreamProcessors are kept alive #8044
Comments
Same as #7992 ? |
🤯 I totally forgot. We suspend processing when transitioning no? But the checker and other things might still run? 😖 |
Correct this is my current hypothesis. I'm pretty sure this is the case actually 😅 |
Did you mention it to @pihme as well? They listed some hypotheses in the other issue but I don't think that was one of them |
It looks like that the following block is not executed https://github.com/camunda-cloud/zeebe/blob/develop/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/impl/steps/StreamProcessorTransitionStep.java#L58-L65 which is the reason why the stream processor is still "alive" We can see in the heap dump that the CriticalHealthMonitor component still references the StreamProcessor |
On friday @pihme and I had a look at the transitions and we draw the following matrix:
If every transition goes step by step (with time in between) everything should be fine, but I think the problem occurs if multiple transitions come in. @pihme and me checked some tests to verify that the closes are correctly called. Which is the case for the tests we have written, but we haven't tested our real implementation plus not with all Transition possibilities, like INACTIVE and CANDIDATE. I think I found a scenario where it can come to the situation that we overwrite the StreamProcessor. Scenario
Now we have a duplicated StreamProcessor for the same Partition, one suspended one running.
There are probably also other possible scenarios, but maybe I have overseen something. @pihme or someone else please correct me if I'm wrong. 🙏 Additional InfoI checked the heap dump again, to see what processing mode the suspended processors had. It seem to be both modi, so different scenarios which can lead to this case. Query:
|
I think you are correct @Zelldon. The shutdown logic of the previous transition assumes that a transition to |
Yes I think it is exactly this :) @deepthidevaki |
I'm sorry. I should have investigated #7939 further 😞 |
Hey @deepthidevaki no worries! It is not yours or someone else fault. I think we should just concentrate now on fixing it :) |
If I understand, the issue isn't just memory leak but has more implications like corrupting the state, no? If that's the case, I would raise the severity to critical. |
Could #7767 also be related? i.e. if the stream processor is not closed properly, then maybe that explains why some readers are still alive? |
Yes it is the same or related. This is also described in the disk spike issue. |
Not sure. But what I found in #7767 was that there were no references to the "zombie" readers. The only reference was from the journal where it keeps the list of readers. |
8062: Prevent multiple stream processors (develop branch r=pihme a=pihme ## Description ## Description Changes the transition logic as follows: - preparation/cleanup is done for all steps (not just the steps started by the last iteration) - preparation/cleanup is done in the context of the next transition, not in the context of the last transition. - cleanup is renamed to preparation As a consequence, preparation/cleanup will be executed more often than transitionTo. This can also be seen in the log for the new test case. Essentially, when a transition is "skipped" then only it's transitionTo is skipped, but the cleanup is executed anyway. I think one could improve that by making the cleanup react to the cancel signal, but I want to be conservative here. Also, multiple cleanup calls should be fast, because if a cleanup succeeds it sets e.g. the stream processor to null in the context, and any subsequent call will do nothing if it finds no stream processor. Previously: - The old transition did clean up the steps that were started by it - The cleanup assumed that the transitionTo will immediately follow, but this was not a given. The transitionTo might be cancelled, and might eventually transition to a completely different role. - So in essence, it did prepare for a role that maybe never came. ## Related issues relates to #8044 (the issue it is trying to fix) relates to #8059 (the PR for `stable/1.2` and also the source of review comments which have been implemented herein) Co-authored-by: pihme <pihme@users.noreply.github.com> Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
8101: [backport - sort of - 1.2] Prevent multiple stream processors - fix and narrow test r=pihme a=pihme ## Description Changes the transition logic as follows: - preparation/cleanup is done for all steps (not just the steps started by the last iteration) - preparation/cleanup is done in the context of the next transition, not in the context of the last transition. As a consequence, preparation/cleanup will be executed more often than transitionTo. This can also be seen in the log for the new test case. Essentially, when a transition is "skipped" then only it's transitionTo is skipped, but the cleanup is executed anyway. I think one could improve that by making the cleanup react to the cancel signal, but I want to be conservative here. Also, multiple cleanup calls should be fast, because if a cleanup succeeds it sets e.g. the stream processor to null in the context, and any subsequent call will do nothing if it finds no stream processor. Previously: - The old transition did clean up the steps that were started by it - The cleanup assumed that the transitionTo will immediately follow, but this was not a given. The transitionTo might be cancelled, and might eventually transition to a completely different role. - So in essence, it did prepare for a role that maybe never came. ## Related issues closes #8044 subset of #8059 Essentially the same changes as #8062 (develop branch) ## Review Hints This PR is a subset of #8059. It contains the fix and commits related to the first round of review comments. It does not contain the property based test, which is still the object of further discussion. The purpose of this PR is to separate consensus from ongoing discussions. The property based test has been run and in my mind validated the fix. Last week I also ran it with chains of six consecutive transitions and found no inconsistency. This gives me the confidence to merge the fix into the stable branch. Please also indicate whether you think we need to add the property based test to `stable/1.2`. In my mind, it will be sufficient to introduce it in develop. Co-authored-by: pihme <pihme@users.noreply.github.com>
8101: [backport - sort of - 1.2] Prevent multiple stream processors - fix and narrow test r=pihme a=pihme ## Description Changes the transition logic as follows: - preparation/cleanup is done for all steps (not just the steps started by the last iteration) - preparation/cleanup is done in the context of the next transition, not in the context of the last transition. As a consequence, preparation/cleanup will be executed more often than transitionTo. This can also be seen in the log for the new test case. Essentially, when a transition is "skipped" then only it's transitionTo is skipped, but the cleanup is executed anyway. I think one could improve that by making the cleanup react to the cancel signal, but I want to be conservative here. Also, multiple cleanup calls should be fast, because if a cleanup succeeds it sets e.g. the stream processor to null in the context, and any subsequent call will do nothing if it finds no stream processor. Previously: - The old transition did clean up the steps that were started by it - The cleanup assumed that the transitionTo will immediately follow, but this was not a given. The transitionTo might be cancelled, and might eventually transition to a completely different role. - So in essence, it did prepare for a role that maybe never came. ## Related issues closes #8044 subset of #8059 Essentially the same changes as #8062 (develop branch) ## Review Hints This PR is a subset of #8059. It contains the fix and commits related to the first round of review comments. It does not contain the property based test, which is still the object of further discussion. The purpose of this PR is to separate consensus from ongoing discussions. The property based test has been run and in my mind validated the fix. Last week I also ran it with chains of six consecutive transitions and found no inconsistency. This gives me the confidence to merge the fix into the stable branch. Please also indicate whether you think we need to add the property based test to `stable/1.2`. In my mind, it will be sufficient to introduce it in develop. Co-authored-by: pihme <pihme@users.noreply.github.com>
8101: [backport - sort of - 1.2] Prevent multiple stream processors - fix and narrow test r=pihme a=pihme ## Description Changes the transition logic as follows: - preparation/cleanup is done for all steps (not just the steps started by the last iteration) - preparation/cleanup is done in the context of the next transition, not in the context of the last transition. As a consequence, preparation/cleanup will be executed more often than transitionTo. This can also be seen in the log for the new test case. Essentially, when a transition is "skipped" then only it's transitionTo is skipped, but the cleanup is executed anyway. I think one could improve that by making the cleanup react to the cancel signal, but I want to be conservative here. Also, multiple cleanup calls should be fast, because if a cleanup succeeds it sets e.g. the stream processor to null in the context, and any subsequent call will do nothing if it finds no stream processor. Previously: - The old transition did clean up the steps that were started by it - The cleanup assumed that the transitionTo will immediately follow, but this was not a given. The transitionTo might be cancelled, and might eventually transition to a completely different role. - So in essence, it did prepare for a role that maybe never came. ## Related issues closes #8044 subset of #8059 Essentially the same changes as #8062 (develop branch) ## Review Hints This PR is a subset of #8059. It contains the fix and commits related to the first round of review comments. It does not contain the property based test, which is still the object of further discussion. The purpose of this PR is to separate consensus from ongoing discussions. The property based test has been run and in my mind validated the fix. Last week I also ran it with chains of six consecutive transitions and found no inconsistency. This gives me the confidence to merge the fix into the stable branch. Please also indicate whether you think we need to add the property based test to `stable/1.2`. In my mind, it will be sufficient to introduce it in develop. Co-authored-by: pihme <pihme@users.noreply.github.com>
8059: [1.2] Prevent multiple stream processors r=pihme a=pihme ## Description Changes the transition logic as follows: - preparation/cleanup is done for all steps (not just the steps started by the last iteration) - preparation/cleanup is done in the context of the next transition, not in the context of the last transition. As a consequence, preparation/cleanup will be executed more often than transitionTo. This can also be seen in the log for the new test case. Essentially, when a transition is "skipped" then only it's transitionTo is skipped, but the cleanup is executed anyway. I think one could improve that by making the cleanup react to the cancel signal, but I want to be conservative here. Also, multiple cleanup calls should be fast, because if a cleanup succeeds it sets e.g. the stream processor to null in the context, and any subsequent call will do nothing if it finds no stream processor. Previously: - The old transition did clean up the steps that were started by it - The cleanup assumed that the transitionTo will immediately follow, but this was not a given. The transitionTo might be cancelled, and might eventually transition to a completely different role. - So in essence, it did prepare for a role that maybe never came. ## Related issues closes #8044 Co-authored-by: pihme <pihme@users.noreply.github.com>
8059: [1.2] Prevent multiple stream processors r=pihme a=pihme ## Description Changes the transition logic as follows: - preparation/cleanup is done for all steps (not just the steps started by the last iteration) - preparation/cleanup is done in the context of the next transition, not in the context of the last transition. As a consequence, preparation/cleanup will be executed more often than transitionTo. This can also be seen in the log for the new test case. Essentially, when a transition is "skipped" then only it's transitionTo is skipped, but the cleanup is executed anyway. I think one could improve that by making the cleanup react to the cancel signal, but I want to be conservative here. Also, multiple cleanup calls should be fast, because if a cleanup succeeds it sets e.g. the stream processor to null in the context, and any subsequent call will do nothing if it finds no stream processor. Previously: - The old transition did clean up the steps that were started by it - The cleanup assumed that the transitionTo will immediately follow, but this was not a given. The transitionTo might be cancelled, and might eventually transition to a completely different role. - So in essence, it did prepare for a role that maybe never came. ## Related issues closes #8044 Co-authored-by: pihme <pihme@users.noreply.github.com>
Describe the bug
We have a support issue, where the customer encountered several OOM issues. I had a deeper look at the dump and found some interesting parts were StreamProcessors seem to kept alive.
Impact: System has been killed by OOM Error at some point. We seem to keep some references to processors, which might lead to inconsistencies.
Investigation
We have the following configuration:
This means on each partition we have:
StreamProcessors
We would expect at max 24 StreamProcessors on each node, or in the heap dump. We can see 30.
Using a OQL:
select s from instanceof io.camunda.zeebe.engine.processing.streamprocessor.StreamProcessor s where s.shouldProcess == false
We find several paused/suspended StreamProcessors
We can see that we have duplicate stream processors, one which is paused and probably another which is active. Can be shown if we query for there partition ids
E.g. for partition 8 we have the two processors
If we check the suspended processor, we can only see that it is still referenced by an actor task and jobs plus timer subscriptions.
TypedRecordProcessor[]
Furthermore we see 26 TypedRecordProcessor arrays, which have 1300 entries. I feel this is to much (?).
Taking a look at this we can see we have several times the same reference of processors.
Partition
Here we can see we have 24 partitions as expected.
Weird is that the Partition object has two fields call
context
.ZeebeDBState
We can see that we have 26 states (more than partition)
And for some partitions are duplicates
To Reproduce
Not sure.
Expected behavior
No duplicates of processors. Resources should be closed correctly.
Log/Stacktrace
Will be added later
Environment:
The text was updated successfully, but these errors were encountered: