-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not take snapshots after shutdown was requested #7571
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you pointed out, this reduces the rate of incidence of the bug but doesn't completely fix it. Since taking a snapshot may happen on a different thread, it could even be that db
is already null when the check is done on line 187, but the value read is outdated. IIRC, the state controller is called from the AsyncSnapshotDirector and from the ZeebePartition, both of which are actors, and as such executed at times on different threads. It could be that we should make the state controller an actor as well, though that would widen the scope of this PR.
❓ Do you think that the race condition you highlighted is not likely enough/important enough to warrant spending more time, or do you see value in immediately mitigating it with the idea that we'll come back and fix it properly later? If it's the latter, then we shouldn't close the issue by merging this PR, as it's not really fixed. I personally would opt to fix it properly, but I'd like to hear your thoughts.
I think it is an improvement over status quo. I also hope that with the other work we are doing and that Chris is doing this error won't be able to occur anymore, because we have better control over the shutdown sequence. But I am also fine with leaving the issue open and revisiting it later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Let's merge this for now, but remove the closes
from the PR description as I wouldn't close the issue yet.
bors r+ |
Build succeeded: |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/1.0
git worktree add -d .worktree/backport-7571-to-stable/1.0 origin/stable/1.0
cd .worktree/backport-7571-to-stable/1.0
git checkout -b backport-7571-to-stable/1.0
ancref=$(git merge-base be2bfabcd49de3185e0faa952f4bc7a227259148 c8fe46c88a785bbc52fdb6b54cec4fa1bcafc193)
git cherry-pick -x $ancref..c8fe46c88a785bbc52fdb6b54cec4fa1bcafc193 |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/1.1
git worktree add -d .worktree/backport-7571-to-stable/1.1 origin/stable/1.1
cd .worktree/backport-7571-to-stable/1.1
git checkout -b backport-7571-to-stable/1.1
ancref=$(git merge-base be2bfabcd49de3185e0faa952f4bc7a227259148 c8fe46c88a785bbc52fdb6b54cec4fa1bcafc193)
git cherry-pick -x $ancref..c8fe46c88a785bbc52fdb6b54cec4fa1bcafc193 |
It looks like we never backported this in the end? /cc @pihme Or at least I don't see any linked PRs 🤔 |
Seems like it slipped through, yes. |
Description
The change sets the db to null right away, so that any concurrent calls thereafter have no access to the database while it is closing.
It is not completely safe yet. The database could be set to 'null
between checking for
null` in line 187 and using the reference in line 194.Related issues
related to #7188
Definition of Done
Code changes:
backport stable/0.25
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation: