Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not take snapshots after shutdown was requested #7571

Merged
1 commit merged into from
Aug 4, 2021

Conversation

pihme
Copy link
Contributor

@pihme pihme commented Aug 2, 2021

Description

The change sets the db to null right away, so that any concurrent calls thereafter have no access to the database while it is closing.

It is not completely safe yet. The database could be set to 'nullbetween checking fornull` in line 187 and using the reference in line 194.

Related issues

related to #7188

Definition of Done

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/0.25) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement

Copy link
Member

@npepinpe npepinpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you pointed out, this reduces the rate of incidence of the bug but doesn't completely fix it. Since taking a snapshot may happen on a different thread, it could even be that db is already null when the check is done on line 187, but the value read is outdated. IIRC, the state controller is called from the AsyncSnapshotDirector and from the ZeebePartition, both of which are actors, and as such executed at times on different threads. It could be that we should make the state controller an actor as well, though that would widen the scope of this PR.

❓ Do you think that the race condition you highlighted is not likely enough/important enough to warrant spending more time, or do you see value in immediately mitigating it with the idea that we'll come back and fix it properly later? If it's the latter, then we shouldn't close the issue by merging this PR, as it's not really fixed. I personally would opt to fix it properly, but I'd like to hear your thoughts.

@pihme
Copy link
Contributor Author

pihme commented Aug 3, 2021

question Do you think that the race condition you highlighted is not likely enough/important enough to warrant spending more time, or do you see value in immediately mitigating it with the idea that we'll come back and fix it properly later?

I think it is an improvement over status quo. I also hope that with the other work we are doing and that Chris is doing this error won't be able to occur anymore, because we have better control over the shutdown sequence.

But I am also fine with leaving the issue open and revisiting it later.

Copy link
Member

@npepinpe npepinpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Let's merge this for now, but remove the closes from the PR description as I wouldn't close the issue yet.

@pihme
Copy link
Contributor Author

pihme commented Aug 4, 2021

bors r+

@ghost
Copy link

ghost commented Aug 4, 2021

Build succeeded:

@ghost ghost merged commit 9588418 into develop Aug 4, 2021
@ghost ghost deleted the 7188-no-snaphots-after-shutdown branch August 4, 2021 08:32
@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2021

Backport failed for stable/1.0, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/1.0
git worktree add -d .worktree/backport-7571-to-stable/1.0 origin/stable/1.0
cd .worktree/backport-7571-to-stable/1.0
git checkout -b backport-7571-to-stable/1.0
ancref=$(git merge-base be2bfabcd49de3185e0faa952f4bc7a227259148 c8fe46c88a785bbc52fdb6b54cec4fa1bcafc193)
git cherry-pick -x $ancref..c8fe46c88a785bbc52fdb6b54cec4fa1bcafc193

@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2021

Backport failed for stable/1.1, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/1.1
git worktree add -d .worktree/backport-7571-to-stable/1.1 origin/stable/1.1
cd .worktree/backport-7571-to-stable/1.1
git checkout -b backport-7571-to-stable/1.1
ancref=$(git merge-base be2bfabcd49de3185e0faa952f4bc7a227259148 c8fe46c88a785bbc52fdb6b54cec4fa1bcafc193)
git cherry-pick -x $ancref..c8fe46c88a785bbc52fdb6b54cec4fa1bcafc193

@npepinpe
Copy link
Member

npepinpe commented Sep 7, 2021

It looks like we never backported this in the end? /cc @pihme

Or at least I don't see any linked PRs 🤔

@pihme
Copy link
Contributor Author

pihme commented Sep 7, 2021

Seems like it slipped through, yes.

ghost pushed a commit that referenced this pull request Sep 7, 2021
7782: [Backports stable/1.1] Do not take snapshots after shutdown was requested r=pihme a=npepinpe

## Description

Backport of #7571 to stable/1.0.

## Related issues

relates to #7188 

Co-authored-by: pihme <pihme@users.noreply.github.com>
ghost pushed a commit that referenced this pull request Sep 7, 2021
7781: [Backports stable/1.0] Do not take snapshots after shutdown was requested r=pihme a=npepinpe

## Description

Backport of #7571 to stable/1.0.

## Related issues

relates to #7188 

Co-authored-by: pihme <pihme@users.noreply.github.com>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants