Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot is not valid. There is no checksum file. #6377

Closed
Zelldon opened this issue Feb 18, 2021 · 1 comment · Fixed by #6383
Closed

Snapshot is not valid. There is no checksum file. #6377

Zelldon opened this issue Feb 18, 2021 · 1 comment · Fixed by #6383
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround

Comments

@Zelldon
Copy link
Member

Zelldon commented Feb 18, 2021

Describe the bug

Snapshot can't be persisted because the checksum file is missing.

It seems like we have a race condition between taking and persisting the snapshot.

D 2021-02-17T13:09:11.883106Z Broker-2-SnapshotDirector-1 Based on lowest exporter position '591' and last processed position '714', determined '591' as snapshot position.  Broker-2-SnapshotDirector-1
D 2021-02-17T13:09:11.892135Z io.zeebe.snapshots.broker.impl.FileBasedSnapshotStore Taking temporary snapshot into /usr/local/zeebe/data/raft-partition/partitions/1/pending/499-1-1613567351888-714-591.  io.zeebe.snapshots.broker.impl.FileBasedSnapshotStore
D 2021-02-17T13:09:11.921155Z Broker-2-SnapshotDirector-1 Created pending snapshot for Broker-2-StreamProcessor-1  Broker-2-SnapshotDirector-1
I 2021-02-17T13:09:11.922269Z Broker-2-SnapshotDirector-1 Finished taking snapshot, need to wait until last written event position 715 is committed, current commit position is 715. After that snapshot can be marked as valid.  Broker-2-SnapshotDirector-1
D 2021-02-17T13:09:11.924762Z io.zeebe.snapshots.broker.impl.FileBasedSnapshotStore Purging snapshots older than FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/499-1-1613567351888-714-591, metadata=FileBasedSnapshotMetadata{index=499, term=1, timestamp=2021-02-17 01:09:11,888, processedPosition=714, exporterPosition=591}}  io.zeebe.snapshots.broker.impl.FileBasedSnapshotStore
D 2021-02-17T13:09:11.931943Z io.zeebe.snapshots.broker.impl.FileBasedSnapshotStore Search for orphaned snapshots below oldest valid snapshot with index FileBasedSnapshotMetadata{index=499, term=1, timestamp=2021-02-17 01:09:11,888, processedPosition=714, exporterPosition=591} in /usr/local/zeebe/data/raft-partition/partitions/1/pending  io.zeebe.snapshots.broker.impl.FileBasedSnapshotStore
D 2021-02-17T13:09:11.934254Z io.zeebe.snapshots.broker.impl.FileBasedSnapshotStore New snapshot 499-1-1613567351888-714-591 was persisted. Start replicating.  io.zeebe.snapshots.broker.impl.FileBasedSnapshotStore
D 2021-02-17T13:09:11.935008Z Broker-2-DeletionService-1 Compacting Atomix log up to index 499  Broker-2-DeletionService-1
D 2021-02-17T13:09:11.957536Z io.zeebe.snapshots.broker.impl.FileBasedSnapshotStore Created new snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/499-1-1613567351888-714-591, metadata=FileBasedSnapshotMetadata{index=499, term=1, timestamp=2021-02-17 01:09:11,888, processedPosition=714, exporterPosition=591}}  io.zeebe.snapshots.broker.impl.FileBasedSnapshotStore
I 2021-02-17T13:09:11.958251Z Broker-2-SnapshotDirector-1 Current commit position 715 >= 715, snapshot 499-1-1613567351888-714-591 is valid and has been persisted.  Broker-2-SnapshotDirector-1
E 2021-02-17T13:09:11.958650Z Broker-2-SnapshotDirector-1 Unexpected exception occurred on moving valid snapshot.  Broker-2-SnapshotDirector-1
  undefined

There are no other errors which indicate problem on writing the snapshot.

Error group https://console.cloud.google.com/errors/CIzr1vKOuue-DA?service=zeebe&time=P7D&project=camunda-cloud-240911&authuser=1

To Reproduce

Not sure seems to be a race condition.

Expected behavior
No race condition and that I can persist my snapshot.

Log/Stacktrace
If possible add the full stacktrace or Zeebe log which contains the issue.

Full Stacktrace

java.lang.IllegalStateException: Snapshot is not valid. There is no checksum file.
	at io.zeebe.snapshots.broker.impl.FileBasedTransientSnapshot.lambda$persist$3(FileBasedTransientSnapshot.java:114) ~[zeebe-snapshots-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorControl.lambda$call$0(ActorControl.java:136) ~[zeebe-util-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:62) [zeebe-util-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:94) [zeebe-util-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78) [zeebe-util-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]
	at io.zeebe.util.sched.ActorThread.run(ActorThread.java:191) [zeebe-util-1.0.0-SNAPSHOT.jar:1.0.0-SNAPSHOT]

Environment:

  • OS: k8
  • Zeebe Version: SNAPSHOT
  • Configuration:
@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround labels Feb 18, 2021
@Zelldon
Copy link
Member Author

Zelldon commented Feb 18, 2021

It looks like that persist is called twice, which causes this problem.

We see "Current commit position {} >= {}, snapshot {} is valid and has been persisted.", before the error, which is only logged after the persisting succeeded.

It might happen because the SnapshotDirector is calling persistSnapshotIfLastWrittenPositionCommitted when the commit position is updated and when taking a snapshot was completed.

@zeebe-bors zeebe-bors bot closed this as completed in 5871e47 Feb 23, 2021
zeebe-bors bot added a commit that referenced this issue Feb 23, 2021
6428: [BACKPORT 0.26] fix(broker): fix race condition in persisting snapshot r=Zelldon a=MiguelPires

## Description

Backports #6383. No changes were made to the PR.

## Related issues

closes #6377

## Definition of Done

_Not all items need to be done depending on the issue and the pull request._

Code changes:
* [ ] The changes are backwards compatibility with previous versions
* [ ] If it fixes a bug then PRs are created to [backport](https://github.com/zeebe-io/zeebe/compare/stable/0.24...develop?expand=1&template=backport_template.md&title=[Backport%200.24]) the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. `backport stable/0.25`) to the PR, in case that fails you need to create backports manually.

Testing:
* [ ] There are unit/integration tests that verify all acceptance criterias of the issue
* [ ] New tests are written to ensure backwards compatibility with further versions
* [ ] The behavior is tested manually
* [ ] The change has been verified by a QA run
* [ ] The impact of the changes is verified by a benchmark 

Documentation: 
* [ ] The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
* [ ] New content is added to the [release announcement](https://drive.google.com/drive/u/0/folders/1DTIeswnEEq-NggJ25rm2BsDjcCQpDape)


Co-authored-by: Miguel Pires <miguel.pires@camunda.com>
zeebe-bors bot added a commit that referenced this issue Feb 23, 2021
6429: [BACKPORT 0.25] fix(broker): fix race condition in persisting snapshot r=Zelldon a=MiguelPires

## Description

Backports #6383. No changes were made to the PR.

## Related issues

closes #6377

## Definition of Done

_Not all items need to be done depending on the issue and the pull request._

Code changes:
* [ ] The changes are backwards compatibility with previous versions
* [ ] If it fixes a bug then PRs are created to [backport](https://github.com/zeebe-io/zeebe/compare/stable/0.24...develop?expand=1&template=backport_template.md&title=[Backport%200.24]) the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. `backport stable/0.25`) to the PR, in case that fails you need to create backports manually.

Testing:
* [ ] There are unit/integration tests that verify all acceptance criterias of the issue
* [ ] New tests are written to ensure backwards compatibility with further versions
* [ ] The behavior is tested manually
* [ ] The change has been verified by a QA run
* [ ] The impact of the changes is verified by a benchmark 

Documentation: 
* [ ] The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
* [ ] New content is added to the [release announcement](https://drive.google.com/drive/u/0/folders/1DTIeswnEEq-NggJ25rm2BsDjcCQpDape)


Co-authored-by: Miguel Pires <miguel.pires@camunda.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants