New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IllegalStateException: Expected to find event with the snapshot position ... #5500
Comments
Observed this again https://console.cloud.google.com/errors/COCMmvvLj-XUcw?service=zeebe&time=P7D&project=camunda-cloud-240911&authuser=1 on version 0.24.4 |
I'm seeing a similar thing for creating new snapshots on the latest versions of Zeebe that on creating the snapshot it is not able to find the related event in the log. Might be related. For example if I run simple monitor docker compose and have no load on it. |
Interesting is that this only happens for the exporter 🤔
|
If we take a look at the log then we can see that the processor is able to recover but not the exporter.
Interesting is that the stream processor recovers from a snapshot with position
The exporter position is smaller. Did we maybe compacted that already? |
Please verify if this still affects 0.25 - we should not be compacting exported events! |
Analysis on the logs from 0.23.7 :
After 5 minutes Broker 2 takes snapshot at index 57599, which is strange.
But it replicate snapshot 57699 (make sense since it has higher snapshot id)
Meanwhile Broker-0 was restarted. Since it's log was lagging behind, it receives the snapshot at index 57699. On receiving the snapshot it would clear the log.
When Broker-0 recovers from snapshot 57699 it cannot find the exported position:
After leader change, the new leaders recovers succesfully (by pure chance) because they have not compacted the log. |
More logs
|
What is strange here is since the snapshot position go backwards from 67699 to 67599. Should find out why it happened. It would be the root cause of this issue. |
Hypothesis:
|
Observed this again in 0.25.3 log
Broker-1 was trying to recover from a snapshot 39606899-318-1607641121168-63861246-63860153 which was created 2 days ago.
Logs also indicate that Broker 1 was receving newer snapshot and compacting logs:
|
On Broker-1 when it received the said snapshot (39606899-318-1607641121168-63861246-63860153) is received, it is not deleted when the next snapshot is received. Instead an older snapshot is deleted.
|
When loading snapshots on restart, if there are more than one snapshot available for some reason, we don't pick the latest snapshot. That explains why the broker has an older snapshot as the latest one. In both cases this happened after the broker was restarted. |
Nice catch! I see two issues here:
|
Not sure. But probably as a consequence of #5611 ? What if there was an error after copying the snapshot, but before the previous snapshot is deleted? |
In that case, when we get a new snapshot, we should delete any number of older snapshots. Could also be done when opening the store (trim to one). wdyt? |
Ya. We could do that. On the other hand - Is it possible that snapshot is partially copied at this point https://github.com/zeebe-io/zeebe/blob/40da7bf3fc6a91d3652a31ab5b164202e3b00196/snapshot/src/main/java/io/zeebe/snapshots/broker/impl/FileBasedSnapshotStore.java#L277 ? /cc @MiguelPires as potential review buddy |
To delete more than one older snapshots when we receive a new snapshot, we would have to synchronize access to SnapshotStore because of concurrent snapshot replication. Currently it is coordinated under the assumption that we have to delete only the previous one, which we sychronize via an AtomicReference. |
6017: fix(broker): pick latest snapshot on load when multiple snapshots exists r=deepthidevaki a=deepthidevaki ## Description - Pick latest snapshot on startup when multiple snapshot exists - Delete snapshot directory if atomic move fails. If snapshot was partially moved, it would be considered a valid snapshot after a restart. ## Related issues closes #5500 ## Definition of Done _Not all items need to be done depending on the issue and the pull request._ Code changes: * [x] The changes are backwards compatibility with previous versions * [x] If it fixes a bug then PRs are created to [backport](https://github.com/zeebe-io/zeebe/compare/stable/0.24...develop?expand=1&template=backport_template.md&title=[Backport%200.24]) the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. `backport stable/0.25`) to the PR, in case that fails you need to create backports manually. Testing: * [x] There are unit/integration tests that verify all acceptance criterias of the issue * [ ] New tests are written to ensure backwards compatibility with further versions * [ ] The behavior is tested manually * [ ] The change has been verified by a QA run * [ ] The impact of the changes is verified by a benchmark Documentation: * [ ] The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.) * [ ] New content is added to the [release announcement](https://drive.google.com/drive/u/0/folders/1DTIeswnEEq-NggJ25rm2BsDjcCQpDape) Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Describe the bug
Observed as error in production logs.
Log/Stacktrace
https://console.cloud.google.com/errors/CKnfqtj3lqTxaw?service=zeebe&time=P7D&project=camunda-cloud-240911
Environment:
Camunda Cloud / Prod / 0.24.3
The text was updated successfully, but these errors were encountered: