Abort snapshot concurrent to a backup #9906

deepthidevaki · 2022-07-27T14:46:22Z

From ZEP

Edge cases due to concurrent snapshots and compaction

While taking the backup, the available snapshot has processedPosition >= checkpointPosition. This would mean that we cannot take a valid backup, because there is no way to retrieve a state that represents the checkpoint until the checkpoint command.

In this there are two scenarios that can happen:

The available snapshot is after the checkpointPosition. That snapshotId.processedPosition > checkpointPosition.
The available snapshot was started at a position before checkpointPosition, but was taken in parallel to the backup process. In this case snapshotId.processedPosition < checkpointPosition. But since the snapshot is taken concurrently, the actual processedPosition in the state is > checkpointPosition.

To prevent case 1, we fail the backup if snapshotId.processedPosition >= checkpointPosition.

In case 2, it is difficult to find the actual processedPosition in the snapshot without opening the database. Hence, to be safe, we abort the snapshot if snapshotId.processedPosition <= checkpointPosition < lastWrittenPosition. This means that we abort any snapshot that is taken in parallel to a backup. This is not ideal, but it is a simple solution to prevent inconsistencies.

Possible solution

AsyncSnapshotDirector uses the CheckpointListener to find the concurrent checkpoints.

Note:- Check if multiple concurrent checkpoints are a problem.

deepthidevaki · 2022-08-15T13:36:20Z

We found some edge cases that are not covered by #10061 /cc @oleschoenburg

Let's assume the following logstream. ci = command, fi is follow up record of c1. cp = checkpoint record

Case 1:

c1 | c2 | f1 | f2 | cp | fcp | ...

Snapshot positition = c2. This is the acceptable scenario. When the backup is taken, it contains a snapshot which is before the checkpoint position. The system can restore from the backup safely.

Case 2:
c1 | cp | c2 | f1 | fcp | f2 |...

ProcessedPosition in the snapshot = c1
SnapshotPosition = c2 (because the snapshot is taken async)
Including this snapshot in the a backup, makes the backup inconsistent because it already has the state after the checkpoint.

This is the case that we tried to prevent in the proposed solution.

However we found the following edge cases:

Case 3:

c1 | c2 | c3 | cp | c4 | f1 | f2 | f3 | cpf
Snapshot position = c3

Case 4:

c1 | c2 | ......| c100 | cp | c101 | f1 | f2 | ......... | fcp
Snapshot position = c2

In both these cases, snapshot position < checkpoint position. That means the state in the snapshot is before the checkpoint position. But the followup records are after the checkpoint position. As a result, if we include the snapshot in the backup, after restore we cannot recover from the snapshot because the followup events of the processed position is not available in the logstream. This would lead to an inconsistent state after the recovery.

To prevent this case we discussed the following solution

Add lastWrittenPosition as a metadata of the snapshot. This info must be persisted with the snapshot.
BackupManager will fail the backup if there is no valid snapshot for the backup. valid snapshot => snapshot.processedPosition < snapshot.lastWrittenPosition < checkpointPosition
To minimize the chance of failing a backup because of no valid snapshot available, we increase the number of snapshots that are retained. This way if the latest snapshot is not suitable for the backup, BackupManager can chose a previous snapshot.

One downside of this is that, case 3 and case 4 actually have valid state in the snapshot, but it is invalid because we only include the records until checkpoint record in the backup. This is so because we do not want to include any commands after the checkpoint position in the backup as it will make the checkpoint inconsistent. There are some cases that leads us to not be able to take a backup. Consider the case where there are lot of unprocessed records in the logstream due to a bug that leads to StreamProcessor making progress very slowly. If we try to take a backup, this will lead to case 4 and BackupManager will fail the backup as it can never find a valid snapshot.

deepthidevaki mentioned this issue Jul 27, 2022

Zeebe can backup its data to an external storage without downtime and restore from it #9606

Closed

58 tasks

deepthidevaki self-assigned this Aug 4, 2022

deepthidevaki mentioned this issue Aug 15, 2022

Abort snapshot concurrent to a backup #10061

Closed

10 tasks

deepthidevaki mentioned this issue Aug 18, 2022

Add last written position as metadata in snapshot #10115

Closed

deepthidevaki closed this as completed Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abort snapshot concurrent to a backup #9906

Abort snapshot concurrent to a backup #9906

deepthidevaki commented Jul 27, 2022 •

edited

deepthidevaki commented Aug 15, 2022

Abort snapshot concurrent to a backup #9906

Abort snapshot concurrent to a backup #9906

Comments

deepthidevaki commented Jul 27, 2022 • edited

Possible solution

deepthidevaki commented Aug 15, 2022

deepthidevaki commented Jul 27, 2022 •

edited