-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abort snapshot concurrent to a backup #9906
Comments
We found some edge cases that are not covered by #10061 /cc @oleschoenburg Let's assume the following logstream. ci = command, fi is follow up record of c1. cp = checkpoint record Case 1: c1 | c2 | f1 | f2 | cp | fcp | ... Snapshot positition = c2. This is the acceptable scenario. When the backup is taken, it contains a snapshot which is before the checkpoint position. The system can restore from the backup safely. Case 2: ProcessedPosition in the snapshot = c1 This is the case that we tried to prevent in the proposed solution. However we found the following edge cases: Case 3: c1 | c2 | c3 | cp | c4 | f1 | f2 | f3 | cpf Case 4: c1 | c2 | ......| c100 | cp | c101 | f1 | f2 | ......... | fcp In both these cases, snapshot position < checkpoint position. That means the state in the snapshot is before the checkpoint position. But the followup records are after the checkpoint position. As a result, if we include the snapshot in the backup, after restore we cannot recover from the snapshot because the followup events of the processed position is not available in the logstream. This would lead to an inconsistent state after the recovery. To prevent this case we discussed the following solution
One downside of this is that, case 3 and case 4 actually have valid state in the snapshot, but it is invalid because we only include the records until checkpoint record in the backup. This is so because we do not want to include any commands after the checkpoint position in the backup as it will make the checkpoint inconsistent. There are some cases that leads us to not be able to take a backup. Consider the case where there are lot of unprocessed records in the logstream due to a bug that leads to StreamProcessor making progress very slowly. If we try to take a backup, this will lead to case 4 and BackupManager will fail the backup as it can never find a valid snapshot. |
From ZEP
Possible solution
AsyncSnapshotDirector uses the CheckpointListener to find the concurrent checkpoints.
Note:- Check if multiple concurrent checkpoints are a problem.
The text was updated successfully, but these errors were encountered: