Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: incremental backups can miss non-transactional writes #62564

Closed
pbardea opened this issue Mar 25, 2021 · 3 comments
Closed

backupccl: incremental backups can miss non-transactional writes #62564

pbardea opened this issue Mar 25, 2021 · 3 comments
Assignees
Labels
A-disaster-recovery branch-release-20.1 Used to mark technical advisories for 20.1 branch-release-20.2 Used to mark technical advisories for 20.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-technical-advisory Caused a technical advisory

Comments

@pbardea
Copy link
Contributor

pbardea commented Mar 25, 2021

Non-transactional writes, such as AddSSTable requests may write at a timestamp arbitrarily far in the past. As such, the incremental iterator used by backup will miss these writes if they write to a timestamp prior to that of the previous backup.

Indexes that are being added to tables should only be backed up when they appear online. Additionally, these spans should be considered newly introduced so that backup performs a full scan of the index, rather than an incremental read.

Offline tables should be included in the backup to account for the case where a table that was previously online was taken offline. The writes to the table while it was OFFLINE will need to be captured in a subsequent incremental backup once the table returns online.

@pbardea pbardea added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery labels Mar 25, 2021
@pbardea pbardea self-assigned this Mar 25, 2021
@pbardea pbardea added this to Triage in Disaster Recovery Backlog via automation Mar 25, 2021
@mwang1026
Copy link

As discussed offline (har, har) we also need to exclude non-public tables.

@pbardea pbardea changed the title backupccl: only back up public indexes backupccl: only back up public spans Mar 26, 2021
@pbardea pbardea moved this from Triage to Bug in Disaster Recovery Backlog Mar 29, 2021
@pbardea
Copy link
Contributor Author

pbardea commented Apr 2, 2021

Chatted with @dt about the approaches on handing OFFLINE tables. I'm summarizing our conversation here and outlining plans on how to move forward here.

Today, database backups do not include OFFLINE tables. This means that they don't run into this particular issue, but an incremental backup that would have missed data would have failed will run into #61571 instead. So, this ticket will address solving the problem for OFFLINE tables include in cluster backups. The solution is similar for database backups with OFFLINE tables once those are included.

In general, the fix that we need (until ExportRequest supports non-MVCC operations) is for the first incremental backup after the offline table returns to public to capture all changes since the table was brought offline (ie all of the AddSSTable traffic).

What We Do Today

Backup/restore manages the covering of span/time space that it needs to include in incremental backups. This can be visualized by plotting the span/time space that each incremental backup covers. Today, each incremental backup of an offline table will just cover the difference from the last backup:
broken

However inc2 and inc3 will be missing AddSSTable requests that have written at time t(import_start) after inc1 was taken. This means that restores at t > t(import_end) will be missing data.

Fix 1

The first step to fixing this is for the first incremental backup after the table comes back online (inc3 in this case), to take another full backup of this table (rather than incrementally capture the difference). This is the easiest option to implement since we already have a mechanism to do this (introduced spans). This would yield in:
fix1

The drawback is that inc3 will need to re-backup all of the data that was in the table pre-import as well as all of the imported data.

Fix 2

To resolve the drawback, further changes can be made so that instead of covering all the way back to t=0, inc3 just covers back until the import started:
fix2

This optimization will only be implemented after fix 1 since it requires a more disruptive code change, which may not be backportable.

This solution would perform an incremental read from t(import_start) until t(inc3). Today, this won't work because we check that the start time of ExportRequests are above the GC threshold and the GC threshold may be at any point below t(inc2) - including times above t(import_start). However, today IMPORT, IMPORT INTO, and RESTORE do not overwrite keys that previously existed in the table. And since we know that t(import_end) is above the GC threshold, we can conclude that it is safe to read the KVs from this table all the way back to t(import_start). ExportRequest is not aware of this though, so we would need to convince it it is safe (possibly by passing in a flag overriding it's check of the start time against the GC threshold).

craig bot pushed a commit that referenced this issue Apr 5, 2021
62572: backupccl: only backup public indexes r=pbardea a=pbardea

This commit ensures that backup only backs up PUBLIC indexes. This means
that it will not back up ADDING indexes.

This is because AddSSTable requests may write in the past, so
incremental backups may miss writes when performing a time-based scan
from the previous incremental backup.

Imforms #62564.

Test test previously failed with:
```
--- FAIL: TestBackupOnlyPublicSpans (1.92s)
    backup_test.go:7802: 
                Error Trace:    backup_test.go:7802
                Error:          Received unexpected error:
                                expected incremental backup to not contain any data, found spans [/Table/53/{2-3}]
```

Release note (bug fix): Fixes a bug where index backfill data may have
been missed by BACKUP in incremental backups.

Co-authored-by: Paul Bardea <pbardea@gmail.com>
craig bot pushed a commit that referenced this issue Apr 9, 2021
63121: backupccl: re-backup spans that come online during incremental backups r=pbardea a=pbardea

This commit fixes a bug where backup would miss non-transactional writes
(via AddSSTable) during incremental backups. These backups were missed
because AddSSTable can write to a timestamp that is before the previous
incremental backup. So, if a table was written to while OFFLINE (e.g. by
a RESTORE or an IMPORT), during a backup, the following incremental
backup may miss some data.

To resolve this, BACKUP now re-backs up all of the data of OFFLINE
tables on incremental backups that put this table back online. This
comes with the drawback of some incremental backups (when a restore or
import completes) will be much slower since it has to recapture all of
the data.

There is planned future work so that these incrementals only the new
data written by the RESTORE or IMPORT, rather than resorting to backing
up the entire table again. This will be addressed in a later PR.

Implements fix 1 as described in #62564, which addresses
the correctness concerns at the expense of incremental backup performance.

Release note (bug fix): Incremental cluster backups may have missed data
written to tables while they were OFFLINE. In practice this can happen
if a RESTORE or IMPORT was running across incremental backups.


Co-authored-by: Paul Bardea <pbardea@gmail.com>
@pbardea pbardea changed the title backupccl: only back up public spans backupccl: incremental backups can miss non-transactional writes Apr 9, 2021
@pbardea pbardea added this to In Progress in Disaster Recovery Sprint Status Apr 9, 2021
@dt
Copy link
Member

dt commented May 19, 2021

Closing as done by #63121

@dt dt closed this as completed May 19, 2021
Disaster Recovery Sprint Status automation moved this from In Progress to Acceptance Test Queue May 19, 2021
Disaster Recovery Backlog automation moved this from Bug to Done May 19, 2021
adityamaru added a commit to adityamaru/cockroach that referenced this issue Jan 10, 2023
The backup fixtures in this test were generated on 20.2 where
cockroachdb#62564 existed.
Given this was >1 major version in the past we do not expect to
see a reoccurrence of the bug in 22.2+ backups that are restoreable
in this release.

Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Mar 10, 2023
The backup fixtures in this test were generated on 20.2 where
cockroachdb#62564 existed.
Given this was >1 major version in the past we do not expect to
see a reoccurrence of the bug in 22.2+ backups that are restoreable
in this release.

Release note: None
@rytaft rytaft added C-technical-advisory Caused a technical advisory branch-release-20.2 Used to mark technical advisories for 20.2 branch-release-20.1 Used to mark technical advisories for 20.1 labels Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery branch-release-20.1 Used to mark technical advisories for 20.1 branch-release-20.2 Used to mark technical advisories for 20.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-technical-advisory Caused a technical advisory
Projects
Disaster Recovery Sprint Status
  
Acceptance Test Queue
Development

No branches or pull requests

4 participants