New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backup/restore: all tables in an incremental backup must be present in the full backup for restore to work #18633
Comments
You need to specify both the full and incremental backups to restore from an incremental backup: https://www.cockroachlabs.com/docs/stable/restore.html#restore-from-incremental-backups |
You might have to re-explain this to me. Also, I'm sorry - I pasted in the wrong query, since I tried the two commands in both orders to see if the documentation was wrong. Don't I have both the full (crdbcsvtest/database) and the incremental (crdbcsvtest/database_inc) specified above? |
Yeah, talked to diana offline and it looks like there's a real issue here (which was hidden by the initially copy-paste error). On a high level, if a table is added after a full backup and then an incremental backup is run, we can't restore that table from the incremental backup. In the 1.1 timeframe, this will have to be a known limitation. I'm going to replace the issue text with a technical description of what's going on and think some more about how we'd fix this |
@cuongdo unless there's an easier fix I'm missing, this is going to require that we have more than one start time associated with a backup, which means changes to the BackupDescriptor proto that we serialize next to a backup as well as the BackupDetails in the jobs table. Which likely qualifies this as a 1.1 known limitation and a 1.2 fix. Thoughts? |
Could this be a potential 1.1 workaround? Haven't actually tried it yet. BACKUP DATABASE foo TO 'nodelocal://a';
CREATE TABLE foo.new;
BACKUP DATABASE foo TO 'nodelocal://b' INCREMENTAL FROM 'nodelocal://a';
RESTORE foo.* FROM 'nodelocal://a', 'nodelocal://a';
-- Oh te noes!
BACKUP TABLE foo.new TO 'nodelocal://c';
RESTORE foo.* FROM 'nodelocal://a', 'nodelocal://c', 'nodelocal://b'; |
Nice idea, though I think it will reject that since foo.new is present in overlapping times in b and c |
The really unfortunate thing here is that trying to restore |
Documented this 1.1 limitation in cockroachdb/docs#1990. |
The worst of this is already fixed, even in 1.1: #19286 |
ditto what @benesch said, with added point that we might want to "fix" the issue (not just error) in 1.2. doing so "just" requires adding more granular time bounds information to the backup metadata, then using that to determine if the previous backups indeed cover the right tables over all of time. However there's a ux question of if/when we actually want to do that -- automatically include essentially a full backup of one or more of the tables when doing an "incremental" backup. One use case (A): Another use case B: And finally, an almost silly case C: In all three cases, the set of tables in the first backup does not match the set of tables being backed up. (A) seems like it should probably Just Work. (C) seems like it is more likely the operator has just mistakenly pointed at the wrong previous backup. A full backup might be much bigger or more expensive, so they my be unpleasantly surprised when their "incremental" backup contains the entire orders table. (B) isn't quite as clear cut -- under the hood it is the same as (A) except the new table might be old/huge, so it has some of the same potential for unpleasant surprises as (C). On possible rule that catches (C) would be that there must be some overlap in the previous and current backup. Another possible option might be to set the start time range that must be covered to the creation time of the table. Then a previous backup that doesn't include a new table is OK. |
@dt I think you are right in that case B and C seem odd. Based on the customer who ran this, he was backing up the entire database, not adding new tables to an existing backup -> that definitely seems like a weird edge case that I don't particularly think we need to support. If they want to add a table to an existing full backup that existed prior to the full backup's timestamp, I think it's reasonable to force them to run a full backup. |
backing up a database with nightly incrementals, I expect, the default use case, so I think it would be ideal if that Just Works, even if you add/drop/truncate tables in that DB. Under the hood, supporting that is about the same as supporting B and C, but IMO, B and C look like usage errors that I'd expect to fail rather than silently de-incrementalize themselves. |
makes sense to me! what do you think the work would be to support this in terms of time? I'm worried about adding more things to our full plate. |
@danhhz @benesch off the top of my head, the obvious but complicated approach we already discussed seems to be to start keeping per-span time bounds and then de-incrementalize new keyspace. To reject B and C, we'd want to check something like table creation time. Alternatively, I think we could also relax the the coverage requirement to start at table creation time rather than time 0? |
What are you thinking of using as "table creation time"? The mvcc timestamp doesn't work (Think about if you're backing up a table which was created via RESTORE; all the restored data will have mvcc timestamps that are less than the descriptor's) and I don't see a creation time on TableDescriptor. My personal opinion is that the easiest thing to do correctly is handle all of A, B, and C by making start time per-file instead of per-backup. Then you have to figure out the UX issues of a hybrid full/incremental backup but that seems tractable. |
I was thinking we'd put an HLC timestamp in the table descriptor -- old tables would be 0, which is fine since that would match current behavior but new tables would have it -- which is fine, since is is only new tables where it matters. |
That could work. I started to think through some of the edge cases (txn writing the desc gets pushed, etc), but as long as it's not a tight lower bound, I don't think they're so bad. I still think making start time per-file is the way to go, but your call. |
I don't think (C) should work -- it you said "incremental" but just pointed at the wrong backup, silently switching to a full backup and ignoring the unrelated base backup seems likely to do more harm than good -- it reduces the operational predictably, suddenly running a longer/bigger/more expensive operation than the administrator expected. That's why I was thinking it might be nice if, instead of expanding the window of changes we capture (and thus potentially capturing more than was intended), we narrow the required range. |
That said, I can go either way on (B) working, which, if we do want to support, I think implies we want per-file time bounds / per-range start-times, in which case maybe we just reject (c) with an explicit intersection check if we want that. Hmm. |
Yeah. Disallowing (c) and perhaps warning or something on (b) is what I meant by "Then you have to figure out the UX issues of a hybrid full/incremental backup" |
in lunch discussion with mjibson, it seems like |
@dt this is only a problem in case B right? |
When a restore is run, it validates that all time ranges are accounted for to prevent the user from footgun-ing. Consider the following:
A full backup is run for (0, t1]
An incremental backup is run for (t1, t2]
An incremental backup is run for (t2, t3]
If the user tries to restore but only specifies the (0,t1] and (t2,t3] backups, we error instead of restoring incorrect data. Unfortunately, this check breaks if a new table was added in either of the incremental restores and it falsely thinks there is missing history for that table.
One potential fix for this is when generating the export requests for an incremental backup, to use 0 as the start time for any table that was not present in the full backup (so in essence it's a full backup for that table).
The text was updated successfully, but these errors were encountered: