btrbk does not delete partially received subvolumes on error #17

Closed
diraimondo opened this Issue May 9, 2015 · 11 comments

Projects

None yet

3 participants

@diraimondo

I'm attempting to substitute my personal bash-script to remotely backup my laptop on a local server using btrfs send/receive feature. In my attempt the backup is started by the server. My problem is related to the cases where the transfer is interrupted for a network problem (or because if go away with the laptop). In this case, if 'resume_missing' is 'yes', btrbk attempts to resume previous snapshot but this fails. This looks a bug (it should delete previous backup).

$ sudo btrbk -l info run      
btrbk command line client, version 0.17.0  (Sat May  9 11:13:13 2015)
Using configuration: /etc/btrbk/btrbk.conf
Creating subvolume snapshot for: {gandalf}/mnt/btrfs-pool/PACMAN
>>> {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509
Creating subvolume backup (send-receive) for: {gandalf}/mnt/btrfs-pool/PACMAN
Receiving from snapshot: {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509
No common parent subvolume present, creating full backup
>>> /mnt/btrfs-backup-pool/BTRBK/PACMAN.20150509
At subvol /mnt/btrfs-pool/snapshots/PACMAN.20150509
At subvol PACMAN.20150509
^C

$ sudo btrbk -l info run           
btrbk command line client, version 0.17.0  (Sat May  9 11:14:03 2015)
Using configuration: /etc/btrbk/btrbk.conf
Creating subvolume snapshot for: {gandalf}/mnt/btrfs-pool/PACMAN
>>> {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509_1
Checking for missing backups of subvolume "{gandalf}/mnt/btrfs-pool/PACMAN" in: /mnt/btrfs-backup-pool/BTRBK/
Resuming subvolume backup (send-receive) for: {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509
Receiving from snapshot: {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509
No common parent subvolume present, creating full backup
>>> /mnt/btrfs-backup-pool/BTRBK/PACMAN.20150509
At subvol /mnt/btrfs-pool/snapshots/PACMAN.20150509
At subvol PACMAN.20150509
ERROR: creating subvolume PACMAN.20150509 failed. File exists
ERROR: Failed to send/receive btrfs subvolume: {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509  -> /mnt/btrfs-backup-pool/BTRBK
ERROR: Error while resuming backups, aborting
Resumed 1 backups
WARNING: Skipping cleanup of snapshots for subvolume "{gandalf}/mnt/btrfs-pool/PACMAN", as at least one target aborted earlier
Completed within: 2s  (Sat May  9 11:14:05 2015)
--------------------------------------------------------------------------------
Backup Summary (btrbk command line client, version 0.17.0)

    Date:   Sat May  9 11:14:03 2015
    Config: /etc/btrbk/btrbk.conf

Legend:
    +++  created subvolume (source snapshot)
    ---  deleted subvolume
    ***  received subvolume (non-incremental)
    >>>  received subvolume (incremental)
--------------------------------------------------------------------------------
{gandalf}/mnt/btrfs-pool/PACMAN
+++ {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509_1
!!! /mnt/btrfs-backup-pool/BTRBK/PACMAN.20150509
!!! Target "/mnt/btrfs-backup-pool/BTRBK" aborted: Failed to send/receive subvolume

NOTE: Some errors occurred, which may result in missing backups!
Please check warning and error messages above.

I'm not really interested in resuming the transfer of all the previous backups: if I use 'resume_missing' as 'no', then btrbk doesn't attempt to resume the previous transfer. It shots another fresh snapshot on the laptop and it tries to transfer it on the sever. This fits well in my scenario but I would like that btrbk removes all the server-located incomplete snapshots: they are left on the server dirtying my server file-system. I can't even distinguish between completed and uncompleted backups.
Note: my bash-scripts manages this problem adding keywords like 'transfering' / 'synced' to the name of the snapshots.

@digint
Owner
digint commented May 9, 2015

btrbk has a problem here, since btrfs receive does not provide me with enough information to correctly decide if the received subvolume really needs to be deleted.

"Delete partially received subvolumes on error" is on the btrfs-progs TODO list [1]. Until then, I think the only way of fixing this problem is to introduce some unsafe feature, which will always try to delete the received subvolume as soon as btrfs receive returns ANY error, with drawback of possibly deleting something wrong.

I asked on the btrfs mailing list for some hints on this issue, and will probably implement a temporary fix soon.

[1] https://btrfs.wiki.kernel.org/index.php/Design_notes_on_Send/Receive

@digint digint added the bug label May 9, 2015
@digint digint changed the title from [FR] possibility to remove previous uncompleted snapshots on target / [BUG] impossibility to resume previous transfer to btrbk does not delete partially received subvolumes on error May 9, 2015
@diraimondo

Thank you for the quick reply. Could I suggest the following strategy?

  • add a tag #transfering to the source snapshot before the send: in this way the copy on the target will get the same tag;
  • upon a successful transfer, you can rename both the snapshots (source and target);
  • upon a new invocation, if you find a tagged snapshot on target, you know that it is incomplete (delete it); on source you can just remove the tag.
@digint
Owner
digint commented May 9, 2015

Thanks for the suggestion, but this solution would also cause trouble since I would have to rename the subvolume after transferring, which can also possibly fail and leave garbage behind. I'm really trying hard to avoid this kind of trouble.

Thread on btrfs mailing list: http://thread.gmane.org/gmane.comp.file-systems.btrfs/45024

@digint
Owner
digint commented May 9, 2015

Implented bugfix (465a3eb) which always tries to delete the (possibly garbled) subvolume on failure. This should be safe unless you start messing around in target subvolumes by hand.

Branch for this:
https://github.com/digint/btrbk/tree/unsafe_delete_on_receive_errors

@diraimondo

A premises, this fix in the branch would solve the problem in case of resume_missing=yes. This would not solve the other scenario (my preferred) in which I don't transfer old snapshots but just the last-one.

Given this, I've tested your fix, but there are some problems:

$ pwd
/mnt/btrfs-backup-pool/BTRBK

$ ls 
(nothing!)

$ sudo btrbk -l debug run               
btrbk command line client, version 0.17.0  (Sat May  9 16:45:15 2015)
Using configuration: /etc/btrbk/btrbk.conf
### /usr/bin/ssh -i /root/.ssh/id_rsa root@gandalf /sbin/btrfs subvolume show '/mnt/btrfs-pool' 2>/dev/null
Command execution successful
found btrfs root: {gandalf}/mnt/btrfs-pool
### /usr/bin/ssh -i /root/.ssh/id_rsa root@gandalf /sbin/btrfs subvolume list -a -c -u -q -R '/mnt/btrfs-pool'
Command execution successful
Parsed 19 total subvolumes for filesystem at: {gandalf}/mnt/btrfs-pool
Found 19 subvolume children of: {gandalf}/mnt/btrfs-pool
### /sbin/btrfs subvolume show '/mnt/btrfs-backup-pool/BTRBK' 2>/dev/null
Command execution successful
Parsed 11 subvolume detail items: /mnt/btrfs-backup-pool/BTRBK
### /sbin/btrfs subvolume list -a -c -u -q -R '/mnt/btrfs-backup-pool/BTRBK'
Command execution successful
Parsed 195 total subvolumes for filesystem at: /mnt/btrfs-backup-pool/BTRBK
Found 0 subvolume children of: /mnt/btrfs-backup-pool/BTRBK
Creating subvolume snapshot for: {gandalf}/mnt/btrfs-pool/PACMAN
[btrfs] snapshot (ro):
[btrfs]   host  : gandalf
[btrfs]   source: /mnt/btrfs-pool/PACMAN
[btrfs]   target: /mnt/btrfs-pool/snapshots/PACMAN.20150509
>>> {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509
### /usr/bin/ssh -i /root/.ssh/id_rsa root@gandalf /sbin/btrfs subvolume snapshot -r '/mnt/btrfs-pool/PACMAN' '/mnt/btrfs-pool/snapshots/PACMAN.20150509'
Command execution successful
Checking for missing backups of subvolume "{gandalf}/mnt/btrfs-pool/PACMAN" in: /mnt/btrfs-backup-pool/BTRBK/
Found 0 snapshot children of: {gandalf}/mnt/btrfs-pool/PACMAN
No missing backups found
Creating subvolume backup (send-receive) for: {gandalf}/mnt/btrfs-pool/PACMAN
Found 0 snapshot children of: {gandalf}/mnt/btrfs-pool/PACMAN
No common snapshots of "ssh://gandalf/mnt/btrfs-pool/PACMAN" found in src="{gandalf}/mnt/btrfs-pool/", target="/mnt/btrfs-backup-pool/BTRBK/"
Receiving from snapshot: {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509
No common parent subvolume present, creating full backup
>>> /mnt/btrfs-backup-pool/BTRBK/PACMAN.20150509
[btrfs] send/receive (complete):
[btrfs]   source: {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509
[btrfs]   target: /mnt/btrfs-backup-pool/BTRBK
### /usr/bin/ssh -i /root/.ssh/id_rsa root@gandalf /sbin/btrfs send  '/mnt/btrfs-pool/snapshots/PACMAN.20150509' |  /sbin/btrfs receive -v '/mnt/btrfs-backup-pool/BTRBK/'
At subvol /mnt/btrfs-pool/snapshots/PACMAN.20150509
At subvol PACMAN.20150509
receiving subvol PACMAN.20150509 uuid=98791e4b-4d15-634b-a441-557457d22014, stransid=96274
^C

$ ls -l
drwxr-xr-x 1 root root 6 mag  9 16:45 PACMAN.20150509

$ sudo btrbk -l debug run
btrbk command line client, version 0.17.0  (Sat May  9 16:46:47 2015)
Using configuration: /etc/btrbk/btrbk.conf
### /usr/bin/ssh -i /root/.ssh/id_rsa root@gandalf /sbin/btrfs subvolume show '/mnt/btrfs-pool' 2>/dev/null
Command execution successful
found btrfs root: {gandalf}/mnt/btrfs-pool
### /usr/bin/ssh -i /root/.ssh/id_rsa root@gandalf /sbin/btrfs subvolume list -a -c -u -q -R '/mnt/btrfs-pool'
Command execution successful
Parsed 20 total subvolumes for filesystem at: {gandalf}/mnt/btrfs-pool
Found 20 subvolume children of: {gandalf}/mnt/btrfs-pool
### /sbin/btrfs subvolume show '/mnt/btrfs-backup-pool/BTRBK' 2>/dev/null
Command execution successful
Parsed 11 subvolume detail items: /mnt/btrfs-backup-pool/BTRBK
### /sbin/btrfs subvolume list -a -c -u -q -R '/mnt/btrfs-backup-pool/BTRBK'
Command execution successful
Parsed 196 total subvolumes for filesystem at: /mnt/btrfs-backup-pool/BTRBK
Found 1 subvolume children of: /mnt/btrfs-backup-pool/BTRBK
Creating subvolume snapshot for: {gandalf}/mnt/btrfs-pool/PACMAN
[btrfs] snapshot (ro):
[btrfs]   host  : gandalf
[btrfs]   source: /mnt/btrfs-pool/PACMAN
[btrfs]   target: /mnt/btrfs-pool/snapshots/PACMAN.20150509_1
>>> {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509_1
### /usr/bin/ssh -i /root/.ssh/id_rsa root@gandalf /sbin/btrfs subvolume snapshot -r '/mnt/btrfs-pool/PACMAN' '/mnt/btrfs-pool/snapshots/PACMAN.20150509_1'
Command execution successful
Checking for missing backups of subvolume "{gandalf}/mnt/btrfs-pool/PACMAN" in: /mnt/btrfs-backup-pool/BTRBK/
Found 1 snapshot children of: {gandalf}/mnt/btrfs-pool/PACMAN
Found 0 receive targets in "/mnt/btrfs-backup-pool/BTRBK/" for: {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509
No matching receive targets found, adding resume candidate: {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509
Checking schedule for resume candidates
Preserving 2/2 items
Resuming subvolume backup (send-receive) for: {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509
Found 1 snapshot children of: {gandalf}/mnt/btrfs-pool/PACMAN
No common snapshots of "ssh://gandalf/mnt/btrfs-pool/PACMAN#96487" found in src="{gandalf}/mnt/btrfs-pool/", target="/mnt/btrfs-backup-pool/BTRBK/"
Receiving from snapshot: {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509
No common parent subvolume present, creating full backup
>>> /mnt/btrfs-backup-pool/BTRBK/PACMAN.20150509
[btrfs] send/receive (complete):
[btrfs]   source: {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509
[btrfs]   target: /mnt/btrfs-backup-pool/BTRBK
### /usr/bin/ssh -i /root/.ssh/id_rsa root@gandalf /sbin/btrfs send  '/mnt/btrfs-pool/snapshots/PACMAN.20150509' |  /sbin/btrfs receive -v '/mnt/btrfs-backup-pool/BTRBK/'
At subvol /mnt/btrfs-pool/snapshots/PACMAN.20150509
At subvol PACMAN.20150509
receiving subvol PACMAN.20150509 uuid=98791e4b-4d15-634b-a441-557457d22014, stransid=96274
ERROR: creating subvolume PACMAN.20150509 failed. File exists
Command execution failed (exitcode=1): "/usr/bin/ssh -i /root/.ssh/id_rsa root@gandalf /sbin/btrfs send  '/mnt/btrfs-pool/snapshots/PACMAN.20150509' |  /sbin/btrfs receive -v 
'/mnt/btrfs-backup-pool/BTRBK/'"
ERROR: Failed to send/receive btrfs subvolume: {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509  -> /mnt/btrfs-backup-pool/BTRBK
send/received failed, deleting (possibly present and garbled) received subvolume: /mnt/btrfs-backup-pool/BTRBK/PACMAN.20150509
[btrfs] delete (commit-after):
[btrfs]   subvolume: /mnt/btrfs-backup-pool/BTRBK/PACMAN.20150509
### /sbin/btrfs subvolume delete --commit-after '/mnt/btrfs-backup-pool/BTRBK/PACMAN.20150509'
Command execution successful
WARNING: Deleted partially received (garbled) subvolume: /mnt/btrfs-backup-pool/BTRBK/PACMAN.20150509
ERROR: Error while resuming backups, aborting
Resumed 1 backups
WARNING: Skipping cleanup of snapshots for subvolume "{gandalf}/mnt/btrfs-pool/PACMAN", as at least one target aborted earlier
Completed within: 3s  (Sat May  9 16:46:50 2015)
--------------------------------------------------------------------------------
Backup Summary (btrbk command line client, version 0.17.0)

    Date:   Sat May  9 16:46:47 2015
    Config: /etc/btrbk/btrbk.conf

Legend:
    +++  created subvolume (source snapshot)
    ---  deleted subvolume
    ***  received subvolume (non-incremental)
    >>>  received subvolume (incremental)
--------------------------------------------------------------------------------
{gandalf}/mnt/btrfs-pool/PACMAN
+++ {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509_1
!!! /mnt/btrfs-backup-pool/BTRBK/PACMAN.20150509
!!! Target "/mnt/btrfs-backup-pool/BTRBK" aborted: Failed to send/receive subvolume

NOTE: Some errors occurred, which may result in missing backups!
Please check warning and error messages above.

$ ls -l
(nothing again!)

(gandalf host) $ ls -l /mnt/btrfs-pool/snapshots/ | grep PACMAN
drwxr-xr-x 1 root root   6  8 mag 16.38 PACMAN.20150509
drwxr-xr-x 1 root root   6  8 mag 16.38 PACMAN.20150509_1

Please note that the partial target snapshot is gone...

@digint
Owner
digint commented May 10, 2015

This is as expected. Let's read the log file:

First run:

$ sudo btrbk -l debug run               
...
Creating subvolume snapshot for: {gandalf}/mnt/btrfs-pool/PACMAN
>>> {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509
...
Creating subvolume backup (send-receive) for: {gandalf}/mnt/btrfs-pool/PACMAN
>>> /mnt/btrfs-backup-pool/BTRBK/PACMAN.20150509
...
^C

Now you hit ctrl-c, killing the btrbk process (note that btrbk has no signal handler, and I intend to keep it that way).

Now we have

  • {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509
  • /mnt/btrfs-backup-pool/BTRBK/PACMAN.20150509 (garbled)

Second run:

$ sudo btrbk -l debug run
...
Creating subvolume snapshot for: {gandalf}/mnt/btrfs-pool/PACMAN
>>> {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509_1
...
Resuming subvolume backup (send-receive) for: {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509
>>> /mnt/btrfs-backup-pool/BTRBK/PACMAN.20150509
ERROR: creating subvolume PACMAN.20150509 failed. File exists
send/received failed, deleting (possibly present and garbled) received subvolume: /mnt/btrfs-backup-pool/BTRBK/PACMAN.20150509
WARNING: Deleted partially received (garbled) subvolume: /mnt/btrfs-backup-pool/BTRBK/PACMAN.20150509
...

Now everything is clean again. btrbk cleaned the garbled subvolume. By design, btrfs always stops all further action on any failure (we found a garbled subvolume, which is considered a failure).

The subvolumes created by now are:

  • {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509
  • {gandalf}/mnt/btrfs-pool/snapshots/PACMAN.20150509_1

Now, if you would run btrbk a third time, it would resume the backups of those snapshots.

Note that in your test, by hitting ctrl-c, you don't let btrbk delete the garbled subvolume. btrbk would have done so if the following command had terminated with exitcode!=0:

### /usr/bin/ssh -i /root/.ssh/id_rsa root@gandalf /sbin/btrfs send  '/mnt/btrfs-pool/snapshots/PACMAN.20150509' |  /sbin/btrfs receive -v '/mnt/btrfs-backup-pool/BTRBK/'

This is why three runs are needed in your example. Either way, at the end all backups should be resumed correctly.

@digint
Owner
digint commented May 10, 2015

Of course there is further improvement possible here, but in my opinion it is crucial to have the problem fixed upstream. Else, btrbk would have to install signal handlers or even store state information. Until then, btrbk will have to live with a workaround which should not be too complicated.

Regarding your scenario "(my preferred) in which I don't transfer old snapshots but just the last-one":
This scenario is kind of covered: the btrbk resume policy is the same as your backup policy, by setting target_preserve_daily 0 btrbk will only keep the last snapshot (which is always needed for incremental backups) until it is scheduled as preserved weekly or monthly. Why would you want not to resume all the backups you specified in the config? All you can possibly win is less bandwidth usage.
(EDIT: discussion continued in #18)

@diraimondo

Ok, now I see your point: I was assuming that hitting CTRL+C was equal to let the transfer fail. I've tried on first run to kill the receiving-instance of btrfs. In this case btrbk deletes the uncompleted target snapshot without any further run of it.

Thank you for your time.

@digint
Owner
digint commented May 15, 2015

Fixed in 2d445a8
Fix included in v0.17.1

@digint digint closed this May 15, 2015
@gergoerdi

I've just tried using btrbk with the raw transport mechanism (since I'm going to need encryption for my remote backup), and as of 0.22.2, interrupted transfers are not detected at all. This is super dangerous as that means I can end up with what I think is a valid snapshot, then create further incremental snapshots on top of it.

I've tried btrbk clean but it doesn't do anything. btrbk list backups shows the offending truncated image as "up-to-date".

Beside keeping an extra local volume, backing up there using raw, and then rsync'ing it myself, how can I make sure (or at least just check, after-the-fact) that btrbk finished uploading my backup?

@digint
Owner
digint commented Mar 22, 2016

I just created a new issue for that (#75).

Regarding your last question: No, there is no way of really checking the integrity of a raw backup other than receiving it into a btrfs filesystem. This makes it very dangerous for incremental backups (see comment on this in btrfs.conf(5) TARGET TYPES).
If the send stream is corrupt in any ways, you basically loose your complete incremental chain.

This is basically why "raw" backups are still considered experimental.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment