Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage with cloud storage like Amazon S3 or Glacier #123

Open
vote539 opened this issue Jan 9, 2017 · 13 comments
Open

Usage with cloud storage like Amazon S3 or Glacier #123

vote539 opened this issue Jan 9, 2017 · 13 comments
Labels

Comments

@vote539
Copy link

vote539 commented Jan 9, 2017

I would like to set up a BTRFS filesystem with backups, and was happy to find this project. I would like to have the backups sent to a cloud storage solution, however, instead of a hard drive or SSH server. Most of these cloud storage solutions expose RESTful APIs, and you don't have control over the storage medium they use on their end.

Does btrbk support sending backups to an arbitrary REST interface?

@digint
Copy link
Owner

digint commented Jan 9, 2017

Does btrbk support sending backups to an arbitrary REST interface?

No, this is neither implemented nor planned. If you want to push "target raw" backups to your amazon s3 storage, you need to somehow mount it locally. You could use s3fs for this, which should do exactly that. So your setup could be something like this:

  1. mount amazon s3 using s3fs to /mnt/mys3drive
  2. configure target raw /mnt/mys3drive/btrbk_backups/... in btrbk.conf

If you get this working, please post a note here, so that I could add a section for this on the FAQ.

@digint digint added the question label Jan 9, 2017
@vote539
Copy link
Author

vote539 commented Jan 17, 2017

Thanks for the reply! Here's what ended up working for me. AWS Block Storage is the same price range per gigabyte as S3, so I created a block storage device and formatted it as BTRFS. I connected the block storage device to a "nano" head node, whose only job is to run btrbk. This setup gives me 500 GB of backup storage for about US$20/mo.

@Witiko
Copy link
Contributor

Witiko commented May 24, 2017

Amazon S3 is quite pricey if you are looking only for long-term archival. However, services such as Amazon Glacier don't seem to be easily mountable. It would be convenient if btrbk provided a target type for piping incremental backups into arbitrary commands. Think

volume /mnt/btr_pool
  subvolume home
    target pipe /usr/bin/glacier archive upload my_vault --name={} -

where {} would expand to the name of the file which is being passed on stdin and where the /usr/bin/glacier command originates from basak/glacier-cli. It seems trivial to just add

btrbk run && btrfs send -p `find snapshot_dir/ -mindepth 1 -maxdepth 1 | tail -2` |
  (insert a compression and encryption pipeline) |
  glacier archive upload my_vault --name=`ls snapshot_dir | tail -1`.btrfs -

to one's crontab and be done with it, but then you also need to keep a journal of unsuccessful uploads (due to the machine being offline for example), so that everything gets backed up eventually. This is not an unsurmountable task, but direct support for this kind of usage in btrbk would definitely be welcome.

@digint
Copy link
Owner

digint commented May 31, 2017

This is a nice idea, but it's incomplete: As btrbk is stateless, it always needs information of which subvolumes are already present on the target side. For target send-receive, this information is fetched by btrfs subvolume list; for target raw, the uuid's are encoded in the filename.

In order to complete this, we should define some data structure: timestamp, UUID, received-UUID, parent-UUID (similar to btrfs subvolume list), and then also have a user-defined command which would generate it. Then btrbk would parse this data and figure out which subvolumes needs to be sent to the target by the configured target_preserve policy, and which parents to pick for incremental send.

PS: sorry for the late reply, I'm really busy with other things at the moment...

@Witiko
Copy link
Contributor

Witiko commented May 31, 2017

My original idea was that btrbk would be keeping tabs on the successful invocations to automatically infer which volumes need sending. If /usr/bin/glacier archive upload my_vault --name={} - from my example returned with a zero exit code, btrbk would put down {} to a list. Note that the user could specify where they want this list stored:

volume /mnt/btr_pool
  subvolume home
    target pipe /usr/bin/glacier archive upload my_vault --name={} -
    journal /var/lib/btrbk/glacier

Deleted subvolumes could be removed from the list, so that it does not grow ad infinitum.

@digint
Copy link
Owner

digint commented May 31, 2017

Yeah well, but then people start deleting files on the target by hand, and the mess with the journal starts...

I guess glacier also provides some sort of directory listing, so if btrbk would generate filenames the same way as it does for target raw, we could always fetch them and parse them the same way.

volume /mnt/btr_pool
  subvolume home
    target pipe /usr/bin/glacier archive upload my_vault --name={} -
      list_cmd /usr/bin/glacier <insert list command here> my_vault

@Witiko
Copy link
Contributor

Witiko commented May 31, 2017

That would be /usr/bin/glacier archive list my_vault in this case. However, my idea was that the pipe target would be a fire-and-forget kind of a thing. If the user wants to start deleting data from the target, that is not our problem. Suppose I am just piping the data to a mail transfer agent over SMTP, or to a remote shell; I may well not be able to report on what is stored on “the other side”. I find this concept more flexible than what you propose.

P.S.: I guess target pipe is a little confusing name, as it implies that the target is a named pipe. Both target command and target pipeline resolve this ambiguity.

@digint
Copy link
Owner

digint commented Jun 1, 2017

However, my idea was that the pipe target would be a fire-and-forget kind of a thing

Yes I understand, and I see the benefit in this, but that's not how btrbk works. Maybe we could introduce a new sub-command for this kind of thing, something like btrbk oneshot, which would simply create a new snapshot and transfer it (always non-incremental) to the target. The main problem here would be to keep the config consistent and non-confusing. Maybe something like this:

volume /mnt/btr_pool
  subvolume home
    target pipe /usr/bin/glacier archive upload my_vault --name={} -
      target_type oneshot

@Witiko
Copy link
Contributor

Witiko commented Jun 1, 2017

and transfer it (always non-incremental) to the target.

Note that keeping a journal would make it possible to transfer incremental backups even in this setting.

@sbrudenell
Copy link
Contributor

s3fs

I've been trying to get this to work. There are a number of issues.

  • fuse is an operational burden, and docker doesn't help.
  • s3fs is not production-quality
    • After weeks of testing, I haven't been able to use it to upload large files.
    • s3fs' cache options do not play well with btrbk
      • s3fs has a metadata cache, but cat /s3/file; cat /s3/file will still issue two HeadObject requests. This is bad with btrbk as it reads all the *.info files on a raw target on every run.
      • s3fs will cache huge amounts of data to disk during file uploads, rather than streaming them
    • It's not clear s3fs issues will be resolved. Its codebase is undocumented, has heavy copy-paste duplication, uses non-meaningful naming schemes, and interlaces high-level business logic with utility functions. A large portion of it is dedicated to complex ad-hoc manipulations of a userspace cache. The design of this cache is questionable, and I certainly can't get it to perform well. Its user documentation is incoherent.

It would be a huge win if btrbk could use S3 APIs directly. Dozens of cloud providers expose an S3 API now.

The S3 API is a large surface though. Minimal S3 support probably still requires multiple signature versions and autodetection of multipart uploads, and likely other stuff.

In the meantime, I suggest btrbk.conf should offer a set of command endpoints, something like:

target pipe
  pipe_target_list_files /usr/local/bin/list_files_from_s3.sh my_bucket
  pipe_target_read_file /usr/local/bin/read_file_from_s3.sh my_bucket
  pipe_target_write_file /usr/local/bin/write_file_to_s3.sh my_bucket

The expected interactions would then be just like target raw, such that the scripts would be used to read and write *.info files in the same patterns currently used.

@lpyparmentier
Copy link

lpyparmentier commented Jun 6, 2023

Looking for a similar solution, just want to push an encrypted archive of a snapshot into a s3 long term storage such as https://www.ovhcloud.com/en-ca/public-cloud/cold-archive/. For 2$/month/TB its worth it ! I guess I can do it in another way, but directly integrated with btrbk is a must.

@bojidar-bg
Copy link

Hm.. instead of implementing the whole S3 API ourselves or jumping the gun with custom scripts, how about adding rclone support for uploading and managing files? It seems like it has all the necessary commands, e.g.:

  • rclone rcat -- can be used to pipe directly into storage.
  • rclone lsf -- can be used to list current archives in storage.
  • rclone cat -- can be used to pipe directly out of storage.

The only downside is that rclone has its own config format.. that might make it messier than just allowing custom scripts.

@kubrickfr
Copy link

Shameless plug: my simple solution to this problem, https://github.com/kubrickfr/btrfs-send-to-s3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants