IMPORTANT: Before using GridBackup, please contact Shawn Willden at email@example.com. The code is at a state where some advice will be required to use it successfully. This should change soon.
GridBackup is a cross-platform, peer-to-peer backup and restore tool designed to provide high of reliability by backing your files up to a distributed peer-to-peer file system, the allmydata.org Tahoe system.
It is designed to create a series of "snapshots" of your file system, which will ultimately be browseable through a graphical tool (somewhat similar to Apple's Time Machine, though probably not as pretty, and hopefully easier to use). To create a snapshot, it scans your files quickly and logs the current state of each. For each file it checks to see if a backup has already been made and if not it adds a "backup job" to a queue. An upload process is responsible for working its way through the queue to put files in the Tahoe grid.
The reason for separating scanning and uploading is that uploading is slow. Tahoe splits your data into multiple pieces, and stores those pieces on different storage servers. Only a subset of the pieces are needed to restore your data, so the sum of the pieces is larger than the original file, usually by a factor of 3 or so. This means backing up 100 GiB of data may require uploading 300 GiB of data.
An implication of the separation, though, is that it's possible that by the time the uploader gets around to processing a job, the file its supposed to upload has either changed or been deleted. To reduce the chances of that, the uploader prioritizes its work, favoring recently-changed files on the theory that files that haven't changed for a long time are less likely to change. It also favors small files over large files, on the theory that it's better to get many small files backed up than a few large ones.
Once one version of a file is backed up, though, GridBackup will optimize backing up future revisions of the same file by storing just the changes, rather than the whole file. This should mean that once everything on your machine has been backed up successfully, once, future backups will be quick.
Another optimization is achieved by using a what Tahoe calls a "shared convergence secret". This is optional, and using it compromises your privacy very slightly (not in a way that most people would care about), but it also means that if you have the same file as someone else using GridBackup on the same Tahoe grid, then only one of you will actually have to upload it. So common system files will only be uploaded once per grid.
So you want to start using GridBackup, or at least experimenting with it? Well, at the present, it's really only for the adventurous who are also programmers -- and for my family, who have top-notch techical support :-). If you are a developer, though, and want to pitch in and help things along, the code is pretty approachable, and I try to be very responsive.
- The rudiments of a configuration system are in place, so you can control what gets scanned. If you look in ~/.GridBackup (after running GridBackup once) you'll find a config.ini file.
- The scanner is pretty solid. It scans the configured portions (by default the root partition) and generates a backup log and an upload job queue. Later scans generate additional backup logs and job queues, documenting content and metadata changes (and generating new signatures).
- The uploader, is in fairly good shape, but it doesn't yet support backing up of deltas (the infrastructure is in place, I'm just holding off on doing deltas until the verifier is done), and needs some work to properly handle unstable files.
- The restore engine doesn't exist at all.
- The verifier is in a very preliminary state. It works, but needs a lot of improvement.
- The GUI doesn't exist. Like all proper software, GridBackup will be usable from the command line, but will have a GUI built on top of it.
- The whole thing needs a lot of polishing to make it usable.
If you want to fiddle with the code in its current state, you'll need to install Python 2.5 or later (but not 3.0 or later) and pycryptopp. Note that pycryptopp is NOT the same as python-crypto. pycryptopp is what Tahoe uses for its cryptography. The simplest way to get it is with easy_install. Running:
(as root) should do it. The easy_install script should have been installed as part of the base Python installation.
To test GridBackup, run:
python setup.py build
and then copy the _librsync library from where the build process places it in build/lib.<platform>/grid_backup to grid_backup (where all of the source files are). Then you should be able to use run_test.sh to execute the unit tests.
To install GridBackup to your system so you can run it, run (as root):
python setup.py install
That will build and install GridBackup into your Python distribution, and will put the "GridBackup" script on your path (e.g. in /usr/local/bin).
GridBackup is configured with a config.ini file which is stored in the .GridBackup subdirectory of your home directory. If you run GridBackup, it will be created automatically with default values. Alternatively, you can create the .GridBackup directory and copy the provided sample.config.ini there, renaming it config.ini. Edit it as appropriate. Unless your username is the same as mine, it won't work without editing.
There are three scripts that you run to use GridBackup: GridBackup, GridUpload and GridVerify.
"GridBackup" scans your hard drive (or the portions of it specified by the config.ini file) and creates backup "logs" and puts them into the grid, and a job queue which is stored locally. The logs describe the content and metadata of your file system, but do not contain the contents of the files. The logs tell the system what files need to be restored to put restore the backup, and what ownership, permissions, etc. should be applied. The job queue is the list of files that need their content uploaded so that the content described in the logs is actually available for restore.
GridBackup will take a fair amount of time the first time it runs, because it has to read every byte of every file. On subsequent runs it only reads the content of changed files, so it's fairly fast.
"GridUpload" processes the job queue and puts all of the file content in the grid. GridUpload takes a long time to run if you have a lot of data. Perhaps months. You can stop it at any time, though. Just hit Control-C. Note that it may take GridUpload a few minutes to shut down when you hit Control-C, though. It tries to shut down in an orderly way. Even if it's stopped "hard", nothing should be lost, but some rework may be done the next time you start it.
"GridVerify" goes through your backup logs and checks if each mentioned file is in the grid, or in the job queue ready to be uploaded to the grid. If not, it prints an error message, and if the file is still available for upload, it re-adds the job to the queue.
NOTE: All three programs use the same database, and it doesn't like concurrent access. This means that you can only run one of them at a time. Generally, this means that you need to stop GridUpload before running GridBackup or GridVerify, and then restart GridUpload when the other program is done. This problem will be fixed.
GridBackup Development Environment
If you want to hack on GridBackup, the easiest way to set it up is to:
- Install pycryptopp, as above
- Run "sudo python setup.py develop". That will replace the files installed by "setup.py build" with links to your source tree, so that running the copy of GridBackup from your path (e.g. /usr/local/bin) actually runs the code from your source tree. Very convenient.
If you want to help out, please send me e-mail at firstname.lastname@example.org. I'll set up a mailing