Permalink
Browse files

Updated documentation.

  • Loading branch information...
cubiclesoft committed Apr 26, 2018
1 parent 14ac1bd commit ab10c2100d214627e2078a86ef17df65d562598d
Showing with 309 additions and 11 deletions.
  1. +1 −0 .gitignore
  2. +105 −11 README.md
  3. +85 −0 docs/cloud-backup-design-spec-and-benchmarks.md
  4. +118 −0 docs/minimum-requirements-for-backup-software.md
View
@@ -7,3 +7,4 @@
/test.php
/cache/*
/restore/*
/retired-support/*
View
116 README.md
@@ -3,25 +3,115 @@ Cloud Backup
A flexible, powerful, and easy to use rolling incremental backup system that pushes collated, compressed, and encrypted data to online cloud storage services, local attached storage, and network storage.
Cloud Backup is third generation backup software that works for every major platform that matters.
[![Donate](https://cubiclesoft.com/res/donate-shield.png)](https://cubiclesoft.com/donate/)
Features
--------
* All the [standard things you need in a backup system](http://barebonescms.com/documentation/cloud_backup/).
* All the [standard things you need in a backup system](https://github.com/cubiclesoft/cloud-backup/blob/master/docs/minimum-requirements-for-backup-software.md).
* Transparent compression and encryption before sending data to the storage provider.
* Supports these cloud storage services: [Amazon Cloud Drive](https://www.amazon.com/clouddrive), [OpenDrive](https://www.opendrive.com/), and [Cloud Storage Server](http://barebonescms.com/documentation/cloud_storage_server/).
* Supports these cloud storage services: [OpenDrive](https://www.opendrive.com/) and [Cloud Storage Server](https://github.com/cubiclesoft/cloud-storage-server).
* Supports local attached storage.
* Block-based storage for major reductions in the number of API calls made.
* And much, much more. See the official documentation for a more complete feature list.
* Block-based storage for [major reductions in the number of API calls made](https://github.com/cubiclesoft/cloud-backup/blob/master/docs/cloud-backup-design-spec-and-benchmarks.md).
* Also has a liberal open source license. MIT or LGPL, your choice.
* Designed for relatively painless integration into your environment.
* Sits on GitHub for all of that pull request and issue tracker goodness to easily submit changes and ideas respectively.
More Information
----------------
Getting Started
---------------
Download or clone the latest software release. If you do not have PHP installed, then download and install the command-line (CLI) version for your OS (e.g. 'apt install php-cli' on Debian/Ubuntu). Windows users try [Portable Apache + PHP + Maria DB](https://github.com/cubiclesoft/portable-apache-maria-db-php-for-windows).
From a command-line, run:
```
php configure.php
```
The installer will ask a series of questions that will configure the backup. The configuration tool may be re-run at any time - although some options such as service selection can't be changed. Be sure to take advantage of the e-mail notification and file monitoring features.
After the backup has been configured, run it:
```
php backup.php
```
If you encounter any problems, you can test e-mail notifications and service connectivity respectively with these two commands:
```
php test_notifications.php
php test_service.php
```
Once the first backup completes, be sure to verify that it is functioning properly by running:
```
php verify.php
```
Once everything about the backup looks good, which might take several days of running manual backups and verifications, use your system's built-in task scheduler to run 'backup.php' on a regular basis. Under Windows, use [Task Scheduler](http://windows.microsoft.com/en-US/windows/schedule-task). Under most other OSes, use [cron](https://help.ubuntu.com/community/CronHowto).
Install and configure a second copy of Cloud Backup for a different backup location. Good backups have one installation for an on-site backup (e.g. an attached hard drive) and one installation that uses an off-site cloud backup service. If the location where the backup tools are located is in the backup path, be sure to exclude each installation from the other one or else they will constantly back up the others' cached files.
Go into the directories where the backup software is installed. Locate the file called 'config.dat'. This is a plain text JSON file containing your backup configuration, but, more importantly, it also contains your encryption keys. Without the file, the backup data is useless. Copy the files to a couple of external thumbdrives and put those thumbdrives somewhere safe. A safe-deposit box at a bank and a decent hiding place at home/work can do wonders here. Cloud Backup makes it possible to accurately recover data even in the face of disaster scenarios.
At this point, Cloud Backup is set up. Adding a reminder to a calendar to verify backups on a monthly basis is highly recommended. To verify a backup, run:
```
php verify.php
```
Verification is easy and confirms that the backup data still looks valid. Verification spot-checks the backup and displays vital statistics about the files database that tracks details about the directories and files in the backup.
Documentation, examples, and official downloads of this project sit on the Barebones CMS website:
Example Prebackup Scripts
-------------------------
http://barebonescms.com/documentation/cloud_backup/
* [Database export via CSDB](https://github.com/cubiclesoft/csdb/blob/master/docs/csdb_queries.md#generic-database-exportimport)
Restoring Data
--------------
In the event that data needs to be restored from the backup, first verify the backup (sanity check):
```
php verify.php
```
Then start the restoration shell:
```
php restore.php
```
After retrieving the information for a specific backup, 'restore.php' asks which backup to load the view of and, once loaded, presents a shell-like command-line interface to access the backup. This extensible interface has the following commands:
* cd, chdir - Change directory.
* dir, ls - List current directory.
* restore - Restores one or more files or directories to a 'restore' subdirectory where the backup software is located.
* groups, users - Show a unique list of groups/users (relevant for *NIX OSes only).
* mapgroup, mapuser - Change all files and directories matching one specific group/user to another (relevant for *NIX OSes only). Temporarily affects groups and users in the SQLite database so that restored files correctly map to the available groups and users on the host.
* stats - Show database statistics.
* help - Show the help screen.
* exit, quit - Leave the shell.
Depending on how much data is being restored, the process can, of course, take a while.
Defragmenting
-------------
A good rule of thumb is to defragment backups once a year. Defragmentation only affects shared blocks. Non-shared blocks are self-defragmenting.
To defragment a backup, manually run:
```
php backup.php -d
```
See [Cloud Backup Design Specifications and Benchmarks](https://github.com/cubiclesoft/cloud-backup/blob/master/docs/cloud-backup-design-spec-and-benchmarks.md) for more details on how the backup system works with regards to shared blocks. As smaller files are added, removed, and changed, the shared block numbers they point to also change. This, over time, implicitly fragments shared blocks that were created earlier. Each shared block still contains the original data but fewer and fewer references to the shared block will exist.
The defragmentation procedure determines if a shared block has space available greater than two times the small file limit (default 2MB) and then both schedules the shared block for deletion and removes the associated files from the database. The rest of the backup then proceeds normally, which perceives the aforementioned deleted database entries as new files, which will be placed into new shared blocks. The end result is an incremental that eventually makes fairly significant changes once it merges into the base. How long that takes, of course, depends on how many incrementals are kept around and the frequency of backups.
Adding Cloud Services
---------------------
@@ -37,11 +127,15 @@ Once a cloud storage service meets the minimum criteria, two things have to be b
And then, of course, it takes time to test the whole thing to make sure it all works properly.
If a service changes its policies so that the above list is no longer true, then support will be dropped. The following services have been retired: Amazon Cloud Drive.
Other Thoughts
--------------
I'd love to see CrashPlan and Backblaze develop open, public APIs for their services. Currently, only the Amazon Cloud Drive and OpenDrive services meet the minimum criteria.
[Cloud Storage Server](https://github.com/cubiclesoft/cloud-storage-server) is a self-hosted cloud storage API that was developed to create a really nice baseline that plays nicely with the Cloud Backup software. Useful for backing up your data to your neighbors' or friends' residences.
There are hundreds of backup/sync software products out there. I've evaluated quite a few of them. Besides Cloud Backup (this software), only two other products are, in my opinion, worth your attention: [rclone](http://rclone.org/) and [Restic](https://restic.github.io/). (Restic appears to use rclone or some variant of it under the hood.) They meet most of the criteria for good backup/sync software and have a decent following. Be aware that those products rely on deltas, which I've found serious fault with.
[Cloud Storage Server](http://barebonescms.com/documentation/cloud_storage_server/) is a self-hosted cloud storage API that was developed to create a really nice baseline that plays nicely with the Cloud Backup software.
Check out the [DataHoarder](https://www.reddit.com/r/DataHoarder/) subreddit. It's fun to watch people with 100TB+ attempting to back up their data to various places.
Finally, OpenDrive is a bit weird in that the service is only occasionally mentioned on the Internet, rarely reviewed, and can even be hard to find, but happens to have the only complete, published API of a hosted storage service (quite rare), the API has no rate limits (rare), and a non-restrictive EULA (extremely rare). They do have some API connectivity/stability/uptime issues that I wish they would work out but I just find it very bizarre that they aren't mentioned/reviewed more frequently - maybe it is the $13/month that drives reviewers away. However, OpenDrive is an excellent choice for businesses that want to securely back up their data off-site via Cloud Backup - mostly because of that EULA. OpenDrive was the first service to be included in Cloud Backup. It took a long time to find them too because Google Search kept burying OpenDrive results, but software developer and small business friendly companies like theirs quickly earn my respect. I wasn't paid to say that. I'm a small business owner myself, so I know how hard it can be to get much-needed attention.
Finally, OpenDrive is a bit weird in that the service is only occasionally mentioned on the Internet, rarely reviewed, and can even be hard to find, but happens to have a complete, published API (quite rare), the API has no rate limits (rare), and a non-restrictive EULA (extremely rare). They do have some occasional API connectivity/stability/uptime issues (Update 2018: Which seem to finally be fixed?) but I just find it very bizarre that they aren't mentioned/reviewed more frequently - maybe it is the $13/month that drives reviewers away. However, OpenDrive is an excellent choice for businesses that want to securely back up their data off-site via Cloud Backup - mostly because of that EULA. OpenDrive was the first service to be included in Cloud Backup. It took a long time to find them too because Google Search kept burying OpenDrive results.
@@ -0,0 +1,85 @@
Cloud Backup Design Specification and Benchmarks
================================================
This is a technical document with details about how Cloud Backup works. Before using a backup software solution, it is a good idea to be familiar with how it functions under the hood. Backup systems are not perfect - some are better than others depending on the type(s) of data being backed up.
Cloud Backup sends data to remote, possibly untrusted hosts over the Internet (i.e. the cloud). There are certain requirements that need to be met before sending data to such hosts such as encrypting it in advance.
Files
-----
Let's talk about files for a bit. Backing up directory names and symbolic links are extremely minor bits of information. They are important, sure, for maintaining structure, but they occupy little space and are relatively unimportant. Files, on the other hand, are where data is stored. That data is what is important to most people and is a reasonable expectation that a backup system will take good care of that data.
Files mostly come in two main types: Plain text and binary. Plain text files can be opened up in Notepad or another text editor. However, text files are really just a special case of a binary file and, from a backup system design perspective, all files should be treated as binary, opaque data.
Files come in all sizes. There are small files, big files, zero-byte files, and everything in-between. A backup system should handle all sizes of files. The most challenging file sizes are those over 2GB due to 32-bit limitations and...thousands of tiny files.
Transferring 1,000 files over to another computer across a network, especially using protocols like (S)FTP is pretty slow. Transfering a single file that exceeds the total size of the 1,000 separate files over the same network completes in a fraction of the time. This is a repeatable problem. The issue is one of data coalescence. This brings us back to Cloud Backup. Suffice it to say, sending a zillion little tiny files to a cloud storage provider would take forever and cost many, many API calls. Cloud Backup uses a block-based strategy to solve this and other problems with sending data over a network to a destination host.
Blocks
------
A block is a chunk of data. In Cloud Backup, a block may contain one or more files. Blocks may be broken up into parts for easier transmission and error handling over a network.
Cloud Backup has two types of blocks: Shared and non-shared. During a backup, the following logic is used:
* If the file size is under the small file limit after compressing it (1MB by default), the file is placed into the current shared block if there is space OR the current shared block is encrypted and uploaded and a new shared block is started.
* Otherwise, a new non-shared block is used and the file is compressed, encrypted and uploaded solo.
The Benchmarks section below shows the impact that the above rules have: Approximately 275,000 fewer network requests are made! Gathering the smaller files first on a host into larger files makes a dramatic difference.
Block Parts
-----------
For most blocks, they will have one part and counting starts at 0. For example, '0_0.dat' is read as block 0, part 0.
Block numbers increment over time and correspond to a matching number in the database.
The default upper limit on the size of a block part in Cloud Backup is 10MB. This limit exists for a number of reasons, but mostly to keep RAM and network usage down. In order to decrypt a block part, it has to be loaded completely into RAM. Due to how PHP works, there might be 2-3 copies of the block part at any given point in time when it is being read/written, which translates to about 30MB RAM. Throw in not wanting to waste transfer limits with failed uploads and 10MB becomes a decent default limit. The configuration file can be modified to change the limits if your backup needs are different but, generally-speaking, the default setting is a good enough starting point for most users.
Block File Naming
-----------------
Cloud Backup names files in a mostly opaque manner. However, there are are few reserved blocks that are stored in the target in specific ways. For example, '0_0.dat' is the compressed, _encrypted_ 'files.db' SQLite database file.
Beyond the first three blocks, determining what data is contained in a given block requires decrypting the block. Without the decryption keys, the data and knowledge about the data is useless.
Encryption
----------
Cloud Backup uses two AES-256-CBC symmetric key and IV pairs to encrypt all data and uses the [standard CubicleSoft two-step encryption method](http://cubicspot.blogspot.com/2013/02/extending-block-size-of-any-symmetric.html) to extend the block size to a minimum of 1MB.
Anyone who wants to reverse-engineer the dual encryption keys has to repeatedly decrypt 1MB of data twice. Even if AES is ever fully broken, your data is still probably safe and secure from prying eyes. The data being encrypted is surrounded with random bytes so that even the same input data results in completely different output. Each block part also includes the size of the data and a hash for verification purposes. The data is also padded with random bytes out to the nearest 4096 byte boundary (4K increments). All of this helps make it that much more difficult for an attacker to guess what a file might contain.
Since data is encrypted, the keys must be kept safe and a copy of the Cloud Backup configuration should be kept offline so that data can be recovered.
Benchmarks
----------
When it comes to moving large quantities of data, performance is important. Keep in mind that benchmarks are merely demonstrative and that Cloud Backup performs both transparent compression and two rounds of encryption of the data being backed up. In PHP.
The following system was used for the benchmarks:
* Intel Core i7-6700K (6th Gen CPU)
* 32GB RAM - DDR4 2133MHz SDRAM
* Windows 10 Pro 64-bit
* Windows Security Essentials
* Internal 640GB 7200 RPM Western Digital Hard Drive, Caviar, Black, SATA II connection with ~240GB of data to back up (258,162,126,382 bytes)
* External 3TB Western Digital Hard Drive, Green, USB 3.0 connection to backup to
The worst-performing component in that mix is the external hard drive to which data was written as well as Windows Security Essentials checking every file that was opened up. The measured write speed of the drive varied fairly wildly. One moment it plugged along at 25MB/sec and the next it inexplicably plummeted to 5MB/sec. It was a hard drive bought at a bargain basement price and it's primary purpose is longer-term storage rather than heavy-duty use.
An initial backup using the 'local' option during 'configure.php', resulted in the following useful stats:
* Cloud Backup took around 5.5 hours to complete. Data moved through at an average rate of 13.04MB/sec.
* 240GB of data compressed to 161GB. A 32.8% reduction in the amount of data stored.
* 280,537 files were backed up across 39,534 directories. Of those, 275,145 ended up in 427 shared blocks.
The second run performed 24 hours later, resulted in the following useful stats for the first incremental:
* The backup software took approximately 3 minutes to scan the entire system and create the incremental.
* 244MB of compressed data was stored in the incremental across 23 blocks.
* 30 new folders, 168 shared files, and 4 non-shared files were added.
* 45.5MB compressed (118MB uncompressed) of additional data was added to the total. Several large files obviously changed between the two runs.
All-in-all, this is a very solid showing for a backup system written in PHP. Obviously, the cloud service portions of this tool have much longer, slower times - obviously taking days to move the same amount of data over a network that might also have monthly data caps applied.
Oops, something went wrong.

0 comments on commit ab10c21

Please sign in to comment.