I'm wondering if it would be hard to implement generating checksum for each archive. If you don't have time to implement it, maybe you have some advice where to put this? Maybe even separate module?
@szymonpk sounds like a neat idea. I currently have zero knowledge of how to generate a checksum on files, I assume there are unix utilities that can take care of this? It would be great to be able to check whether the local and the remote files are the same, and raise an exception if not.
What's important is that this utility is both available on Linux and OSX. Is it a default utility, or one that you must install via a package manager? Or are there built-in Ruby functions that can do this? As I said, I have zero experience with this so if you could provide some information on what tools there should be used, maybe some simple command snippets so I can try it locally, etc, that would be great!
Also, something like this would likely want it's own class/module depending on how it's being implemented. A new object would have to be created inside the storage objects (FTP/SFTP/SCP/etc). Also, I'm not sure whether this would work with services like Amazon S3, Cloud Files, Ninefold, Dropbox, etc. I would assume you need the checksum utility on both the local as well as the remote machine? In which case it would only work with the FTP/SFTP/SCP storage methods.
But, if you could provide me with some more insight in what utilities to use, how it should be used, maybe some code snippets to get a good checksum going so I can test it locally, then I can better tell what/how/where and if it could/should be implemented.
There are many algorithms generating file checksum, the most popular methods are md5 and sha1 or more advanced variations sha224, sha256, sha384, sha512, most of linux distros use them. But if we have one implementation adding other shouldn't be very hard.
I think that generating checksum at remote machine won't be possible (beside machines with shell access), but we can dump generated checksum to file and upload it to storage space with archive, most important thing is having checksum at all! :)
Now lets talk about tools, ruby has libraries for generating each of mentioned checksum, for example:
# generating md5
# generating sha1
# generating sha512
Each of above examples will generate you completly portable checksum (generating checksum of the same file will give you the same output at any machine by any tool which implements given algorithm).
Only minor problem is speed of ruby. Generating sha512 checksum (this the most complex one, it takes most time to compute this) of 1GB file at Intel Core i7 3.33GHz by ruby takes about ~5 seconds, generating one by system tools takes about ~4 seconds. I'm fine with that ;-) but I think that you should now.
And last thing, output of checksum should go to file and file should like like below (if we have multiple archives in one task):
Each lines contains , such structure makes validating data easier for command line tools (test.sha512 is file from above):
$ sha512sum -c test.sha512
And I've forgotten about one thing, command line utils are provided by coreutils package - http://www.gnu.org/software/coreutils/.
Cool, thanks for the detailed explanation!
This, and #241 are what I want to get incorporated if possible in one of the future releases of backup. I'll have to look at the possibilities.
A question though as I don't know how to get it working, i'm probably doing something wrong.
I have a file called foo.rb and I attach a sha1 string (prefixed) to it with:
mv foo.rb "$(shasum foo.rb)"
So then I have:
Now when I run this:
shasum -c "57e585af50fec18c1373e26aa679de8386cb0680 foo.rb"
It doesn't return anything and stays silent, but when I run it with the -w flag (flags: show warnings)
shasum -wc "57e585af50fec18c1373e26aa679de8386cb0680 foo.rb"
shasum: 57e585af50fec18c1373e26aa679de8386cb0680 foo.rb: 1: improperly formatted SHA checksum line
shasum: 57e585af50fec18c1373e26aa679de8386cb0680 foo.rb: 2: improperly formatted SHA checksum line
shasum: 57e585af50fec18c1373e26aa679de8386cb0680 foo.rb: 3: improperly formatted SHA checksum line
shasum: 57e585af50fec18c1373e26aa679de8386cb0680 foo.rb: 4: improperly formatted SHA checksum line
shasum: 57e585af50fec18c1373e26aa679de8386cb0680 foo.rb: 5: improperly formatted SHA checksum line
shasum: 57e585af50fec18c1373e26aa679de8386cb0680 foo.rb: 6: improperly formatted SHA checksum line
shasum: 57e585af50fec18c1373e26aa679de8386cb0680 foo.rb: 7: improperly formatted SHA checksum line
shasum: 57e585af50fec18c1373e26aa679de8386cb0680 foo.rb: 8: improperly formatted SHA checksum line
Any idea what I'm doing wrong?
→ shasum -h
Usage: shasum [OPTION] [FILE]...
or: shasum [OPTION] --check [FILE]
Print or check SHA checksums.
With no FILE, or when FILE is -, read standard input.
-a, --algorithm 1 (default), 224, 256, 384, 512
-b, --binary read files in binary mode (default on DOS/Windows)
-c, --check check SHA sums against given list
-p, --portable read files in portable mode
produces same digest on Windows/Unix/Mac
-t, --text read files in text mode (default)
The following two options are useful only when verifying checksums:
-s, --status don't output anything, status code shows success
-w, --warn warn about improperly formatted SHA checksum lines
-h, --help display this help and exit
-v, --version output version information and exit
The sums are computed as described in FIPS PUB 180-2. When checking, the
input should be a former output of this program. The default mode is to
print a line with checksum, a character indicating type (`*' for binary,
`?' for portable, ` ' for text), and name for each FILE.
I think you misunderstanded me a little. You don't create file '57e585af50fec18c1373e26aa679de8386cb0680 foo.rb' but file foo.sum WITH content:
# compute sum and write it to file
$ shasum foo.rb >> foo.sum
# check sum
$ shasum -c foo.sum
Ah, now it works. Yeah I misunderstood that part. So you basically append multiple sums to that single file in the case if backing up a backup in multiple chunks?
Yes, exactly as you've described. It's easy to check if data is consistent.
Have you tried to implement this? If not, maybe do you have any ideas where it should be implemented.
I think @burns might be on to something already. It's something that we'd like to have implemented, but preferably only if it works across all storages (for consistency). We just finally pushed out 3.0.20 so we're now looking in to issue #241 and this.
@meskyanichi I would add that, if the backup file is being stored locally first and then transferred to remote storage, it's important to generate a checksum on the local disk before uploading to the remote drive. Once it's on the remote drive, a checksum should be generated there as well to verify file integrity.
That is to say:
I don't know about all the remote storages you have but S3 supports passing an MD5 checksum with the PUT request to save a file: http://aws.amazon.com/articles/1904
"Amazon S3's REST PUT operation provides the ability to specify an MD5 checksum (http://en.wikipedia.org/wiki/Checksum) for the data being sent to S3. When the request arrives at S3, an MD5 checksum will be recalculated for the object data received and compared to the provided MD5 checksum. If there's a mismatch, the PUT will be failed, preventing data that was corrupted on the wire from being written into S3. At that point, you can retry the PUT."
"MD5 checksums are also returned in the response to REST GET requests and may be used client-side to ensure that the data returned by the GET wasn't corrupted in transit. If you need to ensure that values returned by a GET request are byte-for-byte what was stored in the service, calculate the returned value's MD5 checksum and compare it to the checksum returned along with the value by the service."
Issue moved to the features repository: backup/backup-features#7