Generating backup checksum #238

Closed
szymonpk opened this Issue Nov 30, 2011 · 11 comments

Projects

None yet

5 participants

@szymonpk
Contributor

I'm wondering if it would be hard to implement generating checksum for each archive. If you don't have time to implement it, maybe you have some advice where to put this? Maybe even separate module?

@mrrooijen
Member

@szymonpk sounds like a neat idea. I currently have zero knowledge of how to generate a checksum on files, I assume there are unix utilities that can take care of this? It would be great to be able to check whether the local and the remote files are the same, and raise an exception if not.

What's important is that this utility is both available on Linux and OSX. Is it a default utility, or one that you must install via a package manager? Or are there built-in Ruby functions that can do this? As I said, I have zero experience with this so if you could provide some information on what tools there should be used, maybe some simple command snippets so I can try it locally, etc, that would be great!

Also, something like this would likely want it's own class/module depending on how it's being implemented. A new object would have to be created inside the storage objects (FTP/SFTP/SCP/etc). Also, I'm not sure whether this would work with services like Amazon S3, Cloud Files, Ninefold, Dropbox, etc. I would assume you need the checksum utility on both the local as well as the remote machine? In which case it would only work with the FTP/SFTP/SCP storage methods.

But, if you could provide me with some more insight in what utilities to use, how it should be used, maybe some code snippets to get a good checksum going so I can test it locally, then I can better tell what/how/where and if it could/should be implemented.

Cheers!

@szymonpk
Contributor
szymonpk commented Dec 1, 2011

There are many algorithms generating file checksum, the most popular methods are md5 and sha1 or more advanced variations sha224, sha256, sha384, sha512, most of linux distros use them. But if we have one implementation adding other shouldn't be very hard.

I think that generating checksum at remote machine won't be possible (beside machines with shell access), but we can dump generated checksum to file and upload it to storage space with archive, most important thing is having checksum at all! :)

Now lets talk about tools, ruby has libraries for generating each of mentioned checksum, for example:

# generating md5

require 'digest/md5'
Digest::MD5.file("data-2011-11-30.sql.gz")
# generating sha1

require 'digest/sha1'
Digest::SHA1.file("data-2011-11-30.sql.gz")
# generating sha512

require 'digest/sha1'
Digest::SHA512.file("data-2011-11-30.sql.gz")

Each of above examples will generate you completly portable checksum (generating checksum of the same file will give you the same output at any machine by any tool which implements given algorithm).

Only minor problem is speed of ruby. Generating sha512 checksum (this the most complex one, it takes most time to compute this) of 1GB file at Intel Core i7 3.33GHz by ruby takes about ~5 seconds, generating one by system tools takes about ~4 seconds. I'm fine with that ;-) but I think that you should now.

And last thing, output of checksum should go to file and file should like like below (if we have multiple archives in one task):

04a3e70a42c47df13a47ab0e5ce2a8e3bd5e044fb0e0bf5cb0bf8136d5864ef6c594c0fbcd7424aff90c325ba4cd60038e59fc1898f7ecfebac3bcc79ce1d389  data-2011-11-30.sql.gz
e4b060a5e5d52664af70d45423733f92aabbf37f502e42456cf81a6c547778cd52feb8ec361f89b4470ed58471aca3fea4a167fa6353acf0ad5fd72e50c71380  data-2011-11-29.sql.gz
9c361a80e572fcf2b8dda257c7d09ff30730964ee9f25aa5a3c817de284a8802161ba7f8fa82d30fcb43b64d75fc0faee0edc22b5e5bfd2085f01e169df15975  data-2011-11-28.sql.gz

Each lines contains , such structure makes validating data easier for command line tools (test.sha512 is file from above):

$ sha512sum -c test.sha512 
data-2011-11-30.sql.gz: OK
data-2011-11-29.sql.gz: OK
data-2011-11-28.sql.gz: OK

And I've forgotten about one thing, command line utils are provided by coreutils package - http://www.gnu.org/software/coreutils/.

@mrrooijen
Member

Cool, thanks for the detailed explanation!

This, and #241 are what I want to get incorporated if possible in one of the future releases of backup. I'll have to look at the possibilities.

A question though as I don't know how to get it working, i'm probably doing something wrong.

I have a file called foo.rb and I attach a sha1 string (prefixed) to it with:

mv foo.rb "$(shasum foo.rb)"

So then I have:

57e585af50fec18c1373e26aa679de8386cb0680  foo.rb

Now when I run this:

shasum -c "57e585af50fec18c1373e26aa679de8386cb0680  foo.rb"

It doesn't return anything and stays silent, but when I run it with the -w flag (flags: show warnings)

shasum -wc "57e585af50fec18c1373e26aa679de8386cb0680  foo.rb"
shasum: 57e585af50fec18c1373e26aa679de8386cb0680  foo.rb: 1: improperly formatted SHA checksum line
shasum: 57e585af50fec18c1373e26aa679de8386cb0680  foo.rb: 2: improperly formatted SHA checksum line
shasum: 57e585af50fec18c1373e26aa679de8386cb0680  foo.rb: 3: improperly formatted SHA checksum line
shasum: 57e585af50fec18c1373e26aa679de8386cb0680  foo.rb: 4: improperly formatted SHA checksum line
shasum: 57e585af50fec18c1373e26aa679de8386cb0680  foo.rb: 5: improperly formatted SHA checksum line
shasum: 57e585af50fec18c1373e26aa679de8386cb0680  foo.rb: 6: improperly formatted SHA checksum line
shasum: 57e585af50fec18c1373e26aa679de8386cb0680  foo.rb: 7: improperly formatted SHA checksum line
shasum: 57e585af50fec18c1373e26aa679de8386cb0680  foo.rb: 8: improperly formatted SHA checksum line

Any idea what I'm doing wrong?

→ shasum -h
Usage: shasum [OPTION] [FILE]...
   or: shasum [OPTION] --check [FILE]
Print or check SHA checksums.
With no FILE, or when FILE is -, read standard input.

  -a, --algorithm    1 (default), 224, 256, 384, 512
  -b, --binary       read files in binary mode (default on DOS/Windows)
  -c, --check        check SHA sums against given list
  -p, --portable     read files in portable mode
                         produces same digest on Windows/Unix/Mac
  -t, --text         read files in text mode (default)

The following two options are useful only when verifying checksums:
  -s, --status       don't output anything, status code shows success
  -w, --warn         warn about improperly formatted SHA checksum lines

  -h, --help         display this help and exit
  -v, --version      output version information and exit

The sums are computed as described in FIPS PUB 180-2.  When checking, the
input should be a former output of this program.  The default mode is to
print a line with checksum, a character indicating type (`*' for binary,
`?' for portable, ` ' for text), and name for each FILE.
@szymonpk
Contributor
szymonpk commented Dec 3, 2011

I think you misunderstanded me a little. You don't create file '57e585af50fec18c1373e26aa679de8386cb0680 foo.rb' but file foo.sum WITH content:

57e585af50fec18c1373e26aa679de8386cb0680 foo.rb

Full example:

# compute sum and write it to file
$ shasum foo.rb >> foo.sum
# check sum
$ shasum -c foo.sum
foo.rb: OK
@mrrooijen
Member

Ah, now it works. Yeah I misunderstood that part. So you basically append multiple sums to that single file in the case if backing up a backup in multiple chunks?

@szymonpk
Contributor
szymonpk commented Dec 4, 2011

Yes, exactly as you've described. It's easy to check if data is consistent.

@szymonpk
Contributor

Have you tried to implement this? If not, maybe do you have any ideas where it should be implemented.

@mrrooijen
Member

I think @burns might be on to something already. It's something that we'd like to have implemented, but preferably only if it works across all storages (for consistency). We just finally pushed out 3.0.20 so we're now looking in to issue #241 and this.

@bitops
bitops commented Jun 12, 2012

@meskyanichi I would add that, if the backup file is being stored locally first and then transferred to remote storage, it's important to generate a checksum on the local disk before uploading to the remote drive. Once it's on the remote drive, a checksum should be generated there as well to verify file integrity.

That is to say:

  • generate backup file
  • take checksum
  • upload to remote
  • take checksum
  • compare => they should be identical
@bmarini
bmarini commented Oct 10, 2012

I don't know about all the remote storages you have but S3 supports passing an MD5 checksum with the PUT request to save a file: http://aws.amazon.com/articles/1904

"Amazon S3's REST PUT operation provides the ability to specify an MD5 checksum (http://en.wikipedia.org/wiki/Checksum) for the data being sent to S3. When the request arrives at S3, an MD5 checksum will be recalculated for the object data received and compared to the provided MD5 checksum. If there's a mismatch, the PUT will be failed, preventing data that was corrupted on the wire from being written into S3. At that point, you can retry the PUT."

"MD5 checksums are also returned in the response to REST GET requests and may be used client-side to ensure that the data returned by the GET wasn't corrupted in transit. If you need to ensure that values returned by a GET request are byte-for-byte what was stored in the service, calculate the returned value's MD5 checksum and compare it to the checksum returned along with the value by the service."

@tombruijn tombruijn referenced this issue in backup/backup-features Sep 20, 2014
Open

Generating backup checksum #7

@tombruijn
Member

Issue moved to the features repository: backup/backup-features#7

@tombruijn tombruijn closed this Sep 20, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment