Fake the ability to upload an empty archive if requested #7

Merged
merged 0 commits into from Nov 10, 2012

Conversation

Projects
None yet
2 participants

Adds a new command line switch --allow-empty-archives to enable this
behavior. Unfortunately git-annex requires this as it asks the
glacier-cli remote hook to upload empty files if they exist. Everything
except the actual upload, retrieve or delete is done; the archive id is
still stored in the cache like any other archive would be so the
behavior should be identical.

Note that boto itself doesn't handle empty archive upload attempts
properly; you'll need a fixed version that raises an EmptyArchiveError
exception when this happens.

Currently that fix is available in the glacier branch of
https://github.com/petertodd/boto.git at revision
05915453e3887515338ef9c8a062f78c44430058

boto/boto#1083 <- the pull request at boto

Owner

basak commented Oct 29, 2012

Peter,

Thanks for your work on this! I'm not sure what the right thing to do is here and I'll think about it some further, but I did want to give you my initial reaction.

A few thoughts:

  1. I think this will break the archive sync command, but this could be fixed up fairly easily.
  2. But more importantly, it breaks the cache in that the cache is no longer a cache but a store of authoritative information. If the cache is lost (eg. after a disaster or use on another machine), then the glacier archive would become invalid from the point of view of git annex.

I'm not sure if there's an easy solution to this problem. The only one I can think of is to modify git-annex to special case zero length archives. I'm not sure if Joey will want to do this, as it is a special case that wouldn't otherwise have to be, but it is also sort of meaningless to store something zero length in a special annex remote as it would take minimal metadata to deal with restores by recreating them instead of refetching them. But there may be some complications to do with the git-annex backend used - I'm not sure if the size is always known to git-annex or not.

In exactly what case are you getting git-annex storing zero length archives? Do you have encryption disabled?

I'm getting zero-length archives because I'm telling git-annex to put zero-length files into annexes. Basically I've put every single bit of data I care about, which includes a lot of ancient and redundant harddrive images, into a massive annex with a few hundred thousand files in it. Unfortunately it turns out a number of those files were empty, and sure, they could have been just as efficiently stored with plain git add, but they're in annex and glacier-annex should be able to handle that case. You're right about the remote not being used in encrypted mode. It's an easy work around, but I'm dubious about added encryption to a backup where it just isn't required.

Why would it break annex sync? Annex sync seems to work just fine for me with this patch...

I think I know where you're going with it breaking the cache, but remember that the information it is authorative about in this case is what totally made-up archive id's were created for the purpose of pretending that we'd uploaded an empty key. For a trusted glacier remote that will do nothing, because annex will assume that the remote never loses data. Thus glacier-cli checkpresent is never run so even on a different machine the cache is never asked if the key exists. That the key still doesn't exist after an inventory is IMO not a big deal: inventories are painful enough to use as it is.

If the glacier remote is untrusted then yeah, checkpresent returns false. But if the user follows that up by re-uploading the archive all is well again, because the fact that glacier-cli thinks the archive maps to a different archive id than a different repo is immaterial: git annex only knows about annex keys, not amazon archive ids.

However, having said all that... there is one ugly hack you could do to avoid all this: pad every file with a single byte at the beginning. The biggest disadvantage of this approach is probably that you now can't use this fix if, like myself, you realize after the fact that your glacier remote has a problem. This approach is also awkward to implement in that on job retrieval boto is given a file descriptor to write into, and on upload boto is given a file. Also the boto upload API has to be given a file rather than a file descriptor if you want to upload >100MiB files efficiently. (non-multipart)

Owner

basak commented Oct 31, 2012

On Mon, Oct 29, 2012 at 08:09:16PM -0700, Peter Todd wrote:

I'm getting zero-length archives because I'm telling git-annex to put
zero-length files into annexes. Basically I've put every single bit of

Could you please confirm if you're using encryption or have it disabled?
If it's disabled, an easy (if inefficient) mechanism might be to enable
encryption, since I think encryption of a zero-byte file would still
lead to some non-zero size of encrypted file. But I'm not sure.

data I care about, which includes a lot of ancient and redundant
harddrive images, into a massive annex with a few hundred thousand
files in it. Unfortunately it turns out a number of those files were
empty, and sure, they could have been just as efficiently stored with
plain git add, but they're in annex and glacier-annex should be able
to handle that case.

I agree that we should be able to handle the case, but I'm not sure if
manipulating the cache will fix it. I'll explain the sync issue below,
but there's still the problem that the archive would not retrieve
correctly from another machine without the hacked cache. I'd consider
this to be a data loss bug.

Why would it break annex sync? Annex sync seems to work just fine for
me with this patch...

annex sync will eventually decide that the archive no longer exists in
Glacier since it doesn't appear in an inventory that it should appear
in. At this point it'll warn you and drop the entry from the cache.

This particular issue can be fixed, of course, by having the sync code
special case the cache entry as you have done in the other cases.

I think I know where you're going with it breaking the cache, but
remember that the information it is authorative about in this case is
what totally made-up archive id's were created for the purpose of
pretending that we'd uploaded an empty key. For a trusted glacier
remote that will do nothing, because annex will assume that the remote
never loses data. Thus glacier-cli checkpresent is never run so even
on a different machine the cache is never asked if the key exists.
That the key still doesn't exist after an inventory is IMO not a big
deal: inventories are painful enough to use as it is.

But on a different machine, an annex get on an empty archive would fail,
right? That's what I'm concerned about.

If the glacier remote is untrusted then yeah, checkpresent returns
false. But if the user follows that up by re-uploading the archive all
is well again, because the fact that glacier-cli thinks the archive
maps to a different archive id than a different repo is immaterial:
git annex only knows about annex keys, not amazon archive ids.

What I'm worried about is:

  1. User creates an annex, adds an empty file to the annex, adds a
    glacier remote and moves the file there.
  2. User pushes the annex to another machine for a backup of the annex.
    At this point, the user reasonable expects to be save from the loss of
    any one machine.
  3. User loses the original machine.
  4. User tries to retrieve the archive on the second machine, and the
    glacier fetch fails.

Now it was an empty archive so the user could recreate it, of course.
But then why are we trying to fix this in glacier-cli in the first
place? Can't we just no-op the writing of zero byte archive for the same
effect?

However, having said all that... there is one ugly hack you could do
to avoid all this: pad every file with a single byte at the beginning.
The biggest disadvantage of this approach is probably that you now
can't use this fix if, like myself, you realize after the fact that
your glacier remote has a problem. This approach is also awkward to
implement in that on job retrieval boto is given a file descriptor to
write into, and on upload boto is given a file. Also the boto upload
API has to be given a file rather than a file descriptor if you want
to upload >100MiB files efficiently. (non-multipart)

The part I like least about this is that suddenly glacier-cli can't
interoperate with other tools. Unless we added a --pad option that only
the annex hook uses for users who need this. Though actually, this might
be the least invasive fix. I'm sure that boto will accept a pull request
to support uploading of a file with a file object (actually I think it
already does). It would probably be best to pad at the end, since then
retrieving the file without knowing to drop the padding will still work
in many cases, which might save the minds of people recovering from
disasters by hand.

To be clear, this isn't a decision either way. I'm still in discussion
mode. What do you think?

Robie

Using encryption=shared does avoid this problem; you wind up with bare minimum-sized gpg files of 62 bytes. Of course, it does make it more difficult to get your data back if something goes really wrong.

You're example is quite correct and would fail. With a crypto-hash backend you could at least add an empty file to the store and you're file would be available again, but with the WORM backend you'd be screwed. On the other hand while the hash backend doesn't include the size of the file in the key, the WORM backend does, so in the latter case you could manually figure out what files were actually empty.

That said I think all of this points to the most appropriate place for this hack to be in git-annex itself, although it's tricky because the current repository format doesn't explicitly store how file sizes except for the WORM backend. You'd have to either change the format, or add in manual special cases for every hash digest that happens to be for empty files.

I agree w/ interoperability re: pads. The whole point of handling with without special cases is for interoperability mainly, otherwise encryption=shared isn't a bad solution, albeit a little worrisome. (what if you lose every repo? at least without encryption you can get the data back minus metadata)

At least my patch, while not a good idea in the long run, can incorporate with a future proper fix in git-annex itself unlike adding pads. I'll keep using it personally. :) I could write up some documentation for the README to at least warn of the problem and the disadvantages of this hack, as well as the other solution of just using encryption. I'm also sending an email to Joey to discuss this problem; you should get a CC.

@petertodd petertodd merged commit fbd774d into basak:master Nov 10, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment