New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fake the ability to upload an empty archive if requested #7
Conversation
boto/boto#1083 <- the pull request at boto |
Peter, Thanks for your work on this! I'm not sure what the right thing to do is here and I'll think about it some further, but I did want to give you my initial reaction. A few thoughts:
I'm not sure if there's an easy solution to this problem. The only one I can think of is to modify git-annex to special case zero length archives. I'm not sure if Joey will want to do this, as it is a special case that wouldn't otherwise have to be, but it is also sort of meaningless to store something zero length in a special annex remote as it would take minimal metadata to deal with restores by recreating them instead of refetching them. But there may be some complications to do with the git-annex backend used - I'm not sure if the size is always known to git-annex or not. In exactly what case are you getting git-annex storing zero length archives? Do you have encryption disabled? |
I'm getting zero-length archives because I'm telling git-annex to put zero-length files into annexes. Basically I've put every single bit of data I care about, which includes a lot of ancient and redundant harddrive images, into a massive annex with a few hundred thousand files in it. Unfortunately it turns out a number of those files were empty, and sure, they could have been just as efficiently stored with plain git add, but they're in annex and glacier-annex should be able to handle that case. You're right about the remote not being used in encrypted mode. It's an easy work around, but I'm dubious about added encryption to a backup where it just isn't required. Why would it break annex sync? Annex sync seems to work just fine for me with this patch... I think I know where you're going with it breaking the cache, but remember that the information it is authorative about in this case is what totally made-up archive id's were created for the purpose of pretending that we'd uploaded an empty key. For a trusted glacier remote that will do nothing, because annex will assume that the remote never loses data. Thus glacier-cli checkpresent is never run so even on a different machine the cache is never asked if the key exists. That the key still doesn't exist after an inventory is IMO not a big deal: inventories are painful enough to use as it is. If the glacier remote is untrusted then yeah, checkpresent returns false. But if the user follows that up by re-uploading the archive all is well again, because the fact that glacier-cli thinks the archive maps to a different archive id than a different repo is immaterial: git annex only knows about annex keys, not amazon archive ids. However, having said all that... there is one ugly hack you could do to avoid all this: pad every file with a single byte at the beginning. The biggest disadvantage of this approach is probably that you now can't use this fix if, like myself, you realize after the fact that your glacier remote has a problem. This approach is also awkward to implement in that on job retrieval boto is given a file descriptor to write into, and on upload boto is given a file. Also the boto upload API has to be given a file rather than a file descriptor if you want to upload >100MiB files efficiently. (non-multipart) |
On Mon, Oct 29, 2012 at 08:09:16PM -0700, Peter Todd wrote:
Could you please confirm if you're using encryption or have it disabled?
I agree that we should be able to handle the case, but I'm not sure if
annex sync will eventually decide that the archive no longer exists in This particular issue can be fixed, of course, by having the sync code
But on a different machine, an annex get on an empty archive would fail,
What I'm worried about is:
Now it was an empty archive so the user could recreate it, of course.
The part I like least about this is that suddenly glacier-cli can't To be clear, this isn't a decision either way. I'm still in discussion Robie |
Using encryption=shared does avoid this problem; you wind up with bare minimum-sized gpg files of 62 bytes. Of course, it does make it more difficult to get your data back if something goes really wrong. You're example is quite correct and would fail. With a crypto-hash backend you could at least add an empty file to the store and you're file would be available again, but with the WORM backend you'd be screwed. On the other hand while the hash backend doesn't include the size of the file in the key, the WORM backend does, so in the latter case you could manually figure out what files were actually empty. That said I think all of this points to the most appropriate place for this hack to be in git-annex itself, although it's tricky because the current repository format doesn't explicitly store how file sizes except for the WORM backend. You'd have to either change the format, or add in manual special cases for every hash digest that happens to be for empty files. I agree w/ interoperability re: pads. The whole point of handling with without special cases is for interoperability mainly, otherwise encryption=shared isn't a bad solution, albeit a little worrisome. (what if you lose every repo? at least without encryption you can get the data back minus metadata) At least my patch, while not a good idea in the long run, can incorporate with a future proper fix in git-annex itself unlike adding pads. I'll keep using it personally. :) I could write up some documentation for the README to at least warn of the problem and the disadvantages of this hack, as well as the other solution of just using encryption. I'm also sending an email to Joey to discuss this problem; you should get a CC. |
Adds a new command line switch --allow-empty-archives to enable this
behavior. Unfortunately git-annex requires this as it asks the
glacier-cli remote hook to upload empty files if they exist. Everything
except the actual upload, retrieve or delete is done; the archive id is
still stored in the cache like any other archive would be so the
behavior should be identical.
Note that boto itself doesn't handle empty archive upload attempts
properly; you'll need a fixed version that raises an EmptyArchiveError
exception when this happens.
Currently that fix is available in the glacier branch of
https://github.com/petertodd/boto.git at revision
05915453e3887515338ef9c8a062f78c44430058