Commits on Feb 14, 2010
  1. cmd-ls: use the new vfs layer.

    committed Feb 14, 2010
  2. cmd-ftp: a new command-line client you can use for browsing your repo.

    It acts kind of like the 'ftp' command; hence the name.  It even has
    readline and filename autocompletion!
    The new vfs layer stuff should be useful for cmd-ls and cmd-fuse too.
    committed Feb 14, 2010
  3. Another suspicious fix for CatPipe parallelism.

    This really shouldn't be necessary: it's clear to me that the 'it' object
    should be going out of scope right away, and thus getting cleaned up by the
    garbage collector.
    But on one of my Linux PCs (with python 2.4.4) it fails the unit tests
    unless I add this patch.  Oh well, let's do it then.
    committed Feb 14, 2010
  4. hashsplit: smallish files (less than BLOB_MAX) weren't getting split.

    This buglet was introduced when doing my new fanout cleanups.  It's
    relatively unimportant, but it would cause a bit of space wastage for
    smallish files that changed by a bit, since we couldn't take advantage of
    deduplication for their blocks.
    This also explains why the --fanout argument test broke earlier.  I thought
    I was going crazy (since the whole fanout implementation had changed and the
    number now means something slightly different), so I just removed it.  But
    now we can bring it back and it passes again.^
    committed Feb 14, 2010
Commits on Feb 13, 2010
  1. Make CatPipe objects more resilient when interrupted.

    If we stopped iterating halfway through a particular object, the iterator
    wouldn't finishing reading all the data, which would mess up the state of
    the git-cat-file pipe.  Now we read all the data even if we're going to just
    throw it away.
    committed Feb 13, 2010
  2. bup join: continue gracefully if one of the requested files does not …

    This makes it work more like 'cat'.  If any of the requested files is
    missing, the final return code is nonzero.
    committed Feb 13, 2010
Commits on Feb 12, 2010
  1. _hashsplit.c: right shifting 32 bits doesn't work.

    in C, if you do
    	uint32_t i = 0xffffffff;
    	i >>= 32;
    then the answer is 0xffffffff, not 0 as you might expect.  Let's shift it by
    less than 32 at a time, which will give the right results.  This fixes a
    rare infinite loop when counting the bits in the hashsplit.
    committed Feb 12, 2010
  2. Fix building under cygwin.

    I attempted to build the latest under cygwin and ran into this:
    creating build/temp.cygwin-1.7.1-i686-2.5
    gcc -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-
    prototypes -I/usr/include/python2.5 -c _hashsplit.c -o b
    creating build/lib.cygwin-1.7.1-i686-2.5
    gcc -shared -Wl,--enable-auto-image-base build/temp.cygwin-1.7.1-
    i686-2.5/_hashsplit.o -L/usr/lib/python2.5/config -lpyt
    hon2.5 -o build/lib.cygwin-1.7.1-i686-2.5/_hashsplit.dll
    cp build/*/_hashsplit..dll .
    cp: cannot stat uild/*/_hashsplit..dll': No such file or directory
    make: *** [_hashsplit.dll] Error 1
    Some investigation turned up that Makefile was mistakenly referencing instead of _hashsplit.dll
    Changing the Makefile to expand the extension macro for the detected
    platform, now allows problem free building.
    Steve Diver committed with Feb 12, 2010
  3. hashsplit: totally change the way the fanout stuff works.

    Useless code churn or genius innovation?  You decide.
    The previous system for naming chunks of a split file was kind of lame.  We
    tried to name the files something that was "almost" their offset, so that
    filenames wouldn't shuffle around too much if a few bytes were added/deleted
    here and there.  But that totally failed to work if a *lot* of bytes were
    added, and it also lost the useful feature that you could seek to a specific
    point in a file (like a VM image) without restoring the whole thing.
    "Approximate" offsets aren't much good for seeking to.
    The new system is even more crazy than the original hashsplit: we now use
    the "extra bits" of the rolling checksum to define progressively larger
    chunks.  For example, we might define a normal chunk if the checksum ends in
    0xFFF (12 bits).  Now we can group multiple chunks together when the
    checksum ends in 0xFFFF (16 bits).  Because of the way the checksum works,
    this happens about every 2^4 = 16 chunks.  Similarly, 0xFFFFF (20 bits) will
    happen 16 times less often than that, and so on.  We can use this effect to
    define a tree.
    Then, in each branch of the tree, we name files based on their (exact, not
    approximate) offset *from the start of that tree*.
    Essentially, inserting/deleting/changing bytes will affect more "levels" of
    the rolling checksum, mangling bigger and bigger branches of the overall
    tree and causing those branches to change.  However, only the content of
    that sub-branch (and the *names*, ie offsets, of the following branches at
    that and further-up levels) end up getting changed, so the effect can be
    mostly localized.  The subtrees of those renamed trees are *not* affected,
    because all their offsets are relative to the start of their own tree.  This
    means *most* of the sha1sums in the resulting hierarchy don't need to
    change, no matter how much data you add/insert/delete.
    Anyway, the net result is that "git diff -M" now actually does something
    halfway sensible when comparing the trees corresponding to huge split files.
    Only halfway (because the chunk boundaries can move around a bit, and such
    large files are usually binary anyway) but it opens the way for much cooler
    algorithms in the future.
    Also, it'll now be possible to make 'bup fuse' open files without restoring
    the entire thing to a temp file first.  That means restoring (or even
    *using*) snapshotted VMs ought to become possible.
    committed Feb 12, 2010
  4. cmd-split and hashsplit: cleaning up in preparation for refactoring.

    Theoretically, this doesn't actually change any functionality.
    committed Feb 12, 2010
  5. cmd-join: don't restart git cat-file so frequently.

    We would restart cat-file for every id passed on the command line or via
    stdin, which was needlessly inefficient.
    committed Feb 12, 2010
Commits on Feb 11, 2010
  1. Replace randomgen with a new 'bup random' command.

    Now we can override the random seed.  Plus we can specify units
    thanks to a new helpers.parse_num() functions, so it's not always kb.
    Thus, we can now just do
    	bup random 50G
    to generate 50 gigs of random data for testing.
    Update "bup split" parameter parsing to use parse_num() too while we're
    committed Feb 11, 2010
  2. Documentation: correctly mark .md.tmp files as "precious".

    Otherwise, if you don't have pandoc installed, they get repeatedly
    regenerated for no good reason.
    committed Feb 11, 2010
Commits on Feb 10, 2010
  1. midx: automatically ignore .midx files if one of their .idx is missing.

    That implies that a pack has been deleted, so the entire .midx is pretty
    much worthless.  'bup midx -a' will generate a new one.
    committed Feb 10, 2010
  2. midx: prune redundant midx files automatically.

    After running 'bup midx -f', all previous midx files become redundant.
    Throw them away if we end up opening a midx file that supercedes them.
    Also cleans up some minor code bits in
    committed Feb 10, 2010
  3. Fix building on MacOS X on PowerPC.

    bup failed to build on one of my machines, an older iMac; make
    died ~40 lines in with "gcc-4.0: Invalid arch name : Power".
    On PPC machines, uname -m returns the helpfully descriptive
    "Power Macintosh", which gcc doesn't recognize. Some googling
    revealed e.g.
    where they use $(shell arch) to get the necessary info.
    With that little change, bup built on ppc and i386 machines for
    me, and passed all tests.
    andrewschleifer committed with Feb 10, 2010
Commits on Feb 9, 2010
  1. README: bup now has more reasons it's cool and fewer not to use it.

    Clearly we're making some progress.  I look forward to a world in which
    we can finally delete the "reasons bup is stupid" section.
    committed Feb 9, 2010
  2. index: if files were already deleted, don't dirty the index.

    We had a bug where any deleted files in the index would always dirty all
    their parent directories when refreshing, which is inefficient.
    committed Feb 9, 2010
  3. cmd-save: don't recurse into already-valid subdirs.

    When iterating through the index, if we find out that a particular dir (like
    /usr) has a known-valid sha1sum and isn't marked as changed, there's no need
    to recurse into it at all.  This saves some pointless grinding through the
    index when entire swaths of the tree are known to be already valid.
    committed Feb 9, 2010
  4. cmd-index/cmd-save: correctly mark directories as dirty/clean.

    Previously, we just ignored the IX_HASHVALID on directories, and regenerated
    their hashes on every backup regardless.  Now we correctly save directory
    hashes and mark them IX_HASHVALID after doing a backup, as well as removing
    IX_HASHVALID all the way up the tree whenever a file is marked as invalid.
    committed Feb 9, 2010
  5. Fix some list comprehensions that I thought were generator comprehens…

    Apparently [x for x in whatever] yields a list, not an iterator, which means
    two things:
      - it might use more memory than I thought
      - you definitely don't need to write list([...]) since it's already a
    Clean up a few of these.  You learn something new every day.
    committed Feb 9, 2010
Commits on Feb 8, 2010
  1. don't try non-quick fsck on damaged repositories.

    It turns out that older versions of git (1.5.x or so) have a git-verify-pack
    that goes into an endless loop when it hits certain kinds of corruption, and
    our test would trigger it almost every time.  Using --quick avoids calling
    git-verify-pack, so it won't exhibit the problem.
    Unfortunately this means a slightly less thorough test of non-quick
    bup-fsck, but it'll have to do.  Better than failing tests nonstop, anyway.
    Reported by Eduardo Kienetz.
    committed Feb 8, 2010
Commits on Feb 6, 2010
  1. Merge remote branch 'origin/master'

    * origin/master:
      cmd-margin: work correctly in python 2.4 when a midx is present.
    committed Feb 6, 2010
  2. cmd-save: fix a potential divide by zero error.

    In the progress calculation stuff.
    committed Feb 6, 2010
  3. cmd-ls: a line got lost and it didn't work at all.

    Also add a trivial test for bup ls to prevent this sort of thing in the
    committed Feb 6, 2010
  4. cmd-margin: work correctly in python 2.4 when a midx is present.

    And add a test so this doesn't happen again.
    committed Feb 6, 2010
Commits on Feb 5, 2010
  1. Infrastructure for generating a markdown-based man page using pandoc.

    The man page (bup.1) is total drivel for the moment, though.  And arguably
    we could split up the manpages per subcommand like git does, but maybe
    that's overkill at this stage.
    committed Jan 24, 2010
  2. list subcommands in alphabetical order.

    We were forgetting to sort the output of listdir().
    committed Feb 5, 2010
  3. bup save: try to estimate the time remaining.

    Naturally, estimating the time remaining is one of those things that sounds
    super easy, but isn't.  So the numbers wobble around a bit more than I'd
    like, especially at first.  But apply a few scary heuristics, and boom!
    Stuff happens.
    committed Feb 5, 2010
  4. When receiving an index from the server, print a handy progress message.

    This is less boring than seeing a blank screen while we download 5+ megs of
    committed Feb 5, 2010
  5. bup-server: revert to non-midx indexes when suggesting a pack.

    Currently midx files can't tell you *which* index contains a particular
    hash, just that *one* of them does.  So bup-server was barfing when it
    expected MultiPackIndex.exists() to return a pack name, and was getting a
    .midx file instead.
    We could have loosened the assertion and allowed the server to suggest a
    .midx file... but those can be huge, and it defeats the purpose of only
    suggesting the minimal set of packs so that lightweight clients aren't
    committed Feb 5, 2010
  6. Narrow the exception handling in cmd-save.

    If we encountered an error *writing* the pack, we were counting it as a
    non-fatal error, which was not the intention.  Only *reading* files we want
    to back up should be considered non-fatal.
    committed Feb 5, 2010