Commits on Feb 12, 2010
  1. _hashsplit.c: right shifting 32 bits doesn't work.

    in C, if you do
    	uint32_t i = 0xffffffff;
    	i >>= 32;
    then the answer is 0xffffffff, not 0 as you might expect.  Let's shift it by
    less than 32 at a time, which will give the right results.  This fixes a
    rare infinite loop when counting the bits in the hashsplit.
    committed Feb 12, 2010
  2. Fix building under cygwin.

    I attempted to build the latest under cygwin and ran into this:
    creating build/temp.cygwin-1.7.1-i686-2.5
    gcc -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-
    prototypes -I/usr/include/python2.5 -c _hashsplit.c -o b
    creating build/lib.cygwin-1.7.1-i686-2.5
    gcc -shared -Wl,--enable-auto-image-base build/temp.cygwin-1.7.1-
    i686-2.5/_hashsplit.o -L/usr/lib/python2.5/config -lpyt
    hon2.5 -o build/lib.cygwin-1.7.1-i686-2.5/_hashsplit.dll
    cp build/*/_hashsplit..dll .
    cp: cannot stat uild/*/_hashsplit..dll': No such file or directory
    make: *** [_hashsplit.dll] Error 1
    Some investigation turned up that Makefile was mistakenly referencing instead of _hashsplit.dll
    Changing the Makefile to expand the extension macro for the detected
    platform, now allows problem free building.
    Steve Diver committed with Feb 12, 2010
  3. hashsplit: totally change the way the fanout stuff works.

    Useless code churn or genius innovation?  You decide.
    The previous system for naming chunks of a split file was kind of lame.  We
    tried to name the files something that was "almost" their offset, so that
    filenames wouldn't shuffle around too much if a few bytes were added/deleted
    here and there.  But that totally failed to work if a *lot* of bytes were
    added, and it also lost the useful feature that you could seek to a specific
    point in a file (like a VM image) without restoring the whole thing.
    "Approximate" offsets aren't much good for seeking to.
    The new system is even more crazy than the original hashsplit: we now use
    the "extra bits" of the rolling checksum to define progressively larger
    chunks.  For example, we might define a normal chunk if the checksum ends in
    0xFFF (12 bits).  Now we can group multiple chunks together when the
    checksum ends in 0xFFFF (16 bits).  Because of the way the checksum works,
    this happens about every 2^4 = 16 chunks.  Similarly, 0xFFFFF (20 bits) will
    happen 16 times less often than that, and so on.  We can use this effect to
    define a tree.
    Then, in each branch of the tree, we name files based on their (exact, not
    approximate) offset *from the start of that tree*.
    Essentially, inserting/deleting/changing bytes will affect more "levels" of
    the rolling checksum, mangling bigger and bigger branches of the overall
    tree and causing those branches to change.  However, only the content of
    that sub-branch (and the *names*, ie offsets, of the following branches at
    that and further-up levels) end up getting changed, so the effect can be
    mostly localized.  The subtrees of those renamed trees are *not* affected,
    because all their offsets are relative to the start of their own tree.  This
    means *most* of the sha1sums in the resulting hierarchy don't need to
    change, no matter how much data you add/insert/delete.
    Anyway, the net result is that "git diff -M" now actually does something
    halfway sensible when comparing the trees corresponding to huge split files.
    Only halfway (because the chunk boundaries can move around a bit, and such
    large files are usually binary anyway) but it opens the way for much cooler
    algorithms in the future.
    Also, it'll now be possible to make 'bup fuse' open files without restoring
    the entire thing to a temp file first.  That means restoring (or even
    *using*) snapshotted VMs ought to become possible.
    committed Feb 12, 2010
  4. cmd-split and hashsplit: cleaning up in preparation for refactoring.

    Theoretically, this doesn't actually change any functionality.
    committed Feb 12, 2010
  5. cmd-join: don't restart git cat-file so frequently.

    We would restart cat-file for every id passed on the command line or via
    stdin, which was needlessly inefficient.
    committed Feb 12, 2010
Commits on Feb 11, 2010
  1. Replace randomgen with a new 'bup random' command.

    Now we can override the random seed.  Plus we can specify units
    thanks to a new helpers.parse_num() functions, so it's not always kb.
    Thus, we can now just do
    	bup random 50G
    to generate 50 gigs of random data for testing.
    Update "bup split" parameter parsing to use parse_num() too while we're
    committed Feb 11, 2010
  2. Documentation: correctly mark .md.tmp files as "precious".

    Otherwise, if you don't have pandoc installed, they get repeatedly
    regenerated for no good reason.
    committed Feb 11, 2010
Commits on Feb 10, 2010
  1. midx: automatically ignore .midx files if one of their .idx is missing.

    That implies that a pack has been deleted, so the entire .midx is pretty
    much worthless.  'bup midx -a' will generate a new one.
    committed Feb 10, 2010
  2. midx: prune redundant midx files automatically.

    After running 'bup midx -f', all previous midx files become redundant.
    Throw them away if we end up opening a midx file that supercedes them.
    Also cleans up some minor code bits in
    committed Feb 10, 2010
  3. Fix building on MacOS X on PowerPC.

    bup failed to build on one of my machines, an older iMac; make
    died ~40 lines in with "gcc-4.0: Invalid arch name : Power".
    On PPC machines, uname -m returns the helpfully descriptive
    "Power Macintosh", which gcc doesn't recognize. Some googling
    revealed e.g.
    where they use $(shell arch) to get the necessary info.
    With that little change, bup built on ppc and i386 machines for
    me, and passed all tests.
    andrewschleifer committed with Feb 10, 2010
Commits on Feb 9, 2010
  1. README: bup now has more reasons it's cool and fewer not to use it.

    Clearly we're making some progress.  I look forward to a world in which
    we can finally delete the "reasons bup is stupid" section.
    committed Feb 9, 2010
  2. index: if files were already deleted, don't dirty the index.

    We had a bug where any deleted files in the index would always dirty all
    their parent directories when refreshing, which is inefficient.
    committed Feb 9, 2010
  3. cmd-save: don't recurse into already-valid subdirs.

    When iterating through the index, if we find out that a particular dir (like
    /usr) has a known-valid sha1sum and isn't marked as changed, there's no need
    to recurse into it at all.  This saves some pointless grinding through the
    index when entire swaths of the tree are known to be already valid.
    committed Feb 9, 2010
  4. cmd-index/cmd-save: correctly mark directories as dirty/clean.

    Previously, we just ignored the IX_HASHVALID on directories, and regenerated
    their hashes on every backup regardless.  Now we correctly save directory
    hashes and mark them IX_HASHVALID after doing a backup, as well as removing
    IX_HASHVALID all the way up the tree whenever a file is marked as invalid.
    committed Feb 9, 2010
  5. Fix some list comprehensions that I thought were generator comprehens…

    Apparently [x for x in whatever] yields a list, not an iterator, which means
    two things:
      - it might use more memory than I thought
      - you definitely don't need to write list([...]) since it's already a
    Clean up a few of these.  You learn something new every day.
    committed Feb 9, 2010
Commits on Feb 8, 2010
  1. don't try non-quick fsck on damaged repositories.

    It turns out that older versions of git (1.5.x or so) have a git-verify-pack
    that goes into an endless loop when it hits certain kinds of corruption, and
    our test would trigger it almost every time.  Using --quick avoids calling
    git-verify-pack, so it won't exhibit the problem.
    Unfortunately this means a slightly less thorough test of non-quick
    bup-fsck, but it'll have to do.  Better than failing tests nonstop, anyway.
    Reported by Eduardo Kienetz.
    committed Feb 8, 2010
Commits on Feb 6, 2010
  1. Merge remote branch 'origin/master'

    * origin/master:
      cmd-margin: work correctly in python 2.4 when a midx is present.
    committed Feb 6, 2010
  2. cmd-save: fix a potential divide by zero error.

    In the progress calculation stuff.
    committed Feb 6, 2010
  3. cmd-ls: a line got lost and it didn't work at all.

    Also add a trivial test for bup ls to prevent this sort of thing in the
    committed Feb 6, 2010
  4. cmd-margin: work correctly in python 2.4 when a midx is present.

    And add a test so this doesn't happen again.
    committed Feb 6, 2010
Commits on Feb 5, 2010
  1. Infrastructure for generating a markdown-based man page using pandoc.

    The man page (bup.1) is total drivel for the moment, though.  And arguably
    we could split up the manpages per subcommand like git does, but maybe
    that's overkill at this stage.
    committed Jan 24, 2010
  2. list subcommands in alphabetical order.

    We were forgetting to sort the output of listdir().
    committed Feb 5, 2010
  3. bup save: try to estimate the time remaining.

    Naturally, estimating the time remaining is one of those things that sounds
    super easy, but isn't.  So the numbers wobble around a bit more than I'd
    like, especially at first.  But apply a few scary heuristics, and boom!
    Stuff happens.
    committed Feb 5, 2010
  4. When receiving an index from the server, print a handy progress message.

    This is less boring than seeing a blank screen while we download 5+ megs of
    committed Feb 5, 2010
  5. bup-server: revert to non-midx indexes when suggesting a pack.

    Currently midx files can't tell you *which* index contains a particular
    hash, just that *one* of them does.  So bup-server was barfing when it
    expected MultiPackIndex.exists() to return a pack name, and was getting a
    .midx file instead.
    We could have loosened the assertion and allowed the server to suggest a
    .midx file... but those can be huge, and it defeats the purpose of only
    suggesting the minimal set of packs so that lightweight clients aren't
    committed Feb 5, 2010
  6. Narrow the exception handling in cmd-save.

    If we encountered an error *writing* the pack, we were counting it as a
    non-fatal error, which was not the intention.  Only *reading* files we want
    to back up should be considered non-fatal.
    committed Feb 5, 2010
Commits on Feb 4, 2010
  1. bup index: fix progress message printing when using -v.

    It wasn't printing often enough, and thus was absent more often than
    committed Feb 4, 2010
  2. On python 2.4 on MacOS X, __len__() must return an int.

    We were already returning integers, which seem to be "long ints" in this
    case, even though they're relatively small.  Whatever, we'll typecast them
    to int first, and now unit tests pass.
    committed Feb 4, 2010
  3. Merge branch 'indexrewrite'

    * indexrewrite:
      Greatly improved progress reporting during index/save.
      Fix bugs in new indexing code.
      Speed up cmd-drecurse by 40%.
      Split directory recursion stuff from into
      Massive speedups to bupindex code.
    committed Feb 4, 2010
  4. Greatly improved progress reporting during index/save.

    Now that the index reading stuff is much faster, we can afford to waste time
    reading through it just to count how many bytes we're planning to back up.
    And that lets us print really friendly progress messages during bup save, in
    which we can tell you exactly what fraction of your bytes have been backed
    up so far.
    committed Feb 4, 2010
  5. Fix bugs in new indexing code.

    The logic was way too screwy, so I've simplified it a lot.  Also extended
    the unit tests quite a bit to replicate the weird problems I was having.  It
    seems pretty stable - and pretty fast - now.
    Iterating through an index of my whole home directory (bup index -p ~) now
    takes about 5.1 seconds, vs. 3.5 seconds before the rewrite.  However,
    iterating through just a *fraction* of the index can now bypass all the
    parts we don't care about, so it's much much faster than before.
    Could probably still stand some more optimization eventually, but at least
    the file format allows for speed.  The rest is just code :)
    committed Feb 3, 2010
Commits on Feb 3, 2010
  1. Speed up cmd-drecurse by 40%.

    It's now 40% faster, ie. 1.769 seconds or so to go through my home
    directory, instead of the previous 2.935.
    Still sucks compared to the native C 'find' command, but that's probably
    about as good as it's getting in pure python.
    committed Feb 3, 2010
  2. Split directory recursion stuff from into

    Also add a new command, 'bup drecurse', which just recurses through a
    directory tree and prints all the filenames.  This is useful for timing
    performance vs. the native 'find' command.
    The result is a bit embarrassing; for my home directory of about 188000
    files, drecurse is about 10x slower:
    $ time bup drecurse -q ~
    real	0m2.935s
    user	0m2.312s
    sys	0m0.580s
    $ time find ~ -printf ''
    real	0m0.385s
    user	0m0.096s
    sys	0m0.284s
    time find ~ -printf '%s\n' >/dev/null
    real	0m0.662s
    user	0m0.208s
    sys	0m0.456s
    committed Feb 3, 2010