Permalink
Commits on Feb 9, 2010
  1. index: if files were already deleted, don't dirty the index.

    apenwarr committed Feb 9, 2010
    We had a bug where any deleted files in the index would always dirty all
    their parent directories when refreshing, which is inefficient.
  2. cmd-save: don't recurse into already-valid subdirs.

    apenwarr committed Feb 9, 2010
    When iterating through the index, if we find out that a particular dir (like
    /usr) has a known-valid sha1sum and isn't marked as changed, there's no need
    to recurse into it at all.  This saves some pointless grinding through the
    index when entire swaths of the tree are known to be already valid.
  3. cmd-index/cmd-save: correctly mark directories as dirty/clean.

    apenwarr committed Feb 9, 2010
    Previously, we just ignored the IX_HASHVALID on directories, and regenerated
    their hashes on every backup regardless.  Now we correctly save directory
    hashes and mark them IX_HASHVALID after doing a backup, as well as removing
    IX_HASHVALID all the way up the tree whenever a file is marked as invalid.
  4. Fix some list comprehensions that I thought were generator comprehens…

    apenwarr committed Feb 9, 2010
    …ions.
    
    Apparently [x for x in whatever] yields a list, not an iterator, which means
    two things:
      - it might use more memory than I thought
      - you definitely don't need to write list([...]) since it's already a
        list.
    
    Clean up a few of these.  You learn something new every day.
Commits on Feb 8, 2010
  1. test.sh: don't try non-quick fsck on damaged repositories.

    apenwarr committed Feb 8, 2010
    It turns out that older versions of git (1.5.x or so) have a git-verify-pack
    that goes into an endless loop when it hits certain kinds of corruption, and
    our test would trigger it almost every time.  Using --quick avoids calling
    git-verify-pack, so it won't exhibit the problem.
    
    Unfortunately this means a slightly less thorough test of non-quick
    bup-fsck, but it'll have to do.  Better than failing tests nonstop, anyway.
    
    Reported by Eduardo Kienetz.
Commits on Feb 6, 2010
  1. Merge remote branch 'origin/master'

    apenwarr committed Feb 6, 2010
    * origin/master:
      cmd-margin: work correctly in python 2.4 when a midx is present.
  2. cmd-save: fix a potential divide by zero error.

    apenwarr committed Feb 6, 2010
    In the progress calculation stuff.
  3. cmd-ls: a line got lost and it didn't work at all.

    apenwarr committed Feb 6, 2010
    Also add a trivial test for bup ls to prevent this sort of thing in the
    future.
  4. cmd-margin: work correctly in python 2.4 when a midx is present.

    apenwarr committed Feb 6, 2010
    And add a test so this doesn't happen again.
Commits on Feb 5, 2010
  1. Infrastructure for generating a markdown-based man page using pandoc.

    apenwarr committed Jan 24, 2010
    The man page (bup.1) is total drivel for the moment, though.  And arguably
    we could split up the manpages per subcommand like git does, but maybe
    that's overkill at this stage.
  2. bup.py: list subcommands in alphabetical order.

    apenwarr committed Feb 5, 2010
    We were forgetting to sort the output of listdir().
  3. bup save: try to estimate the time remaining.

    apenwarr committed Feb 5, 2010
    Naturally, estimating the time remaining is one of those things that sounds
    super easy, but isn't.  So the numbers wobble around a bit more than I'd
    like, especially at first.  But apply a few scary heuristics, and boom!
    Stuff happens.
  4. When receiving an index from the server, print a handy progress message.

    apenwarr committed Feb 5, 2010
    This is less boring than seeing a blank screen while we download 5+ megs of
    stuff.
  5. bup-server: revert to non-midx indexes when suggesting a pack.

    apenwarr committed Feb 5, 2010
    Currently midx files can't tell you *which* index contains a particular
    hash, just that *one* of them does.  So bup-server was barfing when it
    expected MultiPackIndex.exists() to return a pack name, and was getting a
    .midx file instead.
    
    We could have loosened the assertion and allowed the server to suggest a
    .midx file... but those can be huge, and it defeats the purpose of only
    suggesting the minimal set of packs so that lightweight clients aren't
    overwhelmed.
  6. Narrow the exception handling in cmd-save.

    apenwarr committed Feb 5, 2010
    If we encountered an error *writing* the pack, we were counting it as a
    non-fatal error, which was not the intention.  Only *reading* files we want
    to back up should be considered non-fatal.
Commits on Feb 4, 2010
  1. bup index: fix progress message printing when using -v.

    apenwarr committed Feb 4, 2010
    It wasn't printing often enough, and thus was absent more often than
    present.
  2. On python 2.4 on MacOS X, __len__() must return an int.

    apenwarr committed Feb 4, 2010
    We were already returning integers, which seem to be "long ints" in this
    case, even though they're relatively small.  Whatever, we'll typecast them
    to int first, and now unit tests pass.
  3. Merge branch 'indexrewrite'

    apenwarr committed Feb 4, 2010
    * indexrewrite:
      Greatly improved progress reporting during index/save.
      Fix bugs in new indexing code.
      Speed up cmd-drecurse by 40%.
      Split directory recursion stuff from cmd-index.py into drecurse.py.
      Massive speedups to bupindex code.
  4. Greatly improved progress reporting during index/save.

    apenwarr committed Feb 4, 2010
    Now that the index reading stuff is much faster, we can afford to waste time
    reading through it just to count how many bytes we're planning to back up.
    
    And that lets us print really friendly progress messages during bup save, in
    which we can tell you exactly what fraction of your bytes have been backed
    up so far.
  5. Fix bugs in new indexing code.

    apenwarr committed Feb 3, 2010
    The logic was way too screwy, so I've simplified it a lot.  Also extended
    the unit tests quite a bit to replicate the weird problems I was having.  It
    seems pretty stable - and pretty fast - now.
    
    Iterating through an index of my whole home directory (bup index -p ~) now
    takes about 5.1 seconds, vs. 3.5 seconds before the rewrite.  However,
    iterating through just a *fraction* of the index can now bypass all the
    parts we don't care about, so it's much much faster than before.
    
    Could probably still stand some more optimization eventually, but at least
    the file format allows for speed.  The rest is just code :)
Commits on Feb 3, 2010
  1. Speed up cmd-drecurse by 40%.

    apenwarr committed Feb 3, 2010
    It's now 40% faster, ie. 1.769 seconds or so to go through my home
    directory, instead of the previous 2.935.
    
    Still sucks compared to the native C 'find' command, but that's probably
    about as good as it's getting in pure python.
  2. Split directory recursion stuff from cmd-index.py into drecurse.py.

    apenwarr committed Feb 3, 2010
    Also add a new command, 'bup drecurse', which just recurses through a
    directory tree and prints all the filenames.  This is useful for timing
    performance vs. the native 'find' command.
    
    The result is a bit embarrassing; for my home directory of about 188000
    files, drecurse is about 10x slower:
    
    $ time bup drecurse -q ~
    real	0m2.935s
    user	0m2.312s
    sys	0m0.580s
    
    $ time find ~ -printf ''
    real	0m0.385s
    user	0m0.096s
    sys	0m0.284s
    
    time find ~ -printf '%s\n' >/dev/null
    real	0m0.662s
    user	0m0.208s
    sys	0m0.456s
Commits on Feb 2, 2010
  1. Massive speedups to bupindex code.

    apenwarr committed Jan 31, 2010
    The old file format was modeled after the git one, but it was kind of dumb;
    you couldn't search through the file except linearly, which is pretty slow
    when you have hundreds of thousands, or millions, of files.  It also stored
    the entire pathname of each file, which got very wasteful as filenames got
    longer.
    
    The new format is much quicker; each directory has a pointer to its list of
    children, so you can jump around rather than reading linearly through the
    file.  Thus you can now 'bup index -p' any subdirectory pretty much
    instantly.  The code is still not completely optimized, but the remaining
    algorithmic silliness doesn't seem to matter.
    
    And it even still passes unit tests!  Which is too bad, actually, because I
    still get oddly crashy behaviour when I repeatedly update a large index. So
    there are still some screwy bugs hanging around.  I guess that means we need
    better unit tests...
  2. cmd-save: add --smaller option.

    apenwarr committed Feb 2, 2010
    This makes it only back up files smaller than the given size.  bup can
    handle big files, but you might want to do quicker incremental backups and
    skip bigger files except once a day, or something.
    
    It's also handy for testing.
  3. midx: the fanout table entries can be 4 bytes, not 8.

    apenwarr committed Feb 2, 2010
    I was trying to be future-proof, but it was kind of overkill, since a 32-bit
    fanout entry could handle a total of 4 billion *hashes* per midx.  That
    would be 20*4bil = 80 gigs in a single midx.  This corresponds to about 10
    terabytes of packs, which isn't inconceivable... but if it happens, you
    could just use more than one midx.  Plus you'd likely run into other weird
    bup problems before your midx files get anywhere near 80 gigs.
  4. cmd-midx: correctly handle a tiny nonzero number of objects.

    apenwarr committed Feb 2, 2010
    If all the sha1sums would have fit in a single page, the number of bits in
    the table would be negative, with odd results.  Now we just refuse to create
    the midx if there are too few objects *and* too few files, since it would be
    useless anyway.
    
    We're still willing to create a very small midx if it allows us to merge
    several indexes into one, however small, since that will still speed up
    searching.
  5. Use a heapq object to accelerate git.idxmerge().

    apenwarr committed Feb 2, 2010
    This greatly accelerates bup margin and bup midx when you're iterating
    through a large number of packs.
  6. cmd-margin: a command to find out the max bits of overlap between has…

    apenwarr committed Feb 2, 2010
    …hes.
    
    Run 'bup margin' to go through the list of all the objects in your bup
    directory and count the number of overlapping prefix bits between each two
    consecutive objects.  That is, fine the longest hash length (in bits) that
    *would* have caused an overlap, if sha1 hashes had been that length.
    
    On my system with 111 gigs of packs, I get 44 bits.  Out of a total of 160.
    That means I'm still safe from collisions for about 2^116 times over.  Or is
    it only the square root of that?  Anyway, it's such a large number that my
    brain explodes just thinking about it.
    
    Mark my words: 2^160 ought to be enough for anyone.
Commits on Jan 31, 2010
  1. Update README.md to reflect recent developments.

    apenwarr committed Jan 31, 2010
    - Remove the version number since I never remember to update it
    - We now work with earlier versions of python and MacOS
    - There's now a mailing list
    - 'bup fsck' allows us to remove one of the things from the "stupid" list.
  2. Move testfile[12] into t/

    apenwarr committed Jan 31, 2010
    Since they're only used for testing, they belong there, after all.
  3. fsck: add a -j# (run multiple threads) option.

    apenwarr committed Jan 31, 2010
    Sort of like make -j.  par2 can be pretty slow, so this lets us verify
    multiple files in parallel.  Since the files are so big, though, this might
    actually make performance *worse* if you don't have a lot of RAM.  I haven't
    benchmarked this too much, but on my quad-core with 6 gigs of RAM, -j4 does
    definitely make it go "noticeably" faster.
  4. Basic cmd-fsck for checking integrity of packfiles.

    apenwarr committed Jan 30, 2010
    It also uses the 'par2' command, if available, to automatically generate
    redundancy data, or to use that data for repair purposes.
    
    Includes handy unit test.
  5. cmd-damage: a program for randomly corrupting file contents.

    apenwarr committed Jan 30, 2010
    Sure, that *sounds* like a terrible idea.  But it's fun for testing recovery
    algorithms, at least.