Permalink
Commits on Mar 1, 2010
  1. Rename PackIndex->PackIdx and MultiPackIndex->PackIdxList.

    apenwarr committed Mar 1, 2010
    This corresponds to the PackMidx renaming I did earlier, and helps avoid
    confusion between index.py (which talks to the 'bupindex' file and has
    nothing to do with packs) and git.py (which talks to packs and has nothing
    to do with the bupindex).  Now pack indexes are always called Idx, and the
    bupindex is always Index.
    
    Furthermore, MultiPackIndex could easily be assumed to be the same thing as
    a Midx, which it isn't.  PackIdxList is a more accurate description of what
    it is: a list of pack indexes.  A Midx is an index of a list of packs.
  2. main: list common commands before other ones.

    apenwarr committed Mar 1, 2010
    When you just type 'bup' or 'bup help', we print a list of available
    commands.  Now we improve this list by:
    
    1) Listing the common commands (with one-line descriptions) before listing
    the automatically-generated list.
    
    2) Printing the automatically-generated list in columns, so it takes up less
    vertical space.
    
    This whole concept was stolen from how git does it.  I think it should be a
    bit more user friendly for beginners this way.
Commits on Feb 28, 2010
  1. Add a 'bup help' command.

    apenwarr committed Feb 28, 2010
    It works like 'git help xxx', ie. it runs 'man bup-xxx' where xxx is the
    command name.
  2. vfs: supply ctime/mtime for the root of each commit.

    apenwarr committed Feb 28, 2010
    This makes it a little more obvious which backups were made when.
    
    Mostly useful with 'bup fuse'.
  3. Move cmd-*.py to cmd/*-cmd.py.

    apenwarr committed Feb 28, 2010
    The bup-* programs shouldn't need to be installed into /usr/bin; we should
    search for them in /usr/lib somewhere.
    
    I could have left the names as cmd/cmd-*.py, but the cmd-* was annoying me
    because of tab completion.  Now I can type cmd/ran<tab> to get
    random-cmd.py.
  4. Move python library files to lib/bup/

    apenwarr committed Feb 28, 2010
    ...and update other programs so that they import them correctly from their
    new location.
    
    This is necessary so that the bup library files can eventually be installed
    somewhere other than wherever the 'bup' executable ends up.  Plus it's
    clearer and safer to say 'from bup import options' instead of just 'import
    options', in case someone else writes an 'options' module.
    
    I wish I could have named the directory just 'bup', but I can't; there's
    already a program with that name.
    
    Also, in the name of sanity, rename memtest.py to 'bup memtest' so that it
    can get the new paths automatically.
  5. bup index --check: detect broken index entries.

    apenwarr committed Feb 28, 2010
    Entries with invalid gitmode or sha1 are actually invalid, so if
    IX_HASHVALID is set, that's a bug.  Detect it right away when it happens.
    
    Also clean up a bit of log output related to checking and status.
  6. cmd-index: auto-invalidate entries without a valid sha1 or gitmode.

    apenwarr committed Feb 28, 2010
    Not exactly sure where these entries came from; possibly a failed save or an
    earlier buggy version of bup.  But previously, they weren't auto-fixable
    without deleting your bupindex.
  7. Add a new 'bup newliner' that fixes progress message whitespace.

    apenwarr committed Feb 28, 2010
    If we have multiple processes producing status messages to stderr and/or
    stdout, and some of the lines ended in \r (ie. a progress message that was
    supposed to be overwritten later) they would sometimes stomp on each other
    and leave ugly bits lying around.
    
    Now bup.py automatically pipes stdout/stderr to the new 'bup newliner'
    command to fix this, but only if they were previously pointing at a tty.
    Thus, if you redirect stdout to a file, nothing weird will happen, but if
    you don't, stdout and stderr won't conflict with each other.
    
    Anyway, the output is prettier now.  Trust me on this.
  8. Add an options.fatal() function and use it.

    apenwarr committed Feb 28, 2010
    Every existing call to o.usage() was preceded by an error message that
    printed the exename, then the error message.  So let's add a fatal()
    function that does it all in one step.  This reduces the net number of lines
    plus improves consistency.
Commits on Feb 14, 2010
  1. cmd-ftp: a new command-line client you can use for browsing your repo.

    apenwarr committed Feb 14, 2010
    It acts kind of like the 'ftp' command; hence the name.  It even has
    readline and filename autocompletion!
    
    The new vfs layer stuff should be useful for cmd-ls and cmd-fuse too.
  2. Another suspicious fix for CatPipe parallelism.

    apenwarr committed Feb 14, 2010
    This really shouldn't be necessary: it's clear to me that the 'it' object
    should be going out of scope right away, and thus getting cleaned up by the
    garbage collector.
    
    But on one of my Linux PCs (with python 2.4.4) it fails the unit tests
    unless I add this patch.  Oh well, let's do it then.
  3. hashsplit: smallish files (less than BLOB_MAX) weren't getting split.

    apenwarr committed Feb 14, 2010
    This buglet was introduced when doing my new fanout cleanups.  It's
    relatively unimportant, but it would cause a bit of space wastage for
    smallish files that changed by a bit, since we couldn't take advantage of
    deduplication for their blocks.
    
    This also explains why the --fanout argument test broke earlier.  I thought
    I was going crazy (since the whole fanout implementation had changed and the
    number now means something slightly different), so I just removed it.  But
    now we can bring it back and it passes again.^
Commits on Feb 13, 2010
  1. Make CatPipe objects more resilient when interrupted.

    apenwarr committed Feb 13, 2010
    If we stopped iterating halfway through a particular object, the iterator
    wouldn't finishing reading all the data, which would mess up the state of
    the git-cat-file pipe.  Now we read all the data even if we're going to just
    throw it away.
  2. bup join: continue gracefully if one of the requested files does not …

    apenwarr committed Feb 13, 2010
    …exist.
    
    This makes it work more like 'cat'.  If any of the requested files is
    missing, the final return code is nonzero.
Commits on Feb 12, 2010
  1. _hashsplit.c: right shifting 32 bits doesn't work.

    apenwarr committed Feb 12, 2010
    in C, if you do
    	uint32_t i = 0xffffffff;
    	i >>= 32;
    
    then the answer is 0xffffffff, not 0 as you might expect.  Let's shift it by
    less than 32 at a time, which will give the right results.  This fixes a
    rare infinite loop when counting the bits in the hashsplit.
  2. Fix building under cygwin.

    Steve Diver authored and apenwarr committed Feb 12, 2010
    I attempted to build the latest under cygwin and ran into this:
    
    <snip>
    ...
    creating build/temp.cygwin-1.7.1-i686-2.5
    gcc -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-
    prototypes -I/usr/include/python2.5 -c _hashsplit.c -o b
    uild/temp.cygwin-1.7.1-i686-2.5/_hashsplit.o
    creating build/lib.cygwin-1.7.1-i686-2.5
    gcc -shared -Wl,--enable-auto-image-base build/temp.cygwin-1.7.1-
    i686-2.5/_hashsplit.o -L/usr/lib/python2.5/config -lpyt
    hon2.5 -o build/lib.cygwin-1.7.1-i686-2.5/_hashsplit.dll
    cp build/*/_hashsplit..dll .
    cp: cannot stat uild/*/_hashsplit..dll': No such file or directory
    make: *** [_hashsplit.dll] Error 1
    
    </snip>
    
    Some investigation turned up that Makefile was mistakenly referencing
    _hashsplit.so instead of _hashsplit.dll
    
    Changing the Makefile to expand the extension macro for the detected
    platform, now allows problem free building.
  3. hashsplit: totally change the way the fanout stuff works.

    apenwarr committed Feb 12, 2010
    Useless code churn or genius innovation?  You decide.
    
    The previous system for naming chunks of a split file was kind of lame.  We
    tried to name the files something that was "almost" their offset, so that
    filenames wouldn't shuffle around too much if a few bytes were added/deleted
    here and there.  But that totally failed to work if a *lot* of bytes were
    added, and it also lost the useful feature that you could seek to a specific
    point in a file (like a VM image) without restoring the whole thing.
    "Approximate" offsets aren't much good for seeking to.
    
    The new system is even more crazy than the original hashsplit: we now use
    the "extra bits" of the rolling checksum to define progressively larger
    chunks.  For example, we might define a normal chunk if the checksum ends in
    0xFFF (12 bits).  Now we can group multiple chunks together when the
    checksum ends in 0xFFFF (16 bits).  Because of the way the checksum works,
    this happens about every 2^4 = 16 chunks.  Similarly, 0xFFFFF (20 bits) will
    happen 16 times less often than that, and so on.  We can use this effect to
    define a tree.
    
    Then, in each branch of the tree, we name files based on their (exact, not
    approximate) offset *from the start of that tree*.
    
    Essentially, inserting/deleting/changing bytes will affect more "levels" of
    the rolling checksum, mangling bigger and bigger branches of the overall
    tree and causing those branches to change.  However, only the content of
    that sub-branch (and the *names*, ie offsets, of the following branches at
    that and further-up levels) end up getting changed, so the effect can be
    mostly localized.  The subtrees of those renamed trees are *not* affected,
    because all their offsets are relative to the start of their own tree.  This
    means *most* of the sha1sums in the resulting hierarchy don't need to
    change, no matter how much data you add/insert/delete.
    
    Anyway, the net result is that "git diff -M" now actually does something
    halfway sensible when comparing the trees corresponding to huge split files.
    Only halfway (because the chunk boundaries can move around a bit, and such
    large files are usually binary anyway) but it opens the way for much cooler
    algorithms in the future.
    
    Also, it'll now be possible to make 'bup fuse' open files without restoring
    the entire thing to a temp file first.  That means restoring (or even
    *using*) snapshotted VMs ought to become possible.
  4. cmd-split and hashsplit: cleaning up in preparation for refactoring.

    apenwarr committed Feb 12, 2010
    Theoretically, this doesn't actually change any functionality.
  5. cmd-join: don't restart git cat-file so frequently.

    apenwarr committed Feb 12, 2010
    We would restart cat-file for every id passed on the command line or via
    stdin, which was needlessly inefficient.
Commits on Feb 11, 2010
  1. Replace randomgen with a new 'bup random' command.

    apenwarr committed Feb 11, 2010
    Now we can override the random seed.  Plus we can specify units
    thanks to a new helpers.parse_num() functions, so it's not always kb.
    
    Thus, we can now just do
    	bup random 50G
    to generate 50 gigs of random data for testing.
    
    Update "bup split" parameter parsing to use parse_num() too while we're
    there.
  2. Documentation: correctly mark .md.tmp files as "precious".

    apenwarr committed Feb 11, 2010
    Otherwise, if you don't have pandoc installed, they get repeatedly
    regenerated for no good reason.
Commits on Feb 10, 2010
  1. midx: automatically ignore .midx files if one of their .idx is missing.

    apenwarr committed Feb 10, 2010
    That implies that a pack has been deleted, so the entire .midx is pretty
    much worthless.  'bup midx -a' will generate a new one.
  2. midx: prune redundant midx files automatically.

    apenwarr committed Feb 10, 2010
    After running 'bup midx -f', all previous midx files become redundant.
    Throw them away if we end up opening a midx file that supercedes them.
    
    Also cleans up some minor code bits in cmd-midx.py.
  3. Fix building on MacOS X on PowerPC.

    andrewschleifer authored and apenwarr committed Feb 10, 2010
    bup failed to build on one of my machines, an older iMac; make
    died ~40 lines in with "gcc-4.0: Invalid arch name : Power".
    
    On PPC machines, uname -m returns the helpfully descriptive
    "Power Macintosh", which gcc doesn't recognize. Some googling
    revealed e.g.
    http://www.opensource.apple.com/source/ld64/ld64-95.2.12/unit-tests/include/common.makefile
    where they use $(shell arch) to get the necessary info.
    
    With that little change, bup built on ppc and i386 machines for
    me, and passed all tests.
Commits on Feb 9, 2010
  1. README: bup now has more reasons it's cool and fewer not to use it.

    apenwarr committed Feb 9, 2010
    Clearly we're making some progress.  I look forward to a world in which
    we can finally delete the "reasons bup is stupid" section.
  2. index: if files were already deleted, don't dirty the index.

    apenwarr committed Feb 9, 2010
    We had a bug where any deleted files in the index would always dirty all
    their parent directories when refreshing, which is inefficient.
  3. cmd-save: don't recurse into already-valid subdirs.

    apenwarr committed Feb 9, 2010
    When iterating through the index, if we find out that a particular dir (like
    /usr) has a known-valid sha1sum and isn't marked as changed, there's no need
    to recurse into it at all.  This saves some pointless grinding through the
    index when entire swaths of the tree are known to be already valid.
  4. cmd-index/cmd-save: correctly mark directories as dirty/clean.

    apenwarr committed Feb 9, 2010
    Previously, we just ignored the IX_HASHVALID on directories, and regenerated
    their hashes on every backup regardless.  Now we correctly save directory
    hashes and mark them IX_HASHVALID after doing a backup, as well as removing
    IX_HASHVALID all the way up the tree whenever a file is marked as invalid.
  5. Fix some list comprehensions that I thought were generator comprehens…

    apenwarr committed Feb 9, 2010
    …ions.
    
    Apparently [x for x in whatever] yields a list, not an iterator, which means
    two things:
      - it might use more memory than I thought
      - you definitely don't need to write list([...]) since it's already a
        list.
    
    Clean up a few of these.  You learn something new every day.
Commits on Feb 8, 2010
  1. test.sh: don't try non-quick fsck on damaged repositories.

    apenwarr committed Feb 8, 2010
    It turns out that older versions of git (1.5.x or so) have a git-verify-pack
    that goes into an endless loop when it hits certain kinds of corruption, and
    our test would trigger it almost every time.  Using --quick avoids calling
    git-verify-pack, so it won't exhibit the problem.
    
    Unfortunately this means a slightly less thorough test of non-quick
    bup-fsck, but it'll have to do.  Better than failing tests nonstop, anyway.
    
    Reported by Eduardo Kienetz.
Commits on Feb 6, 2010