Commits on Jan 31, 2010
  1. fsck: add a -j# (run multiple threads) option.

    apenwarr committed Jan 31, 2010
    Sort of like make -j.  par2 can be pretty slow, so this lets us verify
    multiple files in parallel.  Since the files are so big, though, this might
    actually make performance *worse* if you don't have a lot of RAM.  I haven't
    benchmarked this too much, but on my quad-core with 6 gigs of RAM, -j4 does
    definitely make it go "noticeably" faster.
  2. Basic cmd-fsck for checking integrity of packfiles.

    apenwarr committed Jan 30, 2010
    It also uses the 'par2' command, if available, to automatically generate
    redundancy data, or to use that data for repair purposes.
    Includes handy unit test.
  3. cmd-damage: a program for randomly corrupting file contents.

    apenwarr committed Jan 30, 2010
    Sure, that *sounds* like a terrible idea.  But it's fun for testing recovery
    algorithms, at least.
  4. Use mkstemp() when creating temporary packfiles.

    apenwarr committed Jan 30, 2010
    Using getpid() was an okay hack, but there's no good excuse for doing it
    that way when there are perfectly good tempfile-naming functions around
  5. rewire a try/finally with a yield inside to be compatible with python…

    apenwarr committed Jan 31, 2010
    … 2.4
    Apparently you can't put a 'yield' inside a try/finally in older versions of
    python.  lame.
Commits on Jan 30, 2010
  1. client: fix a race condition when the server suggests an index.

    apenwarr committed Jan 30, 2010
    If we finished our current pack too quickly after getting the suggestion,
    the client would get confused, resulting in 'exected "ok, got %r' type
Commits on Jan 27, 2010
  1. cmd-ls and cmd-fuse: toys for browsing your available backups.

    apenwarr committed Jan 27, 2010
    'bup ls' lets you browse the set of backups on your current system.  It's a
    bit useless, so it might go away or be rewritten eventually.
    'bup fuse' is a simple read-only FUSE filesystem that lets you mount your
    backup sets as a filesystem (on Linux only).  You can then export this
    filesystem over samba or NFS or whatever, and people will be able to restore
    their own files from backups.
    Warning: we still don't support file metadata in 'bup save', so all the file
    permissions will be wrong (and users will probably be able to see things
    they shouldn't!).  Also, anything that has been split into chunks will show
    you the chunks instead of the full file, which is a bit silly.  There are
    also tons of places where performance could be improved.
    But it's a pretty neat toy nevertheless.  To try it out:
       mkdir /tmp/backups
       sudo bup fuse /tmp/backups
Commits on Jan 25, 2010
  1. cmd-midx: some performance optimizations.

    apenwarr committed Jan 25, 2010
    Approximately doubles the speed of generating indexes.
  2. cmd-midx: add --auto and --force options.

    apenwarr committed Jan 25, 2010
    Rather than having to list the indexes you want to merge, now it can do it
    for you automatically.  The output filename is now also optional; it'll
    generate it in the right place in the git repo automatically.
  3. When there are multiple overlapping .midx files, discard redundant ones.

    apenwarr committed Jan 25, 2010
    That way if someone generates a .midx for a subset of .idx files, then
    another for the *entire* set of .idx files, we'll automatically ignore the
    former one, thus increasing search speed and improving memory thrashing
    behaviour even further.
  4. MultiPackIndex: use .midx files if they exist.

    apenwarr committed Jan 25, 2010
    Wow, using a single .midx file that merges my 435 megs of packfile indexes
    (across 169 files) reduces memory churn in by at least two orders
    of magnitude.  (ie. we need to map 100x fewer memory pages in order to
    search for each nonexistent object when creating a new backup)
    runs *visibly* faster.
    We can also remove the PackBitmap code now, since it's not nearly as good as
    the PackMidx stuff and is now an unnecessary layer of indirection.
  5. cmd-midx: a command for merging multiple .idx files into one.

    apenwarr committed Jan 25, 2010
    This introduces a new "multi-index" index format, as suggested by Lukasz
    .midx files have a variable-bit-width fanout table that's supposedly
    optimized to be able to find any sha1 while dirtying only two pages (one for
    the fanout table lookup, and one for the final binary search).  Each entry
    in the fanout table should correspond to approximately one page's worth of
    Also adds a PackMidx class, which acts just like PackIndex, but for .midx
    files.  Not using it for anything yet, though.  The idea is to greatly
    reduce memory burn when searching through lots of pack files.
  6. Rename to

    apenwarr committed Jan 25, 2010
    Looks like the python standard is when making a C helper for
    a module named, so let's do it that way.
    Also get rid of the annoying "module" suffix in the .c's filename.  Not sure
    why I ever thought that was needed.
Commits on Jan 24, 2010
  1. toplevel exit() doesn't work in python 2.4.

    apenwarr committed Jan 24, 2010
    Use sys.exit() instead.
  2. In some versions of python, comparing buffers with < gives a warning.

    apenwarr committed Jan 24, 2010
    It seems to be a buggy warning.  But we only really do it in one place, and
    buffers in question are only 20 bytes long, so forcing them into strings
    seems harmless enough.
  3. Wrap mmap calls to help with portability.

    apenwarr committed Jan 24, 2010
    python2.4 in 'fink' on MacOS X seems to not like it when you pass a file
    length of 0, even though that's supposed to mean "determine map size
  4. Makefile: build module using python distutils instead of manually.

    apenwarr committed Jan 24, 2010
    This makes it work with fink's version of python, among possibly other
    So now we can build even on MacOS X tiger, even though tiger's
    python 2.3 is too old, by installing fink's python24 package first.
  5. executable files: don't assume python2.5.

    apenwarr committed Jan 24, 2010
    The forcing of version 2.5 was leftover from before, when it was
    accidentally selecting python 2.4 by accident on some distros when both
    versions are installed.  But actually that's fine; bup works in python 2.4
    without problems.
    So let's not cause potentially *more* portability problems by forcing python
    2.5 when it might not exist.
  6. Makefile: oops, all the $^ and $< were backwards.

    apenwarr committed Jan 24, 2010
    Not that it mattered, since all our files only had one dependency each.  But
    it causes confusion if you ever add extra ones.
  7. Make README a symlink to

    apenwarr committed Jan 24, 2010
    So as not to confuse anyone who has linked to the README file on github in
    the past.
  8. Minor README format changes to make it markdown-compliant.

    apenwarr committed Jan 24, 2010
    That way it looks prettier on github.
Commits on Jan 14, 2010
  1. Change t/ to pass on Mac OS.

    dcoombs committed Jan 14, 2010
    It turns out /etc is a symlink (to /private/etc) on Mac OS, so checking
    that the realpath of t/sampledata/etc is /etc fails.  Instead we now check
    against the realpath of /etc.
Commits on Jan 12, 2010
  1. Use a PackBitmap file as a quicker way to check .idx files.

    apenwarr committed Jan 12, 2010
    When we receive a new .idx file, we auto-generate a .map file from it.  It's
    essentially an allocation bitmap: for each 20-bit prefix, we assign one bit
    to tell us if that particular prefix is in that particular packfile.  If it
    isn't, there's no point searching the .idx file at all, so we can avoid
    mapping in a lot of pages.  If it is, though, we then have to search the
    .idx *too*, so we suffer a bit.
    On the whole this reduces memory thrashing quite a bit for me, though.
    Probably the number of bits needs to be variable in order to work over a
    wider range of packfile sizes/numbers.
  2. a standalone program for testing memory usage in PackIndex.

    apenwarr committed Jan 12, 2010
    The majority of the memory usage in bup split/save is now caused by
    searching pack indexes for sha1 hashes.  The problem with this is that, in
    the common case for a first full backup, *none* of the object hashes will be
    found, so we'll *always* have to search *all* the packfiles.  With just 45
    packfiles of 200k objects each, that makes about (18-8)*45 = 450 binary
    search steps, or 100+ 4k pages that need to be loaded from disk, to check
    *each* object hash. lets us see how fast RSS creeps up under
    various conditions, and how different optimizations affect the result.
  3. options parser: automatically convert strings to ints when appropriate.

    apenwarr committed Jan 12, 2010
    If the given parameter is exactly an int (ie. str(int(v)) == v) then convert
    it to an int automatically.  This helps avoid weird bugs in apps using the
    option parser.
  4. cmd-save: if verbose==1, don't bother printing unmodified names.

    apenwarr committed Jan 12, 2010
    That just clutters the output; clearly what people *really* want to see is
    the list of files we're actually modifying.
    But if you want more, increase the verbosity and you'll get more.
  5. client-server: only retrieve index files when actually needed.

    apenwarr committed Jan 11, 2010
    A busy server could end up with a *large* number of index files, mostly
    referring to objects from other clients.  Downloading all the indexes not only
    wastes bandwidth, but causes a more insidious problem: small servers end up
    having to mmap a huge number of large index files, which sucks lots of RAM.
    In general, the RAM on a server is roughly proportional to the disk space on
    that server.  So it's okay for larger clients to need more RAM in order
    to complete a backup.  However, it's not okay for the existence of larger
    clients to make smaller clients suffer.  Hopefully this change will settle
    it a bit.
  6. Reduce default max objects per pack to 200,000 to save memory.

    apenwarr committed Jan 12, 2010
    After some testing, it seems each object sha1 we need to cache while writing
    a pack costs us about 83 bytes of memory.  (This isn't so great, so
    optimizing it in C later could cut this down a lot.)  The new limit of 200k
    objects takes about 16.6 megs of RAM, which nowadays is pretty acceptable.
    It also corresponds to roughly 1GB of packfile for my random select of
    sample data, so (since the default packfile limit is about 1GB anyway), this
    *mostly* won't matter.
    It will have an effect if your data is highly compressible, however; an
    8192-byte object could compress down to a very small size and you'd end up
    with a large number of objects.  The previous default limit of 10 million
    objects was ridiculous, since that would take 830 megs of RAM.
  7. split_to_blob_or_tree was accidentally not using the 'fanout' setting.

    apenwarr committed Jan 12, 2010
    Thus, 'bup save' on huge files would suck lots of RAM.
Commits on Jan 11, 2010
  1. Merge branch 'cygwin'

    apenwarr committed Jan 11, 2010
    * cygwin:
      Assorted cleanups to Luke's cygwin fixes.
      Makefile: work with cygwin on different windows versions.
      .gitignore sanity.
      Makefile:  On Windows, executable files must end with .exe.  Windows files don't support ':', so rename cachedir.  os.rename() fails on Windows if dstfile already exists.
      Don't try to rename tmpfiles into existing open files.  Cygwin doesn't support `hostname -f`, use `hostname`.  Retry without O_LARGEFILE if not supported.
      Makefile:  Build on Windows under Cygwin.
  2. Assorted cleanups to Luke's cygwin fixes.

    apenwarr committed Jan 11, 2010
    There were a few things that weren't quite done how I would have done them,
    so I changed the implementation.  Should still work in cygwin, though.
    The only actual functional changes are:
     - index.Reader.close() now actually sets m=None rather than just closing it
     - removed the "if rename fails, then unlink first" logic, which is
       seemingly not needed after all.
     - rather than special-casing cygwin to use "hostname" instead of "hostname
       -f", it turns out python has a socket.getfqdn() that does what we want.