Permalink
Commits on Feb 4, 2010
  1. Merge branch 'indexrewrite'

    * indexrewrite:
      Greatly improved progress reporting during index/save.
      Fix bugs in new indexing code.
      Speed up cmd-drecurse by 40%.
      Split directory recursion stuff from cmd-index.py into drecurse.py.
      Massive speedups to bupindex code.
    committed Feb 4, 2010
  2. Greatly improved progress reporting during index/save.

    Now that the index reading stuff is much faster, we can afford to waste time
    reading through it just to count how many bytes we're planning to back up.
    
    And that lets us print really friendly progress messages during bup save, in
    which we can tell you exactly what fraction of your bytes have been backed
    up so far.
    committed Feb 4, 2010
  3. Fix bugs in new indexing code.

    The logic was way too screwy, so I've simplified it a lot.  Also extended
    the unit tests quite a bit to replicate the weird problems I was having.  It
    seems pretty stable - and pretty fast - now.
    
    Iterating through an index of my whole home directory (bup index -p ~) now
    takes about 5.1 seconds, vs. 3.5 seconds before the rewrite.  However,
    iterating through just a *fraction* of the index can now bypass all the
    parts we don't care about, so it's much much faster than before.
    
    Could probably still stand some more optimization eventually, but at least
    the file format allows for speed.  The rest is just code :)
    committed Feb 3, 2010
Commits on Feb 3, 2010
  1. Speed up cmd-drecurse by 40%.

    It's now 40% faster, ie. 1.769 seconds or so to go through my home
    directory, instead of the previous 2.935.
    
    Still sucks compared to the native C 'find' command, but that's probably
    about as good as it's getting in pure python.
    committed Feb 3, 2010
  2. Split directory recursion stuff from cmd-index.py into drecurse.py.

    Also add a new command, 'bup drecurse', which just recurses through a
    directory tree and prints all the filenames.  This is useful for timing
    performance vs. the native 'find' command.
    
    The result is a bit embarrassing; for my home directory of about 188000
    files, drecurse is about 10x slower:
    
    $ time bup drecurse -q ~
    real	0m2.935s
    user	0m2.312s
    sys	0m0.580s
    
    $ time find ~ -printf ''
    real	0m0.385s
    user	0m0.096s
    sys	0m0.284s
    
    time find ~ -printf '%s\n' >/dev/null
    real	0m0.662s
    user	0m0.208s
    sys	0m0.456s
    committed Feb 3, 2010
Commits on Feb 2, 2010
  1. Massive speedups to bupindex code.

    The old file format was modeled after the git one, but it was kind of dumb;
    you couldn't search through the file except linearly, which is pretty slow
    when you have hundreds of thousands, or millions, of files.  It also stored
    the entire pathname of each file, which got very wasteful as filenames got
    longer.
    
    The new format is much quicker; each directory has a pointer to its list of
    children, so you can jump around rather than reading linearly through the
    file.  Thus you can now 'bup index -p' any subdirectory pretty much
    instantly.  The code is still not completely optimized, but the remaining
    algorithmic silliness doesn't seem to matter.
    
    And it even still passes unit tests!  Which is too bad, actually, because I
    still get oddly crashy behaviour when I repeatedly update a large index. So
    there are still some screwy bugs hanging around.  I guess that means we need
    better unit tests...
    committed Jan 31, 2010
  2. cmd-save: add --smaller option.

    This makes it only back up files smaller than the given size.  bup can
    handle big files, but you might want to do quicker incremental backups and
    skip bigger files except once a day, or something.
    
    It's also handy for testing.
    committed Feb 2, 2010
  3. midx: the fanout table entries can be 4 bytes, not 8.

    I was trying to be future-proof, but it was kind of overkill, since a 32-bit
    fanout entry could handle a total of 4 billion *hashes* per midx.  That
    would be 20*4bil = 80 gigs in a single midx.  This corresponds to about 10
    terabytes of packs, which isn't inconceivable... but if it happens, you
    could just use more than one midx.  Plus you'd likely run into other weird
    bup problems before your midx files get anywhere near 80 gigs.
    committed Feb 2, 2010
  4. cmd-midx: correctly handle a tiny nonzero number of objects.

    If all the sha1sums would have fit in a single page, the number of bits in
    the table would be negative, with odd results.  Now we just refuse to create
    the midx if there are too few objects *and* too few files, since it would be
    useless anyway.
    
    We're still willing to create a very small midx if it allows us to merge
    several indexes into one, however small, since that will still speed up
    searching.
    committed Feb 2, 2010
  5. Use a heapq object to accelerate git.idxmerge().

    This greatly accelerates bup margin and bup midx when you're iterating
    through a large number of packs.
    committed Feb 2, 2010
  6. cmd-margin: a command to find out the max bits of overlap between has…

    …hes.
    
    Run 'bup margin' to go through the list of all the objects in your bup
    directory and count the number of overlapping prefix bits between each two
    consecutive objects.  That is, fine the longest hash length (in bits) that
    *would* have caused an overlap, if sha1 hashes had been that length.
    
    On my system with 111 gigs of packs, I get 44 bits.  Out of a total of 160.
    That means I'm still safe from collisions for about 2^116 times over.  Or is
    it only the square root of that?  Anyway, it's such a large number that my
    brain explodes just thinking about it.
    
    Mark my words: 2^160 ought to be enough for anyone.
    committed Feb 2, 2010
Commits on Jan 31, 2010
  1. Update README.md to reflect recent developments.

    - Remove the version number since I never remember to update it
    - We now work with earlier versions of python and MacOS
    - There's now a mailing list
    - 'bup fsck' allows us to remove one of the things from the "stupid" list.
    committed Jan 31, 2010
  2. Move testfile[12] into t/

    Since they're only used for testing, they belong there, after all.
    committed Jan 31, 2010
  3. fsck: add a -j# (run multiple threads) option.

    Sort of like make -j.  par2 can be pretty slow, so this lets us verify
    multiple files in parallel.  Since the files are so big, though, this might
    actually make performance *worse* if you don't have a lot of RAM.  I haven't
    benchmarked this too much, but on my quad-core with 6 gigs of RAM, -j4 does
    definitely make it go "noticeably" faster.
    committed Jan 31, 2010
  4. Basic cmd-fsck for checking integrity of packfiles.

    It also uses the 'par2' command, if available, to automatically generate
    redundancy data, or to use that data for repair purposes.
    
    Includes handy unit test.
    committed Jan 30, 2010
  5. cmd-damage: a program for randomly corrupting file contents.

    Sure, that *sounds* like a terrible idea.  But it's fun for testing recovery
    algorithms, at least.
    committed Jan 30, 2010
  6. Use mkstemp() when creating temporary packfiles.

    Using getpid() was an okay hack, but there's no good excuse for doing it
    that way when there are perfectly good tempfile-naming functions around
    already.
    committed Jan 30, 2010
  7. rewire a try/finally with a yield inside to be compatible with python…

    … 2.4
    
    Apparently you can't put a 'yield' inside a try/finally in older versions of
    python.  lame.
    committed Jan 31, 2010
Commits on Jan 30, 2010
  1. client: fix a race condition when the server suggests an index.

    If we finished our current pack too quickly after getting the suggestion,
    the client would get confused, resulting in 'exected "ok, got %r' type
    errors.
    committed Jan 30, 2010
Commits on Jan 27, 2010
  1. cmd-ls and cmd-fuse: toys for browsing your available backups.

    'bup ls' lets you browse the set of backups on your current system.  It's a
    bit useless, so it might go away or be rewritten eventually.
    
    'bup fuse' is a simple read-only FUSE filesystem that lets you mount your
    backup sets as a filesystem (on Linux only).  You can then export this
    filesystem over samba or NFS or whatever, and people will be able to restore
    their own files from backups.
    
    Warning: we still don't support file metadata in 'bup save', so all the file
    permissions will be wrong (and users will probably be able to see things
    they shouldn't!).  Also, anything that has been split into chunks will show
    you the chunks instead of the full file, which is a bit silly.  There are
    also tons of places where performance could be improved.
    
    But it's a pretty neat toy nevertheless.  To try it out:
    
       mkdir /tmp/backups
       sudo bup fuse /tmp/backups
    committed Jan 27, 2010
Commits on Jan 25, 2010
  1. cmd-midx: some performance optimizations.

    Approximately doubles the speed of generating indexes.
    committed Jan 25, 2010
  2. cmd-midx: add --auto and --force options.

    Rather than having to list the indexes you want to merge, now it can do it
    for you automatically.  The output filename is now also optional; it'll
    generate it in the right place in the git repo automatically.
    committed Jan 25, 2010
  3. When there are multiple overlapping .midx files, discard redundant ones.

    That way if someone generates a .midx for a subset of .idx files, then
    another for the *entire* set of .idx files, we'll automatically ignore the
    former one, thus increasing search speed and improving memory thrashing
    behaviour even further.
    committed Jan 25, 2010
  4. MultiPackIndex: use .midx files if they exist.

    Wow, using a single .midx file that merges my 435 megs of packfile indexes
    (across 169 files) reduces memory churn in memtest.py by at least two orders
    of magnitude.  (ie. we need to map 100x fewer memory pages in order to
    search for each nonexistent object when creating a new backup)  memtest.py
    runs *visibly* faster.
    
    We can also remove the PackBitmap code now, since it's not nearly as good as
    the PackMidx stuff and is now an unnecessary layer of indirection.
    committed Jan 25, 2010
  5. cmd-midx: a command for merging multiple .idx files into one.

    This introduces a new "multi-index" index format, as suggested by Lukasz
    Kosewski.
    
    .midx files have a variable-bit-width fanout table that's supposedly
    optimized to be able to find any sha1 while dirtying only two pages (one for
    the fanout table lookup, and one for the final binary search).  Each entry
    in the fanout table should correspond to approximately one page's worth of
    sha1sums.
    
    Also adds a PackMidx class, which acts just like PackIndex, but for .midx
    files.  Not using it for anything yet, though.  The idea is to greatly
    reduce memory burn when searching through lots of pack files.
    committed Jan 25, 2010
  6. Rename chashsplit.so to _hashsplit.so.

    Looks like the python standard is _modulename.so when making a C helper for
    a module named modulename.py, so let's do it that way.
    
    Also get rid of the annoying "module" suffix in the .c's filename.  Not sure
    why I ever thought that was needed.
    committed Jan 25, 2010
Commits on Jan 24, 2010
  1. toplevel exit() doesn't work in python 2.4.

    Use sys.exit() instead.
    committed Jan 24, 2010
  2. In some versions of python, comparing buffers with < gives a warning.

    It seems to be a buggy warning.  But we only really do it in one place, and
    buffers in question are only 20 bytes long, so forcing them into strings
    seems harmless enough.
    committed Jan 24, 2010
  3. Wrap mmap calls to help with portability.

    python2.4 in 'fink' on MacOS X seems to not like it when you pass a file
    length of 0, even though that's supposed to mean "determine map size
    automatically."
    committed Jan 24, 2010
  4. Makefile: build module using python distutils instead of manually.

    This makes it work with fink's version of python, among possibly other
    things.
    
    So now we can build chashsplit.so even on MacOS X tiger, even though tiger's
    python 2.3 is too old, by installing fink's python24 package first.
    committed Jan 24, 2010
  5. executable files: don't assume python2.5.

    The forcing of version 2.5 was leftover from before, when it was
    accidentally selecting python 2.4 by accident on some distros when both
    versions are installed.  But actually that's fine; bup works in python 2.4
    without problems.
    
    So let's not cause potentially *more* portability problems by forcing python
    2.5 when it might not exist.
    committed Jan 24, 2010
  6. Makefile: oops, all the $^ and $< were backwards.

    Not that it mattered, since all our files only had one dependency each.  But
    it causes confusion if you ever add extra ones.
    committed Jan 24, 2010
  7. Make README a symlink to README.md

    So as not to confuse anyone who has linked to the README file on github in
    the past.
    committed Jan 24, 2010