Sort of like make -j. par2 can be pretty slow, so this lets us verify multiple files in parallel. Since the files are so big, though, this might actually make performance *worse* if you don't have a lot of RAM. I haven't benchmarked this too much, but on my quad-core with 6 gigs of RAM, -j4 does definitely make it go "noticeably" faster.
It also uses the 'par2' command, if available, to automatically generate redundancy data, or to use that data for repair purposes. Includes handy unit test.
Sure, that *sounds* like a terrible idea. But it's fun for testing recovery algorithms, at least.
… 2.4 Apparently you can't put a 'yield' inside a try/finally in older versions of python. lame.
If we finished our current pack too quickly after getting the suggestion, the client would get confused, resulting in 'exected "ok, got %r' type errors.
'bup ls' lets you browse the set of backups on your current system. It's a bit useless, so it might go away or be rewritten eventually. 'bup fuse' is a simple read-only FUSE filesystem that lets you mount your backup sets as a filesystem (on Linux only). You can then export this filesystem over samba or NFS or whatever, and people will be able to restore their own files from backups. Warning: we still don't support file metadata in 'bup save', so all the file permissions will be wrong (and users will probably be able to see things they shouldn't!). Also, anything that has been split into chunks will show you the chunks instead of the full file, which is a bit silly. There are also tons of places where performance could be improved. But it's a pretty neat toy nevertheless. To try it out: mkdir /tmp/backups sudo bup fuse /tmp/backups
That way if someone generates a .midx for a subset of .idx files, then another for the *entire* set of .idx files, we'll automatically ignore the former one, thus increasing search speed and improving memory thrashing behaviour even further.
Wow, using a single .midx file that merges my 435 megs of packfile indexes (across 169 files) reduces memory churn in memtest.py by at least two orders of magnitude. (ie. we need to map 100x fewer memory pages in order to search for each nonexistent object when creating a new backup) memtest.py runs *visibly* faster. We can also remove the PackBitmap code now, since it's not nearly as good as the PackMidx stuff and is now an unnecessary layer of indirection.
This introduces a new "multi-index" index format, as suggested by Lukasz Kosewski. .midx files have a variable-bit-width fanout table that's supposedly optimized to be able to find any sha1 while dirtying only two pages (one for the fanout table lookup, and one for the final binary search). Each entry in the fanout table should correspond to approximately one page's worth of sha1sums. Also adds a PackMidx class, which acts just like PackIndex, but for .midx files. Not using it for anything yet, though. The idea is to greatly reduce memory burn when searching through lots of pack files.
Looks like the python standard is _modulename.so when making a C helper for a module named modulename.py, so let's do it that way. Also get rid of the annoying "module" suffix in the .c's filename. Not sure why I ever thought that was needed.
It seems to be a buggy warning. But we only really do it in one place, and buffers in question are only 20 bytes long, so forcing them into strings seems harmless enough.
This makes it work with fink's version of python, among possibly other things. So now we can build chashsplit.so even on MacOS X tiger, even though tiger's python 2.3 is too old, by installing fink's python24 package first.
The forcing of version 2.5 was leftover from before, when it was accidentally selecting python 2.4 by accident on some distros when both versions are installed. But actually that's fine; bup works in python 2.4 without problems. So let's not cause potentially *more* portability problems by forcing python 2.5 when it might not exist.
That way it looks prettier on github.
When we receive a new .idx file, we auto-generate a .map file from it. It's essentially an allocation bitmap: for each 20-bit prefix, we assign one bit to tell us if that particular prefix is in that particular packfile. If it isn't, there's no point searching the .idx file at all, so we can avoid mapping in a lot of pages. If it is, though, we then have to search the .idx *too*, so we suffer a bit. On the whole this reduces memory thrashing quite a bit for me, though. Probably the number of bits needs to be variable in order to work over a wider range of packfile sizes/numbers.
The majority of the memory usage in bup split/save is now caused by searching pack indexes for sha1 hashes. The problem with this is that, in the common case for a first full backup, *none* of the object hashes will be found, so we'll *always* have to search *all* the packfiles. With just 45 packfiles of 200k objects each, that makes about (18-8)*45 = 450 binary search steps, or 100+ 4k pages that need to be loaded from disk, to check *each* object hash. memtest.py lets us see how fast RSS creeps up under various conditions, and how different optimizations affect the result.
If the given parameter is exactly an int (ie. str(int(v)) == v) then convert it to an int automatically. This helps avoid weird bugs in apps using the option parser.
That just clutters the output; clearly what people *really* want to see is the list of files we're actually modifying. But if you want more, increase the verbosity and you'll get more.
A busy server could end up with a *large* number of index files, mostly referring to objects from other clients. Downloading all the indexes not only wastes bandwidth, but causes a more insidious problem: small servers end up having to mmap a huge number of large index files, which sucks lots of RAM. In general, the RAM on a server is roughly proportional to the disk space on that server. So it's okay for larger clients to need more RAM in order to complete a backup. However, it's not okay for the existence of larger clients to make smaller clients suffer. Hopefully this change will settle it a bit.
After some testing, it seems each object sha1 we need to cache while writing a pack costs us about 83 bytes of memory. (This isn't so great, so optimizing it in C later could cut this down a lot.) The new limit of 200k objects takes about 16.6 megs of RAM, which nowadays is pretty acceptable. It also corresponds to roughly 1GB of packfile for my random select of sample data, so (since the default packfile limit is about 1GB anyway), this *mostly* won't matter. It will have an effect if your data is highly compressible, however; an 8192-byte object could compress down to a very small size and you'd end up with a large number of objects. The previous default limit of 10 million objects was ridiculous, since that would take 830 megs of RAM.
Thus, 'bup save' on huge files would suck lots of RAM.
* cygwin: Assorted cleanups to Luke's cygwin fixes. Makefile: work with cygwin on different windows versions. .gitignore sanity. Makefile: On Windows, executable files must end with .exe. client.py: Windows files don't support ':', so rename cachedir. index.py: os.rename() fails on Windows if dstfile already exists. Don't try to rename tmpfiles into existing open files. helpers.py: Cygwin doesn't support `hostname -f`, use `hostname`. cmd-index.py: Retry os.open without O_LARGEFILE if not supported. Makefile: Build on Windows under Cygwin.
There were a few things that weren't quite done how I would have done them, so I changed the implementation. Should still work in cygwin, though. The only actual functional changes are: - index.Reader.close() now actually sets m=None rather than just closing it - removed the "if rename fails, then unlink first" logic, which is seemingly not needed after all. - rather than special-casing cygwin to use "hostname" instead of "hostname -f", it turns out python has a socket.getfqdn() that does what we want.