Naturally, estimating the time remaining is one of those things that sounds super easy, but isn't. So the numbers wobble around a bit more than I'd like, especially at first. But apply a few scary heuristics, and boom! Stuff happens.
This is less boring than seeing a blank screen while we download 5+ megs of stuff.
Currently midx files can't tell you *which* index contains a particular hash, just that *one* of them does. So bup-server was barfing when it expected MultiPackIndex.exists() to return a pack name, and was getting a .midx file instead. We could have loosened the assertion and allowed the server to suggest a .midx file... but those can be huge, and it defeats the purpose of only suggesting the minimal set of packs so that lightweight clients aren't overwhelmed.
If we encountered an error *writing* the pack, we were counting it as a non-fatal error, which was not the intention. Only *reading* files we want to back up should be considered non-fatal.
It wasn't printing often enough, and thus was absent more often than present.
We were already returning integers, which seem to be "long ints" in this case, even though they're relatively small. Whatever, we'll typecast them to int first, and now unit tests pass.
* indexrewrite: Greatly improved progress reporting during index/save. Fix bugs in new indexing code. Speed up cmd-drecurse by 40%. Split directory recursion stuff from cmd-index.py into drecurse.py. Massive speedups to bupindex code.
Now that the index reading stuff is much faster, we can afford to waste time reading through it just to count how many bytes we're planning to back up. And that lets us print really friendly progress messages during bup save, in which we can tell you exactly what fraction of your bytes have been backed up so far.
The logic was way too screwy, so I've simplified it a lot. Also extended the unit tests quite a bit to replicate the weird problems I was having. It seems pretty stable - and pretty fast - now. Iterating through an index of my whole home directory (bup index -p ~) now takes about 5.1 seconds, vs. 3.5 seconds before the rewrite. However, iterating through just a *fraction* of the index can now bypass all the parts we don't care about, so it's much much faster than before. Could probably still stand some more optimization eventually, but at least the file format allows for speed. The rest is just code :)
It's now 40% faster, ie. 1.769 seconds or so to go through my home directory, instead of the previous 2.935. Still sucks compared to the native C 'find' command, but that's probably about as good as it's getting in pure python.
Also add a new command, 'bup drecurse', which just recurses through a directory tree and prints all the filenames. This is useful for timing performance vs. the native 'find' command. The result is a bit embarrassing; for my home directory of about 188000 files, drecurse is about 10x slower: $ time bup drecurse -q ~ real 0m2.935s user 0m2.312s sys 0m0.580s $ time find ~ -printf '' real 0m0.385s user 0m0.096s sys 0m0.284s time find ~ -printf '%s\n' >/dev/null real 0m0.662s user 0m0.208s sys 0m0.456s
The old file format was modeled after the git one, but it was kind of dumb; you couldn't search through the file except linearly, which is pretty slow when you have hundreds of thousands, or millions, of files. It also stored the entire pathname of each file, which got very wasteful as filenames got longer. The new format is much quicker; each directory has a pointer to its list of children, so you can jump around rather than reading linearly through the file. Thus you can now 'bup index -p' any subdirectory pretty much instantly. The code is still not completely optimized, but the remaining algorithmic silliness doesn't seem to matter. And it even still passes unit tests! Which is too bad, actually, because I still get oddly crashy behaviour when I repeatedly update a large index. So there are still some screwy bugs hanging around. I guess that means we need better unit tests...
This makes it only back up files smaller than the given size. bup can handle big files, but you might want to do quicker incremental backups and skip bigger files except once a day, or something. It's also handy for testing.
I was trying to be future-proof, but it was kind of overkill, since a 32-bit fanout entry could handle a total of 4 billion *hashes* per midx. That would be 20*4bil = 80 gigs in a single midx. This corresponds to about 10 terabytes of packs, which isn't inconceivable... but if it happens, you could just use more than one midx. Plus you'd likely run into other weird bup problems before your midx files get anywhere near 80 gigs.
If all the sha1sums would have fit in a single page, the number of bits in the table would be negative, with odd results. Now we just refuse to create the midx if there are too few objects *and* too few files, since it would be useless anyway. We're still willing to create a very small midx if it allows us to merge several indexes into one, however small, since that will still speed up searching.
This greatly accelerates bup margin and bup midx when you're iterating through a large number of packs.
…hes. Run 'bup margin' to go through the list of all the objects in your bup directory and count the number of overlapping prefix bits between each two consecutive objects. That is, fine the longest hash length (in bits) that *would* have caused an overlap, if sha1 hashes had been that length. On my system with 111 gigs of packs, I get 44 bits. Out of a total of 160. That means I'm still safe from collisions for about 2^116 times over. Or is it only the square root of that? Anyway, it's such a large number that my brain explodes just thinking about it. Mark my words: 2^160 ought to be enough for anyone.
- Remove the version number since I never remember to update it - We now work with earlier versions of python and MacOS - There's now a mailing list - 'bup fsck' allows us to remove one of the things from the "stupid" list.
Since they're only used for testing, they belong there, after all.
Sort of like make -j. par2 can be pretty slow, so this lets us verify multiple files in parallel. Since the files are so big, though, this might actually make performance *worse* if you don't have a lot of RAM. I haven't benchmarked this too much, but on my quad-core with 6 gigs of RAM, -j4 does definitely make it go "noticeably" faster.
It also uses the 'par2' command, if available, to automatically generate redundancy data, or to use that data for repair purposes. Includes handy unit test.
Sure, that *sounds* like a terrible idea. But it's fun for testing recovery algorithms, at least.
Using getpid() was an okay hack, but there's no good excuse for doing it that way when there are perfectly good tempfile-naming functions around already.
… 2.4 Apparently you can't put a 'yield' inside a try/finally in older versions of python. lame.
If we finished our current pack too quickly after getting the suggestion, the client would get confused, resulting in 'exected "ok, got %r' type errors.
'bup ls' lets you browse the set of backups on your current system. It's a bit useless, so it might go away or be rewritten eventually. 'bup fuse' is a simple read-only FUSE filesystem that lets you mount your backup sets as a filesystem (on Linux only). You can then export this filesystem over samba or NFS or whatever, and people will be able to restore their own files from backups. Warning: we still don't support file metadata in 'bup save', so all the file permissions will be wrong (and users will probably be able to see things they shouldn't!). Also, anything that has been split into chunks will show you the chunks instead of the full file, which is a bit silly. There are also tons of places where performance could be improved. But it's a pretty neat toy nevertheless. To try it out: mkdir /tmp/backups sudo bup fuse /tmp/backups
Approximately doubles the speed of generating indexes.
Rather than having to list the indexes you want to merge, now it can do it for you automatically. The output filename is now also optional; it'll generate it in the right place in the git repo automatically.
That way if someone generates a .midx for a subset of .idx files, then another for the *entire* set of .idx files, we'll automatically ignore the former one, thus increasing search speed and improving memory thrashing behaviour even further.
Wow, using a single .midx file that merges my 435 megs of packfile indexes (across 169 files) reduces memory churn in memtest.py by at least two orders of magnitude. (ie. we need to map 100x fewer memory pages in order to search for each nonexistent object when creating a new backup) memtest.py runs *visibly* faster. We can also remove the PackBitmap code now, since it's not nearly as good as the PackMidx stuff and is now an unnecessary layer of indirection.
This introduces a new "multi-index" index format, as suggested by Lukasz Kosewski. .midx files have a variable-bit-width fanout table that's supposedly optimized to be able to find any sha1 while dirtying only two pages (one for the fanout table lookup, and one for the final binary search). Each entry in the fanout table should correspond to approximately one page's worth of sha1sums. Also adds a PackMidx class, which acts just like PackIndex, but for .midx files. Not using it for anything yet, though. The idea is to greatly reduce memory burn when searching through lots of pack files.
Looks like the python standard is _modulename.so when making a C helper for a module named modulename.py, so let's do it that way. Also get rid of the annoying "module" suffix in the .c's filename. Not sure why I ever thought that was needed.