From a76207b92720dc1fd7ef7ce6e6793d3d30bb556f Mon Sep 17 00:00:00 2001 From: Avery Pennarun Date: Sun, 8 May 2011 01:31:59 -0400 Subject: [PATCH] DESIGN: mention bloom filters. Jeff Anderson-Lee discovered the missing information and posted to the mailing list. Gabriel Filion reminded me to actually update the docs :) Signed-off-by: Avery Pennarun --- DESIGN | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/DESIGN b/DESIGN index 3e4737a89..dfe30f360 100644 --- a/DESIGN +++ b/DESIGN @@ -281,7 +281,7 @@ they're written. But that leads us to our next problem. -Huge numbers of huge packfiles (midx.py, cmd/midx) +Huge numbers of huge packfiles (midx.py, bloom.py, cmd/midx, cmd/bloom) ------------------------------ Git isn't actually designed to handle super-huge repositories. Most git @@ -354,9 +354,13 @@ You generate midx files with 'bup midx'. The downside of midx files is that generating one takes a while, and you have to regenerate it every time you add a few packs. -(Computer Sciency observers will note that there are some interesting data -structures out there that could help make things even better. A very -promising sounding one is called a "bloom filter." Look it up in Wikipedia.) +UPDATE: Brandon Low contributed an implementation of "bloom filters", which +have even better characteristics than midx for certain uses. Look it up in +Wikipedia. He also massively sped up both midx and bloom by rewriting the +key parts in C. The nicest thing about bloom filters is we can update them +incrementally every time we get a new idx, without regenerating from +scratch. That makes the update phase much faster, and means we can also get +away with generating midxes less often. midx files are a bup-specific optimization and git doesn't know what to do with them. However, since they're stored as separate files, they don't