Browse files

DESIGN: mention bloom filters.

Jeff Anderson-Lee discovered the missing information and posted to the
mailing list.  Gabriel Filion reminded me to actually update the docs :)

Signed-off-by: Avery Pennarun <>
  • Loading branch information...
1 parent 64abc2f commit a76207b92720dc1fd7ef7ce6e6793d3d30bb556f @apenwarr committed May 8, 2011
Showing with 8 additions and 4 deletions.
  1. +8 −4 DESIGN
@@ -281,7 +281,7 @@ they're written.
But that leads us to our next problem.
-Huge numbers of huge packfiles (, cmd/midx)
+Huge numbers of huge packfiles (,, cmd/midx, cmd/bloom)
Git isn't actually designed to handle super-huge repositories. Most git
@@ -354,9 +354,13 @@ You generate midx files with 'bup midx'. The downside of midx files is that
generating one takes a while, and you have to regenerate it every time you
add a few packs.
-(Computer Sciency observers will note that there are some interesting data
-structures out there that could help make things even better. A very
-promising sounding one is called a "bloom filter." Look it up in Wikipedia.)
+UPDATE: Brandon Low contributed an implementation of "bloom filters", which
+have even better characteristics than midx for certain uses. Look it up in
+Wikipedia. He also massively sped up both midx and bloom by rewriting the
+key parts in C. The nicest thing about bloom filters is we can update them
+incrementally every time we get a new idx, without regenerating from
+scratch. That makes the update phase much faster, and means we can also get
+away with generating midxes less often.
midx files are a bup-specific optimization and git doesn't know what to do
with them. However, since they're stored as separate files, they don't

0 comments on commit a76207b

Please sign in to comment.