Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Expand documentation.

Signed-off-by: Edward Z. Yang <ezyang@mit.edu>
  • Loading branch information...
commit bff21db47f9c56b306ca6dc9eecffddb08da3d62 1 parent 4869142
@ezyang authored
Showing with 186 additions and 23 deletions.
  1. +3 −0  .gitignore
  2. +58 −0 NOTES
  3. +50 −23 cminsketch.ml
  4. +75 −0 cminsketch.mli
View
3  .gitignore
@@ -0,0 +1,3 @@
+*.html
+*.cmi
+*.css
View
58 NOTES
@@ -0,0 +1,58 @@
+Preliminaries: Pairwise Independent Hash Functions
+--------------------------------------------------
+
+One benefit of the count-min sketch is that it does not require p-wise
+independent hash families (a strong independence guarantee that requires
+sophisticated mathematical machinery); it only requires pairwise
+independent hash functions. Unfortunately, the subject of universal
+hash families is not one usually covered in undergraduate computer
+science, so in this section I will attempt to explicate enough of the
+background theory so that we can generate the set of 2-independent hash
+functions necessary to implement a count min sketch.
+
+----
+
+Our first question is one of terminology. There is a somewhat diverse
+set of naming conventions for hash families which I will attempt to
+explicate here. We will assume the following shared conventions: our
+hash family is a family of functions from ``M → N`` where ``M = {0, 1,
+..., m-1}`` and ``N = {0, 1, ..., n-1}`` with ``m >= n``. M corresponds
+to our “universe”, the possibly values being hashed, while N is the
+range of the hash function. This is the conventioned assumed by Motwani
+and Raghavan in *Randomized Algorithms*, which Cormode and Muthukrishnan
+reference in their paper.
+
+The weakest independence guarantee is as follows:
+
+ (WEAK) UNIVERSAL HASH FAMILY or (WEAK) 2-UNIVERSAL HASH FAMILY
+
+ For all x, y ∈ M such that x != y, and for h chosen uniformly at
+ random from H, ::
+
+ Pr[h(x) = h(y)] ≤ 1/n
+
+ We shall abbreviate this subsequently as::
+
+ ∀ x,y ∈ M, x != y. Pr[h(x) = h(y)] ≤ 1/n
+
+(Errata: One set of notes [1] I consulted claimed that ``Pr[h(x) = h(y)
+= d/n]`` indicates a d-universal family; however, this appears to
+contradict the common formulation of a weak 2-universal hash family.
+Maybe this is an off by one?]
+
+Note that, definitionally speaking, 2-universal != 2-independent. The
+name 2-universal alludes to the fact that we are only verifying the
+probabilities pairwise between elements of the universe, so they behave
+*like* pairwise independent random variables. Strictly speaking, this
+is weaker than saying they are actually pairwise independent, as the
+following definition says:
+
+ STRONG UNIVERSAL HASH FAMILY
+ or (STRONGLY) 2-INDEPENDENT UNIVERSAL HASH FAMILY::
+
+ ∀ x,y ∈ M, a,b ∈ N.
+ Pr[h(x) = a ∧ h(y) = b] <= 1/n²
+
+
+
+[1] http://courses.csail.mit.edu/6.851/spring07/erik/L11.pdf
View
73 cminsketch.ml
@@ -2,18 +2,33 @@
* 32-bit or 64-bit (usually, they have one bit less precision), so
* care must be taken. *)
+(** Number of bits in an OCaml int (not native int.) *)
(* Warning: If this value is too small you *will* get out of bounds
* errors. Thus, a mythical OCaml implementation that does not
* reserve any bits for bookkeeping will not handle this properly.
* Fortunately, this is unlikely to change. *)
let int_size = Sys.word_size - 1
-let multiply_shift m a x = (a * x) lsr (int_size - m)
+(* The original paper suggested using Carter and Wegman's universal hash
+ * family, which was strongly 2-independent, and the original paper
+ * stipulates this requirement. However, it has only proven that
+ * multiply shift is weakly universal. Fortunately, the proofs
+ * in the paper don't require 2-independence, and the author of the
+ * paper verified that multiply shift should have the necessary
+ * theoretic properties. *)
+let multiply_shift ~m ~a ~x = (a * x) lsr (int_size - m)
+(** Euler's constant *)
let euler = exp 1.
+
+(** Base 2 logarithm *)
(* XXX probably there is something more efficient, but this is only
* called for sketch creation, which is fairly infrequent *)
let lg x = log x /. log 2.
+
+(** Rounds a float up to the nearest integer. *)
let int_ceil x = int_of_float (ceil x)
+
+(** Generates a random odd int. May be negative. *)
(* Sys.word_size - 2 bits is "just right" (30 bits for 32-bit
* and 62 bits for 64-bit), because one bit is thrown out due to
* implementation details, and one bit is thrown out due to us
@@ -25,52 +40,64 @@ let random_odd_int () =
* no way to ask for less entropy, since all of the functions
* in the Random module use bits internally *)
else (Random.bits () lor (Random.bits () lsl 30) lor (Random.bits () lsl 60)) * 2 + 1
+
+(** Increases a the value at [(i, j)] in matrix [a] by [c]. *)
let step_matrix a i j c = a.(i).(j) <- a.(i).(j) + c
+
+(** Finds the minimum value of an array. *)
let minimum a = Array.fold_left min max_int a
-type cminsketch = CMinSketch of int * int array array * int array
+type sketch = { lg_width : int;
+ count: int array array;
+ hash_functions : int array }
-(* utilizes RNG *)
-(* arguments are to be interpeted as thus: the error in answering
- * a query is within a factor of epsilon with probability delta.
- * You get more accurate results for small epsilon and large delta.
- * Of course, don't set epsilon = 0 or delta = 1: then you're
- * no longer using a probabilistic data structure. *)
-let make epsilon delta =
+let make ~epsilon ~delta =
+ if epsilon <= 0.0 then invalid_arg "Cminsketch.make: epsilon must be greater than 0.0" else
+ if delta >= 1.0 then invalid_arg "Cminsketch.make: delta must be less than 1.0" else
+ if delta <= 0.0 then invalid_arg "Cminsketch.make: delta must be greater than 0.0" else
(* We fudge the width to be a little larger, ensuring that it
* is a power of two for the benefit of our algorithm. This means
* the actual epsilon you will get is smaller than what you
* originally specified. *)
- (* XXX Add error checking for the parameters *)
let m = int_ceil (lg (euler /. epsilon)) in
+ if m < 0 then failwith "Cminsketch.make: internal error, lg_width less than 0" else
let width = 1 lsl m
and depth = int_ceil (log (1. /. delta)) in
- CMinSketch
- (m,
- Array.make_matrix depth width 0,
- Array.init depth (fun _ -> random_odd_int ()))
+ if width <= 0 then failwith "Cminsketch.make: internal error, width less than 1" else
+ if depth <= 0 then failwith "Cminsketch.make: internal error, depth less than 1" else
+ { lg_width = m;
+ count = Array.make_matrix depth width 0;
+ hash_functions = Array.init depth (fun _ -> random_odd_int ());
+ }
-let update (CMinSketch (m, sketch, hfs)) ix c =
- Array.iteri (fun i a -> step_matrix sketch i (multiply_shift m a ix) c) hfs
+let epsilon s = euler /. (float_of_int (1 lsl s.lg_width))
+let delta s = 1. /. exp (float_of_int (Array.length s.count))
-let query (CMinSketch (m, sketch, hfs)) ix =
+let update s ~ix ~c =
+ Array.iteri (fun i a -> step_matrix s.count i (multiply_shift s.lg_width a ix) c) s.hash_functions
+
+let query s ~ix =
(* No fusion :-( so this generates an intermediate data structure
* that is immediately discarded. We could probably write a
* minimum_mapi function but meh *)
- minimum (Array.mapi (fun i a -> sketch.(i).(multiply_shift m a ix)) hfs)
-
+ minimum (Array.mapi (fun i a -> s.count.(i).(multiply_shift s.lg_width a ix)) s.hash_functions)
(* To implement: *)
-(* range_query, inner_product_query *)
+(* range_query - needs customized array of sketches *)
+(* inner_product_query *)
(* phi_quantiles, heavy_hitters *)
let () =
- let x = make 0.998 0.002 in
+ let x = make 2.0 0.9 in
update x 3 4;
update x 3 2;
- update x 24435 1;
+ update x 24435 5;
update x 2323434 1;
update x 223434 1;
- print_int (query x 223434);
+ print_int (query x 234);
+ print_string "\n";
+ print_float (epsilon x);
+ print_string "\n";
+ print_float (delta x);
print_string "\n";
()
View
75 cminsketch.mli
@@ -0,0 +1,75 @@
+(** This module implements the count-min sketch, a sublinear
+ space, probabilistic data structure
+ invented by Graham Cormode and S. Muthukrishnan, described in
+ "An Improved Data Stream Summary: The Count-Min Sketch and its
+ Applications." It is well suited for summarizing data streams and
+ finding quantiles/frequent items. It has also found novel
+ uses: see "Popularity is Everything" by Schechter, Herley
+ and Mitzenmacher for an approach that uses the count min sketch
+ to protect passwords from statistical guessing attacks.
+
+ This implementation is presently incomplete: it still needs
+ more of the original functions described in the original paper,
+ functionality and statistical tests and a serialization
+ format. Future directions include incorporating the saturation
+ protection mechanisms described in "The Eternal Sunshine of the
+ Sketch Data Structure." *)
+
+type sketch
+
+(** Multiply shift, a weak universal hash family that this implemenation
+ uses to back its sketches (a more conventional choice is Carter
+ and Wegman's universal hash family). It was presented in
+ "A Reliable Randomized Algorithm for the Closest-Pair Problem"
+ and has the property that for all ints [x] and [y] such that
+ [x != y], and for [h] chosen uniformly at random from [H =
+ {multiply_shift m a | a is odd}], which maps ints to ints of range
+ [0] to [2^m - 1],
+
+ {[ Pr[h(x) = h(y)] <= 1/m ]}
+
+ A notable quirk about this hash function is that the size of its
+ output range must be a power of two.
+ *)
+val multiply_shift : m:int -> a:int -> x:int -> int
+
+(** Create a count-min sketch for which the error in answering
+ a query is within a factor of [epsilon] with probability [delta].
+ You get more accurate results for small epsilon and large delta,
+ but use less memory for larger epsilon and smaller delta.
+ You can only trade accuracy for memory so far: in one direction, if
+ epsilon is sufficiently large ([> 2.72]) the sketch degenerates into
+ a single-valued counter, in the opposite direction there's no point
+ using a count-min sketch if you're going to demand perfect results.
+
+ More detailed bounds regarding [epsilon] and [delta] can be found
+ in the relevant estimation functions.
+
+ Side effects: Uses the random number generator to generate
+ the hash functions.
+
+ @raise Invalid_argument if [epsilon <= 0.0], [delta >= 1.0] or
+ [delta <= 0.0] *)
+val make : epsilon:float -> delta:float -> sketch
+
+(** Returns the true error factor for a sketch. *)
+val epsilon : sketch -> float
+
+(** Returns the true error probability for a sketch. *)
+val delta : sketch -> float
+
+(** Updates a sketch adding [c] to the field [ix]. *)
+val update : sketch -> ix:int -> c:int -> unit
+
+(** Estimates the count of the field [ix].
+
+ If all actual counts are non-negative, this estimate is never
+ less than the true value and, with probability of at least
+ [1 - delta], the overestimation is no greater than [epsilon * |a|1],
+ where [|a|1] denotes the L1 (taxicab) norm on the actual vector
+ [a] (i.e. the sum of all updates done to all keys in the sketch.)
+
+ If some actual counts are negative, with probability of at least
+ [1 - delta^(1/4)], the estimate falls within [3 * epsilon * |a|1] of
+ the true value. *)
+val query : sketch -> ix:int -> int
Please sign in to comment.
Something went wrong with that request. Please try again.