Skip to content
Permalink
Browse files
Update docs
  • Loading branch information
leerho committed May 3, 2022
1 parent 43c5f6e commit 549503ff5330c8095d842f7356065d97e87e2d61
Showing 2 changed files with 6 additions and 6 deletions.
@@ -24,17 +24,17 @@ layout: doc_page

There are lots of clever and useful algorithms that are sometimes called "sketches". However, due to limited resources, in order to be included in the DataSketches library, we had to clearly define what we meant by the term "sketch". Otherwise, we would end up with a hodge podge of algorithms and have to answer: Why don't we include algorithm X?.

In order to be in our library, a *Sketch* must exhibit these properties:
In order to be in our library, a *Sketch* should exhibit these properties:

## Streaming / One-Touch
Sketches are a class of streaming algorithms by definition, which means they only touch or process each item in a stream once. This is absolutely required for real-time applications.
Sketches are a class of streaming algorithms by definition, which means they only touch or process each item in a stream once. This is absolutely essential for real-time applications.

## Small in Size
One of the key properties of any sketch is that it is a synopsis or summary of a much larger data set. The whole point of a small summary is that it is faster to read and merge. In this context, *small* means small with respect to the original data. If the original data is terabytes in size, a single sketch of 100KB may not seem very different from a sketch of 50KB as both are very small compared to the original data.

But *small* can also be important in an systems context. If that original terabyte of data generates 10,000 sketches, each sketch consuming 100KB, that amounts to a GB of storage. Now the total memory use starts to be a concern. Being able to reduce that by 50% by using a smaller (and otherwise equivalent) sketch can be a big deal.

Nonetheless, *small* is relevant to the specific application. Sketches can very from a few bytes to many megabytes depending on the specific sketch and how it has been configured. Whether it is small enough is up to the system engineers to determine.
Nonetheless, *small* is relevant to the specific application. Sketches can very from a few bytes to many megabytes depending on the specific sketch and how it has been configured. Whether it is small enough depends on the use-case and the specific environment.

## Sublinear in Size Growth
Not only should a sketch start small, it needs to stay small as the size of the input stream grows. Some sketches have an upper bound of size independent of the size of the input stream, which clearly makes them sublinear. Other sketches may need to continue to increase their size as the stream grows. For these sketches it is important that they do so very very slowly. They should grow sublinearly by no more then *O(log(n))* or preferrably by *O(k log(n/k))* or less.
@@ -42,11 +42,11 @@ org.apache.datasketches.frequencies | Frequent Item Sketches, for both longs and
org.apache.datasketches.hash | The 128-bit MurmurHash3 and adaptors
org.apache.datasketches.hll | Unique counting HLL sketches for both heap and off-heap.
org.apache.datasketches.hllmap | The (HLL) Unique Count Map Sketch
org.apache.datasketches.kll | Quantiles sketch with better accuracy per size than the standard quantiles sketch. Includes PMF, CDF funtions, for floats. Only on-heap.
org.apache.datasketches.quantiles | Standard Quantiles sketch, plus PMF and CDF functions, for doubles and generics and for heap and off-heap.
org.apache.datasketches.kll | Quantiles sketch with better accuracy per size than the standard quantiles sketch. Includes PMF, CDF functions, for floats, doubles. On-heap & off-heap.
org.apache.datasketches.quantiles | Standard Quantiles sketch, plus PMF and CDF functions, for doubles and generics. On-heap & off-heap.
org.apache.datasketches.req | Relative Error Quantiles (REQ) sketch, plus PMF and CDF functions for floats, on-heap. Extremely high accuracy for very high ranks (e.g., 99.999%ile), or very low ranks (e.g., .00001%ile.
org.apache.datasketches.sampling | Weighted and uniform reservoir sampling with generics
org.apache.datasketches.theta | Unique counting Theta Sketches for both heap and off-heap
org.apache.datasketches.theta | Unique counting Theta Sketches for both on-heap & off-heap
org.apache.datasketches.tuple | Tuple sketches for both primitives and generics
org.apache.datasketches.tuple.adouble | A Tuple sketch with a Summary of a single double
org.apache.datasketches.tuple.aninteger | A Tuple sketch with a Summary of a single integer

0 comments on commit 549503f

Please sign in to comment.