You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ThetaSketch is mostly an abstract "interface" with common methods. It has no data and is immutable.
The UpdatableThetaSketch extends ThetaSketch and manages a hash table, which is mutable.
The CompactThetaSketch compacts the hash table and optionally sorts it and is immutable. It is the optimal form for storage and transport. The CompactThetaSketch is the simplest Theta sketch as it only has the cache of hash data and the value theta. It does not store the K value, for example, as it doesn't need it. The estimate is simply computed as the size of the cache divided by theta. Depending on how it is derived the size of the cache can be smaller than or larger than K.
The "Direct*" classes allow for off-heap managment of the sketches in both mutable and immutable forms.
The "Empty*" and "Single*" classes are designed for small size (thus speed). Because of the natural power-law distribution of stream sizes prevalent in big data, expect to observe vastly more single item sketches than big ones, thus the importance of optimizing the performance of the small sketches.
The "WrappedCompactCompressedSketch" takes advantage of the fact that sketches created from long streams have lots of leading zeros amongst the hashes, which can be compressed out. This will reduce the size of these sketches by up to 40% or so.
The "Concurrent*" classes were actually an experiment by a scientist from Haifa, where they were able to achieve an order-of-magnitude increase in throughput in processing very long streams with a single sketch (but with lots of threads). This work resulted in a published research paper. I wouldn't bother with it. In general, all of our sketches are single threaded (not thread safe) as concurrency is usually handled at a higher level. It is far simpler to just launch a lot of single threaded sketches in parallel and then merge them at the end. The merge operation is quite fast.
The "QuickSelect" classes are named after the Quickselect algorithm, which is used to find a new, smaller value of theta, once the cache has reached its maximum size. It is basically half of the QuickSort algorithm
I doubt you need all this complexity in Rust, but I'll leave it to you to figure out :)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I doubt you need all this complexity in Rust, but I'll leave it to you to figure out :)
Beta Was this translation helpful? Give feedback.
All reactions