Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recent updates for hyperloglog and MinHash-like algorithms for your book #28

Open
jianshu93 opened this issue Jun 4, 2023 · 4 comments

Comments

@jianshu93
Copy link

Dear Andrii,

This is Jianshu, a Ph.D student in CSE & Bioinformatics, studying probabilistic data structures and their applications in biological sequences analysis. I read your book on probabilistic data structures and algorithms for big data application, which is a very detailed and thorough introduction of many probabilistic data structures from very beginning. I love it so much and I think this is the first book on this topic despite that fact that many data structures and algorithms books cover some of them but not all.

Since the book was written in 2019 or so, it does cover almost all cutting-edge algorithms in cardinality estimation and similarity search (MinHash et.al.). However, in recently years (including 2 or 3 years before 2019), there are many new breakthroughs such as those related to HyperLogLog and MinHash.

  1. in 2017, Ertl came up with new estimator for HyperLogLog (https://arxiv.org/pdf/1706.07290.pdf), including an improved estimator and a Maximum Likelihood Estimator, which is theoretical optimal (it will be mentioned in below section 4). Improved estimator is much better at small cardinalities but it still has large variation at very small and very large cardinalities (no need to correct bias like HyperLogLog++). MLE is the smallest relative variance both in theory and in practice. However, it is expensive to compute because of the need to solve a three-dimensional optimization problem, which does not easily translate into a fast and robust algorithm. The cardinality of two set, A and B can be used to estimate Jaccard index via the inclusion-exclusion rule for similarity search (competing with MinHash, MinHash is also approximating Jaccard index), which requires much smaller memory than MinHash for similar accuracy (but slower). The paper also proposed a joint Maximum Likelihood Estimator (JMLE), which estimate Jaccard index (estimate intersections and unions first) based on HyperLogLog sketches of set A and B, also theoretical optimal and has smaller variance than the one mentioned just now (use cardinality of A and B, inclusion-exclusion rule). This is even slower for Jaccard estimation but the most accurate. Ertl's work was based on Daniel Ting's pioneering work here (https://dl.acm.org/doi/abs/10.1145/2939672.2939772) and also Cohen's work here (https://dl.acm.org/doi/abs/10.1145/3097983.3097999).
  2. Seth Pettie team developed a more general framework, called Tau-GRA (generalized remaining area), which includes LogLog, PCSA, HyperLogLog, and improves the HyperLogLog estimator relative variance from 0.4% (1.079/m) to 0.04% (1.075/m), compare to the Camer-Rao lower bound which is 1.074/m when Tau is 0.88989 and it does not need correction for small cardinalities. The maximum likelihood estimator mentioned above meets the Camér-Rao lower bound but very expensive, the Tau-GRA is very easy to compute and update than MLE methods, despite not optimal (but very small variance). The paper is here: https://arxiv.org/abs/2208.10578 . Those work are based on the pioneering work of Daniel Ting here (https://dl.acm.org/doi/abs/10.1145/2623330.2623669).
  3. New MinHash algorithm, SuperMinHash, which improves MSE of original MinHash (https://arxiv.org/abs/1706.05698) and ProbMinHash (estimate Jp, not exactly Jaccard), which takes into account set multiplicity/abundance and set size (e.g., several same element in the set, each has to be considered, or relative frequencies because it also normalize by the total set size) (https://ieeexplore.ieee.org/abstract/document/9185081?casa_token=R8ncquggKvsAAAAA:JU4MPQOraTFHfBWCvufavLjEhIN67N269yu9-MVDr29-6kzE0iJ9h9RK-brFvhnRMVNoAvLQ). BagMinHash (https://dl.acm.org/doi/abs/10.1145/3219819.3220089) and DartMinHash (https://arxiv.org/abs/2005.11547), both take into account the weights of element in set (but not set size). Those two approximate Jw instead of simple Jaccard. The latter it the fastest weighted MinHash algorithm.
  4. B-bit one permutation MinHash with optimal densification, theoretically fastest MinHash algorithm, based on work from several other papers but summarize in this paper as a whole: http://proceedings.mlr.press/v70/shrivastava17a/shrivastava17a.pdf
  5. New data structure, SetSketch, which benefits from MinHash and Hyperloglog, both fast and space efficient. The interesting point of paper is that actually MinHash and HLL can be combined into such one data structure. Several estimators are also derived, see paper here: https://vldb.org/pvldb/vol14/p2244-ertl.pdf

I am wondering whether there will be a new version of the book in the following years. If yes, those topics could be very relevant and interesting to many readers. I would be happier to be part of it.

Looking forward to your reply.

Thanks,

Jianshu

@gakhov
Copy link
Owner

gakhov commented Jun 4, 2023 via email

@jianshu93
Copy link
Author

Hello Andrii,

I would be also happy to read the Chinese version(you know I am from China). I am working on implement those structures in Rust,a high performance compiling language with memory and thread safety. For real world large data,performance and parallelization is more important, e.g. parallelizing HLL on multithread computer. Is the Chinese version available now, I cannot find it online.

We have already implemented SetSketch and ProbMinhash/SuperMinhash,next will be bagminhash or dartMinHash and the generalized remaining area for PCSA and HLL. Perhaps also b-bit one permutation with optimal densification. Those are not just small changes but a better estimator with theoretical minimum relative variance. Should be very interesting in a lot of applications.

I also forget to mention the new Count-Min sketch from Ting Daniel, also a MLE based method, much better relative variance compared to the original estimator. See here: https://dl.acm.org/doi/abs/10.1145/3219819.3219975

I think explaining those new ideas based on you detailed explanation will be excellent for many beginners, who want to understand and implement all those cutting edge algorithms.

Thanks for the quick response greatly appreciated.

Jianshu

@jianshu93
Copy link
Author

jianshu93 commented Jun 11, 2023

The most important one for MinHash is #4 mentioned above, which break the complexity from O(dk) to O(d+k), d is nonzero elements in the set while k is the number of minhashes due the densify idea with 2-universal hash function. This has never been achieve before and my implementation and others (e.g. bindash) showed that, it is at least 100X faster than the classic Minhash and converge to 0 variance much faster (sqrt(J(1-J)/m)). Should be a replacement to the traditional MinHash.

Thanks,

Jianshu

@jianshu93
Copy link
Author

Dear Andrii,

It seems the recent breakthrough in densification method for one permutation Minhash with densification has reached the theoretical limit. In additional to the optimal densification method mentioned above, fast densification (http://proceedings.mlr.press/v115/mai20a.html) and bidirectional densification (https://dl.acm.org/doi/abs/10.1145/3448016.3452833) further push the limit and achieve better average case running time, a limit of this method.

Thanks,

Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants