Recent updates for hyperloglog and MinHash-like algorithms for your book #28

jianshu93 · 2023-06-04T04:46:13Z

Dear Andrii,

This is Jianshu, a Ph.D student in CSE & Bioinformatics, studying probabilistic data structures and their applications in biological sequences analysis. I read your book on probabilistic data structures and algorithms for big data application, which is a very detailed and thorough introduction of many probabilistic data structures from very beginning. I love it so much and I think this is the first book on this topic despite that fact that many data structures and algorithms books cover some of them but not all.

Since the book was written in 2019 or so, it does cover almost all cutting-edge algorithms in cardinality estimation and similarity search (MinHash et.al.). However, in recently years (including 2 or 3 years before 2019), there are many new breakthroughs such as those related to HyperLogLog and MinHash.

in 2017, Ertl came up with new estimator for HyperLogLog (https://arxiv.org/pdf/1706.07290.pdf), including an improved estimator and a Maximum Likelihood Estimator, which is theoretical optimal (it will be mentioned in below section 4). Improved estimator is much better at small cardinalities but it still has large variation at very small and very large cardinalities (no need to correct bias like HyperLogLog++). MLE is the smallest relative variance both in theory and in practice. However, it is expensive to compute because of the need to solve a three-dimensional optimization problem, which does not easily translate into a fast and robust algorithm. The cardinality of two set, A and B can be used to estimate Jaccard index via the inclusion-exclusion rule for similarity search (competing with MinHash, MinHash is also approximating Jaccard index), which requires much smaller memory than MinHash for similar accuracy (but slower). The paper also proposed a joint Maximum Likelihood Estimator (JMLE), which estimate Jaccard index (estimate intersections and unions first) based on HyperLogLog sketches of set A and B, also theoretical optimal and has smaller variance than the one mentioned just now (use cardinality of A and B, inclusion-exclusion rule). This is even slower for Jaccard estimation but the most accurate. Ertl's work was based on Daniel Ting's pioneering work here (https://dl.acm.org/doi/abs/10.1145/2939672.2939772) and also Cohen's work here (https://dl.acm.org/doi/abs/10.1145/3097983.3097999).
Seth Pettie team developed a more general framework, called Tau-GRA (generalized remaining area), which includes LogLog, PCSA, HyperLogLog, and improves the HyperLogLog estimator relative variance from 0.4% (1.079/m) to 0.04% (1.075/m), compare to the Camer-Rao lower bound which is 1.074/m when Tau is 0.88989 and it does not need correction for small cardinalities. The maximum likelihood estimator mentioned above meets the Camér-Rao lower bound but very expensive, the Tau-GRA is very easy to compute and update than MLE methods, despite not optimal (but very small variance). The paper is here: https://arxiv.org/abs/2208.10578 . Those work are based on the pioneering work of Daniel Ting here (https://dl.acm.org/doi/abs/10.1145/2623330.2623669).
New MinHash algorithm, SuperMinHash, which improves MSE of original MinHash (https://arxiv.org/abs/1706.05698) and ProbMinHash (estimate Jp, not exactly Jaccard), which takes into account set multiplicity/abundance and set size (e.g., several same element in the set, each has to be considered, or relative frequencies because it also normalize by the total set size) (https://ieeexplore.ieee.org/abstract/document/9185081?casa_token=R8ncquggKvsAAAAA:JU4MPQOraTFHfBWCvufavLjEhIN67N269yu9-MVDr29-6kzE0iJ9h9RK-brFvhnRMVNoAvLQ). BagMinHash (https://dl.acm.org/doi/abs/10.1145/3219819.3220089) and DartMinHash (https://arxiv.org/abs/2005.11547), both take into account the weights of element in set (but not set size). Those two approximate Jw instead of simple Jaccard. The latter it the fastest weighted MinHash algorithm.
B-bit one permutation MinHash with optimal densification, theoretically fastest MinHash algorithm, based on work from several other papers but summarize in this paper as a whole: http://proceedings.mlr.press/v70/shrivastava17a/shrivastava17a.pdf
New data structure, SetSketch, which benefits from MinHash and Hyperloglog, both fast and space efficient. The interesting point of paper is that actually MinHash and HLL can be combined into such one data structure. Several estimators are also derived, see paper here: https://vldb.org/pvldb/vol14/p2244-ertl.pdf

I am wondering whether there will be a new version of the book in the following years. If yes, those topics could be very relevant and interesting to many readers. I would be happier to be part of it.

Looking forward to your reply.

Thanks,

Jianshu

gakhov · 2023-06-04T10:49:44Z

Thank you for the email, Jianshu. Yes, there are many new pieces of research in this field that have been completed since 2018 when the book was completed. It is a very long discussion about that to include in the book to keep it feasible to read ... the original idea was to highlight the typical approaches without having to explain all possible algorithms. As you surely know, every slight change could produce a new algorithm that is a bit better/efficient/nicer than the original but the underlying idea in most cases stays the same. There is an idea to have another version of the book where I would like to describe more algorithms but with fewer details, more like a textbook. But these are rather plans. Currently, I primarily have to focus on different languages (by the way, we recently published an edition in simplified Chinese) and try to accumulate the new algorithms to decide in which format to re-publish the book. Your list contains excellent points, thank you for it. Of course, once there will be new edition draft, I would be happy to send it to you for early reading ;) Thank you for reading! Andrii Gakhov Am So., 4. Juni 2023 um 06:46 Uhr schrieb Jianshu_Zhao < ***@***.***>:

…

Dear Andrii, This is Jianshu, a Ph.D student in CSE & Bioinformatics, studying probabilistic data structures and their applications in biological sequences analysis. I read your book on probabilistic data structures and algorithms for big data application, which is a very detailed and thorough introduction of many probabilistic data structures from very beginning. I love it so much and I think this is the first book on this topic despite that fact that many data structures and algorithms books cover some of them but not all. Since the book was written in 2019 or so, it does cover almost all cutting-edge algorithms in cardinality estimation and similarity search (MinHash et.al.). However, in recently years (including 2 or 3 years before 2019), there are many new breakthroughs such as those related to HyperLogLog and MinHash. 1. in 2017, Ertl came up with new estimator for HyperLogLog ( https://arxiv.org/pdf/1706.07290.pdf), including an improved estimator and a Maximum Likelihood Estimator, which is theoretical optimal (it will be mentioned in below section 4). Improved estimator is much better at small cardinalities but it still has large variation at very small and very large cardinalities (no need to correct bias like HyperLogLog++). MLE is the smallest relative variance both in theory and in practice. However, it is expensive to compute because of the need to solve a three-dimensional optimization problem, which does not easily translate into a fast and robust algorithm. The cardinality of two set, A and B can be used to estimate Jaccard index via the inclusion-exclusion rule for similarity search (competing with MinHash, MinHash is also approximating Jaccard index), which requires much smaller memory than MinHash for similar accuracy (but slower). The paper also proposed a joint Maximum Likelihood Estimator (JMLE), which estimate Jaccard index (estimate intersections and unions first) based on HyperLogLog sketches of set A and B, also theoretical optimal and has smaller variance than the one mentioned just now (use cardinality of A and B, inclusion-exclusion rule). This is even slower for Jaccard estimation but the most accurate. Ertl's work was based on Daniel Ting's pioneering work here ( https://dl.acm.org/doi/abs/10.1145/2939672.2939772) and also Cohen's work here (https://dl.acm.org/doi/abs/10.1145/3097983.3097999). 2. Seth Pettie team developed a more general framework, called Tau-GRA (generalized remaining area), which includes LogLog, PCSA, HyperLogLog, and improves the HyperLogLog estimator relative variance from 0.4% (1.079/m) to 0.04% (1.075/m), compare to the Camer-Rao lower bound which is 1.074/m when Tau is 0.88989 and it does not need correction for small cardinalities. The maximum likelihood estimator mentioned above meets the Camér-Rao lower bound but very expensive, the Tau-GRA is very easy to compute and update than MLE methods, despite not optimal (but very small variance). The paper is here: https://arxiv.org/abs/2208.10578 . Those work are based on the pioneering work of Daniel Ting here ( https://dl.acm.org/doi/abs/10.1145/2623330.2623669). 3. New MinHash algorithm, SuperMinHash, which improves MSE of original MinHash (https://arxiv.org/abs/1706.05698) and ProbMinHash (estimate Jp, not exactly Jaccard), which takes into account set multiplicity/abundance and set size (e.g., several same element in the set, each has to be considered, or relative frequencies because it also normalize by the total set size) ( https://ieeexplore.ieee.org/abstract/document/9185081?casa_token=R8ncquggKvsAAAAA:JU4MPQOraTFHfBWCvufavLjEhIN67N269yu9-MVDr29-6kzE0iJ9h9RK-brFvhnRMVNoAvLQ). BagMinHash (https://dl.acm.org/doi/abs/10.1145/3219819.3220089) and DartMinHash (https://arxiv.org/abs/2005.11547), both take into account the weights of element in set (but not set size). Those two approximate Jw instead of simple Jaccard. The latter it the fastest weighted MinHash algorithm. 4. B-bit one permutation MinHash with optimal densification, theoretically fastest MinHash algorithm, based on work from several other papers but summarize in this paper as a whole: http://proceedings.mlr.press/v70/shrivastava17a/shrivastava17a.pdf 5. New data structure, SetSketch, which benefits from MinHash and Hyperloglog, both fast and space efficient. The interesting point of paper is that actually MinHash and HLL can be combined into such one data structure. Several estimators are also derived, see paper here: https://vldb.org/pvldb/vol14/p2244-ertl.pdf I am wondering whether there will be a new version of the book in the following years. If yes, those topics could be very relevant and interesting to many readers. I would be happier to be part of it. Looking forward to your reply. Thanks, Jianshu — Reply to this email directly, view it on GitHub <#28>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHC737WAVCWFJOXL6AYFZ3XJQHKBANCNFSM6AAAAAAYZWMDCQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

jianshu93 · 2023-06-04T18:29:28Z

Hello Andrii,

I would be also happy to read the Chinese version(you know I am from China). I am working on implement those structures in Rust,a high performance compiling language with memory and thread safety. For real world large data,performance and parallelization is more important, e.g. parallelizing HLL on multithread computer. Is the Chinese version available now, I cannot find it online.

We have already implemented SetSketch and ProbMinhash/SuperMinhash,next will be bagminhash or dartMinHash and the generalized remaining area for PCSA and HLL. Perhaps also b-bit one permutation with optimal densification. Those are not just small changes but a better estimator with theoretical minimum relative variance. Should be very interesting in a lot of applications.

I also forget to mention the new Count-Min sketch from Ting Daniel, also a MLE based method, much better relative variance compared to the original estimator. See here: https://dl.acm.org/doi/abs/10.1145/3219819.3219975

I think explaining those new ideas based on you detailed explanation will be excellent for many beginners, who want to understand and implement all those cutting edge algorithms.

Thanks for the quick response greatly appreciated.

Jianshu

jianshu93 · 2023-06-11T02:23:05Z

The most important one for MinHash is #4 mentioned above, which break the complexity from O(dk) to O(d+k), d is nonzero elements in the set while k is the number of minhashes due the densify idea with 2-universal hash function. This has never been achieve before and my implementation and others (e.g. bindash) showed that, it is at least 100X faster than the classic Minhash and converge to 0 variance much faster (sqrt(J(1-J)/m)). Should be a replacement to the traditional MinHash.

Thanks,

Jianshu

jianshu93 · 2023-09-24T03:21:53Z

Dear Andrii,

It seems the recent breakthrough in densification method for one permutation Minhash with densification has reached the theoretical limit. In additional to the optimal densification method mentioned above, fast densification (http://proceedings.mlr.press/v115/mai20a.html) and bidirectional densification (https://dl.acm.org/doi/abs/10.1145/3448016.3452833) further push the limit and achieve better average case running time, a limit of this method.

Thanks,

Jianshu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recent updates for hyperloglog and MinHash-like algorithms for your book #28

Recent updates for hyperloglog and MinHash-like algorithms for your book #28

jianshu93 commented Jun 4, 2023

gakhov commented Jun 4, 2023 via email

jianshu93 commented Jun 4, 2023

jianshu93 commented Jun 11, 2023 •

edited

Loading

jianshu93 commented Sep 24, 2023

Recent updates for hyperloglog and MinHash-like algorithms for your book #28

Recent updates for hyperloglog and MinHash-like algorithms for your book #28

Comments

jianshu93 commented Jun 4, 2023

gakhov commented Jun 4, 2023 via email

jianshu93 commented Jun 4, 2023

jianshu93 commented Jun 11, 2023 • edited Loading

jianshu93 commented Sep 24, 2023

jianshu93 commented Jun 11, 2023 •

edited

Loading