-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is CMash metric? #32
Comments
Hi Jianshu, The Jaccard index is symmetric: J(A, B) = J(B, A). However, I don't think the Jaccard index fits the 2nd rule. Here is a counter-example:
For the containment index we estimated in CMash, it's NOT symmetric by definition: The Jaccard index can be used for distance measurement. Please note that the Jaccard index J(A, B) and the containment index C(A, B) or C(B, A) are interchangeable if we know the set size of A and B. |
Hello Shaopeng, Yes Jaccard index itself is not metric, sorry for the bad description, Jaccard distance, or 1-Jaccard index, is a proper metric. So my question would be will 1-CMash distance/index a metric? Again still the mentioned 2 rules considered. According to you description, it is not symmetric, therefore not metric. Just to confirm. With the metric property, there many other aspects, e.g., nearest neighbor search algorithms, can be applied/combined with the Minhash like algorithms for even faster search and retrieval of similar information in genomic database. Thanks, Jianshu |
Hello Shaopeng, A quick question on how CMash estimate the intersection of kmer in two genomes A and B, especially for large number of genome pairs. Also the number of unique kmers (cardinality) in set A and B is calculated for large number of genomes. Yes I agree that when set size A and B differ by quite a lot, using C(A,B) to calculated J(A,B) is more robust than classic MinHash, but the estimation of intersection of A and B, or cardinality of A and B based on sketching algorithms is never easy for large scale applications. New HyperLogLog estimators from Ertl 2017 (especially the MLE methods), is thought to be the most accurate estimator of cardinality and also intersections or unions. What is the RMSE of JS based on JC, assuming the Flajolet 2007 hyperloglog is used, compare to MinHash, especially for small Jaccard? MinHash is sqrt(J(1-J)/m), while cardinality estimator (Flajolet HLL) itself is already sqrt(1.079/m), assuming using the same registers m when hashing. For small Jaccard like 0.01, MinHash is 100 times more accurate assuming intersection is perfect (which is not and even worse if use cardinality from Flajolet HLL of A and B for intersection). MinHash is very popular because its very small variance, this is important because small jaccard like 0.01 or 0.05 correspond to, 79% to 85% mutation rate (ANI, see Mash distance equation), among which, JS from JC will be significantly less accurate than Minhash. At the same time, we compare genomes only when they are complete, meaning they do not differ by size more than 10% (based on the quality of genomes), otherwise the genome is not trusted. Thanks, Jianshu |
Hi Jianshu, Sorry for my late response. To your 1st post:
To your 2nd post: The way to estimate intersection is similar to Fig 2 in the Mash Screen paper. It's called "streaming method". Here is a brief workflow:
Shaopeng |
Dear CMash team,
I am wondering the CMash is a metric, meaning, is CMash (A, B) the same with CMash (B, A), or CMash (A, B) + CMash (B, C) >= CMash (A, C), that is the last 2 rules of metric space: https://en.wikipedia.org/wiki/Metric_space. Since MinHash is unbiased estimation of Jaccard index (a metric), it is a metric. But I am not sure whether CMash estimated mutation rate is also a metric distance.
Many Thanks,
Jianshu
The text was updated successfully, but these errors were encountered: