-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distances > 0.05 but < 1 are unreliable? #42
Comments
Hi! I just recently encountered a similar issue. When making a tree from the distance matrix and just taking a look at how different bacterial strains were clustered together, some species were mixed. Mihkel |
This is rather interesting, as for the paper our end result for measuring Jaccard Index accuracy, whereas the mash distance is a log transform downstream. Some of the issue could be sketch size (though -S20 is getting to be comparable to the genome sizes), where Mash defaults to 4Kb sketches and dashing defaults to 1Kb. (Being equivalent would be -S/--sketch-size 12.) From the paper, it seems like b-bit minhash is marginally more accurate at low Jaccard Index but less accurate at higher ones. For low JI (larger distances), I might instead try --use-bb-minhash/-8, which is still about as fast and accurate. (You can tune the number of bits with -B/--bbits.) As a comment, -J is a calculation method for HLLs, so it isn't used when --use-range-minhash is active, and it depends on what your use case is. -J is more accurate than not for HLLs, though at a runtime penalty. I'll agree that --full-mash-dist is likely worth doing, as it removes a layer of approximation in the calculation. It only makes a difference for small Jaccards, but that seems to be important. |
Hi again,
I've been using dashing as a prefilter for genome dereplication, since it is much faster than FastANI. I'd previously been using mash for this. I've noticed that some genomes are given distances that are between 0.05 and 0.10, but seem to be spurious. For instance, here's mash distance vs. dashing distance calculated with
-M
:I tested 10 randomly chosen genomes from that top stripe where mash=1 and dashing<1, and none seemed closely related genomes, so it doesn't seem that dashing is simply producing better estimates. The issue does seem to be reasonably widespread at least in this dataset - dashing predicts 49% of genome pairs < 1, where mash predicts 4%.
Is this a known issue? Am I not using dashing correctly? Is there some way I can detect these cases?
Thanks, ben.
The text was updated successfully, but these errors were encountered: