Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distances > 0.05 but < 1 are unreliable? #42

Open
wwood opened this issue Feb 10, 2020 · 2 comments
Open

Distances > 0.05 but < 1 are unreliable? #42

wwood opened this issue Feb 10, 2020 · 2 comments

Comments

@wwood
Copy link

wwood commented Feb 10, 2020

Hi again,

I've been using dashing as a prefilter for genome dereplication, since it is much faster than FastANI. I'd previously been using mash for this. I've noticed that some genomes are given distances that are between 0.05 and 0.10, but seem to be spurious. For instance, here's mash distance vs. dashing distance calculated with -M:

image

I tested 10 randomly chosen genomes from that top stripe where mash=1 and dashing<1, and none seemed closely related genomes, so it doesn't seem that dashing is simply producing better estimates. The issue does seem to be reasonably widespread at least in this dataset - dashing predicts 49% of genome pairs < 1, where mash predicts 4%.

Is this a known issue? Am I not using dashing correctly? Is there some way I can detect these cases?

Thanks, ben.

@mihkelvaher
Copy link

Hi!

I just recently encountered a similar issue. When making a tree from the distance matrix and just taking a look at how different bacterial strains were clustered together, some species were mixed.
I got a decent result with these arguments:
--sketch-size 20 -J --use-range-minhash --full-mash-dist
If I remember correctly, -J and --use-range-minhash had the greatest impact.

Mihkel

@dnbaker
Copy link
Owner

dnbaker commented Feb 10, 2020

This is rather interesting, as for the paper our end result for measuring Jaccard Index accuracy, whereas the mash distance is a log transform downstream.

Some of the issue could be sketch size (though -S20 is getting to be comparable to the genome sizes), where Mash defaults to 4Kb sketches and dashing defaults to 1Kb. (Being equivalent would be -S/--sketch-size 12.) From the paper, it seems like b-bit minhash is marginally more accurate at low Jaccard Index but less accurate at higher ones. For low JI (larger distances), I might instead try --use-bb-minhash/-8, which is still about as fast and accurate. (You can tune the number of bits with -B/--bbits.)

As a comment, -J is a calculation method for HLLs, so it isn't used when --use-range-minhash is active, and it depends on what your use case is. -J is more accurate than not for HLLs, though at a runtime penalty.

I'll agree that --full-mash-dist is likely worth doing, as it removes a layer of approximation in the calculation. It only makes a difference for small Jaccards, but that seems to be important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants