Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

less accurate estimator compare to MASH/sourmash/ANI #97

Open
jianshu93 opened this issue Apr 15, 2023 · 1 comment
Open

less accurate estimator compare to MASH/sourmash/ANI #97

jianshu93 opened this issue Apr 15, 2023 · 1 comment

Comments

@jianshu93
Copy link

Hello Daniel,

I am attaching a real-world genome from the global Tara Ocean Metagenomic study, against all GTDB genomes (https://data.ace.uq.edu.au/public/gtdb/data/releases/release207/207.0/genomic_files_reps/gtdb_genomes_reps_r207.tar.gz) to find top 20 best matches in terms of ANI, I am using orthoANI(https://www.microbiologyresearch.org/content/journal/ijsem/10.1099/ijsem.0.000760), Both MASH, and sourmash performs well, normally, 16 to 17 of best found compare to ANI best hits found. However, Dashing (both default MLE estimator and also JMLE) is very bad at ANI smaller than 80%, only 9 (top 10 are fine) are found out of 20, meaning for smaller distance, Dashing is much worse than Mash or sourmash, both are MinHash but not hyperloglog. I was under the impression that Jaccard index by HLL should be as good as MinHash.

This is the commands used:

dashing sketch -k 16 --nthreads 128 -S 14 --ertl-joint-mle --suffix dashing_hll -F name.txt &
dashing sketch -k 16 --nthreads 128 -S 14 --ertl-joint-mle --suffix dashing_hll -F query_name.txt

then get all the hll file from the genome folder and create list of those hll files.

dashing dist -F ./query_name_dashing_hll_JMLE.txt -Q name_dashing_hll_JMLE.txt --full-tsv --nthreads 128 --presketched -O ./OceanDNA-b42278.dashing.hll.JMLE.gtdb.txt.

I am using the same k and sketch size (2^14) in Mash and sourmash. Top 10 are ok, nearly all are found. I also compare with our most recent SetSketch 1 implenmentation (equivalent to HLL), ours are consistent with sourmash or Mash. I am showing you the best 10th to 20th hits found to the query (OceanDNA-b42278.fa) by several tools (the attached pdf file, forget top 10 in the table title, it is actually top 10 of 10th to 20th) mentioned above for you to double check. Should I use an even large sketch size to better approximate ANI, I think not because top 10 are already very good, meaning sketch size is enough. Dashing is faster for sure than Mash, I am wondering what could be the down side of being fast, e.g., less accurate for very smaller Jaccard index/distance (not similar ones).

Thanks,

Jianshu

OceanDNA-b42278.fa.zip

Blastn-ANI-dashing-setsketch.pdf

@jianshu93
Copy link
Author

jianshu93 commented Jun 9, 2023

Hello Daniel,

I found that bindash is much more accurate than Dashing fro small Jaccard like those around 0.01 or so. Please see the attached result with additional focus on bindash, using the same query and database genome mentioned above. As you can see, bindash is the best while dashing is the worst, there must be place here dashing sacrifice accuracy for speed. Jaccard around 0.01 is very important because this corresponding to ANI 75% to 78%, where most tools lose accuracy. I don no understand why a theoretical variance J*(1-J)/m (Bindash) is larger than 1.074/m (MLE methods) in practice (use m=10000 or so, assuming inclusion-exclusion is perfect, which is not always true) as claimed in the paper, assuming the same sketch size used, bindash is at least 1000 times more accurate than MLE with inclusion-exclusion rule. It is the same with setsketch, setsketch is also much large variation than bindash because of the nature of approximation used in SetSketch 1 (b=0.001, m=4096). Can you please give more explanation on why Jaccard is more accurat in dashing than bindash? @BenLangmead @dnbaker

Thanks

Jianshu

bindash_dashing.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant