Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gzipping all training files results in a nice reduction: add feature that allows scripts/modules to handle this #28

Open
2 tasks
dkoslicki opened this issue Mar 23, 2020 · 0 comments

Comments

@dkoslicki
Copy link
Owner

For example, using the Metalign default training database (199807 genomes) and running

python MakeStreamingDNADatabase.py ${trainingFiles} ${outputDir}/${cmashDatabase} -n ${numHashes} -k 60 -v
python MakeStreamingPrefilter.py ${outputDir}/${cmashDatabase} ${outputDir}/${prefilterName} 30-60-10

results in uncompressed:

16G Mar 22 03:39 cmash_db_n1000_k60.h5
9.3G Mar 22 08:07 cmash_db_n1000_k60_30-60-10.bf
6.9G Mar 22 04:34 cmash_db_n1000_k60.tst

yet

4.6G Mar 22 03:39 cmash_db_n1000_k60.h5.gz
3.6G Mar 22 08:07 cmash_db_n1000_k60_30-60-10.bf.gz
3.6G Mar 22 04:34 cmash_db_n1000_k60.tst.gz

so ~2-4x compression.

Would need to either:

  • Enable MakeStreamingDNADatabase.py and MakeStreamingPrefilter.py to detect compressed training data and decompress it in the script or (better yet)
  • Enable decompression in the modules MinHash.py and Query.py themselves.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant