Gzipping all training files results in a nice reduction: add feature that allows scripts/modules to handle this #28

dkoslicki · 2020-03-23T21:47:26Z

For example, using the Metalign default training database (199807 genomes) and running

python MakeStreamingDNADatabase.py ${trainingFiles} ${outputDir}/${cmashDatabase} -n ${numHashes} -k 60 -v
python MakeStreamingPrefilter.py ${outputDir}/${cmashDatabase} ${outputDir}/${prefilterName} 30-60-10

results in uncompressed:

16G Mar 22 03:39 cmash_db_n1000_k60.h5
9.3G Mar 22 08:07 cmash_db_n1000_k60_30-60-10.bf
6.9G Mar 22 04:34 cmash_db_n1000_k60.tst

yet

4.6G Mar 22 03:39 cmash_db_n1000_k60.h5.gz
3.6G Mar 22 08:07 cmash_db_n1000_k60_30-60-10.bf.gz
3.6G Mar 22 04:34 cmash_db_n1000_k60.tst.gz

so ~2-4x compression.

Would need to either:

Enable MakeStreamingDNADatabase.py and MakeStreamingPrefilter.py to detect compressed training data and decompress it in the script or (better yet)
Enable decompression in the modules MinHash.py and Query.py themselves.

The text was updated successfully, but these errors were encountered:

dkoslicki added enhancement good first issue labels Mar 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gzipping all training files results in a nice reduction: add feature that allows scripts/modules to handle this #28

Gzipping all training files results in a nice reduction: add feature that allows scripts/modules to handle this #28

dkoslicki commented Mar 23, 2020

Gzipping all training files results in a nice reduction: add feature that allows scripts/modules to handle this #28

Gzipping all training files results in a nice reduction: add feature that allows scripts/modules to handle this #28

Comments

dkoslicki commented Mar 23, 2020