New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential speed up #2
Comments
that is great to know, for this and in general. thanks! I'll expose it in the hts-nim API. Can this be run any time or does it have to be when the htsFile is first opened? e.g. can I do (with possible API): var cram = hts.open("some.cram")
for r in cram.query("1", 233, 333):
# do stuff
cram.set_option(INCLUDE_FIELDS, SAM_FLAG or SAM_MAPQ or SAM_RNEXT)
cram.set_option(DECODE_MD, 0)
for r in cram.query("1", 444, 555):
# do stuff faster with only flag, mapq, rnext |
this is implented. Didn't get too much of a speedup as I had to use SAM_AUX, (not sure why). I'll debug further later, but it is exposed in hts-nim. |
Hmm, not sure on the need for aux tags. Maybe something related to NM and MD. Odd though. As for your question; I wouldn't trust widening of fields to decode part way through reading a stream. Upon starting each new slice it looks at the settings to figure out what to decompress and to figure out the dependencies between data types (theoretically this could differ each slice depending on data layout). I don't know if will spot the need to rescan and decompress more blocks if the fields are changed mid-slice, but I doubt it. |
I actually have this implemented. It results in a 2X!! speed improvement for CRAM. It required checking that the return from It does result in some issues in my test crams. I'm trying to determine if it's the crams that are problematic or if it's something in htslib. It only occurs when I stop decoding |
this only happens when the reference or REF_PATH is not specified so I have enforced that. |
No problem - glad it was of use. :-) |
Nice work btw.
You don't need to decode things like quality values, auxiliary tags and read names. These are all intermingled together in BAM, but are (usually) in separate blocks for CRAM. This means there can be potential speed gains to selectively decoding only the bits you need.
I don't know anything about the Nimble API to htslib, but see for example
samtools flagstat
which is much faster on CRAM than BAM due to this:https://github.com/samtools/samtools/blob/10b403702571299cb9a85333935b6d762badbb88/bam_stat.c#L142-L146
The text was updated successfully, but these errors were encountered: