Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential speed up #2

Closed
jkbonfield opened this issue Aug 25, 2017 · 6 comments
Closed

Potential speed up #2

jkbonfield opened this issue Aug 25, 2017 · 6 comments

Comments

@jkbonfield
Copy link

Nice work btw.

You don't need to decode things like quality values, auxiliary tags and read names. These are all intermingled together in BAM, but are (usually) in separate blocks for CRAM. This means there can be potential speed gains to selectively decoding only the bits you need.

I don't know anything about the Nimble API to htslib, but see for example samtools flagstat which is much faster on CRAM than BAM due to this:

https://github.com/samtools/samtools/blob/10b403702571299cb9a85333935b6d762badbb88/bam_stat.c#L142-L146

if (hts_set_opt(fp, CRAM_OPT_REQUIRED_FIELDS,
                SAM_FLAG | SAM_MAPQ | SAM_RNEXT)) {
    fprintf(stderr, "Failed to set CRAM_OPT_REQUIRED_FIELDS value\n");
    return 1;
}
@brentp
Copy link
Owner

brentp commented Aug 26, 2017

that is great to know, for this and in general. thanks!

I'll expose it in the hts-nim API. Can this be run any time or does it have to be when the htsFile is first opened? e.g. can I do (with possible API):

var cram = hts.open("some.cram")
for r in cram.query("1", 233, 333):
   # do stuff

cram.set_option(INCLUDE_FIELDS, SAM_FLAG or SAM_MAPQ or SAM_RNEXT)
cram.set_option(DECODE_MD, 0)

for r in cram.query("1", 444, 555):
   # do stuff faster with only flag, mapq, rnext 

brentp added a commit to brentp/hts-nim that referenced this issue Aug 26, 2017
brentp added a commit to brentp/hts-nim that referenced this issue Aug 26, 2017
brentp added a commit to brentp/hts-nim that referenced this issue Aug 26, 2017
@brentp
Copy link
Owner

brentp commented Aug 28, 2017

this is implented. Didn't get too much of a speedup as I had to use SAM_AUX, (not sure why). I'll debug further later, but it is exposed in hts-nim.

@jkbonfield
Copy link
Author

Hmm, not sure on the need for aux tags. Maybe something related to NM and MD. Odd though.

As for your question; I wouldn't trust widening of fields to decode part way through reading a stream. Upon starting each new slice it looks at the settings to figure out what to decompress and to figure out the dependencies between data types (theoretically this could differ each slice depending on data layout). I don't know if will spot the need to rescan and decompress more blocks if the fields are changed mid-slice, but I doubt it.

@brentp
Copy link
Owner

brentp commented Oct 19, 2017

I actually have this implemented. It results in a 2X!! speed improvement for CRAM. It required checking that the return from sam_itr_next is >= 0 whereas I previously had > 0.

It does result in some issues in my test crams. I'm trying to determine if it's the crams that are problematic or if it's something in htslib. It only occurs when I stop decoding SAM_QUAL and SAM_AUX.

@brentp
Copy link
Owner

brentp commented Oct 19, 2017

this only happens when the reference or REF_PATH is not specified so I have enforced that.
thanks for the idea!

@brentp brentp closed this as completed Oct 19, 2017
@jkbonfield
Copy link
Author

No problem - glad it was of use. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants