Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAM support #619

Closed
dpryan79 opened this issue Oct 20, 2017 · 12 comments
Closed

CRAM support #619

dpryan79 opened this issue Oct 20, 2017 · 12 comments
Assignees
Milestone

Comments

@dpryan79
Copy link
Collaborator

I presume that this already works, but it should be briefly tested and then mentioned in the documentation.

@dpryan79 dpryan79 added this to the 3.0 milestone Oct 20, 2017
@dpryan79 dpryan79 self-assigned this Oct 20, 2017
@dpryan79
Copy link
Collaborator Author

dpryan79 commented Nov 5, 2017

My presumption was only partly correct. Anything that uses idxstats sorts of functionality (e.g., bamCoverage) will fail since summary statistics aren't present in .crai files. Since most tools check to see how many alignments are present (for the sake of sampling), most will fail.

@fidelram
Copy link
Collaborator

fidelram commented Nov 5, 2017

I just read part of the motivation to left this out of the .crai file: this will slow down the index process. Maybe we can argue how this information is quite relevant to speed up other downstream computations?

Alternatively, we could implement a two-pass process where we first count reads, then compute scaling factors, then we apply the scaling factor. But this will require quite some work.

@dpryan79
Copy link
Collaborator Author

dpryan79 commented Nov 5, 2017

I don't foresee a change to the cram index format at this point, that ship has sailed.

I suspect we'll need to check bam.format for CRAM and actually count reads if needed. In theory that could be multithreaded, so I guess it wouldn't take forever.

@dpryan79
Copy link
Collaborator Author

dpryan79 commented Nov 6, 2017

Xref: samtools/hts-specs#137

So maybe some day we won't need two passes.

@dpryan79
Copy link
Collaborator Author

For what it's worth, I need to modify pysam for this, since we need to be able to fetch unmapped reads at the end of the file.

@dpryan79
Copy link
Collaborator Author

Support for fetching reads with no contig has just been added to pysam, so once the next release is out we can finally start supporting CRAM files (I still need to rewrite a bunch of stuff to actually support this, but it's at least possible now).

@dpryan79
Copy link
Collaborator Author

dpryan79 commented Jan 3, 2018

This is mostly implemented, but annoying to test since all of the nose tests need to be run in the test_data directory, due to there being no method to specify a relative fasta file path in CRAM file URIs. An alternative would be to allow specify the fields to decode in CRAM files, since they're rarely needed by deepTools (only the GC bias tools). I might try to make a pysam PR to allow skipping base decoding like one can in htslib.

@dpryan79
Copy link
Collaborator Author

dpryan79 commented Jan 5, 2018

I've gotten around needing a fasta file by not decoding the sequence (I made a PR to add that to pysam), so now testing can be done in any directory. However, it turns out that htslib has a bug wherein not decoding some fields in a CRAM file can (in certain cases) affect whether a given alignment is returned by fetch(). This is, of course, a bit of a problem that requires a fix in htslib. See samtools/htslib#640 for more details.

@dpryan79
Copy link
Collaborator Author

dpryan79 commented Jan 5, 2018

Relatedly, if the htslib issue is resolved and a new version of it and pysam pushed out then deepTools can support CRAM in Galaxy as well as on the command line, where I expect users to be able to handle decoding issues themselves.

@dpryan79
Copy link
Collaborator Author

dpryan79 commented Feb 6, 2018

With htslib 1.7 the issues there should be resolved. Once a new pysam is released the CRAM branch can be merged.

@dpryan79
Copy link
Collaborator Author

pysam 0.14 is now in bioconda, so this should be mergable. I've made not decoding read sequences the default, so it should be possible to support CRAM files in Galaxy.

@dpryan79
Copy link
Collaborator Author

This is now merged into develop for imminent release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants