New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consider samtools depth to replace sambamba | bedtools in callable #1549
Comments
Brent; |
awesome! thanks for implementing! |
I haven't groked the existing code fully, but would it be possible to combine this step with the subsequent Maybe instead of using bedtools groupby it could use a python script to do the grouping and also output 250bp regions of sufficient depth to a separate bed file and then those could be checked as needed for really high depth instead of forcing sambamba to re-run on the whole genome? I can prototype if you think this might work, but I'm not sure of the full use of the sambamba depth output. |
I put some of this into python, replacing
so it basically cuts the time in half again and does even better for cpu time. Here is gr.py
In addition to the speed benefits, this might make it possible to group into 250 base windows (or any) and output the window information to a separate file so it's not doing per-base samtools depth and then immediately doing sambamba depth window. |
Brent -- awesome, thank you. This is a great idea. I'm also working on merging high depth reporting into this as well to avoid the double depth calculation. These are all great suggestions. I'll work on rolling this prototype approach and push a fix soon for testing. Thank you again for all this help and benchmarking. |
cool. here's what I just hacked together to do the window-depth in the same pass. Writing the window stuff to stderr currently, but you get the idea.
|
Awesome work, guys. Much appreciated. |
Brent -- thanks so much for the code and approaches. I've integrated this into the current development pipeline and it's giving the same outputs on my tests. Please let me know if you notice any issues or have other approaches for improving speed. Thanks so much for looking at this and all the suggestions. |
So I set my stuff going again with this, it shows in the log that the first thing to run is
|
ok. I see that was piping to head and finished quicly and now the samtools depth + bcbio.bam.highdepth.bin_depths is running. |
Just fyi, the sambamba depth window didn't get logged to |
Hi Brad, I just saw this:
I guess it was for phix chromosome which apparently had no alignments. So we should prefix that line with |
Brent -- thanks for the heads up on the edge case. I pushed the You're right on with the sambamba depth -- it's a short call estimating median depth across windows which we then use to categorize high depth regions. This could probably be done smarter but I replicated what we had before to not make too many changes all that once. Right now we don't have a way to capture output and also log everything so the sambamba call uses standard subprocess instead of our wrapper around it. Thanks again for all the thoughts and help. |
cool. The sambamba depth ran quickly, so I'm not worried about it. I can already tell the pipeline is proceeding much faster with these changes. |
Hi Brad, just another clarification. the samtools depth is running with a Also, will the new CWL stuff parallelize the samtools depth part at a finer level than chromosome? |
Brent; |
Hi Brad, I wrote a It's quite fast for WGS or for a target. I can get coverage for a 60X genome in 12 minutes (with 24 cpus). I think the output matches what you're doing in bcbio, including the It can also output GC content and masked info which I know @etal uses in cnvkit so maybe we could output something that prevents the depth recalculation in cnvkit? BTW, even if you decide not to use this, samtools depth might be faster without the |
Brent -- this is so awesome, thank you. I'll work on incorporating into bcbio, and it would be great to do a single calculation of depth in windows that we can pick up directly with CNVkit, and also downstream coverage assessment done by @Rorky and @lpantano in QC. Eric, thinking through this a bit it seems like we'd have to do some upstream binning to output in a format compatible with making target/anti coverage cnns. Then we'd have a more general coverage summary we could evaluation for callable regions by combining overlapping callable regions. Thanks again for working on this, looking forward to getting it implemented. |
I haven't tried this out yet, but it seems likely that binning these results with a target BED file would be much faster than repeating the coverage calculation within CNVkit. I'm happy to help with the conversion to .cnn format. Incidentally, having the average on- and off-target coverages before binning would make it possible to calculate the best average antitarget bin size for the given samples instead of relying on a fixed default. That might improve CNV calls significantly for exomes. |
Thanks Eric, that's a great idea for improving anti-target bin sizes as well. Would producing 50bp binned coverage, GC estimates and masking outputs be a good default? Then we could combine and merge as needed to feed into CNVkit cnn calculation. Does that make sense, or were you thinking about another path? Thanks again for helping with this. |
Brad, right now, gotleft depth will not parallelize well if you give it a -b with a single interval that covers the whole genome. I can make it split large intervals to the same size it does for whole genome if you need. Also let me know any problems or features you need. |
For CNVkit -- for target captures and exomes, I think 50bp is probably too coarse for fitting target bins, but 10bp might be all right. For antitarget intervals and WGS, 50bp or even larger should be OK as those boundaries are arbitrary. The antitarget bins are not necessarily equal size, since CNVkit squeezes them to fit evenly into large introns and small intergenic regions. |
Brent; wget https://s3.amazonaws.com/chapmanb/testcases/goleft_test.tar.gz It outputs a Also in thinking through this, would it be possible to calculate coverage with duplicates removed? We could use a Thanks again for the help and apologies in advance if I'm doing something wrong with goleft. |
Hi Brad, you need to specify I'll also look into removing duplicates. |
It looks like samtools depth already filters on flags. It uses:
I fixed the goleft depth in master so an error is raised when --prefix is not specified, but could you try the version that you have with a --prefix specified and make sure it outputs what you need? |
Brent -- sorry about missing the prefix input, that makes total sense now. I was expecting it to stream so fixated on that. I'm reading the code now and what do you think about merging the depth and callable outputs into a single tab delimited output file (that we could potentially stream into bgzip)? It's not really BED and the callable file will need merging as is since I'm breaking it in a more finely grained way for CNVkit and QC input. Practically I'll be using this to generate a initial tab delimited file and then post-processing it for callability, CNVkit and QC. Thanks for checking on samtools depth, that is perfect as is. Nice one. Thanks again for all the help. |
do you mean to just intermix the lines? or only output fixed regions? but yes, I'm fine with changing it. I'd be happy to do it, but not sure I grok what you mean. |
Brent -- I need to stop requesting things because you've already implemented everything. Apologies, I hadn't realized you were already collapsing from windows into the final callable.bed file. This is great as is and I'll continue to work to integrate into bcbio. Thanks much for listening to my rambling ideas. |
no worries. let me know if you have any troubles. I'll make a new release soon as well. |
Super valuable work, thanks guys. |
Brent; |
Removes unused code after moving coverage assessment to goleft #1549
Brent, Eric and all; I'm open for suggestions about how best to handle this. We could use some combination of defining target/anti-targets for depth (around genes, CNVkit-style) and a better on-disk representation of depth that is not a gigantic text file full of numbers. Great to have goleft replacing a lot of bcbio code in the short term, and happy to work on better depth in the longer term. |
We could also see how many bins could be merged if the mean is > 50 and within some delta. And we could merge 0 coverage bins? I can probably look into this next week. |
CNVkit: For WGS, the placement of bins doesn't matter much, so a bin size of 500 or 1000 (i.e. the usual in bcbio) should be OK for creating a targetcoverage.cnn file, and would save a lot of time. For captures, the coverage calculation doesn't take an unbearable amount of time in my experience, while the placement of target bins matters a lot. So, for captures, it may be better to use the goleft coverages to generate only the antitargetcoverage.cnn file, and (re)calculate targetcoverage.cnn with CNVkit. Merging similar-enough bins sounds great to me. For quick storage: |
Just a quick aside concerning 'samtools depth', current development version is running an improper command, at least as far as I can tell.
samtools depth --help does not show a -d option. At least for the version I am using which is 1.1 |
Brad -- the samtools in bcbio is 1.3.1. Is it possible to update to use the one installed by bcbio in your PATH first? That should hopefully resolve the problem. |
I think it's safe to close this now. |
Replaces bedtools genomecov calculation. Provides a 6x speedup over previous approach in initial benchmarking. Also allows piping directly into groupby coverage, avoiding disk IO for the intermediate files. Thanks to Brent Pedersen. Fixes bcbio#1549
Use custom python script from @brentp instead of awk/bedtools to collapse into callable regions from samtools depth. Include identification of highdepth windows in same process to avoid extra work. Fixes bcbio#1549
Swaps to use goleft for coverage estimation including production of a standard coverage file with 10bp resolution for downstream use by other tools (bcbio#1549, bcbio#1583). Remaining work to do: - Cleanup unused code after change - Use pre-calculated coverage bins for CNVkit, seq2c and QC reporting
Removes unused code after moving coverage assessment to goleft bcbio#1549
Ah My apologies. I completely forgot to comment on it's success. Thank you -bwubb On Mon, Oct 24, 2016 at 10:45 AM, Brent Pedersen - Bioinformatics <
|
I just did this experiment:
so the samtools depth version uses 6x less compute. for the same answer.
The text was updated successfully, but these errors were encountered: