Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider samtools depth to replace sambamba | bedtools in callable #1549

Closed
brentp opened this issue Sep 9, 2016 · 38 comments
Closed

consider samtools depth to replace sambamba | bedtools in callable #1549

brentp opened this issue Sep 9, 2016 · 38 comments

Comments

@brentp
Copy link
Contributor

brentp commented Sep 9, 2016

I just did this experiment:

$ time samtools depth -r 1:1-20000000 -Q 1 --reference $fasta $bam > st.bed
real    0m34.300s
user    0m33.170s
sys 0m1.006s

$ time sambamba view -F 'mapping_quality > 0' -f bam -l 1 $bam 1:1-20000000 | bedtools genomecov -split -ibam stdin -bga -g $fasta > bt.bed
real    3m21.922s
user    5m58.229s
sys 1m47.049s

$ head st.bed
1   9997    1
1   9998    1
1   9999    1
1   10000   3
1   10001   9
1   10002   15
1   10003   23
1   10004   32
1   10005   42
1   10006   47

$ head bt.bed
1   0   9996    0
1   9996    9999    1
1   9999    10000   3
1   10000   10001   9
1   10001   10002   15
1   10002   10003   23
1   10003   10004   32
1   10004   10005   42
1   10005   10006   47
1   10006   10007   53

so the samtools depth version uses 6x less compute. for the same answer.

@chapmanb
Copy link
Member

Brent;
Thanks so much for the idea and benchmarking. I swapped over to use samtools depth and also feed the output directly into downstream groupby so it avoids writing one intermediate file to disk. Great to have this simplified and faster, thank you again for identifying the slow point in the process and providing a workaround.

@brentp
Copy link
Contributor Author

brentp commented Sep 11, 2016

awesome! thanks for implementing!

@brentp
Copy link
Contributor Author

brentp commented Sep 11, 2016

I haven't groked the existing code fully, but would it be possible to combine this step with the subsequent sambamba depth window step (which, if I understand, is looking for very high-depth regions)?

Maybe instead of using bedtools groupby it could use a python script to do the grouping and also output 250bp regions of sufficient depth to a separate bed file and then those could be checked as needed for really high depth instead of forcing sambamba to re-run on the whole genome?

I can prototype if you think this might work, but I'm not sure of the full use of the sambamba depth output.

@brentp
Copy link
Contributor Author

brentp commented Sep 12, 2016

I put some of this into python, replacing awk -> bedtools groupby -> cut with a single python snippet.
The comparison looks like this:

time samtools depth -r 1:1-20000000 -Q 1 --reference $fasta $bam \
    | awk '{if ($3 == 0) {print $1"\t"$2-1"\t"$2"\tNO_COVERAGE"} else if ($3 < 4) {print $1"\t"$2-1"\t"$2"\tLOW_COVERAGE"} else {print $1"\t"$2-1"\t"$2"\tCALLABLE"} }' \
    | bedtools groupby -prec 21 -g 1,4 -c 1,2,3,4 -o first,min,max,first \
    | cut -f 3-6 \
    > st.bed
real    1m15.183s
user    2m26.165s
sys 0m2.391s

time samtools depth -r 1:1-20000000 -Q 1 --reference $fasta $bam \
    | python gr.py > gr.bed
real    0m36.280s
user    1m6.930s
sys 0m1.906s

md5sum st.bed gr.bed
beff75c0c3599e125e6b8d22ef3f816f  st.bed
beff75c0c3599e125e6b8d22ef3f816f  gr.bed

so it basically cuts the time in half again and does even better for cpu time. Here is gr.py

import sys

last = (None, None)
cache = []

viter = (l.rstrip("\r\n").split("\t", 3) for l in sys.stdin)
for chrom, pos, depth in viter:
    key = (chrom, "NO_COVERAGE" if depth == "0" else "LOW_COVERAGE" if int(depth) < 4 else "CALLABLE")
    if key != last:
        if last[0] is not None:
            print "\t".join((last[0], str(int(cache[0][1]) - 1), cache[-1][1], last[1]))
        last = key
        cache = cache[:0]
    cache.append((chrom, pos, depth))
print "\t".join((chrom, str(int(cache[0][1]) - 1), cache[-1][1], last[1]))

In addition to the speed benefits, this might make it possible to group into 250 base windows (or any) and output the window information to a separate file so it's not doing per-base samtools depth and then immediately doing sambamba depth window.

@chapmanb
Copy link
Member

Brent -- awesome, thank you. This is a great idea. I'm also working on merging high depth reporting into this as well to avoid the double depth calculation. These are all great suggestions. I'll work on rolling this prototype approach and push a fix soon for testing. Thank you again for all this help and benchmarking.

@brentp
Copy link
Contributor Author

brentp commented Sep 12, 2016

cool. here's what I just hacked together to do the window-depth in the same pass. Writing the window stuff to stderr currently, but you get the idea.

import sys

WINDOW_SIZE = 250
WINDOW_DEPTH_CUTOFF = 300

last = (None, None)
last_window = -1
depth_cache = []
cache = []

def mean(vals):
    return sum(vals) / float(WINDOW_SIZE)

viter = (l.rstrip("\r\n").split("\t", 3) for l in sys.stdin)
for chrom, pos, depth in viter:


    depth, ipos = int(depth), int(pos)
    if ipos / WINDOW_SIZE != last_window:
        if mean(depth_cache) > WINDOW_DEPTH_CUTOFF:
            sys.stderr.write("%s\t%d\t%d\t%.2f\n" % (chrom, last_window * 250, last_window * 250 + 250, mean(depth_cache)))
            sys.stderr.flush()
        depth_cache = depth_cache[:0]
        last_window = ipos / WINDOW_SIZE


    depth_cache.append(depth)


    key = (chrom, "NO_COVERAGE" if depth == 0 else "LOW_COVERAGE" if depth < 4 else "CALLABLE")
    if key != last:
        if last[0] is not None:
            print "\t".join((last[0], str(int(cache[0][1]) - 1), cache[-1][1], last[1]))
        last = key
        cache = cache[:0]
    cache.append((chrom, pos, depth))
print "\t".join((chrom, str(int(cache[0][1]) - 1), cache[-1][1], last[1]))

@schelhorn
Copy link

Awesome work, guys. Much appreciated.

chapmanb added a commit that referenced this issue Sep 13, 2016
Use custom python script from @brentp instead of awk/bedtools to
collapse into callable regions from samtools depth. Include
identification of highdepth windows in same process to avoid extra
work. Fixes #1549
@chapmanb
Copy link
Member

Brent -- thanks so much for the code and approaches. I've integrated this into the current development pipeline and it's giving the same outputs on my tests. Please let me know if you notice any issues or have other approaches for improving speed. Thanks so much for looking at this and all the suggestions.

@brentp
Copy link
Contributor Author

brentp commented Sep 13, 2016

So I set my stuff going again with this, it shows in the log that the first thing to run is multiprocessing: calc_callable_loci using sambamba depth window . I was expecting it to use samtools depth. My algorithm section looks like this:

- algorithm:
    mark_duplicates: false
    realign: false
    recalibrate: false
    svcaller:
    - cnvkit
    - lumpy

@brentp
Copy link
Contributor Author

brentp commented Sep 13, 2016

ok. I see that was piping to head and finished quicly and now the samtools depth + bcbio.bam.highdepth.bin_depths is running.

@brentp
Copy link
Contributor Author

brentp commented Sep 13, 2016

Just fyi, the sambamba depth window didn't get logged to log/bcbio-nextgen-commands.log AFAICT.

@brentp
Copy link
Contributor Author

brentp commented Sep 13, 2016

Hi Brad, I just saw this:

CalledProcessError: Command 'set -o pipefail; /scratch/ucgd/lustre/u6000771/bcbio/galaxy/../anaconda/bin/samtools depth -a -Q 1 -b /scratch/ucgd/lustre/u6000771/bcbio/eiee/work/align/15-0022953/15-0022953-callable-split/15-0022953-NC_007605-callable-coverageregions.bed --reference /scratch/ucgd/lustre/u6000771/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa /scratch/ucgd/lustre/u6000771/bcbio/eiee/work/align/15-0022953/15-0022953.bam | /uufs/chpc.utah.edu/common/home/u6000771/bcbio/anaconda/bin/python -c 'import bcbio.bam.callable; bcbio.bam.highdepth.bin_depths(4, 1370, 250, "/scratch/ucgd/lustre/u6000771/bcbio/eiee/work/align/15-0022953/15-0022953-callable-split/tx/tmp_wQTbP/15-0022953-NC_007605-callable.bed", "/scratch/ucgd/lustre/u6000771/bcbio/eiee/work/align/15-0022953/15-0022953-callable-split/tx/tmp_wQTbP/15-0022953-NC_007605-highdepth.bed")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/uufs/chpc.utah.edu/common/home/u6000771/bcbio/anaconda/lib/python2.7/site-packages/bcbio_nextgen-1.0.0a0-py2.7.egg/bcbio/bam/highdepth.py", line 64, in bin_depths
    callable_handle.write("\t".join((chrom, str(int(cache[0][1]) - 1), str(cache[-1][1]), last[1])) + "\n")
UnboundLocalError: local variable 'chrom' referenced before assignment

I guess it was for phix chromosome which apparently had no alignments. So we should prefix that line with if cache: ...

@chapmanb
Copy link
Member

Brent -- thanks for the heads up on the edge case. I pushed the if cache fix to avoid it. Definitely let me know if you run into anything else.

You're right on with the sambamba depth -- it's a short call estimating median depth across windows which we then use to categorize high depth regions. This could probably be done smarter but I replicated what we had before to not make too many changes all that once. Right now we don't have a way to capture output and also log everything so the sambamba call uses standard subprocess instead of our wrapper around it.

Thanks again for all the thoughts and help.

@brentp
Copy link
Contributor Author

brentp commented Sep 13, 2016

cool. The sambamba depth ran quickly, so I'm not worried about it. I can already tell the pipeline is proceeding much faster with these changes.

@brentp
Copy link
Contributor Author

brentp commented Sep 18, 2016

Hi Brad, just another clarification. the samtools depth is running with a -b callable-coverageregions.bed argument. That is a single-line bed file that covers the entire chromosome. Is this expected?

Also, will the new CWL stuff parallelize the samtools depth part at a finer level than chromosome?

@chapmanb
Copy link
Member

Brent;
That's right for whole genome runs. For exome/targeted runs it should correspond to the regions on that chromosome. CWL doesn't parallelize in more detail that at the chromosome level, it's replicating what is done in bcbio now. Do you think that's needed?

@brentp
Copy link
Contributor Author

brentp commented Sep 20, 2016

Hi Brad, I wrote a samtools depth parallelizer here: https://github.com/brentp/goleft/releases/tag/v0.1.0
with docs here: https://github.com/brentp/goleft/tree/master/depth

It's quite fast for WGS or for a target. I can get coverage for a 60X genome in 12 minutes (with 24 cpus). I think the output matches what you're doing in bcbio, including the CALLABLE/LOW_COVERAGE/NO_COVERAGE tag.

It can also output GC content and masked info which I know @etal uses in cnvkit so maybe we could output something that prevents the depth recalculation in cnvkit?

BTW, even if you decide not to use this, samtools depth might be faster without the -b argument for WGS . I found that having a -b argument slows things down somewhat -- though maybe it's not an issue for only a single interval.

@chapmanb
Copy link
Member

Brent -- this is so awesome, thank you. I'll work on incorporating into bcbio, and it would be great to do a single calculation of depth in windows that we can pick up directly with CNVkit, and also downstream coverage assessment done by @Rorky and @lpantano in QC. Eric, thinking through this a bit it seems like we'd have to do some upstream binning to output in a format compatible with making target/anti coverage cnns. Then we'd have a more general coverage summary we could evaluation for callable regions by combining overlapping callable regions. Thanks again for working on this, looking forward to getting it implemented.

@chapmanb chapmanb reopened this Sep 21, 2016
@etal
Copy link
Contributor

etal commented Sep 21, 2016

I haven't tried this out yet, but it seems likely that binning these results with a target BED file would be much faster than repeating the coverage calculation within CNVkit. I'm happy to help with the conversion to .cnn format.

Incidentally, having the average on- and off-target coverages before binning would make it possible to calculate the best average antitarget bin size for the given samples instead of relying on a fixed default. That might improve CNV calls significantly for exomes.

@chapmanb
Copy link
Member

Thanks Eric, that's a great idea for improving anti-target bin sizes as well. Would producing 50bp binned coverage, GC estimates and masking outputs be a good default? Then we could combine and merge as needed to feed into CNVkit cnn calculation. Does that make sense, or were you thinking about another path? Thanks again for helping with this.

@brentp
Copy link
Contributor Author

brentp commented Sep 22, 2016

Brad, right now, gotleft depth will not parallelize well if you give it a -b with a single interval that covers the whole genome. I can make it split large intervals to the same size it does for whole genome if you need. Also let me know any problems or features you need.

@etal
Copy link
Contributor

etal commented Sep 28, 2016

For CNVkit -- for target captures and exomes, I think 50bp is probably too coarse for fitting target bins, but 10bp might be all right. For antitarget intervals and WGS, 50bp or even larger should be OK as those boundaries are arbitrary. The antitarget bins are not necessarily equal size, since CNVkit squeezes them to fit evenly into large introns and small intergenic regions.

@chapmanb
Copy link
Member

Brent;
Thanks again for putting together the goleft depth approach. I've been working on testing this and for some reason can't get goleft to generate any output. I'm not sure what I'm doing wrong and here's a small test case with almost no arguments that generates nothing for me:

wget https://s3.amazonaws.com/chapmanb/testcases/goleft_test.tar.gz

It outputs a depth.pprof file but nothing else, so I'm confused. Can you provide pointers about I'm doing wrong?

Also in thinking through this, would it be possible to calculate coverage with duplicates removed? We could use a samtools view piped to depth, or alternatively use the filters in sambamba depth. I'm not sure how either of those options look in terms of speed. The de-duplicated output would help with more reliable estimates of coverage actually going into variant calling, improve QC metrics and also hopefully be better for CNVkit input.

Thanks again for the help and apologies in advance if I'm doing something wrong with goleft.

@brentp
Copy link
Contributor Author

brentp commented Sep 30, 2016

Hi Brad, you need to specify --prefix. Otherwise, it's writing to ".depth.bed" and ".callable.bed" which you're not seeing. I'll adjust the code to make that required.

I'll also look into removing duplicates.

@brentp
Copy link
Contributor Author

brentp commented Oct 1, 2016

It looks like samtools depth already filters on flags. It uses:

if ( b->core.flag & (BAM_FUNMAP | BAM_FSECONDARY | BAM_FQCFAIL | BAM_FDUP) ) continue;

I fixed the goleft depth in master so an error is raised when --prefix is not specified, but could you try the version that you have with a --prefix specified and make sure it outputs what you need?

@chapmanb
Copy link
Member

chapmanb commented Oct 1, 2016

Brent -- sorry about missing the prefix input, that makes total sense now. I was expecting it to stream so fixated on that. I'm reading the code now and what do you think about merging the depth and callable outputs into a single tab delimited output file (that we could potentially stream into bgzip)? It's not really BED and the callable file will need merging as is since I'm breaking it in a more finely grained way for CNVkit and QC input. Practically I'll be using this to generate a initial tab delimited file and then post-processing it for callability, CNVkit and QC.

Thanks for checking on samtools depth, that is perfect as is. Nice one. Thanks again for all the help.

@brentp
Copy link
Contributor Author

brentp commented Oct 1, 2016

do you mean to just intermix the lines? or only output fixed regions?

but yes, I'm fine with changing it. I'd be happy to do it, but not sure I grok what you mean.

@chapmanb
Copy link
Member

chapmanb commented Oct 1, 2016

Brent -- I need to stop requesting things because you've already implemented everything. Apologies, I hadn't realized you were already collapsing from windows into the final callable.bed file. This is great as is and I'll continue to work to integrate into bcbio. Thanks much for listening to my rambling ideas.

@brentp
Copy link
Contributor Author

brentp commented Oct 1, 2016

no worries. let me know if you have any troubles. I'll make a new release soon as well.

@schelhorn
Copy link

Super valuable work, thanks guys.

chapmanb added a commit that referenced this issue Oct 5, 2016
Swaps to use goleft for coverage estimation including production
of a standard coverage file with 10bp resolution for downstream use
by other tools (#1549, #1583). Remaining work to do:

- Cleanup unused code after change
- Use pre-calculated coverage bins for CNVkit, seq2c and QC reporting
@chapmanb
Copy link
Member

chapmanb commented Oct 5, 2016

Brent;
Thanks again for all the help getting goleft working. I checked in coverage estimation with goleft, replacing the custom python code. Please let me know if you run into any problems at all. I'll work next on cleaning up the older code and then starting to migrate CNVkit, seq2c and coverage QC to use the 10bp binned outputs produced by goleft. Thank you again for helping push this forward.

chapmanb added a commit that referenced this issue Oct 6, 2016
Removes unused code after moving coverage assessment to goleft #1549
@chapmanb
Copy link
Member

chapmanb commented Oct 6, 2016

Brent, Eric and all;
I did some testing with the goleft approach and checked in a couple of fixes to get things working cleanly with the current development version. The biggest hurdle right now is that the depth files are gigantic at 10bp resolution, 100s of Gbs even for exome experiments. Sorry, I severely underestimated that. I turned the bin size up (a72db6c) to avoid the issue but we'll have to think of a new approach to prepare coverage binning a single time as this won't work practically with feeding into CNVkit and QC to avoid re-calculating later.

I'm open for suggestions about how best to handle this. We could use some combination of defining target/anti-targets for depth (around genes, CNVkit-style) and a better on-disk representation of depth that is not a gigantic text file full of numbers.

Great to have goleft replacing a lot of bcbio code in the short term, and happy to work on better depth in the longer term.

@brentp
Copy link
Contributor Author

brentp commented Oct 6, 2016

We could also see how many bins could be merged if the mean is > 50 and within some delta. And we could merge 0 coverage bins? I can probably look into this next week.
Or do those things not work with assumptions in downstream code?

@etal
Copy link
Contributor

etal commented Oct 6, 2016

CNVkit: For WGS, the placement of bins doesn't matter much, so a bin size of 500 or 1000 (i.e. the usual in bcbio) should be OK for creating a targetcoverage.cnn file, and would save a lot of time. For captures, the coverage calculation doesn't take an unbearable amount of time in my experience, while the placement of target bins matters a lot. So, for captures, it may be better to use the goleft coverages to generate only the antitargetcoverage.cnn file, and (re)calculate targetcoverage.cnn with CNVkit.

Merging similar-enough bins sounds great to me.

For quick storage: numpy.ndarray.tofile can write an array of numbers to a local file in compact binary form that is easily reloaded with numpy.fromfile. Not sure how this compares with pickle, or if saving space with gzip is worth the additional time spent. Or, since this file is coming from goleft initially -- write to HDF5, load with pandas?

@bwubb
Copy link

bwubb commented Oct 12, 2016

Just a quick aside concerning 'samtools depth', current development version is running an improper command, at least as far as I can tell.

[2016-10-12T15:16Z] depth: invalid option -- 'd'
[2016-10-12T15:16Z] depth: invalid option -- '-'
[2016-10-12T15:16Z] [E::hts_open] fail to open file '1000'
[2016-10-12T15:16Z] samtools: Could not open "1000": No such file or directory
[2016-10-12T15:16Z] /bin/bash: line 1:  4966 Segmentation fault      (core dumped) samtools depth -Q 1 -d 1000 -r 1:14026643-14126642 --reference /home/bwubb/bcbio-nextgen/genomes/Hsapiens/GRCh37/seq/GRCh37.fa /home/bwubb/NGS/MelanomaTargeted/bcbio/data/work/1205-CRT-20Aplin/align/1205-CRT-20Aplin/1205-CRT-20Aplin-ready.bam

samtools depth --help does not show a -d option. At least for the version I am using which is 1.1

@chapmanb
Copy link
Member

Brad -- the samtools in bcbio is 1.3.1. Is it possible to update to use the one installed by bcbio in your PATH first? That should hopefully resolve the problem.

@brentp
Copy link
Contributor Author

brentp commented Oct 24, 2016

I think it's safe to close this now.

@brentp brentp closed this as completed Oct 24, 2016
vladsavelyev pushed a commit to vladsavelyev/bcbio-nextgen that referenced this issue Oct 24, 2016
Replaces bedtools genomecov calculation. Provides a 6x speedup over
previous approach in initial benchmarking. Also allows piping directly
into groupby coverage, avoiding disk IO for the intermediate files.
Thanks to Brent Pedersen. Fixes bcbio#1549
vladsavelyev pushed a commit to vladsavelyev/bcbio-nextgen that referenced this issue Oct 24, 2016
Use custom python script from @brentp instead of awk/bedtools to
collapse into callable regions from samtools depth. Include
identification of highdepth windows in same process to avoid extra
work. Fixes bcbio#1549
vladsavelyev pushed a commit to vladsavelyev/bcbio-nextgen that referenced this issue Oct 24, 2016
vladsavelyev pushed a commit to vladsavelyev/bcbio-nextgen that referenced this issue Oct 24, 2016
Swaps to use goleft for coverage estimation including production
of a standard coverage file with 10bp resolution for downstream use
by other tools (bcbio#1549, bcbio#1583). Remaining work to do:

- Cleanup unused code after change
- Use pre-calculated coverage bins for CNVkit, seq2c and QC reporting
vladsavelyev pushed a commit to vladsavelyev/bcbio-nextgen that referenced this issue Oct 24, 2016
Removes unused code after moving coverage assessment to goleft bcbio#1549
@bwubb
Copy link

bwubb commented Oct 25, 2016

Ah My apologies. I completely forgot to comment on it's success. Thank you
for the help, sorry it was a trivial issue.

-bwubb

On Mon, Oct 24, 2016 at 10:45 AM, Brent Pedersen - Bioinformatics <
notifications@github.com> wrote:

I think it's safe to close this now.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1549 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ANJUxdDGhczUUur28i5Lrh6KvbiK9XQ5ks5q3MR9gaJpZM4J5e7p
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants