consider samtools depth to replace sambamba | bedtools in callable #1549

brentp · 2016-09-09T21:27:50Z

I just did this experiment:

$ time samtools depth -r 1:1-20000000 -Q 1 --reference $fasta $bam > st.bed
real    0m34.300s
user    0m33.170s
sys 0m1.006s

$ time sambamba view -F 'mapping_quality > 0' -f bam -l 1 $bam 1:1-20000000 | bedtools genomecov -split -ibam stdin -bga -g $fasta > bt.bed
real    3m21.922s
user    5m58.229s
sys 1m47.049s

$ head st.bed
1   9997    1
1   9998    1
1   9999    1
1   10000   3
1   10001   9
1   10002   15
1   10003   23
1   10004   32
1   10005   42
1   10006   47

$ head bt.bed
1   0   9996    0
1   9996    9999    1
1   9999    10000   3
1   10000   10001   9
1   10001   10002   15
1   10002   10003   23
1   10003   10004   32
1   10004   10005   42
1   10005   10006   47
1   10006   10007   53

so the samtools depth version uses 6x less compute. for the same answer.

The text was updated successfully, but these errors were encountered:

chapmanb · 2016-09-11T09:37:39Z

Brent;
Thanks so much for the idea and benchmarking. I swapped over to use samtools depth and also feed the output directly into downstream groupby so it avoids writing one intermediate file to disk. Great to have this simplified and faster, thank you again for identifying the slow point in the process and providing a workaround.

brentp · 2016-09-11T14:17:30Z

awesome! thanks for implementing!

brentp · 2016-09-11T21:07:28Z

I haven't groked the existing code fully, but would it be possible to combine this step with the subsequent sambamba depth window step (which, if I understand, is looking for very high-depth regions)?

Maybe instead of using bedtools groupby it could use a python script to do the grouping and also output 250bp regions of sufficient depth to a separate bed file and then those could be checked as needed for really high depth instead of forcing sambamba to re-run on the whole genome?

I can prototype if you think this might work, but I'm not sure of the full use of the sambamba depth output.

brentp · 2016-09-12T15:09:19Z

I put some of this into python, replacing awk -> bedtools groupby -> cut with a single python snippet.
The comparison looks like this:

time samtools depth -r 1:1-20000000 -Q 1 --reference $fasta $bam \
    | awk '{if ($3 == 0) {print $1"\t"$2-1"\t"$2"\tNO_COVERAGE"} else if ($3 < 4) {print $1"\t"$2-1"\t"$2"\tLOW_COVERAGE"} else {print $1"\t"$2-1"\t"$2"\tCALLABLE"} }' \
    | bedtools groupby -prec 21 -g 1,4 -c 1,2,3,4 -o first,min,max,first \
    | cut -f 3-6 \
    > st.bed
real    1m15.183s
user    2m26.165s
sys 0m2.391s

time samtools depth -r 1:1-20000000 -Q 1 --reference $fasta $bam \
    | python gr.py > gr.bed
real    0m36.280s
user    1m6.930s
sys 0m1.906s

md5sum st.bed gr.bed
beff75c0c3599e125e6b8d22ef3f816f  st.bed
beff75c0c3599e125e6b8d22ef3f816f  gr.bed

so it basically cuts the time in half again and does even better for cpu time. Here is gr.py

import sys

last = (None, None)
cache = []

viter = (l.rstrip("\r\n").split("\t", 3) for l in sys.stdin)
for chrom, pos, depth in viter:
    key = (chrom, "NO_COVERAGE" if depth == "0" else "LOW_COVERAGE" if int(depth) < 4 else "CALLABLE")
    if key != last:
        if last[0] is not None:
            print "\t".join((last[0], str(int(cache[0][1]) - 1), cache[-1][1], last[1]))
        last = key
        cache = cache[:0]
    cache.append((chrom, pos, depth))
print "\t".join((chrom, str(int(cache[0][1]) - 1), cache[-1][1], last[1]))

In addition to the speed benefits, this might make it possible to group into 250 base windows (or any) and output the window information to a separate file so it's not doing per-base samtools depth and then immediately doing sambamba depth window.

chapmanb · 2016-09-12T15:19:09Z

Brent -- awesome, thank you. This is a great idea. I'm also working on merging high depth reporting into this as well to avoid the double depth calculation. These are all great suggestions. I'll work on rolling this prototype approach and push a fix soon for testing. Thank you again for all this help and benchmarking.

brentp · 2016-09-12T15:21:58Z

cool. here's what I just hacked together to do the window-depth in the same pass. Writing the window stuff to stderr currently, but you get the idea.

import sys

WINDOW_SIZE = 250
WINDOW_DEPTH_CUTOFF = 300

last = (None, None)
last_window = -1
depth_cache = []
cache = []

def mean(vals):
    return sum(vals) / float(WINDOW_SIZE)

viter = (l.rstrip("\r\n").split("\t", 3) for l in sys.stdin)
for chrom, pos, depth in viter:


    depth, ipos = int(depth), int(pos)
    if ipos / WINDOW_SIZE != last_window:
        if mean(depth_cache) > WINDOW_DEPTH_CUTOFF:
            sys.stderr.write("%s\t%d\t%d\t%.2f\n" % (chrom, last_window * 250, last_window * 250 + 250, mean(depth_cache)))
            sys.stderr.flush()
        depth_cache = depth_cache[:0]
        last_window = ipos / WINDOW_SIZE


    depth_cache.append(depth)


    key = (chrom, "NO_COVERAGE" if depth == 0 else "LOW_COVERAGE" if depth < 4 else "CALLABLE")
    if key != last:
        if last[0] is not None:
            print "\t".join((last[0], str(int(cache[0][1]) - 1), cache[-1][1], last[1]))
        last = key
        cache = cache[:0]
    cache.append((chrom, pos, depth))
print "\t".join((chrom, str(int(cache[0][1]) - 1), cache[-1][1], last[1]))

schelhorn · 2016-09-13T06:30:34Z

Awesome work, guys. Much appreciated.

@brentp

Use custom python script from @brentp instead of awk/bedtools to collapse into callable regions from samtools depth. Include identification of highdepth windows in same process to avoid extra work. Fixes #1549

chapmanb · 2016-09-13T09:41:56Z

Brent -- thanks so much for the code and approaches. I've integrated this into the current development pipeline and it's giving the same outputs on my tests. Please let me know if you notice any issues or have other approaches for improving speed. Thanks so much for looking at this and all the suggestions.

brentp · 2016-09-13T14:35:13Z

So I set my stuff going again with this, it shows in the log that the first thing to run is multiprocessing: calc_callable_loci using sambamba depth window . I was expecting it to use samtools depth. My algorithm section looks like this:

- algorithm:
    mark_duplicates: false
    realign: false
    recalibrate: false
    svcaller:
    - cnvkit
    - lumpy

brentp · 2016-09-13T14:40:40Z

ok. I see that was piping to head and finished quicly and now the samtools depth + bcbio.bam.highdepth.bin_depths is running.

brentp · 2016-09-13T14:41:44Z

Just fyi, the sambamba depth window didn't get logged to log/bcbio-nextgen-commands.log AFAICT.

brentp · 2016-09-13T18:00:43Z

Hi Brad, I just saw this:

CalledProcessError: Command 'set -o pipefail; /scratch/ucgd/lustre/u6000771/bcbio/galaxy/../anaconda/bin/samtools depth -a -Q 1 -b /scratch/ucgd/lustre/u6000771/bcbio/eiee/work/align/15-0022953/15-0022953-callable-split/15-0022953-NC_007605-callable-coverageregions.bed --reference /scratch/ucgd/lustre/u6000771/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa /scratch/ucgd/lustre/u6000771/bcbio/eiee/work/align/15-0022953/15-0022953.bam | /uufs/chpc.utah.edu/common/home/u6000771/bcbio/anaconda/bin/python -c 'import bcbio.bam.callable; bcbio.bam.highdepth.bin_depths(4, 1370, 250, "/scratch/ucgd/lustre/u6000771/bcbio/eiee/work/align/15-0022953/15-0022953-callable-split/tx/tmp_wQTbP/15-0022953-NC_007605-callable.bed", "/scratch/ucgd/lustre/u6000771/bcbio/eiee/work/align/15-0022953/15-0022953-callable-split/tx/tmp_wQTbP/15-0022953-NC_007605-highdepth.bed")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/uufs/chpc.utah.edu/common/home/u6000771/bcbio/anaconda/lib/python2.7/site-packages/bcbio_nextgen-1.0.0a0-py2.7.egg/bcbio/bam/highdepth.py", line 64, in bin_depths
    callable_handle.write("\t".join((chrom, str(int(cache[0][1]) - 1), str(cache[-1][1]), last[1])) + "\n")
UnboundLocalError: local variable 'chrom' referenced before assignment

I guess it was for phix chromosome which apparently had no alignments. So we should prefix that line with if cache: ...

chapmanb · 2016-09-13T18:24:01Z

Brent -- thanks for the heads up on the edge case. I pushed the if cache fix to avoid it. Definitely let me know if you run into anything else.

You're right on with the sambamba depth -- it's a short call estimating median depth across windows which we then use to categorize high depth regions. This could probably be done smarter but I replicated what we had before to not make too many changes all that once. Right now we don't have a way to capture output and also log everything so the sambamba call uses standard subprocess instead of our wrapper around it.

Thanks again for all the thoughts and help.

brentp · 2016-09-13T18:26:01Z

cool. The sambamba depth ran quickly, so I'm not worried about it. I can already tell the pipeline is proceeding much faster with these changes.

brentp · 2016-09-18T19:10:04Z

Hi Brad, just another clarification. the samtools depth is running with a -b callable-coverageregions.bed argument. That is a single-line bed file that covers the entire chromosome. Is this expected?

Also, will the new CWL stuff parallelize the samtools depth part at a finer level than chromosome?

chapmanb · 2016-09-18T22:46:51Z

Brent;
That's right for whole genome runs. For exome/targeted runs it should correspond to the regions on that chromosome. CWL doesn't parallelize in more detail that at the chromosome level, it's replicating what is done in bcbio now. Do you think that's needed?

brentp · 2016-09-20T03:26:53Z

Hi Brad, I wrote a samtools depth parallelizer here: https://github.com/brentp/goleft/releases/tag/v0.1.0
with docs here: https://github.com/brentp/goleft/tree/master/depth

It's quite fast for WGS or for a target. I can get coverage for a 60X genome in 12 minutes (with 24 cpus). I think the output matches what you're doing in bcbio, including the CALLABLE/LOW_COVERAGE/NO_COVERAGE tag.

It can also output GC content and masked info which I know @etal uses in cnvkit so maybe we could output something that prevents the depth recalculation in cnvkit?

BTW, even if you decide not to use this, samtools depth might be faster without the -b argument for WGS . I found that having a -b argument slows things down somewhat -- though maybe it's not an issue for only a single interval.

chapmanb · 2016-09-21T02:07:08Z

Brent -- this is so awesome, thank you. I'll work on incorporating into bcbio, and it would be great to do a single calculation of depth in windows that we can pick up directly with CNVkit, and also downstream coverage assessment done by @Rorky and @lpantano in QC. Eric, thinking through this a bit it seems like we'd have to do some upstream binning to output in a format compatible with making target/anti coverage cnns. Then we'd have a more general coverage summary we could evaluation for callable regions by combining overlapping callable regions. Thanks again for working on this, looking forward to getting it implemented.

etal · 2016-09-21T04:36:44Z

I haven't tried this out yet, but it seems likely that binning these results with a target BED file would be much faster than repeating the coverage calculation within CNVkit. I'm happy to help with the conversion to .cnn format.

Incidentally, having the average on- and off-target coverages before binning would make it possible to calculate the best average antitarget bin size for the given samples instead of relying on a fixed default. That might improve CNV calls significantly for exomes.

chapmanb · 2016-09-21T11:57:35Z

Thanks Eric, that's a great idea for improving anti-target bin sizes as well. Would producing 50bp binned coverage, GC estimates and masking outputs be a good default? Then we could combine and merge as needed to feed into CNVkit cnn calculation. Does that make sense, or were you thinking about another path? Thanks again for helping with this.

brentp · 2016-09-22T02:39:16Z

Brad, right now, gotleft depth will not parallelize well if you give it a -b with a single interval that covers the whole genome. I can make it split large intervals to the same size it does for whole genome if you need. Also let me know any problems or features you need.

etal · 2016-09-28T12:42:19Z

For CNVkit -- for target captures and exomes, I think 50bp is probably too coarse for fitting target bins, but 10bp might be all right. For antitarget intervals and WGS, 50bp or even larger should be OK as those boundaries are arbitrary. The antitarget bins are not necessarily equal size, since CNVkit squeezes them to fit evenly into large introns and small intergenic regions.

chapmanb · 2016-09-30T20:07:17Z

Brent;
Thanks again for putting together the goleft depth approach. I've been working on testing this and for some reason can't get goleft to generate any output. I'm not sure what I'm doing wrong and here's a small test case with almost no arguments that generates nothing for me:

wget https://s3.amazonaws.com/chapmanb/testcases/goleft_test.tar.gz

It outputs a depth.pprof file but nothing else, so I'm confused. Can you provide pointers about I'm doing wrong?

Also in thinking through this, would it be possible to calculate coverage with duplicates removed? We could use a samtools view piped to depth, or alternatively use the filters in sambamba depth. I'm not sure how either of those options look in terms of speed. The de-duplicated output would help with more reliable estimates of coverage actually going into variant calling, improve QC metrics and also hopefully be better for CNVkit input.

Thanks again for the help and apologies in advance if I'm doing something wrong with goleft.

brentp · 2016-09-30T22:34:25Z

Hi Brad, you need to specify --prefix. Otherwise, it's writing to ".depth.bed" and ".callable.bed" which you're not seeing. I'll adjust the code to make that required.

I'll also look into removing duplicates.

brentp · 2016-10-01T03:39:02Z

It looks like samtools depth already filters on flags. It uses:

if ( b->core.flag & (BAM_FUNMAP | BAM_FSECONDARY | BAM_FQCFAIL | BAM_FDUP) ) continue;

I fixed the goleft depth in master so an error is raised when --prefix is not specified, but could you try the version that you have with a --prefix specified and make sure it outputs what you need?

chapmanb · 2016-10-01T10:24:16Z

Brent -- sorry about missing the prefix input, that makes total sense now. I was expecting it to stream so fixated on that. I'm reading the code now and what do you think about merging the depth and callable outputs into a single tab delimited output file (that we could potentially stream into bgzip)? It's not really BED and the callable file will need merging as is since I'm breaking it in a more finely grained way for CNVkit and QC input. Practically I'll be using this to generate a initial tab delimited file and then post-processing it for callability, CNVkit and QC.

Thanks for checking on samtools depth, that is perfect as is. Nice one. Thanks again for all the help.

brentp · 2016-10-01T13:19:07Z

do you mean to just intermix the lines? or only output fixed regions?

but yes, I'm fine with changing it. I'd be happy to do it, but not sure I grok what you mean.

chapmanb · 2016-10-01T18:19:33Z

Brent -- I need to stop requesting things because you've already implemented everything. Apologies, I hadn't realized you were already collapsing from windows into the final callable.bed file. This is great as is and I'll continue to work to integrate into bcbio. Thanks much for listening to my rambling ideas.

brentp · 2016-10-01T20:31:33Z

no worries. let me know if you have any troubles. I'll make a new release soon as well.

schelhorn · 2016-10-02T06:08:29Z

Super valuable work, thanks guys.

Swaps to use goleft for coverage estimation including production of a standard coverage file with 10bp resolution for downstream use by other tools (#1549, #1583). Remaining work to do: - Cleanup unused code after change - Use pre-calculated coverage bins for CNVkit, seq2c and QC reporting

chapmanb · 2016-10-05T11:40:52Z

Brent;
Thanks again for all the help getting goleft working. I checked in coverage estimation with goleft, replacing the custom python code. Please let me know if you run into any problems at all. I'll work next on cleaning up the older code and then starting to migrate CNVkit, seq2c and coverage QC to use the 10bp binned outputs produced by goleft. Thank you again for helping push this forward.

Removes unused code after moving coverage assessment to goleft #1549

chapmanb · 2016-10-06T10:09:20Z

Brent, Eric and all;
I did some testing with the goleft approach and checked in a couple of fixes to get things working cleanly with the current development version. The biggest hurdle right now is that the depth files are gigantic at 10bp resolution, 100s of Gbs even for exome experiments. Sorry, I severely underestimated that. I turned the bin size up (a72db6c) to avoid the issue but we'll have to think of a new approach to prepare coverage binning a single time as this won't work practically with feeding into CNVkit and QC to avoid re-calculating later.

I'm open for suggestions about how best to handle this. We could use some combination of defining target/anti-targets for depth (around genes, CNVkit-style) and a better on-disk representation of depth that is not a gigantic text file full of numbers.

Great to have goleft replacing a lot of bcbio code in the short term, and happy to work on better depth in the longer term.

brentp · 2016-10-06T12:17:52Z

We could also see how many bins could be merged if the mean is > 50 and within some delta. And we could merge 0 coverage bins? I can probably look into this next week.
Or do those things not work with assumptions in downstream code?

etal · 2016-10-06T20:12:25Z

CNVkit: For WGS, the placement of bins doesn't matter much, so a bin size of 500 or 1000 (i.e. the usual in bcbio) should be OK for creating a targetcoverage.cnn file, and would save a lot of time. For captures, the coverage calculation doesn't take an unbearable amount of time in my experience, while the placement of target bins matters a lot. So, for captures, it may be better to use the goleft coverages to generate only the antitargetcoverage.cnn file, and (re)calculate targetcoverage.cnn with CNVkit.

Merging similar-enough bins sounds great to me.

For quick storage: numpy.ndarray.tofile can write an array of numbers to a local file in compact binary form that is easily reloaded with numpy.fromfile. Not sure how this compares with pickle, or if saving space with gzip is worth the additional time spent. Or, since this file is coming from goleft initially -- write to HDF5, load with pandas?

bwubb · 2016-10-12T15:30:49Z

Just a quick aside concerning 'samtools depth', current development version is running an improper command, at least as far as I can tell.

[2016-10-12T15:16Z] depth: invalid option -- 'd'
[2016-10-12T15:16Z] depth: invalid option -- '-'
[2016-10-12T15:16Z] [E::hts_open] fail to open file '1000'
[2016-10-12T15:16Z] samtools: Could not open "1000": No such file or directory
[2016-10-12T15:16Z] /bin/bash: line 1:  4966 Segmentation fault      (core dumped) samtools depth -Q 1 -d 1000 -r 1:14026643-14126642 --reference /home/bwubb/bcbio-nextgen/genomes/Hsapiens/GRCh37/seq/GRCh37.fa /home/bwubb/NGS/MelanomaTargeted/bcbio/data/work/1205-CRT-20Aplin/align/1205-CRT-20Aplin/1205-CRT-20Aplin-ready.bam

samtools depth --help does not show a -d option. At least for the version I am using which is 1.1

chapmanb · 2016-10-12T16:28:27Z

Brad -- the samtools in bcbio is 1.3.1. Is it possible to update to use the one installed by bcbio in your PATH first? That should hopefully resolve the problem.

brentp · 2016-10-24T14:45:13Z

I think it's safe to close this now.

Replaces bedtools genomecov calculation. Provides a 6x speedup over previous approach in initial benchmarking. Also allows piping directly into groupby coverage, avoiding disk IO for the intermediate files. Thanks to Brent Pedersen. Fixes bcbio#1549

@brentp

Use custom python script from @brentp instead of awk/bedtools to collapse into callable regions from samtools depth. Include identification of highdepth windows in same process to avoid extra work. Fixes bcbio#1549

Swaps to use goleft for coverage estimation including production of a standard coverage file with 10bp resolution for downstream use by other tools (bcbio#1549, bcbio#1583). Remaining work to do: - Cleanup unused code after change - Use pre-calculated coverage bins for CNVkit, seq2c and QC reporting

Removes unused code after moving coverage assessment to goleft bcbio#1549

bwubb · 2016-10-25T18:33:31Z

Ah My apologies. I completely forgot to comment on it's success. Thank you
for the help, sorry it was a trivial issue.

-bwubb

On Mon, Oct 24, 2016 at 10:45 AM, Brent Pedersen - Bioinformatics <
notifications@github.com> wrote:

I think it's safe to close this now.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1549 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ANJUxdDGhczUUur28i5Lrh6KvbiK9XQ5ks5q3MR9gaJpZM4J5e7p
.

chapmanb closed this as completed in 30c43eb Sep 11, 2016

chapmanb added a commit that referenced this issue Sep 13, 2016

Callability: edge case with no alignments #1549

a8559f9

etal mentioned this issue Sep 20, 2016

Consider samtools depth to replace bedcov etal/cnvkit#134

Closed

chapmanb reopened this Sep 21, 2016

chapmanb mentioned this issue Sep 30, 2016

SVC / CNVKIT Parallelization #1582

Closed

This was referenced Sep 30, 2016

depth: prefix should be required. brentp/goleft#1

Closed

filter on flag brentp/goleft#2

Closed

chapmanb mentioned this issue Oct 3, 2016

slowdown while calculating coverage bins #1583

Closed

chapmanb added a commit that referenced this issue Oct 6, 2016

callable: code cleanup following goleft migration

596e902

Removes unused code after moving coverage assessment to goleft #1549

brentp mentioned this issue Oct 19, 2016

Sambamba Comparison brentp/goleft#8

Closed

brentp closed this as completed Oct 24, 2016

vladsavelyev pushed a commit to vladsavelyev/bcbio-nextgen that referenced this issue Oct 24, 2016

Callability: edge case with no alignments bcbio#1549

8333f14

vladsavelyev pushed a commit to vladsavelyev/bcbio-nextgen that referenced this issue Oct 24, 2016

callable: code cleanup following goleft migration

e31fa67

Removes unused code after moving coverage assessment to goleft bcbio#1549

consider samtools depth to replace sambamba | bedtools in callable #1549

consider samtools depth to replace sambamba | bedtools in callable #1549

Comments

brentp commented Sep 9, 2016

chapmanb commented Sep 11, 2016

brentp commented Sep 11, 2016

brentp commented Sep 11, 2016

brentp commented Sep 12, 2016

chapmanb commented Sep 12, 2016

brentp commented Sep 12, 2016

schelhorn commented Sep 13, 2016

chapmanb commented Sep 13, 2016

brentp commented Sep 13, 2016

brentp commented Sep 13, 2016

brentp commented Sep 13, 2016

brentp commented Sep 13, 2016

chapmanb commented Sep 13, 2016

brentp commented Sep 13, 2016

brentp commented Sep 18, 2016

chapmanb commented Sep 18, 2016

brentp commented Sep 20, 2016

chapmanb commented Sep 21, 2016

etal commented Sep 21, 2016

chapmanb commented Sep 21, 2016

brentp commented Sep 22, 2016

etal commented Sep 28, 2016

chapmanb commented Sep 30, 2016

brentp commented Sep 30, 2016

brentp commented Oct 1, 2016

chapmanb commented Oct 1, 2016

brentp commented Oct 1, 2016

chapmanb commented Oct 1, 2016

brentp commented Oct 1, 2016

schelhorn commented Oct 2, 2016

chapmanb commented Oct 5, 2016

chapmanb commented Oct 6, 2016

brentp commented Oct 6, 2016

etal commented Oct 6, 2016 • edited

bwubb commented Oct 12, 2016

chapmanb commented Oct 12, 2016

brentp commented Oct 24, 2016

bwubb commented Oct 25, 2016

etal commented Oct 6, 2016 •

edited