Skip to content

Commit

Permalink
better document --snp stuff
Browse files Browse the repository at this point in the history
  • Loading branch information
brentp committed Oct 9, 2018
1 parent cc8930b commit 87b763f
Show file tree
Hide file tree
Showing 4 changed files with 26 additions and 4 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,7 @@
data/
*.bcf
src/duphold
*.bam*
*.cram*
syndip/
*.vcf*
23 changes: 20 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ single sample with:
+ **DHBFC**: fold-change for the variant depth *relative to bins in the genome with similar GC-content*.
+ **DHD**: rapid change in depth at one of the break-points (1 for higher (DUP). 0 for no or conflicting changes. -1 for drop (DUP), 2 or -2 for both break points)

If a SNP/Indel VCF is given, `duphold` will annotate each DEL/DUP call with:
If a SNP/Indel VCF/BCF is given, `duphold` will annotate each DEL/DUP call with (see below for more detail on what it does):

+ **DHET**: counts of SNP heterozygotes in the SV supporting: [0] a normal heterozygote, [1] a triploid heterozygote.
for a DUP, we expect most hets to have an allele balance closer to 0.33 or 0.67 than to 0.5. A good heterozygous
Expand All @@ -29,6 +29,20 @@ If a SNP/Indel VCF is given, `duphold` will annotate each DEL/DUP call with:

It also adds **GCF** to the INFO field indicating the fraction of G or C bases in the variant.


## SNP/Indel annotation

**NOTE** it is strongly recommended to use BCF for the `--snp` argument as otherwise VCF parsing will be a bottleneck.

+ A DEL call with many HETs is unlikely to be valid.
+ A DUP call that has many HETs that have a 0.5 allele balance is unlikely to be valid.

When the user specifies a `--snp` VCF, `duphold` finds the appropriate sample in that file and extracts high (> 20) quality, bi-allelic
SNP calls. For each chromosome, it will store a minimal (low-memory representation) in a sorted data-structure for fast access. It will
then query this data structure for each SV and count the number of heterozygotes supporting a diploid HET (allele balance close to 0.5)
or a triploid HET (allele balance close to 0.33 or 0.67) into `DHET`. It will store the number of Hom-Ref, Hom-Alt, Unnkown calls in
`DHHU`.

## Performance

### Speed
Expand All @@ -48,10 +62,13 @@ coming soon.
## Usage

```
duphold -t 4 -v $svvcf -b $cram -f $fasta -o $output.bcf
duphold --threads 4 --vcf $svvcf --bam $cram --fasta $fasta --output $output.bcf
duphold -s $gatk_vcf -t 4 -v $svvcf -b $cram -f $fasta -o $output.bcf
duphold --snp $gatk_bcf --threads 4 --vcf $svvcf --bam $cram --fasta $fasta --output $output.bcf
```

`--snp` can be a multi-sample VCF/BCF. `duphold` will be much faster with a BCF, especially if
the snp/indel file contains many (>20 or so) samples.

the threads are decompression threads so increasing up to about 4 works.

## Examples
Expand Down
1 change: 1 addition & 0 deletions nim.cfg
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
threads:on
--passC:"-flto"
2 changes: 1 addition & 1 deletion src/duphold.nim
Original file line number Diff line number Diff line change
Expand Up @@ -665,7 +665,7 @@ Options:
-v --vcf <path> path to sorted SV VCF/BCF
-b --bam <path> path to indexed BAM/CRAM
-f --fasta <path> indexed fasta reference.
-s --snp <path> optional path to snp/indel VCF with which to annotate SVs.
-s --snp <path> optional path to snp/indel VCF/BCF with which to annotate SVs. BCF is highly recommended as it's much faster to parse.
-t --threads <int> number of decompression threads. [default: 4]
-o --output <string> output VCF/BCF (default is VCF to stdout) [default: -]
-d --drop drop all samples from a multi-sample --vcf *except* the sample in --bam. useful for parallelization by sample followed by merge.
Expand Down

0 comments on commit 87b763f

Please sign in to comment.