Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 0e03033
Showing
10 changed files
with
4,873 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
Package: SNVLFDR | ||
Title: Empirical Bayes Single Nucleotide Variant Calling | ||
Version: 1.0.1 | ||
Authors@R: c(person("Ali", "Karimnezhad", role = c("aut", "cre", "ctb"), | ||
email = "ali.karimnezhad@gmail.com")) | ||
Author: Ali Karimnezhad [aut, cre, ctb] | ||
Maintainer: Ali Karimnezhad <ali.karimnezhad@gmail.com> | ||
Description: Identifies single nucleotide variants in next-generation sequencing data by estimating their local false discovery rates. For more details, see Karimnezhad, A. and Perkins, T. J. (2024) <doi:10.1038/s41598-024-51958-z>. | ||
Encoding: UTF-8 | ||
License: GPL (>= 3) | ||
RoxygenNote: 7.2.3 | ||
NeedsCompilation: no | ||
Packaged: 2024-01-24 16:32:08 UTC; alikarimnezhad | ||
Repository: CRAN | ||
Date/Publication: 2024-01-25 13:30:02 UTC |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
dada0eda7b34d8b9bd5be5685d9930b5 *DESCRIPTION | ||
158d41e1626c314614db731fd3b8d9c8 *NAMESPACE | ||
07fac9673c049d7b71d7d14550ee489b *R/Source.R | ||
aab1cbbcc25a49abb8025edcc369813e *README.md | ||
bf689758e67890bc5237d84e05bb7567 *inst/extdata/bam_input.csv | ||
0bac5d292ed444362d78b7c9b4a48f7c *inst/extdata/calls.vcf | ||
85766890945f0af61eaffa34ea7688cf *inst/extdata/regions.bed | ||
773dd98fef3d9691a2e883a524e96001 *man/get_LFDRs.Rd | ||
c6b47fd680db7d09127c6b851c72bb49 *man/get_LFDRs_given_caller.Rd |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# Generated by roxygen2: do not edit by hand | ||
|
||
export(get_LFDRs) | ||
export(get_LFDRs_given_caller) | ||
import(stats) | ||
import(utils) |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
# SNVLFDR | ||
Empirical Bayes Single Nucleotide Variant Calling for Next Generation Sequencing Data | ||
|
||
### LFDR-based Variant Calling | ||
Identifies single nucleotide variants in next-generation sequencing data by estimating their local false discovery rates. For more details, see <https://doi.org/10.1038/s41598-024-51958-z>. | ||
|
||
|
||
## Installation | ||
|
||
The package can be installed from the GitHub repository | ||
```r | ||
devtools::install_github("empiricalbayes/SNVLFDR") | ||
``` | ||
|
||
## Getting Started | ||
You can load SNVLFDR as follows: | ||
|
||
```r | ||
library(SNVLFDR) | ||
``` | ||
|
||
## How to run SNVLFDR to call variants on an example data. | ||
```r | ||
bam_input <- system.file("extdata", "bam_input.csv", package="SNVLFDR") | ||
bedfile <- system.file("extdata", "regions.bed", package="SNVLFDR") | ||
BQ.T=20 | ||
MQ.T=20 | ||
pi0.initial=0.95 | ||
AF.T=0.01 | ||
DP.T=10 | ||
LFDR.T=0.01 | ||
error=NULL | ||
method='empirical' | ||
epsilon=0.01 | ||
output=get_LFDRs(bam_input,bedfile,BQ.T,MQ.T,pi0.initial,AF.T,DP.T,LFDR.T,error,method,epsilon) | ||
|
||
#Estimated LFDRs | ||
output$estimated.LFDRs | ||
|
||
#Estimated proportion on non-mutant sites | ||
output$estimated.pi0 | ||
|
||
#Filtered Bam matrix that includes estimated LFDRs | ||
output$filtered.bam | ||
|
||
|
||
## How to run SNVLFDR to prioritize variants called by another variant caller | ||
bam_path <- system.file("extdata", "bam_input.csv", package="SNVLFDR") | ||
calls_path <- system.file("extdata", "calls.vcf", package="SNVLFDR") | ||
output=get_LFDRs_given_caller(bam_input=bam_path,calls=calls_path,LFDR.T=0.01,error=NULL) | ||
|
||
#Updated VCF file that includes estimated LFDRs | ||
output$updated.vcf | ||
|
||
|
||
``` | ||
|
||
|
||
## References | ||
Karimnezhad, A and Perkins, T.J. (2024) Empirical Bayes single nucleotide variant-calling for next-generation sequencing data. Scientific Reports 14, 1550, <https://doi.org/10.1038/s41598-024-51958-z> |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
##fileformat=VCFv4.2 | ||
##FILTER=<ID=base_qual,Description="alt median base quality"> | ||
##FILTER=<ID=clustered_events,Description="Clustered events observed in the tumor"> | ||
##FILTER=<ID=contamination,Description="contamination"> | ||
##FILTER=<ID=duplicate,Description="evidence for alt allele is overrepresented by apparent duplicates"> | ||
##FILTER=<ID=fragment,Description="abs(ref - alt) median fragment length"> | ||
##FILTER=<ID=germline,Description="Evidence indicates this site is germline, not somatic"> | ||
##FILTER=<ID=haplotype,Description="Variant near filtered variant on same haplotype."> | ||
##FILTER=<ID=low_allele_frac,Description="Allele fraction is below specified threshold"> | ||
##FILTER=<ID=map_qual,Description="ref - alt median mapping quality"> | ||
##FILTER=<ID=multiallelic,Description="Site filtered because too many alt alleles pass tumor LOD"> | ||
##FILTER=<ID=n_ratio,Description="Ratio of N to alt exceeds specified ratio"> | ||
##FILTER=<ID=normal_artifact,Description="artifact_in_normal"> | ||
##FILTER=<ID=numt_chimera,Description="NuMT variant with too many ALT reads originally from autosome"> | ||
##FILTER=<ID=numt_novel,Description="Alt depth is below expected coverage of NuMT in autosome"> | ||
##FILTER=<ID=orientation,Description="orientation bias detected by the orientation bias mixture model"> | ||
##FILTER=<ID=panel_of_normals,Description="Blacklisted site in panel of normals"> | ||
##FILTER=<ID=position,Description="median distance of alt variants from end of reads"> | ||
##FILTER=<ID=slippage,Description="Site filtered due to contraction of short tandem repeat region"> | ||
##FILTER=<ID=strand_bias,Description="Evidence for alt allele comes from one read direction only"> | ||
##FILTER=<ID=strict_strand,Description="Evidence for alt allele is not represented in both directions"> | ||
##FILTER=<ID=weak_evidence,Description="Mutation does not meet likelihood threshold"> | ||
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> | ||
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor"> | ||
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> | ||
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele"> | ||
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele"> | ||
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> | ||
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> | ||
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another"> | ||
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group"> | ||
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> | ||
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)"> | ||
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias."> | ||
##GATKCommandLine=<ID=FilterMutectCalls,CommandLine="FilterMutectCalls --output /global/online/ohri/ohri1/projects/gapp_ali/analysis/mutect2/TDNA1_filtered.vcf --variant /global/online/ohri/ohri1/projects/gapp_ali/analysis/mutect2/TDNA1_unfiltered.vcf --reference /global/online/ohri/ohri1/projects/gapp/rawdata/genomes/hg19_chrs.fa --threshold-strategy OPTIMAL_F_SCORE --f-score-beta 1.0 --false-discovery-rate 0.05 --initial-threshold 0.1 --mitochondria-mode false --max-events-in-region 2 --max-alt-allele-count 1 --unique-alt-read-count 0 --min-median-mapping-quality 30 --min-median-base-quality 20 --max-median-fragment-length-difference 10000 --min-median-read-position 1 --max-n-ratio Infinity --min-reads-per-strand 0 --autosomal-coverage 0.0 --max-numt-fraction 0.85 --min-allele-fraction 0.0 --contamination-estimate 0.0 --log-snv-prior -13.815510557964275 --log-indel-prior -16.11809565095832 --log-artifact-prior -2.302585092994046 --normal-p-value-threshold 0.001 --min-slippage-length 8 --pcr-slippage-rate 0.1 --distance-on-haplotype 100 --long-indel-length 5 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false",Version="4.1.2.0",Date="August 26, 2022 1:03:37 EDT PM"> | ||
##GATKCommandLine=<ID=Mutect2,CommandLine="Mutect2 --callable-depth 10 --minimum-allele-fraction 0.01 --output /global/online/ohri/ohri1/projects/gapp_ali/analysis/mutect2/TDNA1_unfiltered.vcf --intervals /global/online/ohri/ohri1/projects/gapp/rawdata/trusight170/TST170_DNA_target.bed --input /global/online/ohri/ohri1/projects/gapp_ali/analysis/bwa_rmdup/TDNA1_sorted_rmdups.bam --reference /global/online/ohri/ohri1/projects/gapp/rawdata/genomes/hg19_chrs.fa --f1r2-median-mq 50 --f1r2-min-bq 20 --f1r2-max-depth 200 --genotype-pon-sites false --genotype-germline-sites false --af-of-alleles-not-in-resource -1.0 --mitochondria-mode false --tumor-lod-to-emit 3.0 --initial-tumor-lod 2.0 --pcr-snv-qual 40 --pcr-indel-qual 40 --max-population-af 0.01 --downsampling-stride 1 --max-suspicious-reads-per-alignment-start 0 --normal-lod 2.2 --ignore-itr-artifacts false --gvcf-lod-band -2.5 --gvcf-lod-band -2.0 --gvcf-lod-band -1.5 --gvcf-lod-band -1.0 --gvcf-lod-band -0.5 --gvcf-lod-band 0.0 --gvcf-lod-band 0.5 --gvcf-lod-band 1.0 --genotype-filtered-alleles false --disable-adaptive-pruning false --dont-trim-active-regions false --max-disc-ar-extension 25 --max-gga-ar-extension 300 --padding-around-indels 150 --padding-around-snps 20 --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --min-dangling-branch-length 4 --recover-all-dangling-branches false --max-num-haplotypes-in-population 128 --min-pruning 2 --adaptive-pruning-initial-error-rate 0.001 --pruning-lod-threshold 2.302585092994046 --max-unpruned-variants 100 --debug-assembly false --debug-graph-transformations false --capture-assembly-failure-bam false --error-correct-reads false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --likelihood-calculation-engine PairHMM --base-quality-score-threshold 18 --pair-hmm-gap-continuation-penalty 10 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --min-base-quality-score 10 --smith-waterman JAVA --emit-ref-confidence NONE --max-mnp-distance 1 --min-assembly-region-size 50 --max-assembly-region-size 300 --assembly-region-padding 100 --max-reads-per-alignment-start 50 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --force-active false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false --max-read-length 2147483647 --min-read-length 30 --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false",Version="4.1.2.0",Date="August 26, 2022 12:18:06 EDT PM"> | ||
##INFO=<ID=CONTQ,Number=1,Type=Float,Description="Phred-scaled qualities that alt allele are not due to contamination"> | ||
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered"> | ||
##INFO=<ID=ECNT,Number=1,Type=Integer,Description="Number of events in this haplotype"> | ||
##INFO=<ID=GERMQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles are not germline variants"> | ||
##INFO=<ID=MBQ,Number=R,Type=Integer,Description="median base quality"> | ||
##INFO=<ID=MFRL,Number=R,Type=Integer,Description="median fragment length"> | ||
##INFO=<ID=MMQ,Number=R,Type=Integer,Description="median mapping quality"> | ||
##INFO=<ID=MPOS,Number=A,Type=Integer,Description="median distance from end of read"> | ||
##INFO=<ID=NALOD,Number=A,Type=Float,Description="Negative log 10 odds of artifact in normal with same allele fraction as tumor"> | ||
##INFO=<ID=NCount,Number=1,Type=Integer,Description="Count of N bases in the pileup"> | ||
##INFO=<ID=NLOD,Number=A,Type=Float,Description="Normal log 10 likelihood ratio of diploid het or hom alt genotypes"> | ||
##INFO=<ID=OCM,Number=1,Type=Integer,Description="Number of alt reads whose original alignment doesn't match the current contig."> | ||
##INFO=<ID=PON,Number=0,Type=Flag,Description="site found in panel of normals"> | ||
##INFO=<ID=POPAF,Number=A,Type=Float,Description="negative log 10 population allele frequencies of alt alleles"> | ||
##INFO=<ID=ROQ,Number=1,Type=Float,Description="Phred-scaled qualities that alt allele are not due to read orientation artifact"> | ||
##INFO=<ID=RPA,Number=.,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)"> | ||
##INFO=<ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)"> | ||
##INFO=<ID=SEQQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles are not sequencing errors"> | ||
##INFO=<ID=STR,Number=0,Type=Flag,Description="Variant is a short tandem repeat"> | ||
##INFO=<ID=STRANDQ,Number=1,Type=Integer,Description="Phred-scaled quality of strand bias artifact"> | ||
##INFO=<ID=STRQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles in STRs are not polymerase slippage errors"> | ||
##INFO=<ID=TLOD,Number=A,Type=Float,Description="Log 10 likelihood ratio score of variant existing versus not existing"> | ||
##INFO=<ID=UNIQ_ALT_READ_COUNT,Number=1,Type=Integer,Description="Number of ALT reads with unique start and mate end positions at a variant site"> | ||
##MutectVersion=2.2 | ||
##contig=<ID=chr1,length=249250621> | ||
##contig=<ID=chr2,length=243199373> | ||
##contig=<ID=chr3,length=198022430> | ||
##contig=<ID=chr4,length=191154276> | ||
##contig=<ID=chr5,length=180915260> | ||
##contig=<ID=chr6,length=171115067> | ||
##contig=<ID=chr7,length=159138663> | ||
##contig=<ID=chr8,length=146364022> | ||
##contig=<ID=chr9,length=141213431> | ||
##contig=<ID=chr10,length=135534747> | ||
##contig=<ID=chr11,length=135006516> | ||
##contig=<ID=chr12,length=133851895> | ||
##contig=<ID=chr13,length=115169878> | ||
##contig=<ID=chr14,length=107349540> | ||
##contig=<ID=chr15,length=102531392> | ||
##contig=<ID=chr16,length=90354753> | ||
##contig=<ID=chr17,length=81195210> | ||
##contig=<ID=chr18,length=78077248> | ||
##contig=<ID=chr19,length=59128983> | ||
##contig=<ID=chr20,length=63025520> | ||
##contig=<ID=chr21,length=48129895> | ||
##contig=<ID=chr22,length=51304566> | ||
##contig=<ID=chrX,length=155270560> | ||
##contig=<ID=chrY,length=59373566> | ||
##contig=<ID=chrM,length=16571> | ||
##filtering_status=These calls have been filtered by FilterMutectCalls to label false positives with a list of failed filters and true positives with PASS. | ||
##source=FilterMutectCalls | ||
##source=Mutect2 | ||
##tumor_sample=SAMPLE | ||
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE | ||
chr1 11174415 . G A . PASS CONTQ=93;DP=1544;ECNT=1;GERMQ=72;MBQ=20,20;MFRL=127,133;MMQ=60,60;MPOS=25;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=1536.87 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:838,661:0.440:1499:366,297:465,356:409,429,299,362 | ||
chr1 11184544 . C T . PASS CONTQ=93;DP=1862;ECNT=1;GERMQ=93;MBQ=0,20;MFRL=0,137;MMQ=60,60;MPOS=24;POPAF=7.30;SEQQ=93;STRANDQ=93;TLOD=5303.00 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:0,1794:0.999:1794:0,876:0,888:0,0,954,840 |
Oops, something went wrong.