# Parabricks FQ2BAM, HaplotypeCaller, and dbSNP: GPU-accelerated DNA WGS from reads to annotated variants

---

High-throughput DNA sequencing (NGS) transformed genomics from single-genome moonshots to population-scale studies. The first commercial “next-gen” platform (454 pyrosequencing, 2005) demonstrated massively parallel reads and precipitous cost declines, enabling efforts like the 1000 Genomes Project to chart common human variation and set de-facto file/analysis standards still used today.<sup>[1],[2]</sup> To move raw reads into analysis, the community coalesced on the **SAM/BAM** formats and a CPU toolchain centered on **BWA-MEM** for alignment and the **GATK Best Practices** for variant discovery—an architecture that remains the reference baseline for accuracy and interoperability.<sup>[3],[4],[5]</sup>  

As cohorts grew from dozens to thousands of genomes, turnaround time and cost became bottlenecks. Systematic benchmarks showed that porting secondary analysis to **GPUs** yields order-of-magnitude runtime and cost reductions without sacrificing accuracy—paving the way for production suites like **NVIDIA Parabricks** that mirror gold-standard CPU pipelines while exploiting GPU parallelism. <sup>[12],[13]</sup>

This notebook introduces three Parabricks building blocks that together take you **FASTQ ➜ BAM ➜ VCF ➜ annotated VCF** for whole-genome sequencing (WGS):

---

## 1) **Parabricks FQ2BAM** — GPU-accelerated read mapping & pre-processing

**What it does.** FQ2BAM ingests raw FASTQs and emits a coordinate-sorted, duplicate-marked BAM aligned with **BWA-MEM**, optionally with Base Quality Score Recalibration (BQSR). Outputs match GATK-style expectations, so you can drop directly into downstream variant callers. In short: the canonical CPU steps, but GPU-fast. <sup>[8]</sup>

**Why this step matters.** Aligners place reads onto the reference; SAM/BAM define how those alignments are stored; duplicate marking (Picard/MarkDuplicates-compatible) and BQSR standardize error profiles for robust variant calling. BWA-MEM and SAM/BAM are the community workhorses; Parabricks accelerates them while preserving parity with the baseline implementations.<sup>[3],[4],[5],[7],[14],[15],[16a],[16b]</sup>

---

## 2) **Parabricks HaplotypeCaller** — GPU-accelerated germline variant calling

**What it does.** A GPU implementation of the **GATK HaplotypeCaller** workflow: it performs local de-novo assembly within “active regions” to jointly model SNPs and indels, producing high-quality gVCFs/VCFs consistent with GATK Best Practices—at a fraction of the CPU runtime.<sup>[9]</sup>

**Where it comes from.** HaplotypeCaller emerged from the GATK framework that unified mapping, recalibration, and assembly-based variant discovery; the Best Practices continue to be stewarded and updated by the Broad Institute community (see also the 2020 *Genomics in the Cloud* reference). Parabricks maintains algorithmic equivalence while accelerating execution on GPUs.<sup>[5],[6],[10],[11]</sup>

---

## 3) **Parabricks dbSNP** — GPU-accelerated variant annotation

**What it does.** Annotates your VCF against **dbSNP**, the long-running NCBI archive of known variants. The result is an annotated VCF with rsIDs and related metadata that downstream tools and databases recognize, produced quickly on GPUs.<sup>[10],[11]</sup>

---

## What you’ll do in this notebook

1. **Run FQ2BAM** on example FASTQs to generate a sorted, duplicate-marked **BAM**.  
2. **Call variants** with **HaplotypeCaller** to produce **gVCF/VCF**.  
3. **Annotate variants** against **dbSNP** to add rsIDs and related fields.  
4. (Optional) Compare GPU/CPU wall-time to quantify speedups and cost impacts on your hardware/cloud.  

> Parabricks pipelines are designed to be **drop-in replacements** for the standard CPU workflows—same algorithms and outputs, dramatically faster on NVIDIA GPUs.<sup>[8],[12],[13]</sup>

---

## References

1. Margulies M. *et al.* **[Genome sequencing in microfabricated high-density picolitre reactors][1]**. *Nature* (2005).  
2. 1000 Genomes Project Consortium. **[A map of human genome variation from population-scale sequencing][2]**. *Nature* (2010).  
3. Li H. *et al.* **[The Sequence Alignment/Map (SAM) format and SAMtools][3]**. *Bioinformatics* (2009).  
4. Li H. **[Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM][4]**. *arXiv* (2013).  
5. Van der Auwera GA *et al.* **[From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline][5]**. *Curr Protoc Bioinformatics* (2013).  
6. **[HaplotypeCaller (overview & algorithm notes)][6]**. Broad Institute GATK Docs.  
7. **Picard MarkDuplicates** — [documentation][7], [repository][7a]. Broad Institute.  
8. **[Parabricks FQ2BAM Tutorial][8]**. NVIDIA Docs.  
9. **[Parabricks HaplotypeCaller][9]**. NVIDIA Docs.  
10. **[Parabricks dbSNP annotator][10]**. NVIDIA Docs.  
11. Sherry ST *et al.* **[dbSNP: the NCBI database of genetic variation][11]**. *Nucleic Acids Res* (2001).  
12. Taylor-Weiner A *et al.* **[Scaling computational genomics to millions of individuals with GPUs][12]**. *Genome Biology* (2019).  
13. **Parabricks Documentation** — [overview][13], [output accuracy & CPU parity/benchmarks][13b]. NVIDIA Docs & Guides.  
14. **[BWA GitHub repository][14]**.  
15. **[HTS-specs (SAM/BAM/VCF) GitHub repository][15]**.  
16. **Official specifications (PDFs):** [SAM v1][16a]; [VCF v4.3][16b].

<!-- Link definitions -->
[1]: https://www.nature.com/articles/nature03959
[2]: https://www.nature.com/articles/nature09534
[3]: https://pmc.ncbi.nlm.nih.gov/articles/PMC2723002/
[4]: https://arxiv.org/abs/1303.3997
[5]: https://pmc.ncbi.nlm.nih.gov/articles/PMC4243306/
[6]: https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller
[7]: https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard
[7a]: https://github.com/broadinstitute/picard
[8]: https://docs.nvidia.com/clara/parabricks/latest/Tutorials/FQ2BAM_Tutorial.html
[9]: https://docs.nvidia.com/clara/parabricks/latest/Documentation/ToolDocs/man_haplotypecaller.html
[10]: https://docs.nvidia.com/clara/parabricks/latest/Documentation/ToolDocs/man_dbsnp.html
[11]: https://academic.oup.com/nar/article/29/1/308/1116004
[12]: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1836-7
[13]: https://docs.nvidia.com/clara/parabricks/4.5.1/Overview.html
[13b]: https://docs.nvidia.com/clara/parabricks/latest/Documentation/ToolDocs/OutputAccuracyAndCompatibleCpuSoftwareVersions.html
[14]: https://github.com/lh3/bwa
[15]: https://github.com/samtools/hts-specs
[16a]: https://samtools.github.io/hts-specs/SAMv1.pdf
[16b]: https://samtools.github.io/hts-specs/VCFv4.3.pdf

## Download and preprocess datasets
Download fasta, fastq, and vcf file.

In [None]:
base = "../datasets"
study_dir = f"{base}/wgs"
# Create directories for files
!mkdir $base
!mkdir $study_dir
# Fasta
fasta_url = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.26_GRCh38/GCF_000001405.26_GRCh38_genomic.fna.gz"
fasta_gz = "../datasets/gws/GCF_000001405.26_GRCh38_genomic.fna.gz"
!wget -nc -P ../datasets/gws/ $fasta_url
!gzip -d $fasta_gz
# FastQ files
fastq1_url = (
    "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR133/022/ERR13301022/ERR13301022_1.fastq.gz"
)
fastq2_url = (
    "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR133/022/ERR13301022/ERR13301022_2.fastq.gz"
)
!wget -nc -P $study_dir $fastq1_url
!wget -nc -P $study_dir $fastq2_url
# Known sites
vcf_url = "https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz"
vcf_md5_url = "https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz.md5"
vcf_tbi_url = "https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz.tbi"
vcf_tbi_md5_url = "https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz.tbi.md5"
knowns_sites = f"{study_dir}/GCF_000001405.25.vcf.gz"
vcf_md5_name = f"{study_dir}/GCF_000001405.25.vcf.gz.md5"
vcf_tbi_name =  f"{study_dir}/GCF_000001405.25.vcf.gz.tbi"
vcf_tbi_md5_name = f"{study_dir}/GCF_000001405.25.vcf.gz.tbi.md5"
!wget -nc -O $knowns_sites $vcf_url
!wget -nc -O $vcf_md5_name $vcf_md5_url
!wget -nc -O $vcf_tbi_name $vcf_tbi_url
!wget -nc -O $vcf_tbi_md5_name $vcf_tbi_md5_url

Index fasta file using bwa

In [None]:
fasta_file = "../datasets/wgs/GCF_000001405.26_GRCh38_genomic.fna"
# Index fasta with BWA
!/opt/bwa/bwa index $fasta_file

Trimming fastq files to remove low quality reads

In [None]:
fastq1_file = "../datasets/wgs/ERR13301022_1.fastq.gz"
fastq2_file = "../datasets/wgs/ERR13301022_2.fastq.gz"
trimmed_fastq1_file = "../datasets/wgs/trimmed_ERR13301022_1.fastq.gz"
trimmed_fastq2_file = "../datasets/wgs/trimmed_ERR13301022_2.fastq.gz"
# Preprocess fasta file with fastp
!/opt/fastp \
    -i $fastq1_file -I $fastq2_file \
    -o $trimmed_fastq1_file -O $trimmed_fastq2_file \
    -w 16
# Preprocess fasta
!/opt/bwa/bwa index $fasta_file

## Align reads to reference genome

In [None]:
bqsr_file = f"{study_dir}/recal_file.txt"
bam_file = f"{study_dir}/study.bam"
knowns_sites = f"{study_dir}/GCF_000001405.25.vcf.gz"

!pbrun fq2bam \
    --in-fq $trimmed_fastq1_file $trimmed_fastq2_file \
    --knownSites $knowns_sites \
    --out-bam $bam_file \
    --ref $fasta_file \
    --out-recal-file $bqsr_file

## Variant calling

In [None]:
variant_file = f"{study_dir}/ERR13301022.vcf"

!pbrun haplotypecaller \
    --ref $fasta_file \
    --in-bam $bam_file \
    --in-recal-file $bqsr_file \
    --out-variants $variant_file

## Annotates VCF file

In [None]:
annotated_variant_file = f"{study_dir}/ERR13301022_annotated.vcf"

!pbrun dbsnp \
    --in-vcf $variant_file \
    --out-vcf $annotated_variant_file \
    --in-dbsnp-file $knowns_sites