# Parabricks **RNA-FQ2BAM**: GPU-accelerated RNA-seq, from FASTQ to BAM

---

RNA sequencing (RNA-seq) reshaped transcriptomics by replacing hybridization microarrays with sequencing-based, digital counting of transcripts. Landmark studies in 2008 established the approach in yeast and mammals, demonstrating that short-read sequencing could map and quantify entire transcriptomes with unprecedented dynamic range [1, 2].

As throughput climbed, specialized **spliced aligners** emerged. Early tools such as **TopHat** (2009) pioneered junction discovery, then **STAR** (2013) delivered order-of-magnitude speedups while improving alignment sensitivity and supporting chimeric/fusion reads—advances that became the backbone of modern RNA-seq pipelines and remain reflected in community best practices [3–6].

Meanwhile, the field’s scale exploded—from single samples to cohorts of thousands—driving adoption of **GPU acceleration** to keep costs and turnaround times manageable. Systematic benchmarks showed >200× runtime reductions and ~5–10× cost savings when genomics workloads move to GPUs, catalyzing production-grade suites such as **NVIDIA Parabricks** [8, 9].

## Where **RNA-FQ2BAM** fits

**Parabricks RNA-FQ2BAM** is the GPU-accelerated entry point for short-read RNA-seq alignment. It takes paired-end **FASTQ** files and outputs a coordinate-sorted, duplicate-marked **BAM**—ready for variant and fusion analyses—while adhering to widely used CPU workflows. Under the hood it runs the splice-aware **STAR** aligner and mirrors GATK-style data-cleanup steps, enabling “same results, far faster” execution on NVIDIA GPUs [4, 6, 8].

**At a glance, RNA-FQ2BAM:**
- Aligns reads with **STAR** (splice-aware, chimeric-read friendly).  
- Performs **coordinate sorting** and **duplicate marking**.  
- Produces a standards-compliant **BAM** file (SAM/BAM spec by Li _et al._ 2009) [7].  
- Serves as a drop-in for downstream **GATK RNA-seq** short-variant workflows, or as feedstock for fusion callers like **STAR-Fusion** and **Arriba** [6, 10].

> **Why it matters:** Parabricks pipelines are built to match reference CPU tools while delivering dramatic speedups on GPUs—reducing time-to-answer and cloud costs without changing scientific outputs [8, 9].

## How we got here (a brief lineage)

1. **RNA-seq foundations (2008–2012).** First transcriptome-wide sequencing studies establish digital expression quantification and novel transcript discovery, displacing microarrays [1, 2].  
2. **Splice-aware mappers (2009–2016).** TopHat enables de-novo junction finding; STAR and HISAT/HISAT2 push speed and memory efficiency, becoming defaults in many labs [3–5].  
3. **Community best practices.** The Broad’s **GATK RNA-seq** recommendations standardize mapping (two-pass STAR), data cleanup, and short-variant calling—shaping the canonical FQ→BAM→VCF flow [6].  
4. **Fusion detection maturation.** Chimeric-aware alignment enables accurate fusion callers; benchmarking highlights **STAR-Fusion** and **Arriba** among top performers for cancer transcriptomes [10].  
5. **GPU era and Parabricks.** As datasets scale, Parabricks ports core steps (including STAR-based RNA-seq alignment) to GPUs, preserving CPU-tool parity while shrinking runtimes from hours to minutes [8, 9].

## What you’ll do in this notebook

- **Download and preprocess datasets** download sample files and preprocess them to run RNA-FQ2BAM.  
- **Run RNA-FQ2BAM** on example FASTQs to obtain a sorted, duplicate-marked **BAM**.  
- **Inspect alignment quality** and metadata recorded in BAM per **SAM/BAM** specifications [7].  
- **Optionally branch**:  
  - feed BAMs into GATK’s **RNA-seq short-variant** workflow, or  
  - export chimeric evidence to fusion callers like **STAR-Fusion** or **Arriba** [6, 10].

---

## References & Resources (selected)

1. Nagalakshmi U. _et al._ (2008) **The transcriptional landscape of the yeast genome defined by RNA-Seq**. *Science*.  
2. Mortazavi A. _et al._ (2008) **Mapping and quantifying mammalian transcriptomes by RNA-Seq**. *Nat Methods*.  
3. Trapnell C. _et al._ (2009) **TopHat**: discovering splice junctions with RNA-Seq. *Bioinformatics*.  
4. Dobin A. _et al._ (2013) **STAR**: ultrafast universal RNA-seq aligner. *Bioinformatics*. GitHub: [alexdobin/STAR].  
5. Kim D. _et al._ (2015) **HISAT**: fast spliced aligner with low memory requirements. *Nat Methods*.  
6. **GATK RNA-seq best practices** (Broad Institute).  
7. Li H. _et al._ (2009) **The Sequence Alignment/Map format and SAMtools**. *Bioinformatics*. Spec: [hts-specs].  
8. **NVIDIA Parabricks** overview & pipelines; **RNA-FQ2BAM** docs.  
9. Taylor-Weiner A. _et al._ (2019) **Scaling computational genomics with GPUs**. *Genome Biology*.  
10. Fusion callers: **STAR-Fusion** (Haas _et al._), **Arriba** (Uhrig _et al._). GitHub: [STAR-Fusion], [Arriba].

[alexdobin/STAR]: https://github.com/alexdobin/STAR  
[hts-specs]: https://github.com/samtools/hts-specs  
[STAR-Fusion]: https://github.com/STAR-Fusion/STAR-Fusion  
[Arriba]: https://github.com/suhrig/arriba

## Download and preprocess datasets

### Download raw reads and reference files

In [None]:
# Create directories for files
!mkdir ../datasets
!mkdir ../datasets/rna_seq
# Fasta
fasta_url = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.26_GRCh38/GCF_000001405.26_GRCh38_genomic.fna.gz"
fasta_gz = "../datasets/rna_seq/GCF_000001405.26_GRCh38_genomic.fna.gz"
!wget -nc -P ../datasets/rna_seq/ $fasta_url
!gzip -d fasta_gz
# GTF File
gtf_url = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.26_GRCh38/GCF_000001405.26_GRCh38_genomic.gtf.gz"
gtf_gz = "../datasets/rna_seq/GCF_000001405.26_GRCh38_genomic.gtf.gz"
!wget -nc -P ../datasets/rna_seq/ $gtf_url
!gzip -d gtf_gz
# FastQ files
fastq1_url = (
    "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR223/074/SRR22331474/SRR22331474_1.fastq.gz"
)
fastq2_url = (
    "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR223/074/SRR22331474/SRR22331474_2.fastq.gz"
)
!wget -nc -P ../datasets/rna_seq/ $fastq1_url
!wget -nc -P ../datasets/rna_seq/ $fastq2_url

### Preprocess fasta file

In [None]:
fasta_file = "../datasets/rna_seq/GCF_000001405.26_GRCh38_genomic.fna"
gtf_file = "../datasets/rna_seq/GCF_000001405.26_GRCh38_genomic.gtf"
genome_dir = "../datasets/rna_seq/"
# Preprocess fasta file with STAR
!/opt/STAR-2.7.1a/bin/Linux_x86_64/STAR \
    --runThreadN 32 \
    --runMode genomeGenerate \
    --genomeSAindexNbases 10 \
    --genomeDir $genome_dir \
    --genomeFastaFiles $fasta_file \
    --sjdbGTFfile $gtf_file \
    --sjdbOverhang 100

### Preprocess fastq files

In [None]:
fastq1_file = "../datasets/rna_seq/SRR22331474_1.fastq.gz"
fastq2_file = "../datasets/rna_seq/SRR22331474_2.fastq.gz"
trimmed_fastq1_file = "../datasets/rna_seq/trimmed_SRR22331474_1.fastq.gz"
trimmed_fastq2_file = "../datasets/rna_seq/trimmed_SRR22331474_2.fastq.gz"
# Preprocess fasta file with fastp
!/opt/fastp \
    -i $fastq1_file -I $fastq2_file \
    -o $trimmed_fastq1_file -O $trimmed_fastq2_file \
    -w 16

## Run RNA-FQ2BAM 

In [None]:
!pbrun rna_fq2bam \
    --in-fq $trimmed_fastq1_file $trimmed_fastq2_file \
    --read-files-command zcat \
    --genome-lib-dir ../datasets/rna_seq \
    --output-dir ../datasets/rna_seq_homo/ \
    --out-bam ../datasets/rna_seq_homo \
    --ref $fasta_file