# From FASTQ to **Personalized FASTA** for Genome Language Models  
**fastp ➜ Parabricks FQ2BAM (GPU WGS) ➜ Parabricks HaplotypeCaller (GPU) ➜ bcftools filter & consensus**

---

Modern genome language models (GLMs) benefit from **subject-specific sequences** rather than a generic reference. This notebook introduces a streamlined, GPU-accelerated pipeline that turns raw short-read WGS FASTQs into a **consensus FASTA** suitable for GLM pretraining or fine-tuning—while following community best practices.

We’ll cover the historical context and the “why” of each step:

---

## 1) Read preprocessing with **fastp**

Early NGS workflows chained multiple tools for QC, adapter removal, and quality trimming (e.g., separate programs for each task), incurring I/O overheads and slower turnarounds. **fastp** unified these operations into a single, ultra-fast, multi-threaded preprocessor with integrated QC reports, adapter/quality trimming, poly-G tail handling for two-color chemistries, and UMI support—all in one binary ([fastp paper][fastp-paper], [fastp GitHub][fastp-github]).

**Typical command (example):**
```bash
fastp -i sample_R1.fastq.gz -I sample_R2.fastq.gz \
      -o sample.trim.R1.fq.gz -O sample.trim.R2.fq.gz \
      --detect_adapter_for_pe --cut_mean_quality 20 \
      --length_required 50 --thread 16 --html fastp.html --json fastp.json
````

---

## 2) Mapping & BAM production with **Parabricks FQ2BAM** (GPU WGS)

Short-read WGS mapping crystallized around **BWA-MEM** (2013), balancing speed and accuracy across read lengths and supporting paired-end/chimeric alignment ([BWA-MEM][bwa-mem]). Alignments are stored in the SAM/BAM format introduced by Li *et al.* (2009), now a standard for NGS reads and random access to large genomes ([SAM/BAM][sam-bam]). **GATK Best Practices** standardized downstream steps (mark duplicates, base quality score recalibration, etc.) for reproducible variant discovery ([GATK Best Practices][gatk-best-practices]).

**NVIDIA Parabricks** ports this canonical CPU pipeline to GPUs. The **FQ2BAM** tool wraps **BWA-MEM** and performs sorting, duplicate marking, and (optionally) BQSR in one GPU-accelerated stage, delivering **CPU-parity results** with large wall-clock speedups ([Parabricks FQ2BAM][pbr-fq2bam], [Parabricks overview][pbr-overview]). At population scale, GPU acceleration is a key enabler; studies report **>200× runtime decreases and \~5–10× cost reductions** relative to CPUs ([GPU genomics scaling][gpu-scaling]).

**Typical command (example):**

```bash
pbrun fq2bam \
  --ref hg38.fa \
  --in-fq sample.trim.R1.fq.gz sample.trim.R2.fq.gz \
  --out-bam sample.bam \
  --knownSites dbsnp.vcf.gz \
  --markdups --BQSR --num-threads 32
```

---

## 3) Germline variant calling with **Parabricks HaplotypeCaller** (GPU)

**GATK HaplotypeCaller** shifted from pileup-based genotyping to **local de-novo assembly in “active regions”**, improving accuracy in complex contexts and producing either VCF or gVCF for joint genotyping ([HaplotypeCaller docs][haplotypecaller-docs]). Parabricks provides a GPU-accelerated implementation that mirrors GATK behavior with greatly reduced runtime, enabling rapid turnarounds without altering scientific outputs ([Parabricks overview][pbr-overview]).

**Typical command (example):**

```bash
pbrun haplotypecaller \
  --ref hg38.fa \
  --in-bam sample.bam \
  --out-variants sample.g.vcf.gz \
  --emit-ref-confidence GVCF \
  --gvcf
```

> **Option:** Joint-genotype multiple gVCFs (e.g., `pbrun genotypegvcf`), then proceed with filtering.

---

## 4) Filtering & **consensus FASTA** with **bcftools**

After variant calling, apply **transparent, auditable filters**—e.g., depth (DP), genotype quality (GQ), strand bias (FS), mapping quality (MQ), allele balance (AB)—before generating a personalized FASTA. The **bcftools** toolkit provides both filtering and a **`consensus`** subcommand that applies variants onto a reference to yield a consensus/individualized sequence ([bcftools manual][bcftools-manual], [bcftools consensus how-to][bcftools-consensus]).

**Filtering examples (tune to your coverage & organism):**

```bash
# Hard filters (illustrative thresholds)
bcftools filter -i 'TYPE="snp" && DP>=10 && GQ>=20 && MQ>=40' sample.vcf.gz \
  -Oz -o sample.filtered.vcf.gz
tabix -p vcf sample.filtered.vcf.gz
```

**Create a consensus FASTA** (choose haplotype handling):

* **Haploid or pick a phase:** `-H 1` or `-H 2`
* **IUPAC codes** for heterozygous sites: `--iupac-codes`

```bash
# Apply variants to the reference to build a subject-specific FASTA
cat hg38.fa | bcftools consensus \
  --sample SAMPLE_ID \
  --haplotype 1 \
  sample.filtered.vcf.gz > sample.consensus.fa
```

---

## Why this matters for **Genome LMs**

GLMs trained on **individualized sequences** can capture real genetic variation (SNPs/indels) instead of the “average” reference. The pipeline above keeps the **scientific lineage intact** (fastp → BWA-MEM/SAM/BAM → GATK Best Practices → HaplotypeCaller) while leveraging **GPU acceleration** to shorten iteration cycles—crucial when generating many per-sample FASTAs for large-scale ML ([BWA-MEM][bwa-mem], [SAM/BAM][sam-bam], [GATK Best Practices][gatk-best-practices], [HaplotypeCaller docs][haplotypecaller-docs], [Parabricks overview][pbr-overview]).


<!-- Reference-style link definitions -->

[fastp-paper]: https://academic.oup.com/bioinformatics/article/34/17/i884/5093234 "fastp: an ultra-fast all-in-one FASTQ preprocessor"
[fastp-github]: https://github.com/OpenGene/fastp "OpenGene/fastp"
[bwa-mem]: https://arxiv.org/pdf/1303.3997 "Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM"
[sam-bam]: https://academic.oup.com/bioinformatics/article/25/16/2078/204688 "Sequence Alignment/Map format and SAMtools"
[gatk-best-practices]: https://pmc.ncbi.nlm.nih.gov/articles/PMC4243306/ "From FastQ data to high confidence variant calls (GATK Best Practices)"
[haplotypecaller-docs]: https://gatk.broadinstitute.org/hc/en-us/articles/21905025322523-HaplotypeCaller "GATK HaplotypeCaller documentation"
[pbr-fq2bam]: https://docs.nvidia.com/clara/parabricks/4.4.0/Documentation/ToolDocs/man_fq2bam.html "Parabricks FQ2BAM (BWA-MEM + GATK)"
[pbr-overview]: https://docs.nvidia.com/clara/parabricks/4.3.1/index.html "NVIDIA Parabricks overview"
[gpu-scaling]: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1836-7 "Scaling computational genomics to millions of individuals with GPUs"
[bcftools-manual]: https://samtools.github.io/bcftools/bcftools.html "bcftools manual"
[bcftools-consensus]: https://samtools.github.io/bcftools/howtos/consensus-sequence.html "bcftools consensus how-to"


## Import wrapper functions for pipeline

In [None]:
import os
import subprocess
from typing import Dict

# Configurations
THREADS=16
MEMORY=128
MIN_VAR_QUAL=30
MIN_DEPTH=6
MIN_VAR_FREQ=0.6  # For haploid, variants should be majority frequency or greater
MAX_DEPTH=500

def trim_fastq(
        job: Dict[str,str],
        min_illumina_quality:int=20,
        min_illumina_length:int=50,
        min_illumina_complexity:int=30):
    print(job['OUTPUT_PREFIX'])
    cmd = [
        "/opt/fastp",
        "-i", job["UNPROCESSED_FASTQ_R1"],
        "-I", job["UNPROCESSED_FASTQ_R2"],
        "-o", job["FASTQ_R1"],
        "-O", job["FASTQ_R2"],
        "-w", "16",
        "-q", str(min_illumina_quality),
        "--detect_adapter_for_pe",
        "--length_required", str(min_illumina_length),
        "-c", "-y", "-Y", str(min_illumina_complexity)
    ]
    with open(job["LOG_FILE"], "w") as f_out:
        subprocess.run(
            cmd,
            stdout=f_out,
            stderr=subprocess.STDOUT,
            text=True,
            check=True,
        )

def alignment(job: Dict[str,str]):
    print(job['OUTPUT_PREFIX'])
    cmd = [
        "pbrun", "fq2bam",
        "--ref", job["REF_FASTA"],
        "--in-fq", job["FASTQ_R1"], job["FASTQ_R2"],
        "--out-bam", job["BAM_FILE"],
    ]
    with open(job["LOG_FILE"], "a") as f_out:
        subprocess.run(
            cmd,
            stdout=f_out,
            stderr=subprocess.STDOUT,
            text=True,
            check=True,
        )

def variant_calling(job:Dict[str,str]):
    print(job['OUTPUT_PREFIX'])
    cmd = [
        "pbrun", "haplotypecaller",
        "--ref", job["REF_FASTA"],
        "--in-bam", job["BAM_FILE"],
        "--out-variants", job["RAW_VCF"],
        "--ploidy", "1",
        "--minimum-mapping-quality", "20",
        "--haplotypecaller-options", "-standard-min-confidence-threshold-for-calling 10"
    ]
    with open(job["LOG_FILE"], "a") as f_out:
        subprocess.run(
            cmd,
            stdout=f_out,
            stderr=subprocess.STDOUT,
            text=True,
            check=True,
        )

def basic_snp_filter(job:Dict[str,str]):
    print(job['OUTPUT_PREFIX'])
    cmd = [
        "bcftools", "view",
        "-i", f"QUAL >= {MIN_VAR_QUAL} && INFO/DP >= {MIN_DEPTH} && INFO/DP <= {MAX_DEPTH}",
        job["RAW_VCF"],
        "-O", "v",
        "-o", job["FILTERED_VCF"]
    ]
    with open(job["LOG_FILE"], "a") as f_out:
        subprocess.run(
            cmd,
            stdout=f_out,
            stderr=subprocess.STDOUT,
            text=True,
            check=True,
        )

def haploid_snp_filter(job:Dict[str,str]):
    print(job['OUTPUT_PREFIX'])
    cmd = [
        "bcftools", "view",
        "-i", f"FORMAT/AD[0:1]/(FORMAT/AD[0:0]+FORMAT/AD[0:1]) >= {MIN_VAR_FREQ}",
        job["FILTERED_VCF"],
        "-O", "v",
        "-o", job["HIGH_CONF_VCF"]
    ]
    with open(job["LOG_FILE"], "a") as f_out:
        subprocess.run(
            cmd,
            stdout=f_out,
            stderr=subprocess.STDOUT,
            text=True,
            check=True,
        )

def generate_fasta(job:Dict[str,str]):
    print(job['OUTPUT_PREFIX'])
    cmd1 = [
        "bcftools", "norm",
        "-f", job["REF_FASTA"],
        "-m", "-both",
        job["HIGH_CONF_VCF"],
        "-O", "v",
        "-o", f"{job['HIGH_CONF_VCF']}.normalized.vcf"
    ]
    cmd2 = [
        "bcftools", "view",
        f"{job['HIGH_CONF_VCF']}.normalized.vcf", 
        "-Oz", "-o", f"{job['HIGH_CONF_VCF']}.normalized.vcf.gz", 
    ]
    cmd3 = ["bcftools", "index", f"{job['HIGH_CONF_VCF']}.normalized.vcf.gz"]
    cmd4 = [
        "bcftools", "consensus",
        "-f", job["REF_FASTA"],
        "-o", job["CONSENSUS_FASTA"],
        f"{job['HIGH_CONF_VCF']}.normalized.vcf.gz"
    ]
    with open(job["LOG_FILE"], "a") as f_out:
        for cmd in [cmd1,cmd2,cmd3,cmd4]:
            subprocess.run(
                cmd,
                stdout=f_out,
                stderr=subprocess.STDOUT,
                text=True,
                check=True,
            )

### Preprocess datasets

In [None]:
base = "/workspace/datasets"
study_dir = f"{base}/haploid"
bam_dir = f"{study_dir}/bam"
vcf_dir = f"{study_dir}/vcf"
tmp_dir = f"{study_dir}/tmp"
fastq_dir = f"{study_dir}/fastq"
fastq_raw_delimiters = ["_unprocessed_illumina_"]

# Create directories for files
!mkdir -p $base
!mkdir -p $study_dir
!mkdir -p $bam_dir
!mkdir -p $vcf_dir
!mkdir -p $tmp_dir

In [None]:
# Preprocess fasta file
fasta_file = f"{study_dir}/fasta/GCA_000027005.1_ASM2700v1_genomic.fna"
genomic_gtf = f"{study_dir}/fasta/genomic.gtf"

# Index fasta with BWA
!/opt/bwa/bwa index $fasta_file

### Build jobs

In [None]:
fastq_files = [f"{fastq_dir}/{i}" for i in os.listdir(fastq_dir) if all([j in i for j in fastq_raw_delimiters])]
sample_names = ["your-sample-names"]
# My process to identify filenames based on path
# sample_names = sorted(list(set([i.split("fastq_unprocessed_illumina_")[-1].split("_R")[0] for i in fastq_files]))) 

jobs = [
    {
        "UNPROCESSED_FASTQ_R1":[i for i in fastq_files if sample_name in i and "_R1." in i][0],
        "UNPROCESSED_FASTQ_R2":[i for i in fastq_files if sample_name in i and "_R2." in i][0],
        "FASTQ_R1":[i.replace("_unprocessed_","_trimmed_") for i in fastq_files if sample_name in i and "_R1." in i][0],
        "FASTQ_R2":[i.replace("_unprocessed_","_trimmed_") for i in fastq_files if sample_name in i and "_R2." in i][0],
        "REF_FASTA":fasta_file,
        "ANNOTATION_GTF":genomic_gtf,
        "OUTPUT_PREFIX":sample_name,
        "BAM_FILE":f"{study_dir}/bam/{sample_name}.sorted.bam",
        "RAW_VCF":f"{study_dir}/vcf/{sample_name}.raw.vcf",
        "FILTERED_VCF":f"{study_dir}/vcf/{sample_name}.filtered.vcf",
        "HIGH_CONF_VCF":f"{study_dir}/vcf/{sample_name}.high_confidence.vcf",
        "CONSENSUS_FASTA":f"{study_dir}/fasta/{sample_name}.consensus.fasta",
        "STATS_FILE":f"{study_dir}/{sample_name}.stats.txt",
        "LOG_FILE":f"{study_dir}/{sample_name}.pipeline.log",
    }
    for sample_name in sample_names
]
print(len(jobs))
jobs

### Run pipeline

In [None]:
for job in jobs:
    trim_fastq(job)
    alignment(job)
    variant_calling(job)
    basic_snp_filter(job)
    haploid_snp_filter(job)
    generate_fasta(job)

## References

1. Chen S. *et al.* (2018) **fastp**: an ultra-fast all-in-one FASTQ preprocessor. *Bioinformatics*. ([paper][fastp-paper], [code][fastp-github])
2. Li H. (2013) **BWA-MEM**: aligning sequence reads and contigs. *arXiv*. ([preprint][bwa-mem])
3. Li H. *et al.* (2009) **The SAM/BAM format and SAMtools**. *Bioinformatics*. ([article][sam-bam])
4. Van der Auwera G.A. *et al.* (2013) **From FastQ data to high-confidence variant calls**. *Curr Protoc Bioinformatics*. ([overview][gatk-best-practices])
5. **GATK HaplotypeCaller** documentation. ([page][haplotypecaller-docs])
6. **NVIDIA Parabricks**: FQ2BAM tool and suite overview. ([FQ2BAM][pbr-fq2bam], [overview][pbr-overview])
7. Taylor-Weiner A. *et al.* (2019) **Scaling computational genomics with GPUs**. *Genome Biology*. ([article][gpu-scaling])
8. **bcftools** manual and consensus how-to. ([manual][bcftools-manual], [how-to][bcftools-consensus])