# Germline Analysis Blueprint 

This notebook shows how to run Germline analysis on WES data. 

## Dataset

The data set used in this lab is an exome NA12878 from the [NIH](https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data_indexes/NA12878/sequence.index.NA12878_Illumina_HiSeq_Exome_Garvan_fastq_09252015) sequenced on Illumina. The two fastq files and the reference files can be found in `data` and total to 12 GB in size

In [2]:
! tree data

[01;34mdata[0m
├── [00mdata_source.txt[0m
├── [01;31mNIST7035_TAAGGCGA_L001_R1_001.fastq.gz[0m
├── [01;31mNIST7035_TAAGGCGA_L001_R2_001.fastq.gz[0m
└── [01;34mref[0m
    ├── [00mHomo_sapiens_assembly38.dict[0m
    ├── [00mHomo_sapiens_assembly38.fasta[0m
    ├── [00mHomo_sapiens_assembly38.fasta.amb[0m
    ├── [00mHomo_sapiens_assembly38.fasta.ann[0m
    ├── [00mHomo_sapiens_assembly38.fasta.bwt[0m
    ├── [00mHomo_sapiens_assembly38.fasta.fai[0m
    ├── [00mHomo_sapiens_assembly38.fasta.pac[0m
    ├── [00mHomo_sapiens_assembly38.fasta.sa[0m
    ├── [01;31mHomo_sapiens_assembly38.known_indels.vcf.gz[0m
    └── [00mHomo_sapiens_assembly38.known_indels.vcf.gz.tbi[0m

1 directory, 13 files


## Alignment

In this step we will run BWA alignment using [Parabricks fq2bam](https://docs.nvidia.com/clara/parabricks/latest/documentation/tooldocs/man_fq2bam.html) tool. 

In [None]:
%%sh

DOCKER_IMAGE="nvcr.io/nvidia/clara/clara-parabricks:4.4.0-1"

DATA_DIR="$PWD/data"

REF="ref/Homo_sapiens_assembly38.fasta"
FASTQ_1="NIST7035_TAAGGCGA_L001_R1_001.fastq.gz"
FASTQ_2="NIST7035_TAAGGCGA_L001_R2_001.fastq.gz"
OUT_BAM="NIST7035_TAAGGCGA_L001_R1_001.bam"

docker run --gpus all --rm \
    -v ${DATA_DIR}:${DATA_DIR} \
    -w ${DATA_DIR} \
    ${DOCKER_IMAGE} pbrun fq2bam \
    --ref ${REF} \
    --in-fq ${FASTQ_1} ${FASTQ_2} \
    --out-bam ${OUT_BAM}


[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Set --bwa-options="-K #" to produce compatible pair-ended results with previous versions of
fq2bam or BWA MEM.
[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /home/gburnett/repos/parabricks-germline-
brevdev/data/NIST7035_TAAGGCGA_L001_R1_001.fastq.gz and /home/gburnett/repos/parabricks-germline-
brevdev/data/NIST7035_TAAGGCGA_L001_R2_001.fastq.gz
[Parabricks Options Mesg]: @RG\tID:H7AP8ADXX.1\tLB:lib1\tPL:bar\tSM:sample\tPU:H7AP8ADXX.1


[PB Info 2025-Feb-13 19:57:53] ------------------------------------------------------------------------------
[PB Info 2025-Feb-13 19:57:53] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2025-Feb-13 19:57:53] ||                              Version 4.4.0-1                             ||
[PB Info 2025-Feb-13 19:57:53] ||                      GPU-PBBWA mem, Sorting Phase-I                      ||
[PB Info 2025-Feb-13 19:57:53] ------------------------------------------------------------------------------
[PB Info 2025-Feb-13 19:57:53] Mode = pair-ended-gpu
[PB Info 2025-Feb-13 19:57:53] Running with 4 GPU(s), using 4 stream(s) per device with 16 worker threads per GPU
[PB Info 2025-Feb-13 19:58:03] # 100  0  3  0  0   0 pool:  3 741485922 bases/GPU/minute: 1112228883.0 
[PB Info 2025-Feb-13 19:58:12] Time spent reading: 6.017737 seconds
[PB Info 2025-Feb-13 19:58:13] # 43  0  4  0  0  16 pool: 60 3528668815 bases/GPU/minute: 4180774339.5 
[PB Inf

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation




Looking in the data folder now, we now see the generated bam files. 

In [11]:
! ls data | grep .bam

NIST7035_TAAGGCGA_L001_R1_001.bam
NIST7035_TAAGGCGA_L001_R1_001.bam.bai



## Variant Calling

In this step we will run [Parabricks DeepVariant](https://docs.nvidia.com/clara/parabricks/latest/documentation/tooldocs/man_deepvariant.html). Since we are using exomes, we must include the `--use-wes-model` flag. 

In [None]:
%%sh

DOCKER_IMAGE="nvcr.io/nvidia/clara/clara-parabricks:4.4.0-1"

DATA_DIR="$PWD/data"

REF="ref/Homo_sapiens_assembly38.fasta"
IN_BAM="NIST7035_TAAGGCGA_L001_R1_001.bam"
OUT_VCF="NIST7035_TAAGGCGA_L001_R1_001.vcf"

docker run --gpus all --rm \
    -v ${DATA_DIR}:${DATA_DIR} \
    -w ${DATA_DIR} \
    ${DOCKER_IMAGE} pbrun deepvariant \
    --ref ${REF} \
    --in-bam ${IN_BAM} \
    --out-variants ${OUT_VCF} \
    --use-wes-model

Detected 4 CUDA Capable device(s), considering 4 device(s)
  CUDA Driver Version / Runtime Version          12.7 / 12.3
Using model for CUDA Capability Major/Minor version number:    80


[PB Info 2025-Feb-13 20:01:03] ------------------------------------------------------------------------------
[PB Info 2025-Feb-13 20:01:03] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2025-Feb-13 20:01:03] ||                              Version 4.4.0-1                             ||
[PB Info 2025-Feb-13 20:01:03] ||                                deepvariant                               ||
[PB Info 2025-Feb-13 20:01:03] ------------------------------------------------------------------------------
[PB Info 2025-Feb-13 20:01:03] Starting DeepVariant
[PB Info 2025-Feb-13 20:01:03] Running with 4 GPU devices, each with 2 group instances and 6 workers
[PB Info 2025-Feb-13 20:01:03] ProgressMeter -	Current-Locus	Elapsed-Minutes
[PB Info 2025-Feb-13 20:01:09] ProgressMeter -	chr1:14000	0.1
[PB Info 2025-Feb-13 20:01:15] ProgressMeter -	chr3:52228000	0.2
[PB Info 2025-Feb-13 20:01:21] ProgressMeter -	chr7:105174000	0.3
[PB Info 2025-Feb-13 20:01:

/usr/local/parabricks/binaries/bin/deepvariant /home/gburnett/repos/parabricks-germline-brevdev/data/ref/Homo_sapiens_assembly38.fasta /home/gburnett/repos/parabricks-germline-brevdev/data/NIST7035_TAAGGCGA_L001_R1_001.bam 4 2 -o /home/gburnett/repos/parabricks-germline-brevdev/data/NIST7035_TAAGGCGA_L001_R1_001.vcf -n 6 --model /usr/local/parabricks/binaries/model/80+/shortread/deepvariant_wes.eng --channel_insert_size --pileup_image_width 221 --max_reads_per_partition 1500 --partition_size 1000 --vsc_min_count_snps 2 --vsc_min_count_indels 2 --vsc_min_fraction_snps 0.12 --min_mapping_quality 5 --min_base_quality 10 --alt_aligned_pileup none --variant_caller VERY_SENSITIVE_CALLER --dbg_min_base_quality 15 --ws_min_windows_distance 80 --aux_fields_to_keep HP --p_error 0.001 --max_ins_size 10
Variant caller done, total time: 0.6 min
Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation



Looking in the data folder now, we now see the generated vcf files. 

In [12]:
! ls data | grep .vcf

NIST7035_TAAGGCGA_L001_R1_001.vcf


## Concordance

In [15]:
%%sh

DATA_DIR="$PWD/data"

REF="ref/Homo_sapiens_assembly38.fasta"
EVAL_VCF="NIST7035_TAAGGCGA_L001_R1_001.vcf"
TRUTH_VCF="HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz"
TRUTH_BED="HG001_GRCh38_1_22_v4.2.1_benchmark.bed"
OUT_FILE="NIST7035_TAAGGCGA_L001_R1_001.output"

docker run \
    -v ${DATA_DIR}:${DATA_DIR} \
    -w ${DATA_DIR} \
    jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
    ${TRUTH_VCF} \
    ${EVAL_VCF} \
    -f ${TRUTH_BED} \
    -r ${REF} \
    -o ${OUT_FILE} \
    --engine=vcfeval \
    --pass-only



[W] overlapping records at chr6:29747431 for sample 0
[W] Variants that overlap on the reference allele: 6
[I] Total VCF records:         3893341
[I] Non-reference VCF records: 3893341
[W] overlapping records at chr10:104040598 for sample 0
[W] Variants that overlap on the reference allele: 3
[I] Total VCF records:         332986
[I] Non-reference VCF records: 257370


Hap.py 




Benchmarking Summary:
Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL       467702     11766    455936        23007      4955       6217   3034    507       0.025157          0.704884        0.270222         0.048580                     NaN                     NaN                   1.456462                   0.543054
INDEL   PASS       467702     11766    455936        23007      4955       6217   3034    507       0.025157          0.704884        0.270222         0.048580                     NaN                     NaN                   1.456462                   0.543054
  SNP    ALL      3254386    133786   3120600       226829     69163      23867  34485   2834       0.041109          0.659232        0.105220         0.077393                2.110694          

## Next Steps

Try running Germline analysis on your own data. 