# Germline Analysis Blueprint 

This notebook shows how to run Germline analysis on WES data using [Parabricks](https://docs.nvidia.com/clara/parabricks/latest/index.html). Parabricks is a free software suite for performing secondary analysis of next generation sequencing (NGS) DNA and RNA data. It uses GPU acceleration to deliver fast results. Its output matches commonly used software, making it fairly simple to verify the accuracy of the output.

## Dataset

The data set used in this lab is an exome NA12878 from the [NIH](https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data_indexes/NA12878/sequence.index.NA12878_Illumina_HiSeq_Exome_Garvan_fastq_09252015) sequenced on Illumina. The two fastq files and the HG38 reference files can be downloaded using the `download_data.sh` script. 

In [None]:
import os

data_dir = "./data"

if not os.path.exists(data_dir):
    os.system("sh ./download_data.sh")
else:
    print(f'Parabricks data file found')  

Data directory found
Parabricks data file found


It can take up to 15 minutes to download and organize the data into the following directory structure. 

ADD DETAILS HERE FOR WHILE YOU WAIT

```
data
├── NIST7035_TAAGGCGA_L001_R1_001.fastq.gz
├── NIST7035_TAAGGCGA_L001_R2_001.fastq.gz
└── ref
    ├── Homo_sapiens_assembly38.dict
    ├── Homo_sapiens_assembly38.fasta
    ├── Homo_sapiens_assembly38.fasta.amb
    ├── Homo_sapiens_assembly38.fasta.ann
    ├── Homo_sapiens_assembly38.fasta.bwt
    ├── Homo_sapiens_assembly38.fasta.fai
    ├── Homo_sapiens_assembly38.fasta.pac
    ├── Homo_sapiens_assembly38.fasta.sa
    ├── Homo_sapiens_assembly38.known_indels.vcf.gz
    └── Homo_sapiens_assembly38.known_indels.vcf.gz.tbi
```

Let's verify that the data downloaded correctly. 

In [7]:
! ls data

NIST7035_TAAGGCGA_L001_R1_001.bam	NIST7035_TAAGGCGA_L001_R1_001_chrs.txt
NIST7035_TAAGGCGA_L001_R1_001.bam.bai	NIST7035_TAAGGCGA_L001_R2_001.fastq.gz
NIST7035_TAAGGCGA_L001_R1_001.fastq.gz	ref
NIST7035_TAAGGCGA_L001_R1_001.vcf


In [8]:
! ls data/ref

Homo_sapiens_assembly38.dict
Homo_sapiens_assembly38.fasta
Homo_sapiens_assembly38.fasta.amb
Homo_sapiens_assembly38.fasta.ann
Homo_sapiens_assembly38.fasta.bwt
Homo_sapiens_assembly38.fasta.fai
Homo_sapiens_assembly38.fasta.pac
Homo_sapiens_assembly38.fasta.sa
Homo_sapiens_assembly38.known_indels.vcf.gz
Homo_sapiens_assembly38.known_indels.vcf.gz.tbi


## Alignment

The following cell shows how to run alignment on the FASTQ files in the data directory using [Parabricks fq2bam](https://docs.nvidia.com/clara/parabricks/latest/documentation/tooldocs/man_fq2bam.html). Below is a diagram of the exact steps involved. 

![fq2bam_diagram](images/fq2bam.png)

The software is packaged as a Docker container, therefore the cells to run Parabricks will be executing bash code to run Docker commands. The image comes from the [NVIDIA GPU Container (NGC) Registry](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/containers/clara-parabricks). 
this is an example of a shell script to do _________

```code
REF="data/ref/Homo_sapiens_assembly38.fasta"
FASTQ_1="data/NIST7035_TAAGGCGA_L001_R1_001.fastq.gz"
FASTQ_2="data/NIST7035_TAAGGCGA_L001_R2_001.fastq.gz"
OUT_BAM="data/NIST7035_TAAGGCGA_L001_R1_001.bam"

pbrun fq2bam \
    --ref ${REF} \
    --in-fq ${FASTQ_1} ${FASTQ_2} \
    --out-bam ${OUT_BAM}
```

In [4]:
!sh pb_alignment.sh

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation



[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Set --bwa-options="-K #" to produce compatible pair-ended results with previous versions of
fq2bam or BWA MEM.
[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /root/germline-blueprint/data/NIST7035_TAAGGCGA_L001_R1_001.fastq.gz
and /root/germline-blueprint/data/NIST7035_TAAGGCGA_L001_R2_001.fastq.gz
[Parabricks Options Mesg]: @RG\tID:H7AP8ADXX.1\tLB:lib1\tPL:bar\tSM:sample\tPU:H7AP8ADXX.1
[PB Info 2025-Mar-04 00:19:04] ------------------------------------------------------------------------------
[PB Info 2025-Mar-04 00:19:04] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2025-Mar-04 00:19:04] ||                              Version 4.4.0-1                             ||
[PB Info 2025-Mar-04 00:19:04] ||          

Looking in the data folder now, we now see the generated bam files. 

In [5]:
! ls data | grep .bam

NIST7035_TAAGGCGA_L001_R1_001.bam
NIST7035_TAAGGCGA_L001_R1_001.bam.bai



## Variant Calling

This cell shows how to run variant calling using [Parabricks DeepVariant](https://docs.nvidia.com/clara/parabricks/latest/documentation/tooldocs/man_deepvariant.html). 

This is the code in `variant_calling.sh`
```
REF="data/ref/Homo_sapiens_assembly38.fasta"
IN_BAM="data/NIST7035_TAAGGCGA_L001_R1_001.bam"
OUT_VCF="data/NIST7035_TAAGGCGA_L001_R1_001.vcf"

pbrun deepvariant \
    --ref ${REF} \
    --in-bam ${IN_BAM} \
    --out-variants ${OUT_VCF} \
    --use-wes-model
```

In [9]:
!sh variant_calling.sh

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation

Detected 1 CUDA Capable device(s), considering 1 device(s)
  CUDA Driver Version / Runtime Version          12.4 / 12.3
Using model for CUDA Capability Major/Minor version number:    89
/usr/local/parabricks/binaries/bin/deepvariant /root/germline-blueprint/data/ref/Homo_sapiens_assembly38.fasta /root/germline-blueprint/data/NIST7035_TAAGGCGA_L001_R1_001.bam 1 2 -o /root/germline-blueprint/data/NIST7035_TAAGGCGA_L001_R1_001.vcf -n 6 --model /usr/local/parabricks/binaries/model/80+/shortread/deepvariant_wes.eng --channel_insert_size --pileup_image_width 221 --max_reads_per_partition 1500 --partition_size 1000 --vsc_min_count_snps 2 --vsc_min_count_indels 2 --vsc_min_fraction_snps 0.12 --min_mapping_quality 5 --min_base_quality 10 --alt_aligned_pileup none --variant_caller VERY_SENSITIVE_CALLER --dbg_min_base_quality 15 --ws_min_windows_distance 80 --aux_fields_to_keep HP --p_error 0.001 --max_ins_size 10


Looking in the data folder now, we now see the generated vcf files. 

In [10]:
! ls data | grep .vcf

NIST7035_TAAGGCGA_L001_R1_001.vcf


## Next Steps

Try running Germline analysis on your own data. 