# Germline Analysis Blueprint 

This notebook shows how to run Germline analysis on WES data using [Parabricks](https://docs.nvidia.com/clara/parabricks/latest/index.html). Parabricks is a free software suite for performing secondary analysis of next generation sequencing (NGS) DNA and RNA data. It uses GPU acceleration to deliver fast results. Its output matches commonly used software, making it fairly simple to verify the accuracy of the output.

## Dataset

The data set used in this lab is an exome NA12878 from the [NIH](https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data_indexes/NA12878/sequence.index.NA12878_Illumina_HiSeq_Exome_Garvan_fastq_09252015) sequenced on Illumina. The two fastq files and the HG38 reference files can be downloaded using the `download_data.sh` script. 

In [None]:
! ./download_data.sh

It will take around 15 minutes to download and organize the data into the following directory structure. 

```
data
├── NIST7035_TAAGGCGA_L001_R1_001.fastq.gz
├── NIST7035_TAAGGCGA_L001_R2_001.fastq.gz
└── ref
    ├── Homo_sapiens_assembly38.dict
    ├── Homo_sapiens_assembly38.fasta
    ├── Homo_sapiens_assembly38.fasta.amb
    ├── Homo_sapiens_assembly38.fasta.ann
    ├── Homo_sapiens_assembly38.fasta.bwt
    ├── Homo_sapiens_assembly38.fasta.fai
    ├── Homo_sapiens_assembly38.fasta.pac
    ├── Homo_sapiens_assembly38.fasta.sa
    ├── Homo_sapiens_assembly38.known_indels.vcf.gz
    └── Homo_sapiens_assembly38.known_indels.vcf.gz.tbi
```

Check that the data downloaded correctly. 

In [None]:
! ls data

In [None]:
! ls data/ref

## Alignment

The following cell shows how to run alignment on the FASTQ files in the data directory using [Parabricks fq2bam](https://docs.nvidia.com/clara/parabricks/latest/documentation/tooldocs/man_fq2bam.html). Below is a diagram of the exact steps involved. 

![fq2bam_diagram](images/fq2bam.png)

The software is packaged as a Docker container, therefore the cells to run Parabricks will be executing bash code to run Docker commands. The image comes from the [NVIDIA GPU Container (NGC) Registry](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/containers/clara-parabricks). 

In [None]:
%%sh

DOCKER_IMAGE="nvcr.io/nvidia/clara/clara-parabricks:4.4.0-1"

DATA_DIR="$PWD/data"

REF="ref/Homo_sapiens_assembly38.fasta"
FASTQ_1="NIST7035_TAAGGCGA_L001_R1_001.fastq.gz"
FASTQ_2="NIST7035_TAAGGCGA_L001_R2_001.fastq.gz"
OUT_BAM="NIST7035_TAAGGCGA_L001_R1_001.bam"

docker run --gpus all --rm \
    -v ${DATA_DIR}:${DATA_DIR} \
    -w ${DATA_DIR} \
    ${DOCKER_IMAGE} pbrun fq2bam \
    --ref ${REF} \
    --in-fq ${FASTQ_1} ${FASTQ_2} \
    --out-bam ${OUT_BAM}

Looking in the data folder now, we now see the generated bam files. 

In [None]:
! ls data | grep .bam


## Variant Calling

This cell shows how to run variant calling using [Parabricks DeepVariant](https://docs.nvidia.com/clara/parabricks/latest/documentation/tooldocs/man_deepvariant.html). 

In [None]:
%%sh

DOCKER_IMAGE="nvcr.io/nvidia/clara/clara-parabricks:4.4.0-1"

DATA_DIR="$PWD/data"

REF="ref/Homo_sapiens_assembly38.fasta"
IN_BAM="NIST7035_TAAGGCGA_L001_R1_001.bam"
OUT_VCF="NIST7035_TAAGGCGA_L001_R1_001.vcf"

docker run --gpus all --rm \
    -v ${DATA_DIR}:${DATA_DIR} \
    -w ${DATA_DIR} \
    ${DOCKER_IMAGE} pbrun deepvariant \
    --ref ${REF} \
    --in-bam ${IN_BAM} \
    --out-variants ${OUT_VCF} \
    --use-wes-model

Looking in the data folder now, we now see the generated vcf files. 

In [None]:
! ls data | grep .vcf

## Next Steps

Try running Germline analysis on your own data. 