In [None]:
!apt update
!apt install -y wget

## Download the Sample Data

In [None]:
# The tar file is 9.3GB and, when extracted, an additional 14GB
!mkdir sample_data
%cd sample_data
!wget -O parabricks_sample.tar.gz "https://s3.amazonaws.com/parabricks.sample/parabricks_sample.tar.gz"
!tar xvf parabricks_sample.tar.gz
!mv parabricks_sample/* .
%cd ..
!mkdir outputdir

In [None]:
!ls sample_data

In [None]:
!ls sample_data/Data

In [None]:
!ls sample_data/Ref

## GPU Monitoring

In [None]:
!nvidia-smi

In [None]:
# Run the command below in the terminal
### watch -n 0.5 nvidia-smi
#

## Alignment: FASTQ to BAM

In [None]:
WORKDIR = '/ws'

In [None]:
!pbrun fq2bam -h

In [None]:
# BQSR is not recommended for DeepVariant pipeline \
!pbrun fq2bam \
      --ref $WORKDIR/sample_data/Ref/Homo_sapiens_assembly38.fasta \
      --in-fq $WORKDIR/sample_data/Data/sample_1.fq.gz $WORKDIR/sample_data/Data/sample_2.fq.gz \
      --out-bam $WORKDIR/outputdir/fq2bam_output.bam \
#       --knownSites $WORKDIR/sample_data/Ref/Homo_sapiens_assembly38.known_indels.vcf.gz \
#       --out-recal-file $WORKDIR/outputdir/recall.table \
      --num-gpus 1

In [None]:
!ls outputdir/

## Variant Calling

#### GATK Haplotypecaller

In [None]:
!pbrun haplotypecaller -h

- vcf

In [None]:
!pbrun haplotypecaller \
      --ref $WORKDIR/sample_data/Ref/Homo_sapiens_assembly38.fasta \
      --in-bam $WORKDIR/outputdir/fq2bam_output.bam \
      --out-variants $WORKDIR/outputdir/variants_gatk.vcf \
      --num-gpus 1

- gvcf

In [None]:
!pbrun haplotypecaller \
      '''
      FIX ME
      '''

"The key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps." (https://gatk.broadinstitute.org/hc/en-us/articles/360035531812-GVCF-Genomic-Variant-Call-Format)

#### DeepVariant

In [None]:
!pbrun deepvariant -h

- vcf

In [None]:
!pbrun deepvariant \
    --ref $WORKDIR/sample_data/Ref/Homo_sapiens_assembly38.fasta \
    --in-bam $WORKDIR/outputdir/fq2bam_output.bam \
    --out-variants $WORKDIR/outputdir/variants_dv.vcf \
    --num-streams-per-gpu 2 \
    --run-partition \
    --gpu-num-per-partition 1 \
    --num-gpus 1

DeepVariant from Parabricks has the ability to use multiple streams on a GPU. The number of streams that can be used depends on the available resources. The default number of streams is set to two but can be increased up to a maximum of six to get better performance. This is something that has to be experimented with, before getting the optimal number on your system. (https://docs.nvidia.com/clara/parabricks/4.1.0/bestperformance.html#best-performance-for-deepvariant)

In [None]:
!pbrun deepvariant \
    --ref $WORKDIR/sample_data/Ref/Homo_sapiens_assembly38.fasta \
    --in-bam $WORKDIR/outputdir/fq2bam_output.bam \
    --out-variants $WORKDIR/outputdir/variants_dv.vcf \
    --num-streams-per-gpu 4 \
    --run-partition \
    --gpu-num-per-partition 1 \
    --num-gpus 1

- gvcf

Using the --run-partition, --proposed-variants, and --gvcf options at the same time will lead to a substantial slowdown.

In [None]:
!pbrun deepvariant \
    --ref $WORKDIR/sample_data/Ref/Homo_sapiens_assembly38.fasta \
    --in-bam $WORKDIR/outputdir/fq2bam_output.bam \
    --out-variants $WORKDIR/outputdir/variants_dv.g.vcf \
    --num-streams-per-gpu 4 \
    --gvcf \
    --num-gpus 1