# Retraining DeepVariant v1.5 using Parabricks

#### What is Parabricks?

NVIDIA Parabricks® is the only GPU-accelerated computational genomics toolkit that delivers fast and accurate analysis for sequencing centers, clinical teams, genomics researchers, and next-generation sequencing instrument developers. Parabricks provides GPU-accelerated versions of tools used every day by computational biologists and bioinformaticians—enabling significantly faster runtimes, workflow scalability, and lower compute costs.

The toolkit includes full compatibility with workflow languages and managers (WDL, NextFlow, Cromwell) to easily intertwine GPU- and CPU-powered tasks, as well as support for easy cloud deployment (AWS, GCP, Terra, and DNAnexus).

#### What is DeepVariant?

[DeepVariant](https://www.nature.com/articles/nbt.4235.epdf?author_access_token=q4ZmzqvvcGBqTuKyKgYrQ9RgN0jAjWel9jnR3ZoTv0NuM3saQzpZk8yexjfPUhdFj4zyaA4Yvq0LWBoCYQ4B9vqPuv8e2HHy4vShDgEs8YxI_hLs9ov6Y1f_4fyS7kGZ), developed by Google, is a deep learning-based variant caller that takes aligned reads, produces pileup image tensors from them, classifies each tensor using a convolutional neural network, and then outputs the results in a VCF or gVCF file.

# Downloading the data

The example data for this notebook can be found on Google cloud and requires the [gsutil](https://cloud.google.com/storage/docs/gsutil) tool. We will keep it in a folder called `data`. In total it is ~14 GB and should take a few minutes to download. 

In [None]:
%%sh 

DATA_BUCKET="gs://deepvariant/training-case-study/BGISEQ-HG001"
DATA_DIR="data"
OUTPUT_DIR="output"

mkdir ${DATA_DIR}
mkdir ${OUTPUT_DIR}

gsutil -m cp "${DATA_BUCKET}/BGISEQ_PE100_NA12878.sorted.chr*.bam*" "${DATA_DIR}"
gsutil -m cp -r "${DATA_BUCKET}/ucsc_hg19.fa*" "${DATA_DIR}"
gsutil -m cp -r "${DATA_BUCKET}/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_*" "${DATA_DIR}"

# Building a dataset to retrain DeepVariant

To retrain the WGS baseline model, we need a dataset to train on. We will use Chromosome 1 from HG001 to generate a training dataset. 

Note: The filepaths for mounting data might have to change depending on where you cloned this repo. 

Note: This took ~7 minutes on two GPUs. 

In [None]:
%%sh

INPUT_DIR="/data"
OUTPUT_DIR="/output"
REF="${INPUT_DIR}/ucsc_hg19.fa"
BAM_CHR1="${INPUT_DIR}/BGISEQ_PE100_NA12878.sorted.chr1.bam"
TRUTH_VCF="${INPUT_DIR}/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer_chrs_FIXED.vcf.gz"
TRUTH_BED="${INPUT_DIR}/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_chr.bed"
TRAIN_EXAMPLES="${OUTPUT_DIR}/training_set_gpu.with_label.tfrecord.gz"
CONTAINER="nvcr.io/nv-parabricks-dev/clara-parabricks:4.2.0-1.dvtraining"

docker run \
    --runtime "nvidia" \
    --rm \
    -v ${PWD}/data:/data \
    -v ${PWD}/output:/output \
    ${CONTAINER} pbrun make_examples \
    --ref ${REF} \
    --reads ${BAM_CHR1} \
    --num-streams-per-gpu 4 \
    --num-gpus 2 \
    --num-cpu-threads-per-stream 8 \
    -L "chr1" \
    --disable-use-window-selector-model \
    --truth-variants ${TRUTH_VCF} \
    --confident-regions ${TRUTH_BED} \
    --examples ${TRAIN_EXAMPLES} \
    --channel-insert-size


We will use Chromosome 21 from HG001 to generate a validation dataset. 

Note: This took ~3 minutes on 2 GPUs. 

In [None]:
%%sh 

INPUT_DIR="/data"
OUTPUT_DIR="/output"
REF="${INPUT_DIR}/ucsc_hg19.fa"
BAM_CHR21="${INPUT_DIR}/BGISEQ_PE100_NA12878.sorted.chr21.bam"
TRUTH_VCF="${INPUT_DIR}/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer_chrs_FIXED.vcf.gz"
TRUTH_BED="${INPUT_DIR}/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel_chr.bed"
VAL_EXAMPLES="${OUTPUT_DIR}/validation_set_gpu.with_label.tfrecord.gz"
CONTAINER="nvcr.io/nv-parabricks-dev/clara-parabricks:4.2.0-1.dvtraining"

docker run \
    --runtime "nvidia" \
    --rm \
    -v ${PWD}/data:/data \
    -v ${PWD}/output:/output \
    ${CONTAINER} pbrun make_examples \
    --ref ${REF} \
    --reads ${BAM_CHR21} \
    --num-streams-per-gpu 4 \
    --num-gpus 2 \
    --num-cpu-threads-per-stream 8 \
    -L "chr21" \
    --disable-use-window-selector-model \
    --truth-variants ${TRUTH_VCF} \
    --confident-regions ${TRUTH_BED} \
    --examples ${VAL_EXAMPLES} \
    --channel-insert-size

### Shuffling the training data

Before we can train the model we will need to shuffle each set of examples and generate a data config file. This has to be done for both the training and validation dataset. 

In [None]:
%%sh

OUTPUT_DIR="/output"
CONTAINER="nvcr.io/nv-parabricks-dev/clara-parabricks:4.2.0-1.dvtraining"
INPUT_FILES=$(ls ${PWD}/output/training_set_gpu.with_label.tfrecord-?????-of-00004.gz | sed 's|'${PWD}/output'|'${OUTPUT_DIR}'|')

# shuffle training set 
docker run \
    --runtime "nvidia" \
    --rm \
    -v ${PWD}/output:${OUTPUT_DIR} \
    ${CONTAINER} pbrun shuffle \
    --input-pattern-list ${INPUT_FILES} \
    --output-pattern-prefix=${OUTPUT_DIR}/training_set_gpu.with_label.shuffled \
    --output-dataset-config=${OUTPUT_DIR}/training_set_gpu.pbtxt \
    --output-dataset-name="HG001" \
    --direct-num-workers=4

In [None]:
%%sh

OUTPUT_DIR="/output"
CONTAINER="nvcr.io/nv-parabricks-dev/clara-parabricks:4.2.0-1.dvtraining"
INPUT_FILES=$(ls ${PWD}/output/validation_set_gpu.with_label.tfrecord-?????-of-00004.gz | sed 's|'${PWD}/output'|'${OUTPUT_DIR}'|')

# shuffle validation set
docker run \
    --runtime "nvidia" \
    --rm \
    -v ${PWD}/output:/output \
    ${CONTAINER} pbrun shuffle \
    --input-pattern-list ${INPUT_FILES} \
    --output-pattern-prefix=${OUTPUT_DIR}/validation_set_gpu.with_label.shuffled \
    --output-dataset-config=${OUTPUT_DIR}/validation_set_gpu.pbtxt \
    --output-dataset-name="HG001" \
    --direct-num-workers=4

# Training the DeepVariant model

Next we want to run the following two code blocks at the same time to train and evaluate the different possible models. 

This first cell will constantly check the `training_dir` folder for new model checkpoints. When a new model checkpoint is generated by the training script, it will evaluate the checkpoint and keep track of which checkpoint performs the best. 

In [None]:
%%sh

# Note: we have to manually stop running this once model train stops generating checkpoints

BIN_VERSION="1.5.0"
OUTPUT_DIR="/output"
TRAINING_DIR="/training_dir"
LOG_DIR="/logs"

mkdir logs
mkdir training_dir

docker run \
    -v ${PWD}/output:/output \
    -v ${PWD}/training_dir:/training_dir \
    -v ${PWD}/logs:/logs \
    google/deepvariant:"${BIN_VERSION}" \
    /opt/deepvariant/bin/model_eval \
    --dataset_config_pbtxt="${OUTPUT_DIR}/validation_set_gpu.pbtxt" \
    --checkpoint_dir="${TRAINING_DIR}" \
    --batch_size=512 > "logs/eval_gpu.log" 2>&1 &

This cell kicks off the DeepVariant training. 

In [None]:
%%sh

# all parameters below are used as an example. They are not optimized for this dataset, and are not recommended as the best default

BIN_VERSION="1.5.0"
OUTPUT_DIR="/output"
TRAINING_DIR="/training_dir"
LOG_DIR="/logs"

MODEL_BUCKET="gs://deepvariant/models/DeepVariant/${BIN_VERSION}/DeepVariant-inception_v3-${BIN_VERSION}+data-wgs_standard"
GCS_PRETRAINED_WGS_MODEL="${MODEL_BUCKET}/model.ckpt"

(docker run \
    --runtime "nvidia" \
    --rm \
    -v ${PWD}/output:/output \
    -v ${PWD}/training_dir:/training_dir \
    -v ${PWD}/logs:/logs \
    google/deepvariant:"${BIN_VERSION}-gpu" \
    /opt/deepvariant/bin/model_train \
    --dataset_config_pbtxt="${OUTPUT_DIR}/training_set_gpu.pbtxt" \
    --train_dir="${TRAINING_DIR}" \
    --model_name="inception_v3" \
    --number_of_steps=5000 \
    --save_interval_secs=300 \
    --batch_size=32 \
    --learning_rate=0.0005 \
    --start_from_checkpoint="${GCS_PRETRAINED_WGS_MODEL}" \
    ) > "logs/train_gpu.log" 2>&1

# Choose the best model

We then want to pick the best mdoel. We can determine which model to use by running the line of code below. 

Let's cat the file `training_dir/best_checkpoint.txt`

In [None]:
! cat training_dir/best_checkpoint.txt