# GATK Convolutional Neural Network (CNN) Filtering Tutorial <a class="tocSkip">
   

**January 2020**

<font size=4>This GATK tutorial will help you become familiar with using Convolutional Neural Net to filter annotated variants. The notebook illustrates the following steps. 

- Use GATK to annotate a VCF with scores from a Convolutional Neural Network (CNN)
- Generate 1D and 2D CNN models
- Apply tranche filtering to VCF based on scores from an annotation in the INFO field  
- Calculate concordance metrics</font>

_This tutorial was last tested with GATK v4.1.4.1 and IGV v2.8.0._ See [GATK Tool Documentation](https://gatk.broadinstitute.org/hc/en-us/articles/360037224712) for further information on the tools we use below.

# Set up your Notebook

## Set cloud environment values
If you opened this notebook and didn't adjust any cloud environment values, now's the time to edit them. Click on the gear icon in the upper right to edit your Cloud Environment form. Set the values as specified below:

| Option | Value |
| ------ | ------ |
| Environment | Default |
| Profile | Custom |
| CPU | 4 |
| Disk size | 100 GB |
| Memory | 15 GB |

Click the "Update" button when you are done, and Terra will begin to create a new runtime with your settings. When it is finished, it will pop up asking you to apply the new settings. In the meantime, you can continue with the setup instructions below. 

## Check kernel type
A kernel is a _computational engine_ that executes the code in the notebook. For this particular notebook, we will be using a Python 3 kernel so we can execute GATK commands using _Python Magic_ (`!`). In the upper right corner of the notebook, just under the Notebook Runtime, it should say `Python3`. If this notebook isn't running a Python 3 kernel, you can switch it by navigating to the Kernel menu and selecting `Change kernel`.

## Set up your files
Your notebook has a temporary folder that exists so long as your cluster is running. To see what files are in your notebook environment at any time, you can click on the Jupyter logo in the upper left corner. 

For this tutorial, we need to copy some files from this temporary folder to and from our workspace bucket. Run the two commands below to set up the workspace bucket variable and the file paths inside your notebook.

<font color = "green"> **Tool Tip:** To run a cell in a notebook, press `SHIFT + ENTER`</font>

In [None]:
# Set your workspace bucket variable for this notebook.
import os
BUCKET = os.environ['WORKSPACE_BUCKET']

In [None]:
# Set workshop variable to access the most recent materials
WORKSHOP = "workshop_2002"

In [None]:
# Create directories for your files to live inside this notebook
! mkdir -p /home/jupyter/notebooks/2-germline-vd/sandbox/
! mkdir -p /home/jupyter/notebooks/2-germline-vd/ref
! mkdir -p /home/jupyter/notebooks/2-germline-vd/resources
! mkdir -p /home/jupyter/notebooks/CNN/Output/

## Check data permissions
For this tutorial, we have hosted the starting files in a public Google bucket. We will first check that the data is available to your user account, and if it is not, we simply need to install Google Cloud Storage.

In [None]:
# Check if data is accessible. The command should list several gs:// URLs.
! gsutil ls gs://gatk-tutorials/$WORKSHOP/2-germline/

In [None]:
# If you do not see gs:// URLs listed above, run this cell to install Google Cloud Storage. 
# Afterwards, restart the kernel with Kernel > Restart.
#! pip install google-cloud-storage

## Download Data to the Notebook 
Some tools are not able to read directly from a Google bucket, so we download their files to our local notebook folder.

In [None]:
! gsutil cp gs://gatk-tutorials/$WORKSHOP/2-germline/ref/* /home/jupyter/notebooks/2-germline-vd/ref
! gsutil cp gs://gatk-tutorials/$WORKSHOP/2-germline/trio.ped /home/jupyter/notebooks/2-germline-vd/
! gsutil cp gs://gatk-tutorials/$WORKSHOP/2-germline/resources/* /home/jupyter/notebooks/2-germline-vd/resources/

---
# Run the default 1D model on the VCF with CNNScoreVariants

CNNScoreVariant is a pre-trained Convolutional Neural Network tool to score variants. This tool uses machine learning to differentiate between good variants and artifacts of the sequencing process, a fairly new approach that is especially effective at correctly calling indels. 

VQSR and Hard-filtering only take into account variant annotations. However, CNNScoreVariants 1D Model evaluates **annotations** AND **reference files**, plus or minus 64 bases from the variant. For example, it accounts for regions in the ref file that are difficult to sequence.

To enable the models to accurately filter and score variants from VCF files, we **trained** on validated VCFs (from truth models including **SynDip, Genomes in a bottle, and Platinum Genomes**) with unvalidated VCFs aligned to different reference builds (**HG19, HG38**), sequenced on **different machines**, using **different protocols**. 

In [None]:
!gatk CNNScoreVariants \
-V gs://gatk-tutorials/$WORKSHOP/2-germline/CNNScoreVariants/vcfs/g94982_b37_chr20_1m_15871.vcf.gz \
-O /home/jupyter/notebooks/CNN/Output/my_1d_cnn_scored.vcf \
-R gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta

The output VCF `my_1d_cnn_scored.vcf` will now have an INFO  field CNN_1D which corresponds to the score assigned by 1D model.

In [None]:
!cat /home/jupyter/notebooks/CNN/Output/my_1d_cnn_scored.vcf | grep -v '##' | head -5

## Apply filters to the VCF based on the CNN_1D score with the FilterVariantTranches tool

After scoring, you can filter your VCF by applying a sensitivity threshold with the tool FilterVariantTranches. 

In [None]:
!gatk FilterVariantTranches \
-V /home/jupyter/notebooks/CNN/Output/my_1d_cnn_scored.vcf \
--resource gs://gcp-public-data--broad-references/hg19/v0/1000G_omni2.5.b37.vcf.gz \
--resource gs://gcp-public-data--broad-references/hg19/v0/hapmap_3.3.b37.vcf.gz \
--info-key CNN_1D \
--snp-tranche 99.9 \
--indel-tranche 99.9 \
-O /home/jupyter/notebooks/CNN/Output/my_1d_filtered.vcf \
--invalidate-previous-filters 

**Now you have a neural network filtered VCF!**

## Evaluate the 1D Model 

In [None]:
!gatk Concordance \
-truth gs://gatk-tutorials/$WORKSHOP/2-germline/CNNScoreVariants/vcfs/hg001_na12878_b37_truth.vcf.gz \
-eval /home/jupyter/notebooks/CNN/Output/my_1d_filtered.vcf \
-L 20:1000000-9467292 \
-S /home/jupyter/notebooks/CNN/Output/my_1d_filtered_concordance.txt


## Evaluate the unfiltered VCF

In [None]:
!gatk Concordance \
-truth gs://gatk-tutorials/$WORKSHOP/2-germline/CNNScoreVariants/vcfs/hg001_na12878_b37_truth.vcf.gz \
-eval gs://gatk-tutorials/$WORKSHOP/2-germline/CNNScoreVariants/vcfs/g94982_b37_chr20_1m_15871.vcf.gz \
-L 20:1000000-9467292 \
-S /home/jupyter/notebooks/CNN/Output/unfiltered_concordance.txt


**Now look at how precision goes up (and sensitivity goes down) as we filter.**

In [None]:
!cat /home/jupyter/notebooks/CNN/Output/unfiltered_concordance.txt

In [None]:
!cat /home/jupyter/notebooks/CNN/Output/my_1d_filtered_concordance.txt

# (Try on your own) Run the default 2D model on the VCF with CNNScoreVariants

Due to time constraints, we encourage you to try the following 2D modelling on your own after the workshop.

The process is quite similar for the 2D model except we will also need to supply a BAM file with DNA read data to CNNScoreVariants.  We tell the tool to use the 2D read processing model with the tensor-type argument.

> **CNNScoreVariants 2D Model evaluates a) annotations, b) reference files and c) all variant information from the bam file.**

Copy and paste the following code into a new code cell block to run it.

```
!gatk CNNScoreVariants \
-I gs://gatk-tutorials/$WORKSHOP/2-germline/CNNScoreVariants/bams/g94982_chr20_1m_10m_bamout.bam \
-V gs://gatk-tutorials/$WORKSHOP/2-germline/CNNScoreVariants/vcfs/g94982_b37_chr20_1m_895.vcf \
-R gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta \
-O /home/jupyter/notebooks/CNN/Output/my_2d_cnn_scored.vcf \
--tensor-type read_tensor \
--transfer-batch-size 8 \
--inference-batch-size 8
```

## Now apply filters to the VCF based on the CNN_2D score with the FilterVariantTranches tool

Copy and paste the following code into a new code cell block to run it.

```
!gatk FilterVariantTranches \
-V /home/jupyter/notebooks/CNN/Output/my_2d_cnn_scored.vcf \
--resource gs://gcp-public-data--broad-references/hg19/v0/1000G_omni2.5.b37.vcf.gz \
--resource gs://gcp-public-data--broad-references/hg19/v0/hapmap_3.3.b37.vcf.gz \
--info-key CNN_2D \
--snp-tranche 95.9 \
--indel-tranche 95.0 \
-O /home/jupyter/notebooks/CNN/Output/my_2d_filtered.vcf \
--invalidate-previous-filters
```

## Evaluate the 2D Model
Now let’s evaluate how the filter did by running the concordance tool. 

Copy and paste the following code into a new code cell block to run it.

```
!gatk Concordance \
-truth gs://gatk-tutorials/$WORKSHOP/2-germline/CNNScoreVariants/vcfs/hg001_na12878_b37_truth.vcf.gz \
-eval /home/jupyter/notebooks/CNN/Output/my_2d_filtered.vcf \
-L 20:1000000-1432828 \
-S /home/jupyter/notebooks/CNN/Output/2d_filtered_concordance.txt
```


## Evaluate the unfiltered VCF

Copy and paste the following code into a new code cell block to run it.

```
!gatk Concordance \
-truth gs://gatk-tutorials/$WORKSHOP/2-germline/CNNScoreVariants/vcfs/hg001_na12878_b37_truth.vcf.gz \
-eval gs://gatk-tutorials/$WORKSHOP/2-germline/CNNScoreVariants/vcfs/g94982_b37_chr20_1m_895.vcf \
-L 20:1000000-1432828 \
-S /home/jupyter/notebooks/CNN/Output/unfiltered_2d_concordance.txt
```

**Now look at how precision goes up (and sensitivity goes down) as we filter.**

Copy and paste the following code into a new code cell block to run it.

`!cat /home/jupyter/notebooks/CNN/Output/unfiltered_2d_concordance.txt`

Copy and paste the following code into a new code cell block to run it.

`!cat /home/jupyter/notebooks/CNN/Output/2d_filtered_concordance.txt`

**Finally, you can train your own models with the tools CNNVariantWriteTensors and CNNVariantTrain, as long as you have validated VCFs to use as training data.**