# GATK Germline Variant Discovery Tutorial <a class="tocSkip">

**February 2020**  

<img src="https://storage.googleapis.com/gatk-tutorials/images/2-germline/vd-image1.png" alt="drawing" width="40%" align="left" style="margin:0px 20px"/> 
<font size="4">The tutorial demonstrates an effective workflow for joint calling germline SNPs and indels in cohorts of multiple samples. The workflow applies to whole genome or exome data. Specifically, the tutorial uses a trio of WG sample snippets to demonstrate HaplotypeCaller's GVCF workflow for joint variant analysis. We use a GenomicsDB database structure, perform a genotype refinement based on family pedigree, and evaluate the effects of refinement.</font>

_This tutorial was last tested with the GATK v4.1.4.1 and IGV v2.8.0._
 See [GATK Tool Documentation](https://gatk.broadinstitute.org/hc/en-us/articles/360037224712) for further information on the tools we use.

# Set up your Notebook
## Set cloud environment values
If you opened this notebook and didn't adjust any cloud environment values, now's the time to edit them. Click on the gear icon in the upper right to edit your Cloud Environment form. Set the values as specified below:

| Option | Value |
| ------ | ------ |
| Environment | Default |
| Profile | Custom |
| CPU | 4 |
| Disk size | 100 GB |
| Memory | 15 GB |
| Startup Script | `gs://gatk-tutorials/scripts/gatk_4141.sh` |

Click the "Update" button when you are done, and Terra will begin to create a new cloud environment with your settings. When it is finished, it will pop up asking you to apply the new settings.

## Check kernel type
A kernel is a _computational engine_ that executes the code in the notebook. For this particular notebook, we will be using a Python 3 kernel so we can execute GATK commands using _Python Magic_ (`!`). In the upper right corner of the notebook, just under the Notebook Runtime, it should say `Python3`. If this notebook isn't running a Python 3 kernel, you can switch it by navigating to the Kernel menu and selecting `Change kernel`.

## Set up your files
Your notebook has a temporary folder that exists so long as your cluster is running. To see what files are in your notebook environment at any time, you can click on the Jupyter logo in the upper left corner. 

For this tutorial, we need to copy some files from this temporary folder to and from our workspace bucket. Run the commands below to set up environment variables and the file paths inside your notebook.

<font color = "green"> **Tool Tip:** To run a cell in a notebook, press `SHIFT + ENTER`</font>

In [None]:
# Set your workspace bucket variable for this notebook.
import os
BUCKET = os.environ['WORKSPACE_BUCKET']

# Set workshop variable to access the most recent materials
WORKSHOP = "workshop_2002"

In [None]:
# Create directories for your files to live inside this notebook
! mkdir -p /home/jupyter/notebooks/2-germline-vd/sandbox/
! mkdir -p /home/jupyter/notebooks/2-germline-vd/ref
! mkdir -p /home/jupyter/notebooks/2-germline-vd/resources
! mkdir -p /home/jupyter/notebooks/2-germline-vd/gvcfs
! mkdir -p /home/jupyter/notebooks/CNN/Output/

## Check data permissions
For this tutorial, we have hosted the starting files in a public Google bucket. We will first check that the data is available to your user account, and if it is not, we simply need to install Google Cloud Storage.

In [None]:
# Check if data is accessible. The command should list several gs:// URLs.
! gsutil ls gs://gatk-tutorials/$WORKSHOP/2-germline/

In [None]:
# If you do not see gs:// URLs listed above, uncomment the last line in this cell
# and run it to install Google Cloud Storage. 
# Afterwards, restart the kernel with Kernel > Restart.
#! pip install google-cloud-storage

## Download Data to the Notebook 
Some tools are not able to read directly from a Google bucket, so we download their files to our local notebook folder.

In [None]:
! gsutil cp gs://gatk-tutorials/$WORKSHOP/2-germline/ref/* /home/jupyter/notebooks/2-germline-vd/ref
! gsutil cp gs://gatk-tutorials/$WORKSHOP/2-germline/trio.ped /home/jupyter/notebooks/2-germline-vd/
! gsutil cp gs://gatk-tutorials/$WORKSHOP/2-germline/resources/* /home/jupyter/notebooks/2-germline-vd/resources/
! gsutil cp gs://gatk-tutorials/$WORKSHOP/2-germline/gvcfs/* /home/jupyter/notebooks/2-germline-vd/gvcfs/

## Set up Integrative Genomics Viewer (IGV)
We will be using IGV in this tutorial to view BAM and VCF files. In order to do so without downloading each individual file, we will connect IGV with our google account.
- [Download IGV](https://software.broadinstitute.org/software/igv/download) to your local machine if you haven't already done so.
- Follow [these instructions](https://googlegenomics.readthedocs.io/en/latest/use_cases/browse_genomic_data/igv.html) to connect your Google account to IGV.


-----------------------------------------------------------------------------------------------------------

# Call variants with HaplotypeCaller in default VCF mode
In this first step we run HaplotypeCaller in its simplest form on a single sample to get familiar with its operation and to learn some useful tips and tricks.  

For this command and further commands in this tutorial, we will be working with data from the CEUTrio. The mother (NA12878) is used for our first command and is the most sequenced individual in the world. As such, she makes for a great case study to demonstrate how variant calling works, because we can verify the results against a larger body of knowledge. We will also be working with a father (NA12877) and a son (NA12882) in later portions of this notebook.


In [None]:
! gatk HaplotypeCaller \
    -R gs://gatk-tutorials/$WORKSHOP/2-germline/ref/ref.fasta \
    -I gs://gatk-tutorials/$WORKSHOP/2-germline/bams/mother.bam \
    -O /home/jupyter/notebooks/2-germline-vd/sandbox/motherHC.vcf \
    -L 20:10,000,000-10,200,000

In [None]:
# copy files from your notebook sandbox to your workspace bucket sandbox
! gsutil cp /home/jupyter/notebooks/2-germline-vd/sandbox/* $BUCKET/sandbox

Open IGV and <font color=red>set the genome to b37</font>. It is important you do this first, as changing the genome later will require you to reopen all files you may have already loaded into the program. 

Load the input BAM (mother.bam) and output VCF (mother.vcf), both printed below, in IGV and go to the coordinates **20:10,002,294-10,002,623**.

In [None]:
# prints out the file paths you will need to open in IGV
! echo gs://gatk-tutorials/$WORKSHOP/2-germline/bams/mother.bam
! echo $BUCKET/sandbox/motherHC.vcf

We see that HaplotypeCaller called a homozygous variant insertion of three T bases. How is this possible when so few reads seem to support an insertion at this position? When you encounter indel-related weirdness, turn on the display of soft-clips, which IGV turns off by default. Go to View > Preferences > Alignments and select “Show soft-clipped bases”.

<img src="https://storage.googleapis.com/gatk-tutorials/images/2-germline/vd-image1-IGVDesktop.png" alt="drawing" width="60%"/>

With soft clip display turned on, the region lights up with mismatching bases. **For these reads, the aligner (BWA MEM in our case) found the penalty of soft-clipping mismatching bases less than the penalty of inserting bases or inserting a gap.**

<img src="https://storage.googleapis.com/gatk-tutorials/images/2-germline/vd-image2-IGVDesktop.png" alt="drawing" width="100%"/>

<font color=green>**Tool Tip**</font>

<img src="https://storage.googleapis.com/gatk-tutorials/images/2-germline/vd-image3-IGVDesktop.png" alt="drawing" width="25px" align=left style="margin:20px 10px">

By default, IGV shows details of each read when you hover over them. To change this, click on the yellow box icon in the top bar and select `Show Details on Click`.



## View realigned reads and assembled haplotypes
Let's take a peek under the hood of HaplotypeCaller. HaplotypeCaller has a parameter called `-bamout`, which allows you to ask for the realigned reads. **These realigned reads are what HaplotypeCaller uses to make its variant calls**, so you will be able to see if a realignment fixed the messy region in the original bam.

Run the following command:

In [None]:
! gatk HaplotypeCaller \
    -R gs://gatk-tutorials/$WORKSHOP/2-germline/ref/ref.fasta \
    -I gs://gatk-tutorials/$WORKSHOP/2-germline/bams/mother.bam \
    -O /home/jupyter/notebooks/2-germline-vd/sandbox/motherHCdebug.vcf \
    -bamout /home/jupyter/notebooks/2-germline-vd/sandbox/motherHCdebug.bam \
    -L 20:10,002,000-10,003,000

In [None]:
# copy files from your notebook sandbox to your workspace bucket sandbox
! gsutil cp /home/jupyter/notebooks/2-germline-vd/sandbox/* $BUCKET/sandbox

Load the output BAM (motherHCdebug.bam) in IGV, and switch to Collapsed view (right-click>Collapsed). You should still be zoomed in on the same coordinates (**20:10,002,294-10,002,623**), and have the mother.bam track loaded for comparison.

In [None]:
# prints out the file paths you will need to open in IGV
! echo $BUCKET/sandbox/motherHCdebug.bam

Since we are only interested in looking at that messy region, we gave the tool a narrowed interval with `-L 20:10,002,000-10,003,000`. This is why the reads seem to sharply cut off when you compare the original BAM with the realigned BAM.

<img src="https://storage.googleapis.com/gatk-tutorials/images/2-germline/vd-image4-IGVDesktop.png" alt="drawing" width="100%"/>

After realignment by HaplotypeCaller (the bottom track), almost all the reads show the insertion, and the messy soft clips from the original bam are gone. **HaplotypeCaller will utilize soft-clipped sequences towards realignment**. Expand the reads in the output BAM (right-click>Expanded view), and you can see that all the insertions are in phase with the C/T SNP. 

This shows that HaplotypeCaller found a different alignment after performing its local graph assembly step. The reassembled region provided HaplotypeCaller with enough support to call the indel, which position-based callers like UnifiedGenotyper would have missed.

<img src="https://storage.googleapis.com/gatk-tutorials/images/2-germline/vd-image5-IGVDesktop.png" alt="drawing" width="60%"/>

➤ Focus on the insertion locus. **How many different types of insertions do you see?** Which one did HaplotypeCaller call in the VCF? What do you think of this choice?

There is more to a BAM than meets the eye--or at least, what you can see in this view of IGV. Right-click on the motherHCdebug.bam track to bring up the view options menu. **Select Color alignments by, and choose read group.** Your gray reads should now be colored similar to the screenshot below.

<img src="https://storage.googleapis.com/gatk-tutorials/images/2-germline/vd-image6-IGVDesktop.png" alt="drawing" width="60%"/>

Some of the first reads, shown in **red at the top of the pile, are not real reads.** These represent artificial haplotypes that were constructed by HaplotypeCaller, and are tagged with a special read group identifier, **RG:Z:ArtificialHaplotypeRG** to differentiate them from actual reassembled reads. You can click on an artificial read to see this tag under Read Group. 

<img src="https://storage.googleapis.com/gatk-tutorials/images/2-germline/vd-image7-IGVDesktop.png" alt="drawing" width="40%" align=left style="margin:0px 20px"/> 

<br>
➤ How are each of the three artificial haplotypes different from the others? 

Let's separate these artificial reads to the top of the track. Right click on a read, then select **Sort alignments by**, and choose **base**.

If you click on the purple insertion bars, you can see what call they correspond to. There is a `TTT` insertion and a `TT` insertion. When we sort by base, it will push the reads with evidence for a `TTT` insertion up to the top.


<img src="https://storage.googleapis.com/gatk-tutorials/images/2-germline/vd-image8-IGVDesktop.png" alt="drawing" width="50%" align=left style="margin:20px 20px"/>

Now we will color the reads differently. Right click on a read again and select **Color alignments by**, choose **tag**, and type in **HC**. HaplotypeCaller labels reassembled reads that have unequivocal support for a haplotype (based on likelihood calculations) with an HC tag value that matches the HC tag value of the corresponding haplotype. The gray color on some reads indicate that they could support one or more possible haplotypes.



➤ Again, what do you think of HaplotypeCaller's choice to call the three-base insertion instead of the two-base insertion? Is there more evidence for one or the other?

If you zoom out, you will also see the three active regions within the scope of the interval we provided. HaplotypeCaller considered twelve, three, and six putative haplotypes, respectively, for the regions, and performed local reassembly for each of the three regions. 

-----------------------------------------------------------------------------------------------------------

# GVCF workflow

## Run HaplotypeCaller on a single bam file in GVCF mode

It is possible to genotype a multi-sample cohort simultaneously with HaplotypeCaller. However, this scales poorly. **For a scalable analysis, GATK offers the GVCF workflow**, which separates BAM-level variant calling from genotyping. In the GVCF workflow, HaplotypeCaller is run with the `-ERC GVCF` option on each individual BAM file and produces a GVCF, which adheres to VCF format specifications while giving information about the data at every genomic position. GenotypeGVCFs then genotypes the samples in a cohort via the given GVCFs.

Run HaplotypeCaller in GVCF mode on the mother’s bam. This will produce a GVCF file that contains likelihoods for each possible genotype for the variant alleles, including a symbolic <NON_REF> allele. You'll see what this looks like soon.

In [None]:
! gatk HaplotypeCaller \
    -R gs://gatk-tutorials/$WORKSHOP/2-germline/ref/ref.fasta \
    -I gs://gatk-tutorials/$WORKSHOP/2-germline/bams/mother.bam \
    -O /home/jupyter/notebooks/2-germline-vd/sandbox/mother.g.vcf \
    -ERC GVCF \
    -L 20:10,000,000-10,200,000

In [None]:
# copy files from your notebook sandbox to your workspace bucket sandbox
! gsutil cp /home/jupyter/notebooks/2-germline-vd/sandbox/* $BUCKET/sandbox

**In the interest of time, we have supplied the other sample GVCFs in the bundle, but normally you would run them individually in the same way as the first.**

Let's take a look at a GVCF in IGV. Start a new session to clear your IGV screen (File>New Session), then load the GVCF for each family member, printed with the command below. Zoom in on **20:10,002,371-10,002,546**.

In [None]:
# prints out the file paths you will need to open in IGV
! echo gs://gatk-tutorials/$WORKSHOP/2-germline/gvcfs/father.g.vcf.gz
! echo $BUCKET/sandbox/mother.g.vcf
! echo gs://gatk-tutorials/$WORKSHOP/2-germline/gvcfs/son.g.vcf.gz

<img src="https://storage.googleapis.com/gatk-tutorials/images/2-germline/vd-image9-IGVDesktop.png" alt="drawing" width="100%"/>

Notice anything different from the VCF? Along with the colorful variant sites, you see many gray blocks in the GVCF representing reference confidence intervals. The gray blocks represent the blocks where the sample **appears** to be **homozygous reference or invariant**. The likelihoods are evaluated against an abstract non-reference allele and so these are referred to somewhat **counterintuitively as NON_REF** blocks of the GVCF. Each belongs to different **contiguous quality GVCFBlock** blocks. 

If we peek into the GVCF file using the command below, we actually see in the ALT column a **symbolic <NON_REF> allele, which represents non-called but possible non-reference alleles**. Using the likelihoods against the <NON_REF> allele we assign likelihoods to alleles that weren’t seen in the current sample during joint genotyping. Additionally, for NON_REF blocks, the **INFO field gives the end position** of the homozygous-reference block. The **FORMAT field gives Phred-scaled likelihoods (PL) for each potential genotype** given the alleles including the NON_REF allele.

Later, the genotyping step will retain only sites that are confidently variant against the reference. 


In [None]:
!head -n100 /home/jupyter/notebooks/2-germline-vd/sandbox/mother.g.vcf

## Consolidate GVCFs using GenomicsDBImport
For the next step, we need to consolidate the GVCFs into a GenomicsDB datastore. That might sound complicated but it's actually very straightforward.

In [None]:
! rm -rf /home/jupyter/notebooks/2-germline-vd/sandbox/trio

In [None]:
! gatk GenomicsDBImport \
    -V gs://gatk-tutorials/$WORKSHOP/2-germline/gvcfs/mother.g.vcf.gz \
    -V gs://gatk-tutorials/$WORKSHOP/2-germline/gvcfs/father.g.vcf.gz \
    -V gs://gatk-tutorials/$WORKSHOP/2-germline/gvcfs/son.g.vcf.gz \
    --genomicsdb-workspace-path /home/jupyter/notebooks/2-germline-vd/sandbox/trio \
    --intervals 20:10,000,000-10,200,000

For those who cannot use GenomicDBImport, the alternative is to consolidate GVCFs with CombineGVCFs. Keep in mind though that the GenomicsDB intermediate allows you to scale analyses to large cohort sizes efficiently, and to add data incremently (which is not possible in CombineGVCFs). **Because it's not trivial to examine the data within the database, we will extract the trio's combined data from the GenomicsDB database using SelectVariants.**

In [None]:
# Create a soft link to sandbox.
! rm -f sandbox
! ln -s /home/jupyter/notebooks/2-germline-vd/sandbox/ sandbox

In [None]:
! gatk SelectVariants \
    -R /home/jupyter/notebooks/2-germline-vd/ref/ref.fasta \
    -V gendb://sandbox/trio \
    -O /home/jupyter/notebooks/2-germline-vd/sandbox/trio_selectvariants.g.vcf

➤ Take a look inside the combined GVCF. How many samples are represented? What is going on with the genotype field (GT)? What does this genotype notation mean?

In [None]:
! cat /home/jupyter/notebooks/2-germline-vd/sandbox/trio_selectvariants.g.vcf | head -100

## Run joint genotyping on the trio to generate the VCF
The last step is to joint genotype variant sites for the samples using GenotypeGVCFs. 

In [None]:
! gatk GenotypeGVCFs \
    -R /home/jupyter/notebooks/2-germline-vd/ref/ref.fasta \
    -V gendb://sandbox/trio \
    -O /home/jupyter/notebooks/2-germline-vd/sandbox/trioGGVCF.vcf \
    -L 20:10,000,000-10,200,000

In [None]:
# copy files from your notebook sandbox to your workspace bucket sandbox
! gsutil cp /home/jupyter/notebooks/2-germline-vd/sandbox/* $BUCKET/sandbox

The calls made by GenotypeGVCFs and HaplotypeCaller run in multisample mode should mostly be equivalent, especially as cohort sizes increase. However, there can be some marginal differences in borderline calls, i.e. low-quality variant sites, in particular for small cohorts with low coverage. For such cases, joint genotyping directly with HaplotypeCaller and/or using the new quality score model with GenotypeGVCFs (turned on with `-new-qual`) may be preferable.

```
gatk HaplotypeCaller \
    -R ref/ref.fasta \
    -I bams/mother.bam \
    -I bams/father.bam \
    -I bams/son.bam \
    -O sandbox/trio_hcjoint_nq.vcf \
    -L 20:10,000,000-10,200,000 \
    -new-qual \
    -bamout sandbox/trio_hcjoint_nq.bam
```

In the interest of time, we do not run the above command. Note the BAMOUT will contain reassembled reads for all the input samples. 

Let's circle back to the locus we examined at the start. Load sandbox/trioGGVCF.vcf into IGV and navigate to <b>20:10,002,376-10,002,550</b>.

In [None]:
# prints out the file paths you will need to open in IGV
! echo $BUCKET/sandbox/trioGGVCF.vcf

<img src="https://storage.googleapis.com/gatk-tutorials/images/2-germline/vd-image10-IGVDesktop.png" alt="drawing" width="50%"/>

<img src="https://storage.googleapis.com/gatk-tutorials/images/2-germline/vd-image11-IGVDesktop.png" alt="drawing" width="20%" align=right style="margin:20px 20px"/>

Take a look at the father's genotype call at **20:10002458** (the leftmost variant call). Knowing the familial relationship for the three samples and the child's homozygous-variant genotype, what do you think about the father's HOM_REF call?

Results from GATK v4.0.1.0 also show HOMREF but give PLs (phred-scaled likelihoods) of 0,0,460. Changes since then improve hom-ref GQs near indels in GVCFs, as seen in the results from GATK v4.1.1.0 in the picture on the right. The table below shows this is an ambiguous site for other callers as well. 

<img src="https://storage.googleapis.com/gatk-tutorials/images/2-germline/vd-image3.png" alt="drawing" width="60%" align=left style="margin:0px 20px"/> 

If you recall the pipeline at the top of this notebook, you'll remember that there are several post-processing steps after we get raw variant calls. It's possible that those filtering steps would improve the call to resolve the Mendelian inheritance violation, but we don't have time to look into that further today.

Now let's take a look at the father's other variant call at **20:10002470** (the rightmost one). It also doesn't follow familial inheritance rules, but if you click on that site you'll notice something very interesting. The genotype is marked as `./.` and the PL is `0,0,0`. This indicates that HaplotypeCaller emitted **no call** at that location--it did not find evidence for either a reference call or a variant call for the father.

This is a great candidate for genotype refinement.

------------------------------------------------------------------------------------

# Genotype Refinement

## Refine the genotype calls with CalculateGenotypePosteriors
If you are running this notebook as a part of the GATK workshop series, then you will shortly hear more about Genotype Refinement. The basic principle is that we can systematically refine our calls for the trio using a tool called CalculateGenotypePosteriors. For starters, we can use pedigree information, which is provided in the trio.ped file. Second, we can use population priors; we use a population allele frequencies resource derived from gnomAD.

In [None]:
! gatk CalculateGenotypePosteriors \
    -V /home/jupyter/notebooks/2-germline-vd/sandbox/trioGGVCF.vcf \
    -ped /home/jupyter/notebooks/2-germline-vd/trio.ped \
    --skip-population-priors \
    -O /home/jupyter/notebooks/2-germline-vd/sandbox/trioCGP.vcf

In [None]:
! gatk CalculateGenotypePosteriors \
    -V /home/jupyter/notebooks/2-germline-vd/sandbox/trioGGVCF.vcf \
    -ped /home/jupyter/notebooks/2-germline-vd/trio.ped \
    --supporting-callsets /home/jupyter/notebooks/2-germline-vd/resources/af-only-gnomad.chr20subset.b37.vcf.gz \
    -O /home/jupyter/notebooks/2-germline-vd/sandbox/trioCGP_gnomad.vcf

In [None]:
# copy files from your notebook sandbox to your workspace bucket sandbox
! gsutil cp /home/jupyter/notebooks/2-germline-vd/sandbox/* $BUCKET/sandbox

In [None]:
# prints out the file paths you will need to open in IGV
! echo $BUCKET/sandbox/trioCGP.vcf
! echo $BUCKET/sandbox/trioCGP_gnomad.vcf

<img src="https://storage.googleapis.com/gatk-tutorials/images/2-germline/vd-image12-IGVDesktop.png" alt="drawing" width="30%" align=right style="margin:0px 20px"/> Add both sandbox/trioCGP.vcf and sandbox/trioCGP_gnomad.vcf to the IGV session. 

➤ What has changed? What has not changed?

You'll notice that the difficult-to-call site on the left (position 10002458) hasn't adjusted its calls at all, but it has become a lot less confident in its call. Compare the GQ values, and you'll see that it's confidence in that G/G call is much lower-- from 42 to 2. A GQ of 2 is a site that would certainly be filtered out in post-processing.

CalculateGenotypePosteriors adds the Phred-scaled Posterior Probability (**PP**), which basically refines the PL values. It incorporates the prior expectations for the given pedigree and/or population allele frequencies. Compare the PP and PL of the final gnomad file, and you'll see that there was another haplotype it ranked at a likelihood of 4. This means this site is pretty closely torn between 3 different possible haplotypes.

On the other hand, the ambiguous site on the right (position 10002470) was improved as we had predicted! With information from both the population priors and pedigree data, the father's new variant call at that site is `HOM_REF` with a genotype quality of 72. This is now a confident `HOM_REF` call.

The PL stays the same, still calling 0,0,0. If you add to the cohort again in the future, you'll be able to re-evaluate. In our case, it looks like CalculateGenotypePosteriors found that the population, including the family, had a high frequency for the T allele at this site.

 <img src="https://storage.googleapis.com/gatk-tutorials/images/2-germline/vd-image13-IGVDesktop.png" alt="drawing" width="43%" align=left style="margin:0px 20px"/> <img src="https://storage.googleapis.com/gatk-tutorials/images/2-germline/vd-image14-IGVDesktop.png" alt="drawing" width="43%" align=left style="margin:0px 20px"/> 

You can learn more about the Genotype Refinement workflow [here](https://software.broadinstitute.org/gatk/documentation/article?id=11074).  

## Compare changes with CollectVariantCallingMetrics 
There are a few different GATK/Picard tools to compare site-level and genotype-level concordance. If you are in a GATK workshop while running this tutorial, you will see the presentation soon. If you aren't, here's a quick summary. `CollectVariantCallingMetrics` collects summary and per-sample metrics about variant calls in a VCF file. We are going to compare our callsets before and after Genotype Refinement to see if we've improved them overall.

In [None]:
! gatk CollectVariantCallingMetrics \
    -I /home/jupyter/notebooks/2-germline-vd/sandbox/trioGGVCF.vcf \
    --DBSNP /home/jupyter/notebooks/2-germline-vd/resources/dbsnp.vcf \
    -O /home/jupyter/notebooks/2-germline-vd/sandbox/trioGGVCF_metrics

In [None]:
! cat /home/jupyter/notebooks/2-germline-vd/sandbox/trioGGVCF_metrics.variant_calling_detail_metrics | grep -v "##" | grep -v "#" | cut -f1,6,11,13,18 


In [None]:
! gatk CollectVariantCallingMetrics \
    -I /home/jupyter/notebooks/2-germline-vd/sandbox/trioCGP.vcf \
    --DBSNP /home/jupyter/notebooks/2-germline-vd/resources/dbsnp.vcf \
    -O /home/jupyter/notebooks/2-germline-vd/sandbox/trioCGP_metrics

In [None]:
! cat /home/jupyter/notebooks/2-germline-vd/sandbox/trioCGP_metrics.variant_calling_detail_metrics | grep -v "##" | grep -v "#" | cut -f1,6,11,13,18 


CollectVariantCallingMetrics produces both summary and detail metrics. The summary metrics provide cohort-level variant metrics, while the detail metrics segment the variant metrics for each sample in the callset, and add a few more fields. (You can read about all metrics more in-depth [here](https://broadinstitute.github.io/picard/picard-metric-definitions.html).)

For our purposes, we have subset the detailed metrics to a smaller number of columns to discuss here.

**Total SNPS and Total INDELS**
Comparing the two files, you will see that we recovered 3 SNP sites in the mother (NA12878) and 4 in the father (NA12887). We also recovered indels for each of the three samples.

**DBSNP_TITV**
This column shows the transition (Ti) transversion(Tv) ratio. In whole-genome samples, we expect the ratio to be between 2 and 2.1. We see a slight improvement in this ratio for the father (NA12877) and the mother (NA12878), but they are still below 2, which could indicate a higher rate of false positives in the callset. Further filtering would improve this score.

**DBNSP_INS_DEL_RATIO**
This column shows the ratio of insertions to deletions, which we expect to be about 1 for common variant studies. As we haven't specifically picked these samples to diagnose a rare disease, common variation has equal selective pressure on insertion and deletion events, so we find those to be about even. In rare disease, we often see a ratio of 0.2-0.5. Our results show that the numbers stay about the same or increase very slightly. It's not a strong indicator of success, but it does show that we didn't completely imbalance the callset by applying Genotype Refinement to fix those two odd sites.