# GATK TUTORIAL :: Mutect2 Basics :: Notebook

March 2019


In this hands-on tutorial, we will call somatic short mutations, both single nucleotide and indels, using the GATK4 Mutect2 and FilterMutectCalls. 

<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/BP_somatic_snps_indels_4_1_00.png" width="900"/>



If you need a primer on what somatic calling is about, see GATK forum Article#11127 at <https://software.broadinstitute.org/gatk/documentation/article?id=11127>.

---
The tutorial was last tested with the GATK4.1.0.0 Docker and IGV v2.4.15. Example data are based on a breast cancer cell line and its matched normal cell line derived from blood. Both cell lines are consented and known as HCC1143 and HCC1143_BL (blood normal), respectively and  are 2x76 paired end whole exome sequences aligned to GRCh38. 



Keyboard shortcuts:

 - `SHIFT + ENTER` or `CTRL + ENTER` to evaluate a cell
 - `ESC` to return to navigation mode
 - `y` to turn a markdown cell into code
 - `m` to turn a code cell into markdown
 - `a` to add a new cell **above** the currently selected cell
 - `b` to add a new cell **below** the currently selected cell
 - `d, d` (repeated) to delete the currently selected cell
 - `TAB` to activate code completion
 
To try this out, create a new cell below this one using `b`, and print `my_variable` by starting with `print(my` and pressing `TAB`!

## Setup Instructions for this Tutorial

### 1.) Make sure the notebook is using a Python 3 kernel in the top right corner.

A kernel is a computational engine that executes the code in the notebook. We can execute GATK bash commands using Python Magic (!).

Unlike the other tutorials, this one requires file uploads to our IGV Browser, so we need to make the filespace visible to our google bucket.  This notebook is running inside a docker on the cluster, so any files that we create here are not visible to the bucket.

After we run our analysis we will copy our jupyter files into the bucket so we use the IGV viewer.


### Launch the Jupyter Notebook Cluster in Terra using gatk_script.sh

If you are reading this, you have already started a notebook cluster.

It is okay to reprogram the cluster to have settings that you need to do the work (such as more memory, etc.)  

Navigate to the "Notebook Runtime" tab in the upper corner and follow the images below for resetting the cluster.  

<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/Cluster1.png" width="600"/>

The complete link to paste into the url box is:

```
gs://gatk-tutorials/scripts/install_gatk_4100.sh

```




<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/Cluster3.png" width="600"/>

It will take a few minutes to create the cluster, click "Apply" once it is available.


### 2.)  Find the name of the google bucket.


<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/mutect2-image2.png" width="900"/>

Replace the buckdet name in the following command with your bucket name:

```
bucket="gs://fc-your-bucket-name"

```



In [None]:
bucket = "gs://fc-your-bucket-name"

Check to make sure that your variable loaded correctly.  We will use this at the end to load a file we have created into IGV.

In [None]:
!echo $bucket

### Check if data is accessible. The command should list several gs:// URLs.


In [None]:
!gsutil ls gs://gatk-tutorials/workshop_1903/3-somatic/

If you do not see gs:// URLs listed above, create a new cell below and run this command to install Google Cloud Storage. 
Afterwards, restart the kernel with Kernel > Restart.

```
! pip install google-cloud-storage
```

### Download Data to the Jupyter Notebooks in this Workspace

The files can be called directly from the tutorial bucket, but to avoid accidentally overwriting other people's work, we will take the extra precaution of bringing the files into our local jupyter notebook.



In [None]:
!gsutil -m cp -r gs://gatk-tutorials/workshop_1903/3-somatic/bams /home/jupyter-user/
!gsutil -m cp -r gs://gatk-tutorials/workshop_1903/3-somatic/ref /home/jupyter-user/
!gsutil -m cp -r gs://gatk-tutorials/workshop_1903/3-somatic/resources /home/jupyter-user/
!gsutil -m cp -r gs://gatk-tutorials/workshop_1903/3-somatic/mutect2_precomputed /home/jupyter-user/

In [None]:
!mkdir -p /home/jupyter-user/sandbox

Check to see if the files are now in our local bucket

In [None]:
!ls /home/jupyter-user

### Test to see if gatk is installed on your notebook cluster already
```
!gatk --list
```

In [None]:
!gatk --list

### Use these instructions ONLY IF gatk did not install after waiting for the cluster to reload (1.1.2)

If the previous error indicated gatk not found, download and unzip a precomplied copy of GATK4.1 

```
! wget -nc -P /home/jupyter-user/ https://github.com/broadinstitute/gatk/releases/download/4.1.0.0/gatk-4.1.0.0.zip
```


In [None]:
#ONLY RUN THIS IS GATK DID NOT LOAD OR IF YOU WANT TO RUN THE TUTORIAL LOCALLY
#! wget -nc -P /home/jupyter-user/ https://github.com/broadinstitute/gatk/releases/download/4.1.0.0/gatk-4.1.0.0.zip



Unzip GATK4.1

```
! if [ ! -d home/jupyter-user/gatk-4.1.0.0 ] ; then unzip -o /home/jupyter-user/gatk-4.1.0.0.zip -d /home/jupyter-user/

```


In [None]:
#ONLY RUN THIS IS GATK DID NOT LOAD OR IF YOU WANT TO RUN THE TUTORIAL LOCALLY
#!unzip -o /home/jupyter-user/gatk-4.1.0.0.zip -d /home/jupyter-user/



Set GATK variable 

```
gatk="/home/jupyter-user/gatk-4.1.0.0/gatk"

```

In [None]:
#ONLY RUN THIS IS GATK DID NOT LOAD OR IF YOU WANT TO RUN THE TUTORIAL LOCALLY
#gatk="/home/jupyter-user/gatk-4.1.0.0/gatk"



**PLEASE NOTE THAT IF YOU HAD TO INSTALL GATK MANUALLY, THE FOLLOWING COMMANDS WILL REQUIRE \"\$\" IN FRONT OF GATK**


For example, if gatk was already loaded:


```

!gatk --list

```

provides a list of available programs.

If a local install was necessary:

```

!$gatk --list

```

is the correct command



# CALL SOMATIC SNV & INDELS WITH MUTECT2

<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/BP_somatic_snps_indels_4_1_00.png" width="900"/>





## Call somatic SNVs and indels and generate a BAMOUT



We start by calling somatic short mutations on our HCC1143 tumor sample and matched normal using Mutect2.




```

!gatk Mutect2 \
      -R /home/jupyter-user/ref/Homo_sapiens_assembly38.fasta \
     # Mutect2 may input more tumor samples from the same individual 
      -I /home/jupyter-user//bams/tumor.bam \
     # Mutect2 may input matched normals from the same individual
      -I /home/jupyter-user/bams/normal.bam \
      -tumor HCC1143_tumor \
      -normal HCC1143_normal \
     # Mutect2 may input a panel of normals to help identify technical artifacts
      -pon /home/jupyter-user/resources/chr17_m2pon.vcf.gz \    
      # For most purposes Mutect2 should be supplied with gnomAD
      --germline-resource /home/jupyter-user//resources/chr17_af-only-gnomad_grch38.vcf.gz \
      -L /home/jupyter-user/resources/chr17plus.interval_list \
     # Mutect2 may input parameters learned by CollectF1R2Counts 
     # and LearnReadOrientationModel for one or more tumor samples
     #(Not set currently) --orientation-bias-artifact-priors tumor1-artifact-prior.tsv \
     -O /home/jupyter-user//sandbox/1_somatic_m2.vcf.gz \
     -bamout /home/jupyter-user/sandbox/2_tumor_normal_m2.bam

```

This command produces a raw unfiltered somatic callset restricted to the specified intervals list plus a BAM containing reassembled alignments. 

In [None]:
!gatk Mutect2 \
     -R /home/jupyter-user/ref/Homo_sapiens_assembly38.fasta \
     -I /home/jupyter-user/bams/tumor.bam \
     -I /home/jupyter-user/bams/normal.bam \
     -tumor HCC1143_tumor \
     -normal HCC1143_normal \
     -pon /home/jupyter-user/resources/chr17_m2pon.vcf.gz \
     --germline-resource /home/jupyter-user//resources/chr17_af-only-gnomad_grch38.vcf.gz \
     -L /home/jupyter-user/resources/chr17plus.interval_list \
     -O /home/jupyter-user//sandbox/1_somatic_m2.vcf.gz \
     -bamout /home/jupyter-user//sandbox/2_tumor_normal_m2.bam




*  GATK4.0.2.1+ Mutect2 disables the `MateOnSameContigOrNoMappedMateReadFilter` by default. Previous versions of the tutorial add a parameter to disable this read filter.

* GATK4.0.4.0+ instantiate different default values for the `--af-of-alleles-not-in-resource` parameter depending on mode. By default, case-only calling now uses 5e-8 and matched-control calling uses 1e-5. Previously, the default was 0.001, the average heterozygosity of humans, and the recommendation was to change this to 1/(2\*samples in resource), which is 2.5e-6 for our particular resource.

Mutect2 skips from analysis likely variant *sites* in the matched-control (germline) and sites in the PoN (likely artifactual). It will include borderline variant sites. If you need, you can include all sites with `--genotype-germline-sites` and `–-genotype-pon-sites`. 

The tool considers germline resource alleles. Namely, the tool will annotate variant alleles with the germline resource population allele frequencies or the frequency defined by ``–-af-of-alleles-not-in-resource`. Downstream filtering will use these values. 





**➤ What is the value of using a matched normal control?**





### A. Mutect2 uses the matched normal to additionally exclude rare germline variation not captured by the germline resource and individual-specific artifacts. 


<img style="float: right;" src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/Matchednormal.png" width="300"/>

To illustrate, if we run our tumor sample through Mutect2 without the matched normal (we do not recommend this). 

We get 48,371 calls, an order of magnitude more calls than with the matched normal. Yikes. Note that without a matched normal, the tool skips any allele that corresponds to a population allele frequency of 0.001 or higher and so the count excludes common germline variant sites. The 0.001 is the average heterozygosity of humans. Basically, knowing nothing at all about a site, this is a rough prior probability that there is a germline event there. 

### Mutect2 uses a germline population resource towards evidence of alleles being germline.

The simplified sites-only [gnomAD](http://gnomad.broadinstitute.org/) resource retaining allele-specific frequencies is available at <ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2>. It represents germline calls from ~200K exomes and ~16K genomes. Mutect2 uses the allele-specific frequency in its likelihood calculations for a variant being of somatic origin. In our example, the gnomAD resource `af-only-gnomad_grch38.vcf.gz` represents ~200k exomes and the tutorial data is exome data, so we use the tool default of 1e-5 for `--af-of-alleles-not-in-resource`, which is roughly in the same ballpark as 0.0000025 (1/(2*exome samples)). It's not an exact formula since you're just guessing allele frequency based on its absence in gnomAD. This value is an expected imputed value and not a bound.



### A panel of normals (PoN) has a vital role that fills a gap between the matched normal and the population resource. 


Mutect2 uses the PoN to catch additional sites of noise in sequencing data, like mapping artifacts or other somewhat random but systematic artifacts of sequencing and data processing. PoN in gatk4.1.0 for technical artifacts and not for germline variants.  The next version in gatk 4.1.1 will use GenomicDBImports and be much more efficient.




##  Make a panel of normals (PoN)

* The PoN used here was made using GATK4.beta.6 with 40 exome samples aligned to GRCh38 from the [1000 Genomes Project](http://www.internationalgenome.org/). 

* Ideally, the PoN should *include* technically similar samples that were sequenced on the same platform, e.g. HiSeqX, using the same chemistry and analyzed using the same reference genome and tool-chain. 

* However, even an unmatched PoN is better than no PoN at all. This is because mapping artifacts and polymerase slippage errors occur for pretty much the same genomic loci for short read sequencing approaches.




### To make your own PoN, run Mutect2 in tumor-only mode on each normal BAM individually then run `CreateSomaticPanelOfNormals` on the resulting VCFs.

For this tutorial, to practice the commands, we will generate a PoN from three precomputed VCFs, but will not run the tumor-only mode command. An example command is shown (DO NOT RUN). 


```

!gatk Mutect2 \
       -R /home/jupyter-user/ref/Homo_sapiens_assembly38.fasta \
       -I /home/jupyter-user/bams/HG00701.bam \
       -tumor /home/jupyter-user//HG00701 \
       -L /home/jupyter-user/resources/chr17plus.interval_list \
       -O /home/jupyter-user/3-somatic/sandbox/3_HG00701.vcf.gz 

```





This generates `3_HG00701.vcf.gz` M2 callset and a matching index. 

NOTE: This takes an extremely long time to run, so we are not doing it today.  To run it at a later date/time, make sure to remove the "#" from the front of the lines in the code cell.

---

In [None]:
#!gatk Mutect2 \
#       -R /home/jupyter-user/ref/Homo_sapiens_assembly38.fasta \
#       -I /home/jupyter-user/bams/HG00701.bam \
#       -tumor /home/jupyter-user//HG00701 \
#       -L /home/jupyter-user/resources/chr17plus.interval_list \
#       -O /home/jupyter-user/sandbox/3_HG00701.vcf.gz 

### Next, collate all the normal VCFs into a single callset with `CreateSomaticPanelOfNormals`.


For the tutorial, we run this command on three small normal sample VCFs from [1KGP](http://www.internationalgenome.org/). This generates a PoN VCF `6_threesamplepon.vcf.gz` and an index.


```

!gatk CreateSomaticPanelOfNormals \
     -vcfs /home/jupyter-user/mutect2_precomputed/3_HG00701.vcf.gz \
     -vcfs /home/jupyter-user/mutect2_precomputed/4_NA19771.vcf.gz \
     -vcfs /home/jupyter-user/mutect2_precomputed/5_HG02759.vcf.gz \
     -O /home/jupyter-user/sandbox/6_threesamplepon.vcf.gz

```
        
`CreateSomaticPanelOfNormals` retains sites with variants in two or more samples. 

The `--min-sample-count` is set to two by default and you can adjust this to any number. 

The tool retains the alleles from the samples but drops all other annotations to create an eight-column, sites-only VCF. 

---

NOTE: This takes an extremely long time to run, so we are not doing it today. To run it at a later date/time, make sure to remove the "#" from the front of the lines in the code cell.

In [None]:
#!gatk CreateSomaticPanelOfNormals \
#     -vcfs /home/jupyter-user/mutect2_precomputed/3_HG00701.vcf.gz \
#     -vcfs /home/jupyter-user/mutect2_precomputed/4_NA19771.vcf.gz \
#     -vcfs /home/jupyter-user/mutect2_precomputed/5_HG02759.vcf.gz \
#     -O /home/jupyter-user/sandbox/6_threesamplepon.vcf.gz



**➤What annotations does the M2 PoN contain?**




## Mutect2 mutation calls can be multiallelic

To illustrate how Mutect2 applies annotations, below is a multiallelic site from the callset. Pull this out by running `gzcat` or `!zcat sandbox/1_somatic_m2.vcf.gz | awk '$5 ~","'`. The `awk '$5 ~","'`` subsets records that contain a comma in the 5th column.

Running the first cell shows mutation calls that have a single allele.

Running the second cell shows mutation calls that have mutliple alleles.


In [None]:
!zcat /home/jupyter-user/sandbox/1_somatic_m2.vcf.gz | grep -v "##"  | awk -F "\t" '$5 !~","'

In [None]:
!zcat /home/jupyter-user/sandbox/1_somatic_m2.vcf.gz | awk -F "\t" '$5 ~","'

<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/bluebox1.png" width="900"/>

We see eleven columns of information per variant call including genotype calls for the normal and tumor. Notice the empty fields for QUAL and FILTER, and annotations at the site (INFO) and sample level (columns 10 and 11). The samples each have genotypes and when a site is multiallelic, we see allele-specific annotations. Samples may have additional annotations, e.g. `PGT` and `PID` that relate to phasing. For a summary description of each annotation, view the header ##FORMAT and ##INFO lines.

<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/Format.png" width="900"/>



#  FILTER FOR CONFIDENT SOMATIC CALLS

In section 1.1, we generated an unfiltered Mutect2 callset. Now, we will use filtering tools to identify which mutation candidates are likely to be real somatic mutations.





## Estimate cross-sample contamination 

We estimate cross-sample contamination with two tools `GetPileupSummaries` and `CalculateContamination`. The estimation involves a known germline variant resource to limit analyses to sites that are not commonly variant, which we set to less than 20% population allele frequency. The statistics of contamination estimation is tuned to expectations in humans. `GetPileupSummaries` checks the given sample at these sites and uses those that it determines are homozygous-variant and uses contaminating alleles towards estimating contamination. The approach accounts for regions of potential loss of heterozygosity as well as copy number variation in its estimation. 





### `GetPileupSummaries` to summarize read support for a set number of known variant sites. 

Here we use a human population germline resource, [gnomAD]( http://gnomad.broadinstitute.org/), filtered to contain biallelic SNP variants present at 0.051 to 0.499 allele frequency in the population. The tool tabulates read counts that support REF, ALT and OTHER alleles for the sites in the resource. Let's run the tool on the tumor and the normal.


```
!gatk GetPileupSummaries \
    -I /home/jupyter-user/bams/tumor.bam \
    -V /home/jupyter-user/resources/chr17_small_exac_common_3_grch38.vcf.gz \
    -L /home/jupyter-user/resources/chr17_small_exac_common_3_grch38.vcf.gz \
    -O /home/jupyter-user/sandbox/7_tumor_getpileupsummaries.table
```

---
This first run is on the tumor samples.

In [None]:
!gatk GetPileupSummaries \
    -I /home/jupyter-user/bams/tumor.bam \
    -V /home/jupyter-user/resources/chr17_small_exac_common_3_grch38.vcf.gz \
    -L /home/jupyter-user/resources/chr17_small_exac_common_3_grch38.vcf.gz \
    -O /home/jupyter-user/sandbox/7_tumor_getpileupsummaries.table




```
!gatk GetPileupSummaries \
    -I /home/jupyter-user/bams/normal.bam \
    -V /home/jupyter-user/resources/chr17_small_exac_common_3_grch38.vcf.gz \
    -L /home/jupyter-user/resources/chr17_small_exac_common_3_grch38.vcf.gz \
    -O /home/jupyter-user/sandbox/7_normal_getpileupsummaries.table
```

---
The second run is on the normal samples.


In [None]:
!gatk GetPileupSummaries \
    -I /home/jupyter-user/bams/normal.bam \
    -V /home/jupyter-user/resources/chr17_small_exac_common_3_grch38.vcf.gz \
    -L /home/jupyter-user/resources/chr17_small_exac_common_3_grch38.vcf.gz \
    -O /home/jupyter-user/sandbox/7_normal_getpileupsummaries.table


Each command produces a six-column table as shown. The `alt_count` is the count of reads that support the ALT allele in the germline resource. The `allele_frequency` corresponds to that given in the germline resource. Counts for `other_alt_count` refer to reads that support all other alleles.

<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/bluebox2.png" width="700"/>



In [None]:
!head /home/jupyter-user/sandbox/7_tumor_getpileupsummaries.table

In [None]:
!head /home/jupyter-user/sandbox/7_normal_getpileupsummaries.table

The tool considers *homozygous-variant* sites in the sample where the alternate allele frequency (AF) in the population resource ranges between 0.01 and 0.2. This range is adjustable. We can expect a lot of contamination by alternate alleles at sites where the alternate AF is large, so those sites wouldn't tell us much. Conversely, at homozygous-alternate sites where the variant allele is rare in the population, we are more likely to observe the presence of REF or other alleles if there was cross-sample contamination, and therefore we will be able to measure contamination more accurately.





### Estimate contamination with `CalculateContamination`.

The tool gives the fraction contamination. This estimation informs downstream filtering by FilterMutectCalls. 

```
!$gatk CalculateContamination \

    -I /home/jupyter-user/3-somatic/sandbox/7_tumor_getpileupsummaries.table \
    -O /home/jupyter-user/3-somatic/sandbox/8_tumor_calculatecontamination.table
```



In [None]:
!gatk CalculateContamination \
    -I /home/jupyter-user/sandbox/7_tumor_getpileupsummaries.table \
    -O /home/jupyter-user/sandbox/8_tumor_calculatecontamination.table



Let's also try out an additional feature of the tool. We can provide both the tumor and the matched normal pileup table. The pairing can allow for a slightly more accurate estimate. 


```
!$gatk CalculateContamination \
    -I /home/jupyter-user/sandbox/7_tumor_getpileupsummaries.table \
    # the normal pileups are useful but optional
    -matched /home/jupyter-user/sandbox/7_normal_getpileupsummaries.table \
    # it is highly recommended to produce segments for FilterMutectCalls
    # (Not set currently) -tumor-segmentation segments.table]
    -O /home/jupyter-user/sandbox/8_pair_calculatecontamination.table 

```



In [None]:
!gatk CalculateContamination \
    -I /home/jupyter-user/sandbox/7_tumor_getpileupsummaries.table \
    -matched /home/jupyter-user/sandbox/7_normal_getpileupsummaries.table \
    -O /home/jupyter-user/sandbox/8_pair_calculatecontamination.table 


The resulting files from the two variations each give the fraction contamination. Run these to view results: 

* `!cat /home/jupyter-user/sandbox/8_tumor_calculatecontamination.table`
* `!cat /home/jupyter-user/sandbox/8_pair_calculatecontamination.table` 



In [None]:
!cat /home/jupyter-user/sandbox/8_tumor_calculatecontamination.table

In [None]:
!cat /home/jupyter-user/sandbox/8_pair_calculatecontamination.table

For our small tumor BAM file, you can see the contamination is ~0.0191 with an error of ~0.0022. We get a slightly lower number, ~0.0120 +/– 0.00454 for the matched estimate. For the full BAM file, we see a slightly larger contamination number. This threshold informs you to be wary of calls with less than that number for the alternate allele fraction.

---


## Apply filters with `FilterMutectCalls`

In this step, we filter the small data, 1_somatic_m2.vcf, with `FilterMutectCalls`. []Tutorial#11136](https://software.broadinstitute.org/gatk/documentation/article?id=11136) provides the full Mutect2 callset using v4.0.0.0 and tallies filtering results. The tool uses the annotations within the callset, and if provided, uses the contamination table in filtering. Default settings are tuned for human somatic analyses.

```
!$gatk FilterMutectCalls \
    -V /home/jupyter-user/sandbox/1_somatic_m2.vcf.gz \
    # FilterMutectCalls may input segmentation for one or more \
    # tumor samples from CalculateContamination \
    #(Not set currently)  --tumor-segmentation segments1.table \
    --contamination-table /home/jupyter-user/sandbox/8_tumor_calculatecontamination.table \
    # FilterMutectCalls may input contamination estimates for one or more \ 
    #tumor samples from CalculateContamination \
    --stats /home/jupyter-user/sandbox/9_somatic_oncefiltered.stats.txt \
    -O /home/jupyter-user/sandbox/9_somatic_oncefiltered.vcf.gz
```



In [None]:
!gatk FilterMutectCalls \
    -V /home/jupyter-user/sandbox/1_somatic_m2.vcf.gz \
    --contamination-table /home/jupyter-user/sandbox/8_tumor_calculatecontamination.table \
    --stats /home/jupyter-user/sandbox/9_somatic_oncefiltered.stats.txt \
    -O /home/jupyter-user/sandbox/9_somatic_oncefiltered.vcf.gz



This produces a VCF callset `9_somatic_oncefiltered.vcf.gz` and index. Calls that are likely true positives get the PASS label in the FILTER field, and calls that are likely false positives are labeled with the reason(s) for filtering in the FILTER field of the VCF. We can view the available filters in the VCF header using 

* `!zgrep '##FILTER' /home/jupyter-user/sandbox/9_somatic_oncefiltered.vcf.gz`




In [None]:
!zgrep '##FILTER' /home/jupyter-user/sandbox/9_somatic_oncefiltered.vcf.gz

<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/bluebox3.png" width="900"/>

This step seemingly applies 20 filters, including contamination. However, if an annotation a filter relies on is absent, the tool skips the particular filtering. The filter will still appear in the header. For example, the `duplicate_evidence` filter requires a nonstandard annotation that our callset omits. 

---




## (Optional) Filter on sequence context artifacts 

To mitigate the effects of sequence context artifacts, e.g. OxoG and FFPE, the workflow uses another filtering tool, `FilterByOrientationBias`. The tool requires metrics generated by `CollectSequencingArtifactMetrics`, which we provide. 

This is the old version of the code, we are not going to run this.
---
For details on generating the metrics, see Section 5 of [Tutorial#11136](https://gatkforums.broadinstitute.org/dsde/discussion/11136#5)


```
!gatk FilterByOrientationBias \
    -AM C/T \
    -AM G/T \
    -V /home/jupyter-user/sandbox/9_somatic_oncefiltered.vcf.gz \
    -P /home/jupyter-user/mutect2_precomputed/10_tumor_artifact.pre_adapter_detail_metrics.txt \
    -O /home/jupyter-user/sandbox/11_somatic_twicefiltered.vcf.gz

```
---

This tool has been replaced with two tools.  Both are run to generate a tsv containing the artifact priors which are entered into the initial Mutect2 command as this parameter:

```
--orientation-bias-artifact-priors tumor1-artifact-prior.tsv
```


It is no longer necessary to specify ref/alt biases such as C/T or G/T, as all of them are checked in the newer code.

Here is the newer code version

```
!gatk CollectF1R2Counts \
     -R reference.fasta \
     -I tumor.bam \
     #a outputs tab-separated output table of pileup data over alt sites
     -alt-table tumor-alt.tsv \
     #Outputs metrics file with overall summary metrics and \
     #reference context-specific depth histograms (required)
     -ref-hist tumor-ref.metrics \
     #Outputs a histogram of alt sites with alt depth = 1 (required)
      -alt-hist tumor-alt.metrics
```


```
!gatk LearnReadOrientationModel \
   -alt-table tumor-alt.tsv \
   -ref-hist tumor-ref.metrics \
   -alt-hist tumor-alt.metrics \
   -O tumor-artifact-prior.tsv


```

The read orientation artifact, also known as the orientation bias artifact, arises due to a chemical change in the nucleotide during library prep that results in, for example, G base-paring with A. This kind of artifact has a clear signature (e.g. C to A SNP that occurs predominantly for the middle C in the DNA sequence CCG), and it's single-stranded in nature. Downstream, this artifact manifests as low allele fraction SNPs whose evidence for the alt allele consists almost entirely F1R2 reads or F2R1 reads. A read pair is F1R2 (forward 1st, reverse 2nd) if the sequence of bases in Read 1 maps to the forward strand of the reference (F1), and the sequence of Read 2 to the reverse strand of the reference (R2). F2R1 is defined similarly.

Thank you Takuto Sato for the loan of this image.

<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/orientationbias.png" width="750"/>

In [None]:
#!gatk FilterByOrientationBias \
#    -AM C/T \
#    -AM G/T \
#    -V /home/jupyter-user/sandbox/9_somatic_oncefiltered.vcf.gz \
#    -P /home/jupyter-user/mutect2_precomputed/10_tumor_artifact.pre_adapter_detail_metrics.txt \
#    -O /home/jupyter-user/sandbox/11_somatic_twicefiltered.vcf.gz


This produces a VCF, index, and summary table. For our small data, this step does not filter any sites with the orientation_bias filter and this can be seen in the Num_Artifact_Mode_Filtered column of `11_somatic_twicefiltered.vcf.gz.summary`. 


We are skipping the older version of the code, but if you want to practice the generation of artifact priors, the following two code cells provide a demonstration on how to set up the code.


In [None]:
!gatk CollectF1R2Counts  \
     -R /home/jupyter-user/ref/Homo_sapiens_assembly38.fasta  \
     -I /home/jupyter-user/bams/tumor.bam \
     -alt-table tumor-alt.tsv \
     -ref-hist tumor-ref.metrics \
     -alt-hist tumor-alt.metrics

In [None]:
!gatk LearnReadOrientationModel \
   -alt-table tumor-alt.tsv \
   -ref-hist tumor-ref.metrics \
   -alt-hist tumor-alt.metrics \
   -O tumor-artifact-prior.tsv

In [None]:
!head -n50 tumor-ref.metrics | awk '{print $1 "\t" $2 "\t" $3 "\t" $4}'

In [None]:
!head tumor-artifact-prior.tsv 



# REVIEW CALLS WITH IGV

Deriving a good somatic callset involves comparing callsets from different callers, manually reviewing passing and filtered calls and, if necessary, additional filtering. Manual review extends from deciphering call record annotations to the nitty-gritty of reviewing read alignments using a visualizer. 




## Setup IGV to review somatic calls


How do we account for variant calls based on the read data? Remember Mutect2 reassembles reads just like HaplotypeCaller, so the clean alignments will not necessarily reflect the calls. We must examine the BAMOUT that Mutect2's graph-assembly produces. We already generated this BAMOUT in section 1.1 (`sandbox/2_tumor_normal_m2.bam`).  We are going to copy it into our bucket for loading into the IGV.


### Install or upgrade IGV Deskop to ensure you have a recent version. 

IGV Desktop can be obtained from http://www.broadinstitute.org/software/igv/download

### Copy the result of our analysis into the workspace bucket so we can load it into IGV.

We use the google cloud utilities (`gsutil`) command for copy (`cp`) to put our sandbox files into the bucket wher we can load them into the IGV.  The other files (`bams`, `resources`,`mutect2_precomputed`) were already made available  through our gatk-tutorials bucket so we don't have to copy those again.  

In [None]:
!gsutil cp /home/jupyter-user/sandbox/2_tumor_normal_m2.bam $bucket
!gsutil cp /home/jupyter-user/sandbox/2_tumor_normal_m2.bai $bucket

List the directory and make note of the full bucket name with the 2_tunor_normal_m2.bam file in it.

In [None]:
!gsutil ls $bucket

### Start IGV

### In IGV, load Human (hg38) as the reference

<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/IGV.changegenome.png" width="400"/>




### In IGV, go to `View`-->`Preferences` and check the box to enable `Google Access`

<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/IGV.Changepreferences.png" width="400"/>

### In IGV, go to `File`-->`Load from URL` then load these files in order: 



```
#These three files are coming from the workshop bucket, no changes need to me made to the path,.

A. gs://gatk-tutorials/workshop_1903/3-somatic/resources/chr17_m2pon.vcf.gz
B. gs://gatk-tutorials/workshop_1903/3-somatic/resources/chr17_af-only-gnomad_grch38.vcf.gz
C. gs://gatk-tutorials/workshop_1903/3-somatic/mutect2_precomputed/9_somatic_oncefiltered.vcf.gz



#Replace the following path with your bucket name in front of "/2_tumor_normal_m2.bam"

D. gs://fc-your-bucket-name/2_tumor_normal_m2.bam


#These two files are coming from the workshop bucket, no changes need to me made to the path,

E. gs://gatk-tutorials/workshop_1903/3-somatic/bams/tumor.bam
F. gs://gatk-tutorials/workshop_1903/3-somatic/bams/normal.bam

```


<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/IGV.loadgsfiles.png" width="600"/>




### Navigate to the location of the genome where variants were called

With the exception of the somatic callset, the regions the data cover are again in the `chr17plus.interval_list`. Navigate IGV to the **TP53** locus at **chr17:7,666,402-7,689,550.** 


<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/IGV.navigate.png" width="600"/>

* Right-click on track \[B] and collapse the view.
* Zoom into the somatic call in \[C], **chr17:7,673,333-7,675,077**
* Hover over or click on the gray call in track \[C] to view annotations.
* Scroll through the data and notice the coverage for the samples. 

<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/IGV.hover.png" width="600"/>



####  ➤ We see a C→T variant light up in red for the tumor but not the normal. What do you think is happening in D, 2_tumor_normal_m2.bam? 	



####  ➤ What does the coverage tell you?
	


If these alignments seem hard to decipher, it is because we need to tweak some settings.

* Make room to focus on track \[D]. Shift+select on the left panels for tracks \[E], \[F] and their coverages.

* Right-click and `Remove Tracks` to remove the tumor and normal BAMs.

* Go to `View>Preferences>Alignments`. Uncheck `Downsample reads`.
* Right-click on the alignments track and
      * Group by sample  
      * Color alignments by tag: HC
      * Sort by base
* Scroll and click on a read in each group to determine which group belongs to which sample. 



#### ➤ What are the three grouped tracks for the bamout? What do the colors indicate? What differentiates the pastel versus gray reads? 



#### ➤ How do you feel about this somatic call? 

---




##  Review filtered indels to study the logic behind different filters

Explore a few insertion and deletion sites in IGV and consider the evidence that supports the filtering decisions. 

| CHROM | POS | REF | ALT | FILTER |
| --- | --- | --- | --- | --- |
| chr17 | 7,221,420| CACTGCCCTAGGTCAGGA | C | artifact_in_normal;contamination;panel_of_normals;str_contraction |
| chr17 | 19,748,387| G | GA | contamination;str_contraction;t_lod |
| chr17 | 50,124,771 | GCACACACACACACACA | G,GCACA,GCACACA | artifact_in_normal;clustered_events;germline_risk;multiallelic;panel_of_normals




Here are a few more filtered indel calls to explore.


| CHROM | POS | REF | ALT | FILTER |
| --- | --- | --- | --- | --- |
| chr17 | 26,982,033 | G | GC | artifact_in_normal;bad_haplotype;clustered_events |
| chr17 | 35,671,734 | CTT | C,CT,CTTT | artifact_in_normal;multiallelic;panel_of_normals |
| chr17 | 47,157,394 | CAA | C,CAAA | artifact_in_normal;germline_risk;panel_of_normals |
| chr17 | 68,907,890 | GA | G,GAA | artifact_in_normal;base_quality;germline_risk;panel_of_normals;str_contraction |
| chr17 | 69,182,632 | C | CA | artifact_in_normal;contamination;str_contraction;t_lod |





# ANNOTATE MUTATIONS WITH FUNCOTATOR

Another approach to filtering mutation calls is by the significance of their functional impact. For example, a stop codon in the middle of a protein coding region or a missense mutation that changes how a protein functions is more significant than a silent mutation or a mutation in the middle of an intron. 

To gauge functional impact, we must know which regions of the genome code for protein sequence and which correspond to elements important to gene expression. Transcript annotation resources such as [GENCODE](https://www.gencodegenes.org/) capture such information in a standardized format [](General Transfer Format (GTF)](https://www.gencodegenes.org/pages/data_format.html).   

GATK4 Funcotator annotates variant alleles with information from any number of annotation resources. The annotation resources must be organized in a particular way. You can download prepared Funcotator resource bundles from `gs://broad-public-datasets/funcotator/` or use GATK4 `FuncotatorDataSourceDownloader` to download the latest data sources directly from your GATK4 install. For this tutorial, we have specially prepared a small annotation resource. 



Annotate the `9_somatic_oncefiltered.vcf.gz` mutation callset with the resource.

```
!gatk Funcotator \
    --data-sources-path /home/jupyter-user/resources/funcotator_dataSources_GATK_Workshop_20181205 \
    --ref-version hg38 \
    -R /home/jupyter-user/ref/Homo_sapiens_assembly38.fasta \
    -V /home/jupyter-user/mutect2_precomputed/9_somatic_oncefiltered.vcf.gz \
    -O /home/jupyter-user/sandbox/12_somatic_oncefiltered_funcotate.vcf.gz \
    --output-file-format VCF
```


In [None]:
!gatk Funcotator \
    --data-sources-path /home/jupyter-user/resources/funcotator_dataSources_GATK_Workshop_20181205/ \
    --ref-version hg38 \
    -R /home/jupyter-user/ref/Homo_sapiens_assembly38.fasta \
    -V /home/jupyter-user/mutect2_precomputed/9_somatic_oncefiltered.vcf.gz \
    -O /home/jupyter-user/sandbox/12_somatic_oncefiltered_funcotate.vcf.gz \
    --output-file-format VCF



This produces a VCF callset with annotations. If needed, Funcotator can instead write results in historic [Mutation Annotation Format (MAF)](http://software.broadinstitute.org/software/igv/MutationAnnotationFormat) given `–-output-file-format MAF`.




**➤ Examine the annotations for the TP53 mutation that we viewed earlier in IGV, at chr17:7674220.**



In [None]:
!zgrep chr 17 /home/jupyter-user/sandbox/12_somatic_oncefiltered_funcotate.vcf.gz | zgrep 7674220

<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/bluebox4.png" width="900"/>

We see an arginine to glutamine missense mutation. In our 124 mutation records, 21 are annotated with MISSENSE, and of these, ten PASS filters. 
