# GATK Somatic Copy Number Alteration Tutorial <a class="tocSkip">

**February 2020**  

<img src="https://storage.googleapis.com/gatk-tutorials/images/3-somatic/somatic-cna-image1.July2019.png" alt="drawing" width="50%" align="left" style="margin:10px 20px"/> 
<font size=3>This hands-on tutorial outlines steps to sensitively detect alterations in total and allelic copy ratios using GATK4's ModelSegments CNA workflow. The workflow is suitable towards detecting somatic copy ratio alterations, more familiarly copy number alterations (CNAs) or copy number variants (CNVs), for whole genomes and targeted exomes.</font>


_This tutorial was last tested with the GATK v4.1.4.1 and IGV v2.8.0._
 See [GATK Tool Documentation](https://gatk.broadinstitute.org/hc/en-us/articles/360037224712) for further information on the tools we use below.


# Set up your Notebook

<b><font color=red>The instructions below are slightly different from what you ran this morning. Run all the commands to ensure the notebook will work properly.</font></b>

## Set cloud environment values
If you are opening a notebook for the first time today and you didn't adjust any runtime values, now's the time to edit them. Click on the gear icon in the upper right to edit your Notebook Runtime. Set the values as specified below:

| Option | Value |
| ------ | ------ |
| Environment | Default |
| Profile | Custom |
| CPU | 4 |
| Disk size | 100 GB |
| Memory | 15 GB |

Click the "Update" button when you are done, and Terra will begin to create a new runtime with your settings. When it is finished, it will pop up asking you to apply the new settings. In the meantime, you can continue with the setup instructions below. 

## Check kernel type
A kernel is a _computational engine_ that executes the code in the notebook. For this particular notebook, we will be using a Python 3 kernel so we can execute GATK commands using _Python Magic_ (`!`). In the upper right corner of the notebook, just under the Notebook Runtime, it should say `Python3`. If this notebook isn't running a Python 3 kernel, you can switch it by navigating to the Kernel menu and selecting `Change kernel`.

## Install required packages

<font color = "green"> **Tool Tip:** To run a cell in a notebook, press `SHIFT + ENTER`</font>

In [None]:
!pip install rpy2
import rpy2
%load_ext rpy2.ipython

In [None]:
%%R
install.packages("optparse")
library("optparse")

## Set up your files
Your notebook has a temporary folder that exists so long as your cluster is running. To see what files are in your notebook environment at any time, you can click on the Jupyter logo in the upper left corner. 

For this tutorial, we need to copy some files from this temporary folder to and from our workspace bucket. Run the two commands below to set up the workspace bucket variable and the file paths inside your notebook.

In [None]:
# Set your workspace bucket variable for this notebook.
import os
BUCKET = os.environ['WORKSPACE_BUCKET']

In [None]:
# Set workshop variable to access the most recent materials
WORKSHOP = "workshop_2002"

In [None]:
# Create sandbox and file directories
! mkdir -p /home/jupyter/notebooks/3-somatic-cna/sandbox/
! mkdir -p /home/jupyter/notebooks/3-somatic-cna/sandbox/cna_plots/
! mkdir -p /home/jupyter/notebooks/3-somatic-cna/ref/
! mkdir -p /home/jupyter/notebooks/3-somatic-cna/cna_inputs

# Removes any old symbolic linked sandbox directory and adds a new link, in case you've run this before
! rm -rf sandbox
! ln -s /home/jupyter/notebooks/3-somatic-cna/sandbox sandbox

In [None]:
# Adds Display function for viewing images from the notebook folder. We will use this to look
# at the outputs at different steps in this pipeline
from IPython.display import display, Image

## Check data permissions
For this tutorial, we have hosted the starting files in a public Google bucket. We will first check that the data is available to your user account, and if it is not, we simply need to install Google Cloud Storage.

In [None]:
# Check if data is accessible. The command should list several gs:// URLs.
! gsutil ls gs://gatk-tutorials/$WORKSHOP/3-somatic/

In [None]:
# If you do not see gs:// URLs listed above, run this cell to install Google Cloud Storage. 
# Afterwards, restart the kernel with Kernel > Restart.
#! pip install google-cloud-storage

## Download Data to the Notebook 
Some tools are not able to read directly from a Google bucket, so we download their files to our local notebook folder.

In [None]:
! gsutil cp gs://gatk-tutorials/$WORKSHOP/3-somatic/cna_inputs/* /home/jupyter/notebooks/3-somatic-cna/cna_inputs/
! gsutil cp gs://gatk-tutorials/$WORKSHOP/3-somatic/ref/Homo_sapiens_assembly38.dict /home/jupyter/notebooks/3-somatic-cna/ref/
    

---

# PERFORM COVERAGE ANALYSIS: MODELSEGMENTS CNA

## Prepare intervals for coverage collection
<img src="https://storage.googleapis.com/gatk-tutorials/images/3-somatic/somatic-cna-image3.July2019.png" alt="drawing" width="300" align="right" style="margin:0px 20px"/> 
Collecting coverage counts forms the basis of copy number variant detection. However, before we can do that, we must define the resolution of the analysis with a genomic intervals list. Since we are using exome data, we will pad the target regions. Padding target regions 250 bases on each side has been shown to increase sensitivity for the CNA workflow. In the case of whole genome data, we would divide the reference genome into equally sized intervals or bins. In either case, we use PreprocessIntervals to prepare the intervals list.
 
The `--bin-length` value must be set for different data types, e.g. default 1000 for whole genome or 0 for exomes. For the tutorial exome data, we provide a snippet of the capture kit target regions and set --bin-length to zero.

In [None]:
 ! gatk PreprocessIntervals \
    -L gs://gatk-tutorials/$WORKSHOP/3-somatic/resources/targets_chr17.interval_list \
    -R gs://gatk-tutorials/$WORKSHOP/3-somatic/ref/Homo_sapiens_assembly38.fasta \
    --padding 250 \
    --bin-length 0 \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O /home/jupyter/notebooks/3-somatic-cna/sandbox/targets_chr17.preprocessed.interval_list

This produces a Picard-style intervals list targets_chr17.preprocessed.interval_list with 11,307 targets for use in the coverage collection step.

➤ Peruse both the before and after intervals. Do we have the same number of intervals as before? How does the tool pad intervals that are less than 500bp apart?  
➤ Take a look at the tool doc description for `-imr OVERLAPPING_ONLY` [here](https://gatk.broadinstitute.org/hc/en-us/articles/4405443672347-PreprocessIntervals#--interval-merging-rule). What does this option ensure?

In [None]:
!gsutil cat gs://gatk-tutorials/$WORKSHOP/3-somatic/resources/targets_chr17.interval_list | egrep -v "^@" | wc -l
    

In [None]:
!egrep -v "^@" /home/jupyter/notebooks/3-somatic-cna/sandbox/targets_chr17.preprocessed.interval_list | wc -l


In [None]:
!gsutil cat gs://gatk-tutorials/workshop_1910/3-somatic/resources/targets_chr17.interval_list | egrep -v "^@" | awk '{$6=$3-$2;print}' | head
    

In [None]:
!grep -v "^@" /home/jupyter/notebooks/3-somatic-cna/sandbox/targets_chr17.preprocessed.interval_list | awk '{$6=$3-$2;print}' | head


---
## Collect read counts for samples across target intervals

<img src="https://storage.googleapis.com/gatk-tutorials/images/3-somatic/somatic-cna-image4.July2019.png" alt="drawing" width="300" align="right" style="margin:0px 20px"/> 

The basis for detecting amplification and deletion events from sequencing data is read coverage. In this step, we count the number of read starts that overlap each interval using CollectReadCounts. We perform this step for the tumor sample and for the normal sample.

By default, the tool writes [HDF5 format](https://software.broadinstitute.org/gatk/documentation/article?id=11508) data, which is handled more efficiently by downstream tools (decreases runtime by reducing time spent on IO). Here we change the output format to TSV for teaching purposes. 

In [None]:
! gatk CollectReadCounts \
    -I gs://gatk-tutorials/$WORKSHOP/3-somatic/bams/tumor.bam \
    -L /home/jupyter/notebooks/3-somatic-cna/sandbox/targets_chr17.preprocessed.interval_list \
    -R gs://gatk-tutorials/$WORKSHOP/3-somatic/ref/Homo_sapiens_assembly38.fasta \
    --format TSV \
    -imr OVERLAPPING_ONLY \
    -O /home/jupyter/notebooks/3-somatic-cna/sandbox/tumor.counts.tsv

In [None]:
! gatk CollectReadCounts \
    -I gs://gatk-tutorials/$WORKSHOP/3-somatic/bams/normal.bam \
    -L /home/jupyter/notebooks/3-somatic-cna/sandbox/targets_chr17.preprocessed.interval_list \
    -R gs://gatk-tutorials/$WORKSHOP/3-somatic/ref/Homo_sapiens_assembly38.fasta \
    --format TSV \
    -imr OVERLAPPING_ONLY \
    -O /home/jupyter/notebooks/3-somatic-cna/sandbox/normal.counts.tsv

Here we show the raw counts per target (y-axis) for the normal and the tumor across 23 chromosomes (x-axis). Each target is represented by a point.
<img src="https://storage.googleapis.com/gatk-tutorials/images/3-somatic/cna-image1.png" alt="drawing" width="1000"/>
➤ Can you tell if either sample has copy number variants?

---
## Create a 1-sample panel of normals (PoN)

<img src="https://storage.googleapis.com/gatk-tutorials/images/3-somatic/somatic-cna-image5.July2019.png" alt="drawing" width="300" align="right" style="margin:0px 20px"/> 

Now we generate the CNA PoN with CreateReadCountPanelOfNormals. The tool creates a panel of normals that forms the baseline for what is the norm against which the workflow compares case samples. The tool uses Singular Value Decomposition, a type of Principal Component Analysis to capture systematic noise against statistical noise.

Normally, you will want to create a PoN with some number of **normal samples that were ideally subject to the same batch effects as your case sample** under scrutiny. This tutorial will use a PoN made of forty 1KGP normal samples and generated with the following command: 

```
gatk --java-options "-Xmx6500m" CreateReadCountPanelOfNormals \
    -I file1_clean.counts.hdf5 \
    … 
    -I file40_clean.counts.hdf5 \
    -O cnaponC.pon.hdf5
```

CreateReadCountPanelOfNormals performs several other filtering steps across samples and across targets, and [this article section](https://gatk.broadinstitute.org/hc/en-us/articles/4405443742491-CreateReadCountPanelOfNormals) outlines these.

**At the least, the PoN should consist of 10 normal samples** that were ideally subject to the same batch effects as that of the tumor sample. Our **recommendation is 40 or more normal samples**. To illustrate tool features, we create a PoN with our normal sample with the following command.

In [None]:
#For multiple samples, the -I would be specified multiple times
! gatk CreateReadCountPanelOfNormals \
    -I /home/jupyter/notebooks/3-somatic-cna/sandbox/normal.counts.tsv \
    -O /home/jupyter/notebooks/3-somatic-cna/sandbox/normal.pon.hdf5

➤ Study the stdout. 

**Are we losing any data during the filtering steps?** 

Given the reasons one might want to use a matched normal, **would you change this command?** 

Remember also __PoN medians are used to standardize case counts (by dividing)__.

So far we have been using subset data.

## Run CreateReadCountPanelOfNormals 

Run the command using the full data file cna_inputs/hcc1143_N_clean.counts.hdf5. Here we have adjusted the parameters to include `--minimum-interval-median-percentile`.

Changing the --minimum-interval-median-percentile argument from the default of 10.0 to a smaller value of 5.0 allows retention of more data, which is appropriate for this carefully selected normals cohort.

In [None]:
! gatk CreateReadCountPanelOfNormals \
    -I /home/jupyter/notebooks/3-somatic-cna/cna_inputs/hcc1143_N_clean.counts.hdf5 \
    --minimum-interval-median-percentile 5.0 \
    -O /home/jupyter/notebooks/3-somatic-cna/sandbox/normal.pon.hdf5

➤ Which do you think will perform better in revealing copy number events in the tumor, the 40-sample PoN or the matched-normal? Why?

If you are curious to see for yourself how the matched-normal PoN pans out, it is possible to substitute it in to the remaining steps. Instructions continue with the 40-sample PoN.

---
## Remove noise from sample coverage using the PoN

<img src="https://storage.googleapis.com/gatk-tutorials/images/3-somatic/somatic-cna-image6.July2019.png" alt="drawing" width="300" align="right" style="margin:0px 20px"/>

&nbsp;

We use DenoiseReadCounts and the PoN to standardize and then denoise sample read counts. 

This produces two files, the standardized copy ratios hcc1143_T_clean.standardizedCR.tsv and the denoised copy ratios hcc1143_T_clean.denoisedCR.tsv that each represents a data transformation. In the first transformation, the tool standardizes counts by the PoN median counts. The standarization includes log2 transformation and normalizing the counts data to center around one. In the second transformation, the tool denoises the standardized copy ratios using the principal components of the PoN.

In the single-sample-PoN case, the two results will be identical to each other, as the tool only performs standardization.  

### Denoise Read Counts

In [None]:
! gatk --java-options "-Xmx7g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" DenoiseReadCounts \
    -I /home/jupyter/notebooks/3-somatic-cna/cna_inputs/hcc1143_T_clean.counts.hdf5 \
    --count-panel-of-normals /home/jupyter/notebooks/3-somatic-cna/cna_inputs/cnaponC.pon.hdf5 \
    --standardized-copy-ratios /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_T_clean.standardizedCR.tsv \
    --denoised-copy-ratios /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv

In [None]:
! gatk --java-options "-Xmx7g" DenoiseReadCounts \
    -I /home/jupyter/notebooks/3-somatic-cna/cna_inputs/hcc1143_N_clean.counts.hdf5 \
    --count-panel-of-normals /home/jupyter/notebooks/3-somatic-cna/cna_inputs/cnaponC.pon.hdf5 \
    --standardized-copy-ratios /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_N_clean.standardizedCR.tsv \
    --denoised-copy-ratios /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_N_clean.denoisedCR.tsv

➤ Skim the stdout to get a sense of the data transformations during standardization vs. denoising. 

The tool uses the maximum number of eigensamples available in the PoN. Changing the `--number-of-eigensamples` in DenoiseReadCounts to lower values can change the resolution of results, i.e. how smooth segments are. Using a larger number of principal components will result in a higher level of denoising and a larger difference in the MADs. The level of denoising should be chosen with some care, as it will ultimately affect the sensitivity of the analysis.

### Plot Denoised Copy Ratios 
Let's take a look at the data in its current state.

In [None]:
! gatk PlotDenoisedCopyRatios \
    --standardized-copy-ratios /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_T_clean.standardizedCR.tsv \
    --denoised-copy-ratios /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv \
    --sequence-dictionary /home/jupyter/notebooks/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output-prefix hcc1143_T_clean \
    -O /home/jupyter/notebooks/3-somatic-cna/sandbox/cna_plots

View the plot by running the cell below.

In [None]:
display(Image('/home/jupyter/notebooks/3-somatic-cna/sandbox/cna_plots/hcc1143_T_clean.denoised.png'))

In [None]:
! gatk PlotDenoisedCopyRatios \
    --standardized-copy-ratios /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_N_clean.standardizedCR.tsv \
    --denoised-copy-ratios /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_N_clean.denoisedCR.tsv \
    --sequence-dictionary /home/jupyter/notebooks/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output /home/jupyter/notebooks/3-somatic-cna/sandbox/cna_plots \
    --output-prefix hcc1143_N_clean

View the plot by running the cell below.

In [None]:
display(Image('/home/jupyter/notebooks/3-somatic-cna/sandbox/cna_plots/hcc1143_N_clean.denoised.png'))

➤ Skim the stdout to get a sense of the data transformations during standardization and denoising. 

Each command produces two sets of data: plots and QC values. 
- In the plots, standardized copy ratios are shown in blue. Standardization involves median-centering and log-transformation. Denoised copy ratios are in green. Denoising is performed using the principal components of the PoN. 
- The QC values pertain to the median-absolute-deviation (MAD) in different contexts, including the change between standardized and denoised (.deltaMAD.txt)  and the change between the two scaled by the standardized MAD (.deltaScaledMAD.txt).

| &nbsp; | &nbsp; |
| --- | --- |
| <img src="https://storage.googleapis.com/gatk-tutorials/images/3-somatic/somatic-cna-image2.png" alt="drawing" width="500"/> | <img src="https://storage.googleapis.com/gatk-tutorials/images/3-somatic/somatic-cna-image3.png" alt="drawing" width="500"/> |

---


## Perform segmentation based on coverage alone

At the heart of the GATK4 CNA workflow is ModelSegments, a tool that groups contiguous copy ratios into segments. Either or both copy ratios and allelic copy ratios inform segmentation. So far, the tutorial has focused only on coverage data. So let's see what segmentation with coverage alone looks like.

<img src="https://storage.googleapis.com/gatk-tutorials/images/3-somatic/somatic-cna-image7.July2019.png" alt="drawing" width="800" align="left" style="margin:0px 20px"/> 




### Model segments on coverage alone 

In [None]:
! gatk --java-options "-Xmx7g" ModelSegments \
    --denoised-copy-ratios /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv \
    --output /home/jupyter/notebooks/3-somatic-cna/sandbox  \
    --output-prefix hcc1143_T_clean

In [None]:
! gatk --java-options "-Xmx7g" ModelSegments \
    --denoised-copy-ratios /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_N_clean.denoisedCR.tsv \
    --output /home/jupyter/notebooks/3-somatic-cna/sandbox  \
    --output-prefix hcc1143_N_clean

Each command produces nine files.

### Plot Modeled Segments
Let's see what modeling segments on coverage alone looks like. Here we provide a second plotting tool, PlotModeledSegments, with the denoised copy ratios (from DenoiseReadCounts), the segments (from ModelSegments), and the reference sequence dictionary. 

In [None]:
! gatk PlotModeledSegments \
    --denoised-copy-ratios /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv \
    --segments /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_T_clean.modelFinal.seg \
    --sequence-dictionary /home/jupyter/notebooks/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output /home/jupyter/notebooks/3-somatic-cna/sandbox/cna_plots \
    --output-prefix hcc1143_T_clean

View the plot generated from the previous command by running the Markdown cell below.

In [None]:
display(Image('/home/jupyter/notebooks/3-somatic-cna/sandbox/cna_plots/hcc1143_T_clean.modeled.png'))

In [None]:
! gatk PlotModeledSegments \
    --denoised-copy-ratios /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_N_clean.denoisedCR.tsv \
    --segments /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_N_clean.modelFinal.seg \
    --sequence-dictionary /home/jupyter/notebooks/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output /home/jupyter/notebooks/3-somatic-cna/sandbox/cna_plots \
    --output-prefix hcc1143_N_clean


View the plot generated from the previous command by running the cell below.

In [None]:
display(Image('/home/jupyter/notebooks/3-somatic-cna/sandbox/cna_plots/hcc1143_N_clean.modeled.png'))

The command produces a plot with extension .modeled.png, where denoised copy ratios in alternate segments are colored in blue and orange and segment medians are drawn in black. For noisy data, box plots of the available posteriors for each segment become visible. 

<img src="https://storage.googleapis.com/gatk-tutorials/images/3-somatic/somatic-cna-image4.png" alt="drawing" width="900"/>
<img src="https://storage.googleapis.com/gatk-tutorials/images/3-somatic/somatic-cna-image5.png" alt="drawing" width="900"/>

➤ The tumor sample shows a lot of activity. Specifically, it has 235 segments. Is this surprising?
➤ Focus on chr2 of the normal sample. How do you interpret its copy ratio of ~1.3? How about the ~0.9 copy ratio of chr6? 
 
At glance, segments appear to separate into roughly evenly spaced ratios, which represent absolute copy numbers, e.g. 1, 2, 3 and so on. Segments that fall between these likely represent subclonal populations. 

### (Optional) Call Copy Ratio Segments
If you need callsets with amplifications (+), deletions (-) and neutral segments (0) clearly marked, then CallCopyRatioSegments can do this for you. These designations are appended as a new column to the segmented copy-ratio .cr.seg file from ModelSegments.

In [None]:
! gatk CallCopyRatioSegments \
    -I /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_T_clean.cr.seg \
    -O /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_T_clean.called.seg

---
# INCORPORATE ALLELIC DATA: MODELSEGMENTS CNA

## Perform segmentation jointly with coverage and allelic data
<img src="https://storage.googleapis.com/gatk-tutorials/images/3-somatic/somatic-cna-image8.July2019.png" alt="drawing" width="35%" align="right" style="margin:0px 20px"/> We just saw what segmentation with coverage data alone looks like. But we can squeeze more juice out of the lemon! In this section, we will model segments using both allelic counts and coverage data for a matched-control case. 

➤ How do allelic counts improve the detection of copy alteration?

Consider in normal germline sequencing, how it is we decide a site's genotype is heterozygous versus homozygous. For a site that is heterozygous, that presents two alleles for a diploid sample, the confidence that the sample has at least two chromosomes is high. A hundred heterozygous sites adjacent to each other becomes strong evidence towards the multi-copy number state of the genomic interval. 

<img src="https://storage.googleapis.com/gatk-tutorials/images/3-somatic/somatic-cna-image2.July2019.png" alt="drawing" width="50%" align="right" style="margin:20px 20px"/>We can extend this concept further towards detection of a type of zygosity that has implications for cancer. We can take allele counts for sites that are commonly variant in the population. For sites where the normal control is heterozygous, if the tumor sample is homozygous, then we can deduce the tumor underwent loss of heterozygosity (LOH) for the allele. With a string of adjacent LOH sites, we can be confident of an LOH segment. Here, either the tumor simply lost the chromosome segment or underwent a slightly more complicated event called copy-neutral LOH (cnLOH). Coverage data can offer clues towards deducing which type of loss is likely.   

Note it is possible to use allelic counts alone with ModelSegments. Furthermore, the tool will model segments for either a matched case or for a case sample alone. The latter can be useful in revealing clonal subpopulations. 

### Collect allelic counts from pileups (chr17 data)
CollectAllelicCounts tabulates counts of the reference allele and counts of the dominant alternate allele for sites in a given genomic intervals list. The tool filters out reads with MAPQ below 30 and discounts bases with base quality less than 20.
 
We perform this step on the chr17 subset data. In later steps, we will use precomputed results from the full data. Here, theta_snps_paddedC_chr17.vcf.gz contains lifted-over gnomAD SNPs-only sites subset to the padded target regions from section 1. 

In [None]:
! gatk CollectAllelicCounts \
    -L gs://gatk-tutorials/$WORKSHOP/3-somatic/resources/theta_snps_paddedC_chr17.vcf.gz \
    -I gs://gatk-tutorials/$WORKSHOP/3-somatic/bams/tumor.bam \
    -R gs://gatk-tutorials/$WORKSHOP/3-somatic/ref/Homo_sapiens_assembly38.fasta \
    -O /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_T_clean.allelicCounts.tsv

In [None]:
! gatk CollectAllelicCounts \
    -L gs://gatk-tutorials/$WORKSHOP/3-somatic/resources/theta_snps_paddedC_chr17.vcf.gz \
    -I gs://gatk-tutorials/$WORKSHOP/3-somatic/bams/normal.bam \
    -R gs://gatk-tutorials/$WORKSHOP/3-somatic/ref/Homo_sapiens_assembly38.fasta \
    -O /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_N_clean.allelicCounts.tsv

The resulting tables notate the read counts for REF and ALT as well as the REF allele and the ALT allele for every site provided in the intervals list.

➤ For sites lacking ALT allele counts, what is in the field for ALT_NUCLEOTIDE?

In [None]:
!grep -v ^@ /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_N_clean.allelicCounts.tsv | head 




### Model segments jointly on coverage and allelic data (full data)  
In this step, the full spectrum of data converge. We provide precomputed allelic counts from the cna_inputs folder and tumor denoised read counts. Here we use default parameters. Adjusting tool parameters can change the resolution and smoothness of the segmentation results and we recommend researchers tune the parameters for their data.

It is useful to have the matched normal in this case because it makes it easier to detect real heterozygous SNPs.  (Location and counts)

In [None]:
! gatk --java-options "-Xmx7g" ModelSegments \
    --denoised-copy-ratios /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv \
    --allelic-counts /home/jupyter/notebooks/3-somatic-cna/cna_inputs/hcc1143_T_clean.allelicCounts.tsv \
    --normal-allelic-counts /home/jupyter/notebooks/3-somatic-cna/cna_inputs/hcc1143_N_clean.allelicCounts.tsv \
    --output /home/jupyter/notebooks/3-somatic-cna/sandbox \
    --output-prefix hcc1143_TN_clean

➤ Skim the stdout to get a sense of the preprocessing and analysis. The tool filters sites with total allelic counts less than how many? How many control heterozygous sites does the tool retain? Does the tool then use all of these towards the joint analysis?  
➤ How many segments does the MultidimensionalKernelSegmenter initially find? After smoothing, how many final segments are there? 

The step produces eleven files. See [ModelSegments tool documentation](https://gatk.broadinstitute.org/hc/en-us/articles/4405451294747-ModelSegments) for details. Of note, we have two files with .hets. in the extension, .hets.normal.tsv and .hets.tsv. The former contains the normal control's heterozygous sites. The latter contains the tumor's allele counts for the normal's heterozygous sites. Finally, the .modelFinal.seg file contains the segmentation results.

### Plot coverage copy ratios and alternate allele fractions 
We provide PlotModeledSegments the case sample's denoised copy ratios, .hets allele counts, and final segmentation results. 
<img src="https://storage.googleapis.com/gatk-tutorials/images/3-somatic/somatic-cna-image9.July2019.png" alt="drawing" width="50%" style="margin:20px 20px"/>


In [None]:
! gatk PlotModeledSegments \
    --denoised-copy-ratios /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv \
    --allelic-counts /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_TN_clean.hets.tsv \
    --segments /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_TN_clean.modelFinal.seg \
    --sequence-dictionary /home/jupyter/notebooks/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output /home/jupyter/notebooks/3-somatic-cna/sandbox/cna_plots \
    --output-prefix hcc1143_TN_clean

View the plot generated from the previous command by running the cell below.

In [None]:
display(Image('/home/jupyter/notebooks/3-somatic-cna/sandbox/cna_plots/hcc1143_TN_clean.modeled.png'))

This produces a file with two plots, each with 398 segments. The top plot shows segmented copy ratios and the bottom plot shows segmented alternate-allele fractions. Box plots for the major and minor allele fractions mark the 10th, 50th and 90th percentile credible intervals. Vertical streaks appear for very short segments as fewer supporting data points make estimates more uncertain.
<img src="https://storage.googleapis.com/gatk-tutorials/images/3-somatic/somatic-cna-image6.png" alt="drawing" width="900"/>
➤ What do the allelic segments at 0 and 1 indicate? For example, at chr4, chr5 and chr17?

---

## Perform segmentation with allelic data alone
Perform one final comparison. Run ModelSegments and PlotModeledSegments for the matched-case using allelic data alone. 


In [None]:
! gatk --java-options "-Xmx7g" ModelSegments \
    --allelic-counts /home/jupyter/notebooks/3-somatic-cna/cna_inputs/hcc1143_T_clean.allelicCounts.tsv \
    --normal-allelic-counts /home/jupyter/notebooks/3-somatic-cna/cna_inputs/hcc1143_N_clean.allelicCounts.tsv \
    --output /home/jupyter/notebooks/3-somatic-cna/sandbox \
    --output-prefix hcc1143_TN_allelic

In [None]:
! gatk PlotModeledSegments \
    --allelic-counts /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_TN_allelic.hets.tsv \
    --segments /home/jupyter/notebooks/3-somatic-cna/sandbox/hcc1143_TN_allelic.modelFinal.seg \
    --sequence-dictionary /home/jupyter/notebooks/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output /home/jupyter/notebooks/3-somatic-cna/sandbox/cna_plots \
    --output-prefix hcc1143_TN_allelic

View the plot generated from the previous command by running the cell below.

In [None]:
display(Image('/home/jupyter/notebooks/3-somatic-cna/sandbox/cna_plots/hcc1143_TN_allelic.modeled.png'))

This produces an allelic ratios plot with 105 segments. 
<img src="https://storage.googleapis.com/gatk-tutorials/images/3-somatic/somatic-cna-image7.png" alt="drawing" width="900"/>
➤ This is ~4x less segments than the CR + allelic analysis, and ~2x less than the 235 segments from the copy ratios alone. How do you explain such differences? 

Remember that joint calling groups contiguous segments with the same copy ratio and the same minor allele fraction, for high-resolution results. Finally, remember that the CNA workflow produces copy ratios and not copy numbers. GATK is developing a tool to call absolute somatic copy numbers. For germline absolute copy number detection, see GATK4's [GermlineCNVCaller](https://gatk.broadinstitute.org/hc/en-us/articles/4405451329179-GermlineCNVCaller). 