GATK TUTORIAL :: Somatic CNA :: Worksheet
====================

**March 2019**  

This GATK tutorial corresponds to a section of the GATK named GATK TUTORIAL Somatic CNA Worksheet available at [https://drive.google.com/drive/folders/1CZnuBm0z0sbLL6UA8cKHxV7uiFZxCl3M](https://drive.google.com/drive/folders/1CZnuBm0z0sbLL6UA8cKHxV7uiFZxCl3M). 

This hands-on tutorial outlines steps to sensitively detect alterations in total and allelic copy ratios using GATK4's ModelSegments CNA workflow. The workflow is suitable towards detecting somatic copy ratio alterations, more familiarly copy number alterations (CNAs) or copy number variants (CNVs), for whole genomes and targeted exomes.

<img src="https://us.v-cdn.net/5019796/uploads/editor/t8/f03soh0gs0ve.png" alt="drawing" width="2000"/> 

Tutorial was last tested with the GATK4.1.0.0 Docker. If the system's Docker engine limits memory, increase the memory available to Docker to at least 8GB. Otherwise, the commands will error. 

---
**Table of Contents**   
1. NOTES ON THE WORKFLOW BETA STATUS AND TUTORIAL DATA	 
   1.1 Differences between GATK4 and GATK4.beta CNA workflows  
   1.2 Tutorial switches between data subset to chr17 and full data

2. PERFORM COVERAGE ANALYSIS: MODELSEGMENTS CNA EQUIVALENT OF GATK4.BETA CNA  
   2.1 Prepare intervals for coverage collection  
   2.2 Collect read counts for samples across target intervals  
   2.3 Create CNA panel of normals (PoN)  
   2.4 Remove noise from sample coverage using the PoN  
   2.5 Perform segmentation based on coverage alone  

3. INCORPORATE ALLELIC DATA: MODELSEGMENTS CNA EQUIVALENT OF GATK4.BETA ACNA  
   3.1 Perform segmentation jointly with coverage and allelic data  
   3.2 Perform segmentation with allelic data alone  
---

## Differences between GATK4 and GATK4.beta CNA workflows
The workflow in the official GATK4 release differs from that of the GATK4.BETA release. On the surface, two differences stand out. First, the official workflow is capable of efficiently handling WGS data. Second, it incorporates functionality that considers allelic data, which previously was a separate workflow. Note that the official GATK4 release CNA workflow itself is still in beta. This means the workflow is still undergoing adjustments. 

Tool-wise, differences are as follows. Note that you cannot substitute a PoN created with one workflow in the other workflow. 


| GATK4.0.3+ | GATK4.beta | Description |
| :---- | :---- | :---- |
| PreprocessIntervals | PadTargets | Pad or bin intervals |
| CollectReadCounts* | CalculateTargetCoverage | Collect read counts |
| CreateReadCountPanelOfNormals | CreatePanelOfNormals| Create the PoN |
| DenoiseReadCounts | NormalizeSomaticReadCounts | Denoise case sample counts against the PoN |
| CollectAllelicCounts | CollectAllelicCounts | Count alleles |
| ModelSegments | PerformSegmentation, ACNV workflow tools | Group and model contiguous copy-ratios and allele fractions |
| CallCopyRatioSegments | CallSegments | Call copy neutral (0) loss (-), and gain (+) segments |
| PlotDenoisedCopyRatios & PlotModeledSegments | PlotSegmentedCopyRatio, PlotACNVResults | Plot copy ratios and allele fractions to visualize denoising and segmentation |

---


## Tutorial switches between data subset to chr17 and full data
We use 1000 Genomes Project (1KGP) data and the HCC1143 matched normal and tumor samples that we also use in the Mutect2 tutorial. Note the tutorial coverage data originates from a previous iteration of the workflow prior to v4.0.3 that used CollectFragmentCounts instead of CollectReadCounts*.

- Panel of normals samples are Phase 3 1KGP samples aligned to GRCh38.
- Case sample data are based on a breast cancer cell line and its matched normal cell line derived from blood. Both cell lines are appropriately consented and known as HCC1143 and HCC1143_BL, respectively. 
- Target intervals are an intersection of the HCC capture kit targets and 1KGP WES targets. Targets were converted from GRCh37 to GRCh38 coordinates using UCSC liftOver.

Note the tutorial switches between data subset to chr17 and full data. At any point, you can use the input files provided in the cna_precomputed folder instead of the sandbox files generated during the tutorial. [gs://gatk-tutorials/workshop_1903/3-somatic/cna_precomputed](https://console.cloud.google.com/storage/browser/gatk-tutorials/workshop_1903/3-somatic/cna_precomputed/?project=broad-dsde-outreach&organizationId=548622027621)

### First, make sure the notebook is using a Python 3 kernel in the top right corner.
A kernel is a _computational engine_ that executes the code in the notebook. We can execute GATK commands using _Python Magic_ (`!`).

### How to run this notebook:
- **Click to select a gray cell and then pressing SHIFT+ENTER to run the cell.**
- **Write results to `/home/jupyter-user/3-somatic-cna/sandbox/`. To access the directory, click on the upper-left jupyter icon.**

In [None]:
# Create your sandbox directory
! mkdir -p /home/jupyter-user/3-somatic-cna/sandbox/
# Create directory to store plots
! mkdir /home/jupyter-user/3-somatic-cna/sandbox/cna_plots/
# Removes any old symbolic linked sandbox directory and adds a new link
! rm sandbox
! ln -s /home/jupyter-user/3-somatic-cna/sandbox sandbox
! ls sandbox/

### Enable reading Google bucket data 

In [None]:
# Check if data is accessible. The command should list several gs:// URLs.
! gsutil ls gs://gatk-tutorials/workshop_1903/3-somatic/

In [None]:
# If you do not see gs:// URLs listed above, run this cell to install Google Cloud Storage. 
# Afterwards, restart the kernel with Kernel > Restart.
#! pip install google-cloud-storage

### Install R packages

In [None]:
# Install R packages for ploting 
! echo "install.packages(c(\"optparse\",\"data.table\"))" | R --no-save

### Download Data Locally
Some tools are not able to use read from a googe bucket, here we download the files locally.

In [None]:
! mkdir /home/jupyter-user/3-somatic-cna/cna_inputs
! gsutil cp gs://gatk-tutorials/workshop_1903/3-somatic/cna_inputs/* /home/jupyter-user/3-somatic-cna/cna_inputs/
! mkdir /home/jupyter-user/3-somatic-cna/ref/
! gsutil cp gs://gatk-tutorials/workshop_1903/3-somatic/ref/Homo_sapiens_assembly38.dict /home/jupyter-user/3-somatic-cna/ref/

---

# PERFORM COVERAGE ANALYSIS: MODELSEGMENTS CNA EQUIVALENT OF GATK4.BETA CNA

## Prepare intervals for coverage collection
We define the genomic regions in which we expect read coverage. Since we are using exome data, we will pad the target regions. Padding target regions 250 bases on each side has been shown to increase sensitivity for the CNA workflow. In the case of whole genome data, we would divide the reference genome into equally sized intervals or bins. In either case, we use PreprocessIntervals to prepare the intervals list.
 
The --bin-length value must be set for different data types, e.g. default 1000 for whole genome or 0 for exomes. For the tutorial exome data, we provide a snippet of the capture kit target regions and set --bin-length to zero.

In [None]:
 ! gatk PreprocessIntervals \
    -L gs://gatk-tutorials/workshop_1903/3-somatic/resources/targets_chr17.interval_list \
    -R gs://gatk-tutorials/workshop_1903/3-somatic/ref/Homo_sapiens_assembly38.fasta \
    --padding 250 \
    --bin-length 0 \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O /home/jupyter-user/3-somatic-cna/sandbox/targets_chr17.preprocessed.interval_list

This produces a Picard-style intervals list targets_chr17.preprocessed.interval_list with 11,307 targets for use in the coverage collection step.

➤ Peruse both the before and after intervals. Do we have the same number of intervals as before? How does the tool pad intervals that are less than 500bp apart?  
➤ Take a look at the tool doc description for -imr OVERLAPPING_ONLY at <https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_copynumber_CollectReadCounts.php#--interval-merging-rule>. What does this option ensure?

---
## Collect read counts for samples across target intervals

The basis for detecting amplification and deletion events from sequencing data is read coverage. In this step, we count the number of read starts that overlap each interval using CollectReadCounts. We perform this step for the tumor sample and for the normal sample.

By default, the tool writes HDF5 format <https://software.broadinstitute.org/gatk/documentation/article?id=11508> data, which is handled more efficiently by downstream tools (decreases runtime by reducing time spent on IO). Here we change the output format to TSV for teaching purposes. 

In [None]:
! gatk CollectReadCounts \
    -I gs://gatk-tutorials/workshop_1903/3-somatic/bams/tumor.bam \
    -L /home/jupyter-user/3-somatic-cna/sandbox/targets_chr17.preprocessed.interval_list \
    -R gs://gatk-tutorials/workshop_1903/3-somatic/ref/Homo_sapiens_assembly38.fasta \
    --format TSV \
    -imr OVERLAPPING_ONLY \
    -O /home/jupyter-user/3-somatic-cna/sandbox/tumor.counts.tsv

In [None]:
! gatk CollectReadCounts \
    -I gs://gatk-tutorials/workshop_1903/3-somatic/bams/normal.bam \
    -L /home/jupyter-user/3-somatic-cna/sandbox/targets_chr17.preprocessed.interval_list \
    -R gs://gatk-tutorials/workshop_1903/3-somatic/ref/Homo_sapiens_assembly38.fasta \
    --format TSV \
    -imr OVERLAPPING_ONLY \
    -O /home/jupyter-user/3-somatic-cna/sandbox/normal.counts.tsv

Here we show the raw counts per target (y-axis) for the normal and the tumor across 23 chromosomes (x-axis), produced by a previous iteration of the workflow that used a now deprecated tool, CollectFragmentCounts. Each target is represented by a point.
<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/cna-image1.png" alt="drawing" width="1000"/>
➤ Can you tell if either sample has copy number variants?

---
## Create CNA panel of normals (PoN)

Now we generate the CNA PoN with CreateReadCountPanelOfNormals. The tool creates a panel of normals that forms the baseline for what is the norm against which the workflow compares case samples. The tool uses Singular Value Decomposition, a type of Principal Component Analysis to capture systematic noise against statistical noise.

Normally, you will want to create a PoN with some number of normal samples that were ideally subject to the same batch effects as your case sample under scrutiny. This tutorial will use a PoN made of forty 1KGP normal samples and generated with the following command: 

```
gatk --java-options "-Xmx6500m" CreateReadCountPanelOfNormals \
    -I file1_clean.counts.hdf5 \
    … 
    -I file40_clean.counts.hdf5 \
    --minimum-interval-median-percentile 5.0 \
    -O cnaponC.pon.hdf5
```

Changing the `--minimum-interval-median-percentile` argument from the default of 10.0 to a smaller value of 5.0 allows retention of more data, which is appropriate for this carefully selected normals cohort. With this parameter, the tool filters out targets or bins with a median fractional coverage below this percentile. The median is across the samples. The fractional coverage is the target coverage divided by the sum of the coverage of all targets for a sample.

CreateReadCountPanelOfNormals performs several other filtering steps across samples and across targets, and <https://gatkforums.broadinstitute.org/dsde/discussion/11682#2> outlines these.

At the least, the PoN should consist of ten normal samples that were ideally subject to the same batch effects as that of the tumor sample. Our recommendation is forty or more normal samples. To illustrate tool features, we create a PoN with our normal sample with the following command.

In [None]:
! gatk CreateReadCountPanelOfNormals \
    -I /home/jupyter-user/3-somatic-cna/sandbox/normal.counts.tsv \
    -O /home/jupyter-user/3-somatic-cna/sandbox/normal.pon.hdf5

➤ Study the stdout. Are we losing any data during the filtering steps? Given the reasons one might want to use a matched normal, would you change this command? Remember also PoN medians are used to standardize case counts (by dividing).

So far we have been using subset data. Run the CreateReadCountPanelOfNormals command using the full data file cna_inputs/hcc1143_N_clean.counts.hdf5. Adjust the parameter <your parameter change> to include --minimum-interval-median-percentile.

In [None]:
! gatk CreateReadCountPanelOfNormals \
    -I /home/jupyter-user/3-somatic-cna/cna_inputs/hcc1143_N_clean.counts.hdf5 \
    --minimum-interval-median-percentile 5.0 \
    -O /home/jupyter-user/3-somatic-cna/sandbox/normal.pon.hdf5

➤ Which do you think will perform better in revealing copy number events in the tumor, the 40-sample PoN or the matched-normal? Why?
 
If you are curious to see for yourself how the matched-normal PoN pans out, it is possible to substitute it in to the remaining steps. Instructions continue with the 40-sample PoN.

---
## Remove noise from sample coverage using the PoN

We use DenoiseReadCounts and the PoN to standardize and then denoise sample read counts. The resulting two files each capture a step. In the single-sample-PoN case, the two results will be identical to each other, as the tool only performs standardization.  

**A. Denoise Read Counts**

In [None]:
! gatk --java-options "-Xmx7g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" DenoiseReadCounts \
    -I /home/jupyter-user/3-somatic-cna/cna_inputs/hcc1143_T_clean.counts.hdf5 \
    --count-panel-of-normals /home/jupyter-user/3-somatic-cna/cna_inputs/cnaponC.pon.hdf5 \
    --standardized-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.standardizedCR.tsv \
    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv

In [None]:
! gatk --java-options "-Xmx7g" DenoiseReadCounts \
    -I /home/jupyter-user/3-somatic-cna/cna_inputs/hcc1143_N_clean.counts.hdf5 \
    --count-panel-of-normals /home/jupyter-user/3-somatic-cna/cna_inputs/cnaponC.pon.hdf5 \
    --standardized-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_N_clean.standardizedCR.tsv \
    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_N_clean.denoisedCR.tsv

➤ Skim the stdout to get a sense of the data transformations during standardization vs. denoising. 

The tool uses the maximum number of eigensamples available in the PoN. Changing the `--number-of-eigensamples` in DenoiseReadCounts to lower values can change the resolution of results, i.e. how smooth segments are. Using a larger number of principal components will result in a higher level of denoising and a larger difference in the MADs. The level of denoising should be chosen with some care, as it will ultimately affect the sensitivity of the analysis.

**B. Plot Denoised Copy Ratios**  
Let's take a look at the data in its current state.

In [None]:
! gatk PlotDenoisedCopyRatios \
    --standardized-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.standardizedCR.tsv \
    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv \
    --sequence-dictionary /home/jupyter-user/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output /home/jupyter-user/3-somatic-cna/sandbox/cna_plots \
    --output-prefix hcc1143_T_clean

View the plot generated from the previous command below. Remove qoutes around `!` to view plot in this notebook.  
"!"[hcc1143_T_clean.denoised.png](sandbox/cna_plots/hcc1143_T_clean.denoised.png)

In [None]:
! gatk PlotDenoisedCopyRatios \
    --standardized-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_N_clean.standardizedCR.tsv \
    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_N_clean.denoisedCR.tsv \
    --sequence-dictionary /home/jupyter-user/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output /home/jupyter-user/3-somatic-cna/sandbox/cna_plots \
    --output-prefix hcc1143_N_clean

View the plot generated from the previous command below. Remove qoutes around `!` to view plot in this notebook.  
"!"[hcc1143_N_clean.denoised.png](sandbox/cna_plots/hcc1143_N_clean.denoised.png)

➤ Skim the stdout to get a sense of the data transformations during standardization and denoising. 

Each command produces two sets of data: plots and QC values. 
- In the plots, standardized copy ratios are shown in blue. Standardization involves median-centering and log-transformation. Denoised copy ratios are in green. Denoising is performed using the principal components of the PoN. 
- The QC values pertain to the median-absolute-deviation (MAD) in different contexts, including the change between standardized and denoised (.deltaMAD.txt)  and the change between the two scaled by the standardized MAD (.deltaScaledMAD.txt).

| . | . |
| --- | --- |
| <img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/cna-image2.png" alt="drawing" width="500"/> | <img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/cna-image3.png" alt="drawing" width="500"/> |

---


## Perform segmentation based on coverage alone

At the heart of the GATK4 CNA workflow is ModelSegments, a tool that groups contiguous copy ratios into segments. Either or both copy ratios and allelic copy ratios inform segmentation. So far, the tutorial has focused only on coverage data. So let's see what segmentation with coverage alone looks like.

**A. Model segments on coverage alone**  

In [None]:
! gatk --java-options "-Xmx7g" ModelSegments \
    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv \
    --output /home/jupyter-user/3-somatic-cna/sandbox  \
    --output-prefix hcc1143_T_clean

In [None]:
! gatk --java-options "-Xmx7g" ModelSegments \
    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_N_clean.denoisedCR.tsv \
    --output /home/jupyter-user/3-somatic-cna/sandbox  \
    --output-prefix hcc1143_N_clean

Each command produces nine files.

Underneath the hood, a Gaussian-kernel binary-segmentation algorithm differentiates ModelSegments from a GATK4.beta tool, PerformSegmentation, which GATK4 ModelSegments replaces. The older tool used a CBS (circular binary-segmentation) algorithm. ModelSegment's kernel algorithm enables efficient segmentation of dense data, e.g. that of whole genome sequences. The tool (i) performs multidimensional kernel segmentation and (ii) performs Markov-Chain Monte Carlo (MCMC) sampling and segment smoothing iteratively.  

**B. Plot Modeled Segments**
Let's see what modeling segments on coverage alone looks like. Here we provide a second plotting tool, PlotModeledSegments, with the denoised copy ratios (from DenoiseReadCounts), the segments (from ModelSegments), and the reference sequence dictionary. 

In [None]:
! gatk PlotModeledSegments \
    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv \
    --segments /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.modelFinal.seg \
    --sequence-dictionary /home/jupyter-user/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output /home/jupyter-user/3-somatic-cna/sandbox/cna_plots \
    --output-prefix hcc1143_T_clean

View the plot generated from the previous command below. Remove qoutes around `!` to view plot in this notebook.  
"!"[hcc1143_T_clean.modeled.png](sandbox/cna_plots/hcc1143_T_clean.modeled.png)

In [None]:
! gatk PlotModeledSegments \
    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_N_clean.denoisedCR.tsv \
    --segments /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_N_clean.modelFinal.seg \
    --sequence-dictionary /home/jupyter-user/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output /home/jupyter-user/3-somatic-cna/sandbox/cna_plots \
    --output-prefix hcc1143_N_clean


View the plot generated from the previous command below. Remove qoutes around `!` to view plot in this notebook.  
"!"[hcc1143_N_clean.modeled.png](sandbox/cna_plots/hcc1143_N_clean.modeled.png)

The command produces a plot with extension .modeled.png, where denoised copy ratios in alternate segments are colored in blue and orange and segment medians are drawn in black. For noisy data, box plots of the available posteriors for each segment become visible. 

<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/cna-image4.png" alt="drawing" width="900"/>
<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/cna-image5.png" alt="drawing" width="900"/>

➤ The tumor sample shows a lot of activity. Specifically, it has 235 segments. Is this surprising?
➤ Focus on chr2 of the normal sample. How do you interpret its copy ratio of ~1.3? How about the ~0.9 copy ratio of chr6? 
 
At glance, segments appear to separate into roughly evenly spaced ratios, which represent absolute copy numbers, e.g. 1, 2, 3 and so on. Segments that fall between these likely represent subclonal populations. 

**C. (Optional) Call Copy Ratio Segments**
If you need callsets with amplifications (+), deletions (-) and neutral segments (0) clearly marked, then CallCopyRatioSegments can do this for you. These designations are appended as a new column to the segmented copy-ratio .cr.seg file from ModelSegments. As of July 2018, this part of the workflow is still under active development.

In [None]:
! gatk CallCopyRatioSegments \
    -I /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.cr.seg \
    -O /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.called.seg

---
# INCORPORATE ALLELIC DATA: MODELSEGMENTS CNA EQUIVALENT OF GATK4.BETA ACNA

## Perform segmentation jointly with coverage and allelic data
We just saw what segmentation with coverage data alone looks like. But we can squeeze more juice out of the lemon! In this section, we will model segments using both allelic counts and coverage data for a matched-control case. 

➤ How is allelic counts data useful in detecting copy alteration?

Consider in normal germline sequencing, how it is we decide a site's genotype is heterozygous versus homozygous. For a site that is heterozygous, that presents two alleles for a diploid sample, the confidence that the sample has at least two chromosomes is high. A hundred heterozygous sites adjacent to each other becomes strong evidence towards the multi-copy number state of the genomic interval. 

We can extend this concept further towards detection of a type of zygosity that has implications for cancer. We can take allele counts for sites that are commonly variant in the population. For sites where the normal control is heterozygous, if the tumor sample is homozygous, then we can deduce the tumor underwent loss of heterozygosity (LOH) for the allele. With a string of adjacent LOH sites, we can be confident of an LOH segment. Here, either the tumor simply lost the chromosome segment or underwent a slightly more complicated event called copy-neutral LOH (cnLOH). Coverage data can offer clues towards deducing which type of loss is likely.   

Note it is possible to use allelic counts alone with ModelSegments. Furthermore, the tool will model segments for either a matched case or for a case sample alone. The latter can be useful in revealing clonal subpopulations. 

**A. Collect allelic counts from pileups (chr17 data)**  
CollectAllelicCounts tabulates counts of the reference allele and counts of the dominant alternate allele for sites in a given genomic intervals list. The tool filters out reads with MAPQ below 30 and discounts bases with base quality less than 20.
 
We perform this step on the chr17 subset data. In later steps, we will use precomputed results from the full data. Here, theta_snps_paddedC_chr17.vcf.gz contains lifted-over gnomAD SNPs-only sites subset to the padded target regions from section 1. 

In [None]:
! gatk CollectAllelicCounts \
    -L gs://gatk-tutorials/workshop_1903/3-somatic/resources/theta_snps_paddedC_chr17.vcf.gz \
    -I gs://gatk-tutorials/workshop_1903/3-somatic/bams/tumor.bam \
    -R gs://gatk-tutorials/workshop_1903/3-somatic/ref/Homo_sapiens_assembly38.fasta \
    -O /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.allelicCounts.tsv

In [None]:
! gatk CollectAllelicCounts \
    -L gs://gatk-tutorials/workshop_1903/3-somatic/resources/theta_snps_paddedC_chr17.vcf.gz \
    -I gs://gatk-tutorials/workshop_1903/3-somatic/bams/normal.bam \
    -R gs://gatk-tutorials/workshop_1903/3-somatic/ref/Homo_sapiens_assembly38.fasta \
    -O /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_N_clean.allelicCounts.tsv

The resulting tables notate the read counts for REF and ALT as well as the REF allele and the ALT allele for every site provided in the intervals list.

➤ For sites lacking ALT allele counts, what is in the field for ALT_NUCLEOTIDE?

**B. Model segments jointly on coverage and allelic data (full data)**  
In this step, the full spectrum of data converge. We provide precomputed allelic counts from the cna_inputs folder and tumor denoised read counts. Here we use default parameters. Adjusting tool parameters can change the resolution and smoothness of the segmentation results and we recommend researchers tune the parameters for their data.

In [None]:
! gatk --java-options "-Xmx7g" ModelSegments \
    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv \
    --allelic-counts /home/jupyter-user/3-somatic-cna/cna_inputs/hcc1143_T_clean.allelicCounts.tsv \
    --normal-allelic-counts /home/jupyter-user/3-somatic-cna/cna_inputs/hcc1143_N_clean.allelicCounts.tsv \
    --output /home/jupyter-user/3-somatic-cna/sandbox \
    --output-prefix hcc1143_TN_clean

➤ Skim the stdout to get a sense of the preprocessing and analysis. The tool filters sites with total allelic counts less than how many? How many control heterozygous sites does the tool retain? Does the tool then use all of these towards the joint analysis?  
➤ How many segments does the MultidimensionalKernelSegmenter initially find? After smoothing, how many final segments are there? 

The step produces eleven files. See ModelSegments tool documentation for details <https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.1.1/org_broadinstitute_hellbender_tools_copynumber_ModelSegments.php>. Of note, we have two files with .hets. in the extension, .hets.normal.tsv and .hets.tsv. The former contains the normal control's heterozygous sites. The latter contains the tumor's allele counts for the normal's heterozygous sites. Finally, the .modelFinal.seg file contains the segmentation results.

**C. Plot coverage copy ratios and alternate allele fractions**  
We provide PlotModeledSegments the case sample's denoised copy ratios, .hets allele counts, and final segmentation results. 

In [None]:
! gatk PlotModeledSegments \
    --denoised-copy-ratios /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_T_clean.denoisedCR.tsv \
    --allelic-counts /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_TN_clean.hets.tsv \
    --segments /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_TN_clean.modelFinal.seg \
    --sequence-dictionary /home/jupyter-user/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output /home/jupyter-user/3-somatic-cna/sandbox/cna_plots \
    --output-prefix hcc1143_TN_clean

View the plot generated from the previous command below. Remove qoutes around `!` to view plot in this notebook.  
"!"[hcc1143_TN_clean.modeled.png](sandbox/cna_plots/hcc1143_TN_clean.modeled.png)

This produces a file with two plots, each with 398 segments. The top plot shows segmented copy ratios and the bottom plot shows segmented alternate-allele fractions. Box plots for the major and minor allele fractions mark the 10th, 50th and 90th percentile credible intervals. Vertical streaks appear for very short segments as fewer supporting data points make estimates more uncertain.
<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/cna-image6.png" alt="drawing" width="900"/>
➤ What do the allelic segments at 0 and 1 indicate? For example, at chr4, chr5 and chr17?

---

## Perform segmentation with allelic data alone
Perform one final comparison. Run ModelSegments and PlotModeledSegments for the matched-case using allelic data alone. 


In [None]:
! gatk --java-options "-Xmx7g" ModelSegments \
    --allelic-counts /home/jupyter-user/3-somatic-cna/cna_inputs/hcc1143_T_clean.allelicCounts.tsv \
    --normal-allelic-counts /home/jupyter-user/3-somatic-cna/cna_inputs/hcc1143_N_clean.allelicCounts.tsv \
    --output /home/jupyter-user/3-somatic-cna/sandbox \
    --output-prefix hcc1143_TN_allelic

In [None]:
! gatk PlotModeledSegments \
    --allelic-counts /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_TN_allelic.hets.tsv \
    --segments /home/jupyter-user/3-somatic-cna/sandbox/hcc1143_TN_allelic.modelFinal.seg \
    --sequence-dictionary /home/jupyter-user/3-somatic-cna/ref/Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output /home/jupyter-user/3-somatic-cna/sandbox/cna_plots \
    --output-prefix hcc1143_TN_allelic

View the plot generated from the previous command below. Remove qoutes around `!` to view plot in this notebook.  
"!"[hcc1143_TN_allelic.modeled.png](sandbox/cna_plots/hcc1143_TN_allelic.modeled.png)

This produces an allelic ratios plot with 105 segments. 
<img src="https://storage.googleapis.com/gatk-tutorials/workshop_1903/3-somatic/images/cna-image7.png" alt="drawing" width="900"/>
➤ This is ~4x less segments than the CR + allelic analysis, and ~2x less than the 235 segments from the copy ratios alone. How do you explain such differences? 

Remember that joint calling groups contiguous segments with the same copy ratio and the same minor allele fraction, for high-resolution results. Finally, remember that the CNA workflow produces copy ratios and not copy numbers. GATK is developing a tool to call absolute somatic copy numbers. For germline absolute copy number detection, see GATK4's GermlineCNVCaller <https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_copynumber_GermlineCNVCaller.php>. 