
# GATK Germline Copy Number Variation Tutorial <a class="tocSkip">

**March 2023**  

<img src="https://storage.googleapis.com/broad-dsde-methods-gatk-workshop-public/images/germline-cnv-tutorial/gatk-gcnv-pipeline-overview.png" alt="drawing" width="100%" align="center"/>


The tutorial outlines steps in detecting germline copy number variants using GATK-gCNVs and illustrates two workflow modes: **cohort mode** and the **case mode**. The cohort mode simultaneously generates a cohort model and calls CNVs for the cohort samples. The case mode analyzes a single sample against an already constructed cohort model. The same workflow steps apply to both targeted exome and whole genome sequencing (WGS) data. The workflow is able to call both rare and common events and intelligently handles allosomal ploidies, i.e. cohorts of mixed male and female samples.

For the cohort mode, the general recommendation is at least a hundred samples to start. Researchers should expect to tune workflow parameters from the provided defaults. In particular, GermlineCNVCaller's default inference parameters are conservatively set for efficient run times.

The figure above diagrams the workflow tools. **Section 1** sets up the Notebook environment and downloads the data and resource files needed to complete this tutorial. **Section 2** creates an intervals list and counts read alignments overlapping the intervals. **Section 3** shows optional but recommended cohort mode steps to annotate intervals with covariates for use in filtering intervals as well as for use in explicit modeling. The section also removes outlier counts intervals. **Section 4** generates global baseline observations for the data and models and calls the ploidy of each contig (ploidy means the overall baseline copy number of the contig). **Section 5** is at the heart of the workflow and models per-interval copy number. Because of the high intensity of compute model fitting requires, the section shows how to analyze data in parts. Finally, **Section 6** calls per-sample copy number events per interval and per segment. Results are in VCF format.

➤ This guided Notebook demo is largely based on [Tutorial #11682](https://gatk.broadinstitute.org/hc/en-us/articles/360035531152--How-to-Call-common-and-rare-germline-copy-number-variants)   
➤ A highly recommended [paper](https://www.nature.com/articles/s41588-023-01449-0) detailing the methods was published in _Nature Genetics_ in 2023.  
➤ For pipelined workflows, see the WDL scripts in this Terra workspace.  
➤ This workflow is **not** appropriate for bulk tumor samples, as it infers absolute copy numbers. For somatic copy number alteration calling, see [Tutorial #11682](https://gatk.zendesk.com/hc/en-us/articles/360035531092).

_This tutorial was last tested with the GATK v4.2.4.0 and IGV v2.8.0._
 See [GATK Tool Documentation](https://gatk.broadinstitute.org/hc/en-us/articles/360037224712) for further information on the tools we use below.

# Set up your Notebook


## Set cloud environment values
If you are opening a notebook for the first time today and you didn't adjust any runtime values, now's the time to edit them. Click on the gear icon in the upper right to edit your Notebook Runtime. Set the values as specified below:

| Option | Value |
| ------ | ------ |
| Environment | [Custom](https://github.com/broadinstitute/gatk-workshop-terra-jupyter-image/wiki/Using-the-gatk%E2%80%90workshop%E2%80%90terra%E2%80%90jupyter%E2%80%90image-in-the-Terra-Jupyter-environment#6-type-the-command-setup_gatk_env-in-the-terminal-and-hit-enter) |
| Profile | Custom |
| CPU | 4 |
| Disk size | 100 GB |
| Memory | 15 GB |

**Please Note:** This notebook currently requires the use of a custom environment, as described [here](https://github.com/broadinstitute/gatk-workshop-terra-jupyter-image/wiki/Using-the-gatk%E2%80%90workshop%E2%80%90terra%E2%80%90jupyter%E2%80%90image-in-the-Terra-Jupyter-environment#6-type-the-command-setup_gatk_env-in-the-terminal-and-hit-enter).

Click the "create"/"Update" button when you are done, and Terra will begin to create a new runtime with your settings. When it is finished, it will pop up asking you to apply the new settings. In the meantime, you can continue with the setup instructions below. 



## Setup the GATK conda environment

Run the following commands in the Terra terminal (available from the right panel):

```
setup_gatk_env
```

When you are done, you should close this notebook and reopen.


## Check kernel type selected in the Jupyter notebook
For this particular notebook, we will be using a specialized kernel that we've added to our **Kernel** menu by following the previous two steps for creating the custom envrionemnt. If you've successfully done this, you should be able to select a kernel called `Python [conda env:gatkconda]`under the **Kernel** > **Change Kernel** menu.

## Set up your files
Your notebook has a temporary folder that exists so long as your cluster is running. To see what files are in your notebook environment at any time, you can click on the Jupyter logo in the upper left corner. 

For this tutorial, we need to copy some files from this temporary folder to and from our workspace bucket. Run the two commands below to set up the workspace bucket variable and the file paths inside your notebook.

<font color = "green"> **Tool Tip:** To run a cell in a notebook, press `SHIFT + ENTER`</font>

In [None]:
# Set your workspace bucket variable for this notebook.
import os

WORKSPACE_BUCKET = os.environ['WORKSPACE_BUCKET']
WORKSHOP_BUCKET = 'gs://broad-dsde-methods-gatk-workshop-public'
REFERENCE_BUCKET = 'gs://genomics-public-data/resources/broad/hg38/v0'
WORKSPACE_LOCAL = '/home/jupyter/notebooks/germline-cnv'

In [None]:
# Create sandbox and file directories
! mkdir -p $WORKSPACE_LOCAL/sandbox/
! mkdir -p $WORKSPACE_LOCAL/ref/
! mkdir -p $WORKSPACE_LOCAL/data/

# Removes any old symbolic linked sandbox directory and adds a new link, in case you've run this before
! rm -rf sandbox
! ln -s $WORKSPACE_LOCAL/sandbox sandbox

## Check data permissions
For this tutorial, we have hosted the starting files in a public Google bucket. We will first check that the data is available to your user account, and if it is not, we simply need to install Google Cloud Storage.

In [None]:
# Check if data is accessible. The command should list several gs:// URLs.
! gsutil ls $WORKSHOP_BUCKET/germline-cnv

In [None]:
# If you do not see gs:// URLs listed above, run this cell to install Google Cloud Storage. 
# Afterwards, restart the kernel with Kernel > Restart.
#! pip install google-cloud-storage

## Download data to local disk

The tutorial provides example small WGS data sourced from the _1000 Genomes Project_. Cohort mode illustrations use 7 samples, while case mode illustrations analyze one sample against a cohort model made from the remaining 6 samples. The tutorial uses a fraction of the workflow's recommended hundred samples for ease of illustration. Furthermore, commands in each step use one of three differently sized intervals lists for efficiency. Coverage data are from the entirety of chr20, chrX and chrY. So although a step may analyze a subset of regions, it is possible to instead analyze all three contigs in case or cohort modes.

Download tutorial_11684.tar.gz either from the Workshop Google bucket. The bundle includes data for [Notebook #11685](https://gatk.zendesk.com/hc/en-us/articles/360035890031) and [Notebook #11686](https://gatk.zendesk.com/hc/en-us/articles/360035889891). The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. The example data is from the [1000 Genomes project](http://www.internationalgenome.org/) Phase 3 aligned to GRCh38.

In [None]:
! gsutil cp $WORKSHOP_BUCKET/germline-cnv/tutorial_11684.tar.gz /home/jupyter/notebooks/germline-cnv/data/

In [None]:
! tar -xvf $WORKSPACE_LOCAL/data/tutorial_11684.tar.gz -C $WORKSPACE_LOCAL/data > /dev/null 2>&1

In [None]:
# let's make a shortcut variable
TUTORIAL_DATA_PATH = '/home/jupyter/notebooks/germline-cnv/data/Tutorial11684'

---

# Collect raw counts data with _PreprocessIntervals_ and _CollectReadCounts_
PreprocessIntervals pads exome targets and bins WGS intervals. Binning refers to creating equally sized intervals across the reference. For example, 1000 base binning would define chr1:1-1000 as the first bin. Because counts of reads on reference N bases are not meaningful, the tool automatically excludes bins with all Ns. For GRCh38 chr1, non-N sequences start at base 10,001, so the first few bin become:

## For WGS data, bin entirety of reference, e.g. with 1000 base intervals.
(We skip this step in this tutorial, but the command-line is shown below for your reference)

In [None]:
#  ! gatk PreprocessIntervals \
#         -R $REFERENCE_BUCKET/Homo_sapiens_assembly38.fasta \
#         --padding 0 \
#         -imr OVERLAPPING_ONLY \
#         -O $WORKSPACE_LOCAL/sandbox/grch38.preprocessed.interval_list

This produces a Picard-style intervals list of 1000 base bins.

## For exome data, pad target regions, e.g. with 250 bases.
(We skip this step in this tutorial since the data is WGS, but the command-line is shown below for your reference)

In [None]:
# ! gatk PreprocessIntervals \
#         -R $REFERENCE_BUCKET/Homo_sapiens_assembly38.fasta \
#         -L $TUTORIAL_DATA_PATH/targets.interval_list \
#         --bin-length 0 \
#         -imr OVERLAPPING_ONLY \
#         -O $WORKSPACE_LOCAL/sandbox/targets.preprocessed.interval_list

## For the tutorial, we only bin two contigs
Lets subset the reference to just chr20 and chrY. Coverage data exists for chr20, chrX and chrY in this tutorial, so changing the following arguments to look at all three contigs is left as an exercise.

In [None]:
! gatk PreprocessIntervals \
        -R $REFERENCE_BUCKET/Homo_sapiens_assembly38.fasta \
        --bin-length 1000 \
        --padding 0 \
        -L chr20 -L chrY \
        -imr OVERLAPPING_ONLY \
        -O $WORKSPACE_LOCAL/sandbox/chr20Y.interval_list

In [None]:
# let us take a look
! grep -v '^@' $WORKSPACE_LOCAL/sandbox/chr20Y.interval_list | head -n 10

### Comments on select parameters of _PreprocessIntervals_

- For WGS, the default 1000 `--bin-length` is the recommended starting point for typical 30x data. Be sure to set `--padding 0` to disable padding outside of given genomic regions. Bin size should correlate with depth of coverage, e.g. lower coverage data should use larger bin size while higher coverage data can support smaller bin size. The size of the bin defines the resolution of CNV calls. The factors to consider in sizing include how noisy the data is, average coverage depth and how even coverage is across the reference.

- For targeted exomes, provide the exome capture kit's target intervals with `-L`, set `--bin-length 0` to disable binning and pad the intervals with `--padding 250` or other desired length.

- Provide intervals to exclude from analysis with `--exclude-intervals` or `-XL`, e.g. centromeric regions. Consider using this option especially if data is aligned to a reference other than GRCh38. The workflow enables excluding regions later again using `-XL`. A frugal strategy is to collect read counts using the entirety of intervals and then to exclude undesirable regions later at the _FilterIntervals_ step (section 3), the _DetermineGermlineContigPloidy_ step (section 4), at the _GermlineCNVCaller_ step (section 6) and/or post-calling.

## Count reads per bin using _CollectReadCounts_
_CollectReadCounts_ tabulates the raw integer counts of reads overlapping an interval. The tutorial has already collected read counts ahead of time for the three contigs: chr20 and chrY. Here, we collect read counts on small data.

In [None]:
! gatk CollectReadCounts \
        -L $TUTORIAL_DATA_PATH/chr20sub.interval_list \
        -R $REFERENCE_BUCKET/Homo_sapiens_assembly38.fasta \
        -imr OVERLAPPING_ONLY \
        -I $TUTORIAL_DATA_PATH/NA19017.chr20sub.bam \
        --format TSV \
        -O $WORKSPACE_LOCAL/sandbox/NA19017.tsv 

This generates a TSV format table of read counts. Let's take a look. After the SAM format header section, denoted by lines starting with `@`, the body of the data has a column header line followed by read counts for every interval:

In [None]:
! grep -v '^@' $WORKSPACE_LOCAL/sandbox/NA19017.tsv | head -n 10

### Comments on select parameters of _CollectReadCounts_

- The tutorial generates text-based TSV (tab-separated-value) format data instead of the default HDF5 format by adding `--format TSV` to the command. Omit this option to generate the default HDF5 format. Downstream tools process HDF5 format more efficiently.

- Here and elsewhere in the workflow, set `--interval-merging-rule` (`-imr`) to `OVERLAPPING_ONLY`, to prevent the tool from merging abutting intervals.

- The tool employs a number of engine-level read filters. Of note are _NotDuplicateReadFilter_ and _MappingQualityReadFilter_. This means the tool excludes reads marked as duplicate and excludes reads with mapping quality less than `10`. Change the mapping quality threshold with the `--minimum-mapping-quality` option.

# Annotate intervals with features using _AnnotateIntervals_ and subset regions of interest using _FilterIntervals_

The steps in this section pertain to the **cohort mode**.

Researchers may desire to subset the intervals that _GermlineCNVCaller_ will analyze, either to exclude potentially problematic regions or to retain only regions of interest. For example one may wish to exclude regions where all samples in a large cohort have copy number zero, or regions around segmental duplications (low-copy repeats) that often harbor common CNVs (e.g. in case the researcher is primarily interested in _de novo_ CNVs). Filtering intervals can be especially impactful for analyses that utilize references other than GRCh38 or that are based on sequencing technologies affected by sequence context, e.g. targeted exomes. The tutorial data is WGS data aligned to GRCh38, and the gCNV workflow can process the entirety of the data, without the need for any interval filtering.

Towards deciding which regions to exclude, _AnnotateIntervals_ labels the given intervals with GC content and additionally with mappability and segmental duplication content if given the respective optional resource files. _FilterIntervals_ then subsets the intervals list based on the annotations and other tunable thresholds. Later, _GermlineCNVCaller_ also takes in the annotated intervals to use as covariates towards analysis.

Explicit GC-correction, although optional, is recommended. The default v4.1.0.0 `cnv_germline_cohort_workflow.wdl` pipeline workflow omits explicit gc-correction and we activate it in the pipeline by setting `do_explicit_gc_correction":"True"`. The tutorial illustrates the optional _AnnotateIntervals_ step by performing the recommended explicit GC-content-based filtering.

## _AnnotateIntervals_ with GC content

In [None]:
! gatk AnnotateIntervals \
        -L $WORKSPACE_LOCAL/sandbox/chr20Y.interval_list \
        -R $REFERENCE_BUCKET/Homo_sapiens_assembly38.fasta \
        -imr OVERLAPPING_ONLY \
        -O $WORKSPACE_LOCAL/sandbox/chr20Y.annotated.tsv

This produces a four-column table where the fourth column gives the fraction of GC content. Let's take a look:

In [None]:
! grep -v '^@' $WORKSPACE_LOCAL/sandbox/chr20Y.annotated.tsv | head -n 10

### Comments on select parameters of _AnnotateIntervals_

- The tool requires the `-R` reference and the `-L` intervals. The tool calculates GC-content for the intervals using the reference. Although optional for the tool, we recommend annotating mappability by providing a `--mappability-track` regions file in either .bed or .bed.gz format. Be sure to merge any overlapping intervals beforehand. The tutorial omits use of this resource.

- GATK recommends use of the the single-read mappability track, as the multi-read track requires much longer times to process. For example, the Hoffman lab at the University of Toronto provides human and mouse mappability BED files for various kmer lengths at https://bismap.hoffmanlab.org. The accompanying publication is titled [Umap and Bismap: quantifying genome and methylome mappability](https://doi.org/10.1093/nar/gky677).

- Optionally and additionally, annotate segmental duplication content by providing a `--segmental-duplication-track` regions file in either .bed or .bed.gz format. 

- Exclude undesirable intervals with the `-XL` parameter, e.g. intervals corresponding to centromeric regions.

## _FilterIntervals_ based on GC-content and cohort extreme counts

_FilterIntervals_ takes preprocessed intervals and either annotated intervals or read counts or both. It can also exclude intervals given with `-XL`. When given both types of data, the tool retains the intervals that intersect from filtering on each data type. The v4.1.0.0 `cnv_germline_cohort_workflow.wdl` pipeline script requires read counts files, and so by default the pipeline script always performs the _FilterIntervals_ step on read counts.

In [None]:
! gatk FilterIntervals \
        -L $WORKSPACE_LOCAL/sandbox/chr20Y.interval_list \
        --annotated-intervals $WORKSPACE_LOCAL/sandbox/chr20Y.annotated.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG00096.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG00268.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG00419.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG00759.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG01051.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG01112.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG01500.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG01565.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG01583.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG01595.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG01879.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG02568.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG02922.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG03006.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG03052.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG03642.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/HG03742.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA18525.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA18939.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19017.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19625.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19648.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA20502.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA20845.tsv \
        -imr OVERLAPPING_ONLY \
        -O $WORKSPACE_LOCAL/sandbox/chr20Y.cohort.gc.filtered.interval_list

In [None]:
! grep -v '^@' $WORKSPACE_LOCAL/sandbox/chr20Y.interval_list | wc -l
! grep -v '^@' $WORKSPACE_LOCAL/sandbox/chr20Y.cohort.gc.filtered.interval_list | wc -l

This produces a Picard-style intervals list containing a subset of the starting intervals (80,534 of 87,641).

### Comments on select parameters of _FilterIntervals_

- The tool requires the preprocessed intervals, provided with `-L`, from the previous section. Given annotated intervals with `--annotated-intervals`, the tool filters intervals on the given annotation(s).

- GC-content thresholds are set by `--minimum-gc-content` and `--maximum-gc-content`, where defaults are `0.1` and `0.9`, respectively.

- Mappability thresholds are set by `--minimum-mappability` and `--maximum-mappability`. Defaults are `0.9` and `1.0`, respectively.

- Segmental duplication content thresholds are set by `--minimum-segmental-duplication-content` and `--maximum-segmental-duplication-content`. Defaults are 0.0 and 0.5, respectively.

- Given read counts files, each with `-I` and in either HDF5 or TSV format, the tool filters intervals on low and extreme read counts with the following tunable thresholds.
    - `--low-count-filter-count-threshold` default is `5`
    - `--low-count-filter-percentage-of-samples` default is `90.0`
    - `--extreme-count-filter-minimum-percentile` default is `1.0`
    - `--extreme-count-filter-maximum-percentile` default is `99.0`
    - `--extreme-count-filter-percentage-of-samples` default is `90.0`    
The read counts data must match each other in intervals. For the default parameters, the tool first filters intervals with a count less than `5` in greater than `90%` of the samples. The tool then filters the remaining intervals with a count percentile less than `1` or greater than `99` in a percentage of samples greater than `90%`. These parameters effectively exclude intervals where all samples have extreme outlier counts, e.g. are deleted.

- To disable counts based filtering, omit the read counts or, e.g. when using the v4.1.0.0 _cnv_germline_cohort_workflow.wdl_ pipeline script, set the two percentage-of-samples parameters as follows.
    - `--low-count-filter-percentage-of-samples 100`
    - `--extreme-count-filter-percentage-of-samples 100`    

- Provide intervals to exclude from analysis with `--exclude-intervals` or `-XL`, e.g. centromeric regions. A frugal strategy is to collect read counts using the entirety of intervals and then to exclude undesirable regions later at the _FilterIntervals_ step (section 3), the _DetermineGermlineContigPloidy_ step (section 4), at the _GermlineCNVCaller_ step (section 6) and/or post-calling.

# Call autosomal and allosomal contig ploidy with _DetermineGermlineContigPloidy_

_DetermineGermlineContigPloidy_ calls contig level ploidies (i.e. copy number) for both autosomal, e.g. human chr20, and allosomal contigs, e.g. human chrX. The tool determines baseline contig ploidies using sample coverages and contig ploidy priors that give the prior probabilities for each ploidy state for each contig. In this process, the tool generates global baseline coverage and noise data _GermlineCNVCaller_ will use in section 6.

The tool determines baseline contig ploidies using the total read count per contig. Researchers should consider the impact of this for their data. For example, for the tutorial WGS data, the contribution of the PAR regions to total coverage counts on chrX is small and the tool correctly calls allosomal ploidies. However, consider blacklisting PAR regions for data where the contribution is disporportionate, e.g. targeted panels.

## _DetermineGermlineContigPloidy_ in cohort mode

The cohort mode requires a `--contig-ploidy-priors` table and produces a ploidy model. The ploidy prior table specifies the prior probability of various ploidy states is formatted as follows:

In [None]:
! cat $TUTORIAL_DATA_PATH/chr20XY_contig_ploidy_priors.tsv

According to this table, we put 98% prior probability to call chr20 as diploid (PLOIDY_PRIOR_2), 0% prior probability to call it homozygous deletion (would you be alive without chr20?), and a small probability 1% each for heterozygous deletion (PLOIDY_PRIOR_1) and trisomy (PLOIDY_PRIOR_3).

  

In [None]:
# --max-training-epochs 20 --num-thermal-advi-iters 1000 --max-advi-iter-subsequent-epochs 500 args only used 
# for tutorial to decreaase training time. Not recommeneded for use with full size data.

! gatk DetermineGermlineContigPloidy \
        -L $WORKSPACE_LOCAL/sandbox/chr20Y.cohort.gc.filtered.interval_list \
        --interval-merging-rule OVERLAPPING_ONLY \
        -I $TUTORIAL_DATA_PATH/cvg/NA18525.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA18939.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19017.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19625.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19648.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA20502.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA20845.tsv \
        --contig-ploidy-priors $TUTORIAL_DATA_PATH/chr20XY_contig_ploidy_priors.tsv \
        --max-training-epochs 20 --num-thermal-advi-iters 1000 --max-advi-iter-subsequent-epochs 500 \
        --output $WORKSPACE_LOCAL/sandbox \
        --output-prefix ploidy-cohort7 \
        --verbosity DEBUG

This produces a `<output-prefix>-calls` directory and a `<output-prefix>-model` directory.

The ploidy-calls directory contains a folder of data for each sample in the cohort including the contig ploidy calls. Each sample directory, e.g. `ploidy-calls/SAMPLE_0`, contains five files:

In [None]:
! ls $WORKSPACE_LOCAL/sandbox/ploidy-cohort7-calls

In [None]:
! ls $WORKSPACE_LOCAL/sandbox/ploidy-cohort7-calls/SAMPLE_0

In [None]:
# let's look at the ploidy calls for NA19017
! cat $WORKSPACE_LOCAL/sandbox/ploidy-cohort7-calls/SAMPLE_2/sample_name.txt
! cat $WORKSPACE_LOCAL/sandbox/ploidy-cohort7-calls/SAMPLE_2/contig_ploidy.tsv

Description of output files:
- `contig_ploidy.tsv` notes the ploidy and genotype quality (GQ) of the ploidy call for each contig.
- `global_read_depth.tsv` notes an average depth value and an average ploidy across all the intervals of the sample.
- `mu_psi_s_log__.tsv` captures the posterior mean for all of the modeled parameters.
- `sample_name.txt` contains the readgroup sample (RG SM) name.
- `std_psi_s_log__.tsv` captures the standard deviation for all of the modeled paramters.

In [None]:
! ls $WORKSPACE_LOCAL/sandbox/ploidy-cohort7-model

The ploidy-model directory contains aggregated model data for the cohort. This is the model to provide to a case-mode _DetermineGermlineContigPloidy_ analysis and to _GermlineCNVCaller_. The tutorial ploidy-model directory contains the eight files as follows.

- `contig_ploidy_prior.tsv` is a copy of the ploidy priors given to the tool.
- `gcnvkernel_version.json` notes the version of the kernel.
- `interval_list.tsv` recapitulates the intervals used, e.g. the filtered intervals.
- `mu_mean_bias_j_lowerbound__.tsv` is the estimated contig-level capture bias (mean).
- `mu_psi_j_log__.tsv` is the estimated contig-level capture noise (mean).
- `ploidy_config.json` configuration of the ploidy model
- `std_mean_bias_j_lowerbound__.tsv` is the estimated contg-level capture bias (std).
- `std_psi_j_log__.tsv` is the estimated contig-level capture noise (std).   

Note: The PyMC3/Theano model automatically generates mu_ and std_ files and may append transformations it performs to the file name, e.g. log or lowerbound as we see above. These are likely of interest only to advanced users.

In preparation for the next section (i.e. running the tool in case mode), let us run _DetermineGermlineContigPloidy_ one more time, but assuming that we didn't have `NA19017` sample in the original cohort:

In [None]:
# --max-training-epochs 20 --num-thermal-advi-iters 1000 --max-advi-iter-subsequent-epochs 500 args only used 
# for tutorial to decrease training time. Not recommeneded for use with full size data.

! gatk DetermineGermlineContigPloidy \
        -L $WORKSPACE_LOCAL/sandbox/chr20Y.cohort.gc.filtered.interval_list \
        --interval-merging-rule OVERLAPPING_ONLY \
        -I $TUTORIAL_DATA_PATH/cvg/NA18525.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA18939.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19625.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19648.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA20502.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA20845.tsv \
        --contig-ploidy-priors $TUTORIAL_DATA_PATH/chr20XY_contig_ploidy_priors.tsv \
        --max-training-epochs 20 --num-thermal-advi-iters 1000 --max-advi-iter-subsequent-epochs 500 \
        --output $WORKSPACE_LOCAL/sandbox \
        --output-prefix ploidy-cohort6 \
        --verbosity DEBUG

## _DetermineGermlineContigPloidy_ in case mode

The case mode calls contig ploidies for each sample against the ploidy model given by `--model`. The following command runs sample NA19017 against the 6-sample cohort model we built above:

In [None]:
# --max-training-epochs 20 --num-thermal-advi-iters 1000 --max-advi-iter-subsequent-epochs 500 args only used 
# for tutorial to decrease training time. Not recommeneded for use with full size data.

! gatk DetermineGermlineContigPloidy \
        --model $WORKSPACE_LOCAL/sandbox/ploidy-cohort6-model \
        -I $TUTORIAL_DATA_PATH/cvg/NA19017.tsv \
        -O $WORKSPACE_LOCAL/sandbox \
        --output-prefix ploidy-cohort6-case \
        --max-training-epochs 20 --num-thermal-advi-iters 1000 --max-advi-iter-subsequent-epochs 500 \
        --verbosity DEBUG

This produces a ploidy-case-calls directory, which in turn contains a directory of sample data, SAMPLE_0. A list of the five resulting files is some paragraphs above:

In [None]:
! ls $WORKSPACE_LOCAL/sandbox/ploidy-cohort6-case-calls/SAMPLE_0

In [None]:
# let's look at the ploidy calls for the case sample (NA19017)
! cat $WORKSPACE_LOCAL/sandbox/ploidy-cohort6-case-calls/SAMPLE_0/sample_name.txt
! cat $WORKSPACE_LOCAL/sandbox/ploidy-cohort6-case-calls/SAMPLE_0/contig_ploidy.tsv

### Comments on select parameters of _DetermineGermlineContigPloidy_

- It is possible to analyze multiple samples simultaneously in a case mode command. Provide each sample with `-I`.
- For the `-L` intervals, supply the most processed intervals list. For the tutorial, this is the filtered intervals. Note the case mode does not require explicit intervals because the ploidy model provides them.
- Provide a `--contig-ploidy-priors` table containing the per-contig prior probabilities for integer ploidy state. Again, the case mode does not require an explicit priors file as the ploidy model provides them. Tool index describes this resource in detail.
- Optionally provide intervals to exclude from analysis with `--exclude-intervals` or `-XL`, e.g. [pseudoautosomal (PAR) regions](https://gatk.zendesk.com/hc/en-us/articles/360035891071), which can skew the results on sex chromosomes.

# Call copy-number variants with _GermlineCNVCaller_

_GermlineCNVCaller_ learns a denoising model per scattered shard while consistently calling CNVs across the shards. The tool models systematic biases and CNVs simultaneously, which allows for sensitive detection of both rare and common CNVs. As the tool index states under Important Remarks (v4.1.0.0), the tool should see data from a large enough genomic region so as to be exposed to diverse genomic features. The current recommendation is to provide at least ~10–50Mbp genomic coverage per scatter. This applies to exomes or WGS. This allows reliable inference of bias factors including GC bias. The limitation of analyzing larger regions is available memory. As an analysis covers more data, memory requirements increase.

For expediency, the tutorial commands below analyze small data, specifically the 1400 bins in `twelveregions.cohort.gc.filtered.interval_list` and use default parameters. The tutorial splits the 1400 bins into two shards with 700 bins each to illustrate scattering. This results in ~0.7Mbp genomic coverage per shard. See section 5.2.3 for how to split interval lists by a given number of intervals. Default inference parameters are conservatively set for efficient run times.

## _GermlineCNVCaller_ in cohort mode

This cell produces per-interval gCNV calls for each of the cohort samples and a gCNV model for the cohort. Each command produces three directories within `gcnv-cohort7-twelve`:
- a `gcnv-cohort7-twelve-1of2-calls` folder of per sample gCNV call results,
- a `gcnv-cohort7-twelve-1of2-model` folder of cohort model data,
- a `gcnv-cohort7-twelve-1of2-tracking` folder of data that tracks model fitting.

Note: Each shard should finish under 10 minutes in this Terra VM (4 vCPUs).  When running the default parameters of the v4.1.0.0 WDL cohort-mode workflow on the cloud, the majority of the shard analyses complete in half an hour.

**First shard (7 samples)**

In [None]:
! gatk GermlineCNVCaller \
        --run-mode COHORT \
        -L $TUTORIAL_DATA_PATH/scatter-sm/twelve_1of2.interval_list \
        -I $TUTORIAL_DATA_PATH/cvg/NA18525.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA18939.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19017.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19625.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19648.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA20502.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA20845.tsv \
        --contig-ploidy-calls $WORKSPACE_LOCAL/sandbox/ploidy-cohort7-calls \
        --annotated-intervals $TUTORIAL_DATA_PATH/twelveregions.annotated.tsv \
        --interval-merging-rule OVERLAPPING_ONLY \
        --output $WORKSPACE_LOCAL/sandbox/gcnv-cohort7-twelve \
        --output-prefix gcnv-cohort7-twelve-1of2 \
        --verbosity DEBUG

**Second shard (7 samples)**

In [None]:
! gatk GermlineCNVCaller \
        --run-mode COHORT \
        -L $TUTORIAL_DATA_PATH/scatter-sm/twelve_2of2.interval_list \
        -I $TUTORIAL_DATA_PATH/cvg/NA18525.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA18939.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19017.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19625.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19648.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA20502.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA20845.tsv \
        --contig-ploidy-calls $WORKSPACE_LOCAL/sandbox/ploidy-cohort7-calls \
        --annotated-intervals $TUTORIAL_DATA_PATH/twelveregions.annotated.tsv \
        --interval-merging-rule OVERLAPPING_ONLY \
        --output $WORKSPACE_LOCAL/sandbox/gcnv-cohort7-twelve \
        --output-prefix gcnv-cohort7-twelve-2of2 \
        --verbosity DEBUG

In [None]:
# let's get acquainted with the output
! ls $WORKSPACE_LOCAL/sandbox/gcnv-cohort7-twelve

In [None]:
! ls $WORKSPACE_LOCAL/sandbox/gcnv-cohort7-twelve/gcnv-cohort7-twelve-1of2-model

**(advanced)** The model directory contains aggregated model data for the cohort and the analyzed interval shard. This is the model to provide to a case-mode _GermlineCNVCaller_. The model files are as follows:

- `calling_config.json` is the configuration of the copy-number caller sub-model (the two-level HMM)
- `denoising_config.json` is the configuration of the read-count noise model
- `gcnvkernel_version.json` notes the version of the gcnvkernel
- `interval_list.tsv` recapitulates the intervals used in this shard
- `log_q_tau_tk.tsv` is the posterior probability of region class (rare vs. common) in log-scale
- `mu_ard_u_log__.tsv` is ARD coefficient of the bias factors in log-scale (mean)
- `mu_log_mean_bias_t.tsv` is the mean read-count capture bias of each of interval (mean) 
- `mu_psi_t_log__.tsv` is the read-count unexplained variance (overdispersion) of each of interval in log-scale (mean)
- `mu_W_tu.tsv` is the interval x number of bias factors matrix of learned bias factors (mean)
- `std_ard_u_log__.tsv` is ARD coefficient of the bias factors in log-scale (std)
- `std_log_mean_bias_t.tsv` is the mean read-count capture bias of each of interval (std) 
- `std_psi_t_log__.tsv` is the read-count unexplained variance (overdispersion) of each of interval in log-scale (std)
- `std_W_tu.tsv` is the interval x number of bias factors matrix of learned bias factors (std)   
Note: The PyMC3/Theano model automatically generates mu_ and std_ prefixed files and may append transformations it performs to the file name (e.g. log) as we see above. These are likely of interest only to advanced users who may want to inspect the model parameters.

In [None]:
! ls $WORKSPACE_LOCAL/sandbox/gcnv-cohort7-twelve/gcnv-cohort7-twelve-1of2-tracking

**(advanced)** Loss function trajectory during model training:

- `warm_up_elbo_history.tsv` is evidence lower bound (ELBO) history during the initial model warm-up period
- `main_elbo_history.tsv` is evidence lower bound (ELBO) history during the main model training period

In [None]:
! ls $WORKSPACE_LOCAL/sandbox/gcnv-cohort7-twelve/gcnv-cohort7-twelve-1of2-calls

In [None]:
! ls $WORKSPACE_LOCAL/sandbox/gcnv-cohort7-twelve/gcnv-cohort7-twelve-1of2-calls/SAMPLE_0

**(advanced)** The calls directory contains various sample-specific estimated quantities, including per-interval estimated copy-number states. These intermediate results are later post-processed by `PostprocessGermlineCNVCalls` to produce VCF files (see the next section). The per-sample call files are as follows:

- `baseline_copy_number_t.tsv` is the per-interval baseline copy-number of the sample (recapitulated from DetermineGermlineContigPloidy)
- `log_c_emission_tc.tsv` is the emission probability of each interval to different copy-number states in log-scale
- `log_q_c_tc.tsv` is the per-interval posterior probability of each copy-number state in log-scale
- `mu_denoised_copy_ratio_t.tsv` is the per-interval denoised copy-ratio
- `mu_psi_s_log__.tsv` is the global unexplained variance (overdispersion) of this sample in log-scale (mean)
- `mu_read_depth_s_log__.tsv` is the global read-depth of this sample in log-scale (mean)
- `mu_z_sg.tsv` is loading of each GC bin (related to GC bias correction) (mean) 
- `mu_z_su.tsv` is the loading of each bias factor (mean)
- `sample_name.txt` is the name of the sample
- `std_denoised_copy_ratio_t.tsv` is the per-interval denoised copy-ratio
- `std_psi_s_log__.tsv` is the global unexplained variance (overdispersion) of this sample in log-scale (std)
- `std_read_depth_s_log__.tsv` is the global read-depth of this sample in log-scale (std)
- `std_z_sg.tsv` is loading of each GC bin (related to GC bias correction) (std) 
- `std_z_su.tsv` is the loading of each bias factor (std)  
Note: The PyMC3/Theano model automatically generates mu_ and std_ prefixed files and may append transformations it performs to the file name (e.g. log__) as we see above. These are likely of interest only to advanced users who may want to inspect the model parameters.

To test _GermlineCNVCaller_ in case mode, let us assume we didn't have sample `NA19017` in the initial cohort and let us build a cohort of 6 samples. We will then run _GermlineCNVCaller_ in case mode using the model learned from the 6-sample cohort.

**First shard (6 samples)** 

In [None]:
! gatk GermlineCNVCaller \
        --run-mode COHORT \
        -L $TUTORIAL_DATA_PATH/scatter-sm/twelve_1of2.interval_list \
        -I $TUTORIAL_DATA_PATH/cvg/NA18525.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA18939.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19625.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19648.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA20502.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA20845.tsv \
        --contig-ploidy-calls $WORKSPACE_LOCAL/sandbox/ploidy-cohort6-calls \
        --annotated-intervals $TUTORIAL_DATA_PATH/twelveregions.annotated.tsv \
        --interval-merging-rule OVERLAPPING_ONLY \
        --output $WORKSPACE_LOCAL/sandbox/gcnv-cohort6-twelve \
        --output-prefix gcnv-cohort6-twelve-1of2 \
        --verbosity DEBUG

**Second shard (6 samples)** 

In [None]:
! gatk GermlineCNVCaller \
        --run-mode COHORT \
        -L $TUTORIAL_DATA_PATH/scatter-sm/twelve_2of2.interval_list \
        -I $TUTORIAL_DATA_PATH/cvg/NA18525.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA18939.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19625.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA19648.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA20502.tsv \
        -I $TUTORIAL_DATA_PATH/cvg/NA20845.tsv \
        --contig-ploidy-calls $WORKSPACE_LOCAL/sandbox/ploidy-cohort6-calls \
        --annotated-intervals $TUTORIAL_DATA_PATH/twelveregions.annotated.tsv \
        --interval-merging-rule OVERLAPPING_ONLY \
        --output $WORKSPACE_LOCAL/sandbox/gcnv-cohort6-twelve \
        --output-prefix gcnv-cohort6-twelve-2of2 \
        --verbosity DEBUG

## _GermlineCNVCaller_ in case mode

Call gCNVs on a sample against a cohort model. The case analysis must use the same scatter approach as the model generation. So, as above, we run two shard analyses. Here, `--model` and `--output-prefix` differ between the scatter the commands.

**First shard (case-calling sample `NA19017` against the model built from the 6-sample cohort)** 

In [None]:
! gatk GermlineCNVCaller \
        --run-mode CASE \
        -I $TUTORIAL_DATA_PATH/cvg/NA19017.tsv \
        --contig-ploidy-calls $WORKSPACE_LOCAL/sandbox/ploidy-cohort6-case-calls \
        --model $WORKSPACE_LOCAL/sandbox/gcnv-cohort6-twelve/gcnv-cohort6-twelve-1of2-model \
        --output $WORKSPACE_LOCAL/sandbox/gcnv-case-twelve-vs-cohort6 \
        --output-prefix gcnv-case-twelve-vs-cohort6-1of2 \
        --verbosity DEBUG

**Second shard (case-calling sample `NA19017` against the model built from the 6-sample cohort)** 

In [None]:
! gatk GermlineCNVCaller \
        --run-mode CASE \
        -I $TUTORIAL_DATA_PATH/cvg/NA19017.tsv \
        --contig-ploidy-calls $WORKSPACE_LOCAL/sandbox/ploidy-cohort6-case-calls \
        --model $WORKSPACE_LOCAL/sandbox/gcnv-cohort6-twelve/gcnv-cohort6-twelve-2of2-model \
        --output $WORKSPACE_LOCAL/sandbox/gcnv-case-twelve-vs-cohort6 \
        --output-prefix gcnv-case-twelve-vs-cohort6-2of2 \
        --verbosity DEBUG

In [None]:
# let's get acquainted with the output
! ls $WORKSPACE_LOCAL/sandbox/gcnv-case-twelve-vs-cohort6

In [None]:
! ls $WORKSPACE_LOCAL/sandbox/gcnv-case-twelve-vs-cohort6/gcnv-case-twelve-vs-cohort6-1of2-calls

In [None]:
! ls $WORKSPACE_LOCAL/sandbox/gcnv-case-twelve-vs-cohort6/gcnv-case-twelve-vs-cohort6-1of2-calls/SAMPLE_0

The call output in the case mode is similar to the cohort mode.

At this point, the workflow has done its most heavy lifting to produce data towards copy number calling. In Section 5, we consolidate the data from the scattered GermlineCNVCaller runs, perform segmentation and call copy number states.


### Comments on select parameters of _GermlineCNVCaller_

- The `-O` output directory must be extant before running the command. Future releases (v4.1.1.0) will create the directory.
- The default `--max-copy-number` is capped at 5. This means the tool reports any events with more copies as CN5.
- For the cohort mode, optionally provide `--annotated-intervals` to include the annotations as covariates. These must contain all of the `-L` intervals. The `-L` intervals is an exact match or a subset of the annotated intervals.
- For the case mode, the tool accepts only a single `--model` directory at a time. So the case must be analyzed with the same number of scatters as the cohort model run. The case mode parameters appear fewer than the cohort mode because the `--model` directory provides the seemingly missing requirements, i.e. the scatter intervals and the annotated intervals.
- For both modes, provide the `--contig-ploidy-calls` results from _DetermineGermlineContigPloidy_ (Section 4).
- `--verbosity DEBUG` allows tracking the Python `gcnvkernel` model fitting in the stdout, e.g. with information on denoising epochs and whether the model converged. The default INFO level verbosity is the next most verbose and emits only GATK Engine level messages.

### How do I increase the sensitivity of CNV detection?

The tutorial uses default _GermlineCNVCaller_ modeling parameters. However, researchers should expect to tune parameters for data, e.g. from different sequencing technologies. For tuning, first consider the coherence length parameters, `p-alt`, `p-active` and the `psi-scale` parameters. These hyperparameters are just a few of the plethora of adjustable parameters _GermlineCNVCaller_ offers. Refer to the GermlineCNVCaller tool documentation for detailed explanations, and ask on the GATK Forum for further guidance.

One set of parameter changes for WGS data that dramatically increase the sensitivity of calling on the tutorial data:
```
    --class-coherence-length 1000.0 \
    --cnv-coherence-length 1000.0 \
    --enable-bias-factors false \
    --interval-psi-scale 1.0E-6 \
    --log-mean-bias-standard-deviation 0.01 \
    --sample-psi-scale 1.0E-6 \
```

Article #11687 and Notebook #11686 compare the results of using default vs. the increased-sensitivity parameters. Given the absence of off-the-shelf filtering solutions for CNV calls, when tuning parameters to increase sensitivity, researchers should expect to perform additional due diligence, especially for analyses requiring high precision calls.

**Comments on select sensitivity parameters**

- Decreasing `--class-coherence-length` from its default of `10,000bp` to `1000bp` decreases the expected length of contiguous segments of genomic region classes (silent vs. active). Factor for bin size when tuning.
- Decreasing `--cnv-coherence-length` from its default `10,000bp` to `1000bp` decreases the expected length of per-sample CNV events. Factor for bin size when tuning.
- Turning off `--enable-bias-factors` from the default true state to false turns off active discovery of learnable bias factors. This should always be on for targeted exome data and in general can be turned off for WGS data.
- Decreasing `--interval-psi-scale` from its default of `0.001` to `1.0E-6` reduces the scale the tool considers normal in per-interval noise.
- Decreasing `--log-mean-bias-standard-deviation` from its default of `0.1` to `0.01` reduces what is considered normal noise in capture bias.
- Decreasing `--sample-psi-scale` from its default of `0.0001` to `1.0E-6` reduces the scale that is considered normal in sample-specific global read-count overdispersion.


Additional parameters to consider include `--depth-correction-tau`, `--p-active` and `--p-alt`:

- `--depth-correction-tau` has a default of `10000.0` (10K) and defines the precision of read-depth concordance with the prior estimate.
- `--p-active` has a default of `1e-2` (0.01) and defines the prior probability of designating common CNV genomic regions.
- `--p-alt` has a default of `1e-6` (0.000001) and defines the expected probability of CNV events (in silent genomic regions).

### How do I make interval lists for scattering (genomic sharding)?

This step applies to the cohort mode. It is unnecessary for case mode analyses as the model implies the scatter intervals.

The v4.1.0.0 `cnv_germline_cohort_workflow.wdl` pipeline workflow scatters the _GermlineCNVCaller_ step. Each scattered analysis is on genomic intervals subset from intervals produced either from _PreprocessIntervals_ (section 2) or from _FilterIntervals_ (section 3). The workflow uses Picard _IntervalListTools_ to break up the intervals list into roughly balanced lists:

In [None]:
! mkdir -p $WORKSPACE_LOCAL/sandbox/scatter
! gatk IntervalListTools \
        --INPUT $TUTORIAL_DATA_PATH/chr20sub.cohort.gc.filtered.interval_list \
        --SUBDIVISION_MODE INTERVAL_COUNT \
        --SCATTER_CONTENT 5000 \
        --OUTPUT $WORKSPACE_LOCAL/sandbox/scatter

This produces three intervals lists with ~5K intervals each. For the tutorial's 1Kbp bins, this gives ~5Mbp genomic coverage per scatter. Each list is identically named scattered.interval_list within its own folder within the scatter directory. _IntervalListTools_ systematically names the intermediate folders, e.g. temp_0001_of_3, temp_0002_of_3 and temp_0002_of_3.

### Comments on select parameters of _IntervalListTools_

- The `--SUBDIVISION_MODE INTERVAL_COUNT` mode scatters intervals into similarly sized lists according to the count of intervals regardless of the base count. The tool intelligently breaks up the chr20sub.cohort.gc.filtered.interval_list's ~15K intervals into lists of 5031, 5031 and 5033 intervals. This is preferable to having a fourth interval list with just 95 intervals.
- The tool has another useful feature in the context of the gCNV workflow. To subset `-I` binned intervals, provide the regions of interest with `-SI` (`--SECOND_INPUT`) and use the `--ACTION OVERLAPS` mode to create a new intervals list of the overlapping bins. Adding `--SUBDIVISION_MODE INTERVAL_COUNT --SCATTER_CONTENT 5000` will produce scatter intervals concurrently with the subsetting.

# Call copy number segments and consolidate sample results with _PostprocessGermlineCNVCalls_

_PostprocessGermlineCNVCalls_ consolidates the scattered _GermlineCNVCaller_ results, performs segmentation and calls copy number states. The tool generates per-interval and per-segment CNV calls in VCF format and runs on a single sample at a time. It also produces consolidated denoised copy-ratio estimates from all interval shards.

## _PostprocessGermlineCNVCalls_ in cohort mode

Process a single sample from the 7-sample cohort using the sample index. For `NA19017`, the sample index is 2.

In [None]:
! gatk PostprocessGermlineCNVCalls \
        --model-shard-path $WORKSPACE_LOCAL/sandbox/gcnv-cohort7-twelve/gcnv-cohort7-twelve-1of2-model \
        --model-shard-path $WORKSPACE_LOCAL/sandbox/gcnv-cohort7-twelve/gcnv-cohort7-twelve-2of2-model \
        --calls-shard-path $WORKSPACE_LOCAL/sandbox/gcnv-cohort7-twelve/gcnv-cohort7-twelve-1of2-calls \
        --calls-shard-path $WORKSPACE_LOCAL/sandbox/gcnv-cohort7-twelve/gcnv-cohort7-twelve-2of2-calls \
        --allosomal-contig chrX --allosomal-contig chrY \
        --contig-ploidy-calls $WORKSPACE_LOCAL/sandbox/ploidy-cohort7-calls \
        --sample-index 2 \
        --output-genotyped-intervals $WORKSPACE_LOCAL/sandbox/genotyped-intervals-cohort7-twelve-NA19017.vcf.gz \
        --output-genotyped-segments $WORKSPACE_LOCAL/sandbox/genotyped-segments-cohort7-twelve-NA19017.vcf.gz \
        --output-denoised-copy-ratios $WORKSPACE_LOCAL/sandbox/denoised-copy-ratios-cohort7-twelve-NA19017.tsv \
        --sequence-dictionary $REFERENCE_BUCKET/Homo_sapiens_assembly38.dict

## _PostprocessGermlineCNVCalls_ in case mode

`NA19017` is the singular sample with index 0 in the case mode run from section 5.

In [None]:
! gatk PostprocessGermlineCNVCalls \
        --model-shard-path $WORKSPACE_LOCAL/sandbox/gcnv-cohort6-twelve/gcnv-cohort6-twelve-1of2-model \
        --model-shard-path $WORKSPACE_LOCAL/sandbox/gcnv-cohort6-twelve/gcnv-cohort6-twelve-2of2-model \
        --calls-shard-path $WORKSPACE_LOCAL/sandbox/gcnv-case-twelve-vs-cohort6/gcnv-case-twelve-vs-cohort6-1of2-calls \
        --calls-shard-path $WORKSPACE_LOCAL/sandbox/gcnv-case-twelve-vs-cohort6/gcnv-case-twelve-vs-cohort6-2of2-calls \
        --allosomal-contig chrX --allosomal-contig chrY \
        --contig-ploidy-calls $WORKSPACE_LOCAL/sandbox/ploidy-cohort6-case-calls \
        --sample-index 0 \
        --output-genotyped-intervals $WORKSPACE_LOCAL/sandbox/genotyped-intervals-case-twelve-vs-cohort6.vcf.gz \
        --output-genotyped-segments $WORKSPACE_LOCAL/sandbox/genotyped-segments-case-twelve-vs-cohort6.vcf.gz \
        --output-denoised-copy-ratios $WORKSPACE_LOCAL/sandbox/denoised-copy-ratios-cohort6-twelve-NA19017.tsv \
        --sequence-dictionary $REFERENCE_BUCKET/Homo_sapiens_assembly38.dict

Each command generates two VCFs with indices, and a TSV file. The `genotyped-intervals` VCF contains variant records for each genomic interval and therefore data covers only the interval regions. For the tutorial's small data, this gives 1400 records. The `genotyped-segments` VCF contains records for each contiguous copy number state segment. For the tutorial's small data, this is 30 and 31 records for cohort and case mode analyses, respectively. Finally, the `denoised-copy-ratios` TSV contains the denoised copy-ratio estimates for each genomic interval.

The CNV calls for sample `NA19017` are expected to be highly concordant between cohort and case modes. The slight difference is due to the contribution of the sample itself to model training, which is absent in the 6-sample model used for case-calling the sample. Let's take a look.

In [None]:
# let's exact a few things from the VCF file to a nice table
! gatk VariantsToTable \
        -V $WORKSPACE_LOCAL/sandbox/genotyped-intervals-case-twelve-vs-cohort6.vcf.gz \
        -F CHROM -F POS -F END -GF CN \
        -O $WORKSPACE_LOCAL/sandbox/genotyped-intervals-case-twelve-vs-cohort6.table.txt
! gatk VariantsToTable \
        -V $WORKSPACE_LOCAL/sandbox/genotyped-intervals-cohort7-twelve-NA19017.vcf.gz \
        -F CHROM -F POS -F END -GF CN \
        -O $WORKSPACE_LOCAL/sandbox/genotyped-intervals-cohort7-twelve-NA19017.table.txt

In [None]:
def load_copy_number_table(table_file_path):
    table_data = []
    with open(table_file_path, 'r') as f:
        for line in f:
            if line.split()[0] == 'CHROM':
                pass
            else:
                line_tokens = line.strip().split()
                start_pos = int(line_tokens[1])
                end_pos = int(line_tokens[2])
                cn = int(line_tokens[3])
                table_data.append((start_pos, end_pos, cn))
    return table_data


cohort_cn_calls_data = load_copy_number_table(
    f'{WORKSPACE_LOCAL}/sandbox/genotyped-intervals-cohort7-twelve-NA19017.table.txt')

case_cn_calls_data = load_copy_number_table(
    f'{WORKSPACE_LOCAL}/sandbox/genotyped-intervals-case-twelve-vs-cohort6.table.txt')

In [None]:
len(cohort_cn_calls_data)

In [None]:
len(case_cn_calls_data)

In [None]:
n_agreeing_intervals = sum(
    cohort_cn_call[2] == case_cn_call[2]
    for cohort_cn_call, case_cn_call in zip(cohort_cn_calls_data, case_cn_calls_data))

print(n_agreeing_intervals)

Only copy-number calls over 13 intervals differ between cohort and case mode for sample `NA19017`.

# Where to go from here?

[Article #11687](https://gatk.zendesk.com/hc/en-us/articles/360035531452) visualizes the results in IGV and provides follow-up discussion.

Towards data exploration, here are two illustrative Jupyter Notebook reports that dissect the results:
- [Notebook #11685](https://gatk.zendesk.com/hc/en-us/articles/360035890031/) shows an approach to measuring concordance of sample NA19017 gCNV calls to 1000 Genomes Project truth set calls using tutorial chr20sub small data.
- [Notebook #11686](https://gatk.zendesk.com/hc/en-us/articles/360035889891) examines gCNV callset annotations using larger data, namely chr20 gCNV results using a 24-sample cohort.
