# Common Somatic Tertiary Analysis (COSTA) Notebooks

This series of notebooks is created to common tertiary analysis of somatic genetic variants. The series consists of the following notebooks:

- Notebook 0: Somatic Variant Source Data (not in OpenBio)
- Notebook 1: Somatic VCF to annotated MAF
- Notebook 2: Kaplan-Meier Survival Curve: Phenotype Based Cohort
- Notebook 3: Population Level Somatic Mutation Analysis
- Notebook 4: Kaplan-Meier Survival Curve: Somatic Variant Based Cohort
- Notebook 5: Gene Level Somatic Mutation Analysis

# Notebook 3: Population level somatic mutation analysis
This notebook demonstrates how to to identify the top 10 most mutated genes in a cohort and how to visualize a mutation status.

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook. We also generate a list of the genes co-occurring with the top 10 mutated genes.

## 1. Preparing your environment
### Launch spec:

* App name: JupyterLab with Python, R, Stata, ML
* Kernel: R
* Instance type: mem1_ssd1_v2_x16
* Cost: < $0.2
* Runtime: =~ 10 min
* Data description: File input for this notebook is:

    * MAF file produced from a previous notebook in this series, or project-level TCGA somatic MAF file from GDC Data Portal.
    
    _Note: The input MAF file(s) need to contain variants in atleast two genes for the Oncoplot and Exclusive/Co-occurance analysis to run properly._

### Package and tools dependency:

| Package | License | 
| --- | --- |
| <a href="https://bioconductor.org/packages/maftools">maftools</a> | <a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> + <a href="https://cran.r-project.org/web/packages/tidyverse/LICENSE">file LICENSE</a> |
| <a href="https://readr.tidyverse.org/">readr</a> |  <a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> + <a href="https://cran.r-project.org/web/packages/tidyverse/LICENSE">file LICENSE</a> |


**Install Packages**

Uncomment the install commands if you are comfortable with the library license and want to install and run the parts notebook that depend on the library.

_Note: Package installation takes ~5 minutes_

In [None]:
# BiocManager::install("maftools")
# install.packages("readr")

**Declare input and output file/folder names**

Here we use the individual MAF files generated from individual VCF files in `Notebook 1:Somatic VCF to annotated MAF` as our input.

Our output files will be:
* A csv file contains all variants for the top 10 genes, one variant per row, and contains the following information for each variant: Chromosome number, start position, end position, HGNS gene symbol, sample IDs with the variant, and protein change.
* A csv file containing the co-occurrence analysis for the top 10 mutated genes, including at least the following information: Gene symbols of the co-occurrence, P-value, odds ratio, and the specific event (co-occurrence or exclusive).

In [None]:
# Input folder
maf_folder <- "individual_maf"

# Output files
top_10_file <- "tcga-brca-top10-genes.csv"
co_occurrence_file <- "tcga-brca-top10-co-occurrence.csv"

**Download Data**

If we want to use the project level MAF file obtained from GDC Data Portal that we have saved in our project, download it using `dx download <file_name>`.

In [None]:
# # Get data from GDC Data portal (open, Mutect2 Variant Aggregation and Masking)
# # File = TCGA.BRCA.mutect.995c0111-d90b-4140-bee7-3845436c3b42.DR-10.0.somatic.maf.gz
# # UUID = 995c0111-d90b-4140-bee7-3845436c3b42
# # Data Category =  Simple Nucleotide Variation
# # Data Type = Masked Somatic Mutation
# # File page link: https://portal.gdc.cancer.gov/files/995c0111-d90b-4140-bee7-3845436c3b42

# system("dx download gdc_download_20220310_213209.934185.tar.gz
# tar -xvf gdc_download_20220310_213209.934185.tar.gz
# mv 995c0111-d90b-4140-bee7-3845436c3b42/*.maf.gz .
# mv *.maf.gz tcga-brca.somatic.maf.gz
# gunzip tcga-brca.somatic.maf.gz")

## 2. Load Libraries

In [None]:
library(maftools)
library(readr)
library(dplyr)

_Note: At this point, we suggest creating a snapshot of the environment for resuse --> DNAnexus/Create SnapshotOnce a snapshot is created, the object may be used when launching a new JupyterLab instance and will contain all installed packages and any downloaded data._

## 3. Load Data
We will read and merge the individual level MAF files into one MAF object and use it for analysis.

In [None]:
setwd(paste0("/mnt/project/", maf_folder))
my_maf <- merge_mafs(
  mafs = list.files(path = "."),
  verbose = TRUE
)
setwd("/opt/notebooks/")
my_maf

Alternatively, use the project level MAF file from GDC Data Portal.

In [None]:
# my_maf = read.maf(maf = "tcga-brca.somatic.maf")
# my_maf

## 4. Population Level Somatic Mutation Analysis 

###  Plotting MAF summary

Plot a visual summary of the mutational status of the cohort using the `plotmafsummary` function.

In [None]:
plotmafSummary(
  maf = my_maf,
  rmOutlier = TRUE,
  addStat = "median",
  dashboard = TRUE,
  titvRaw = FALSE
)

###  Get top 10 mutated genes
Get gene summary and obtain the top 10 mutated genes.

In [None]:
top10_genes <- getGeneSummary(my_maf)[1:10]
top10_genes

Subset the MAF file to obtain variant information for the top 10 genes.

In [None]:
maf_subset <- subsetMaf(
  maf = my_maf,
  genes = top10_genes$"Hugo_Symbol",
  mafObj = FALSE
)

Add the following variant information from the MAF subset data.table to the top 10 genes data.table:
* Chromosome number
* Start position
* End position
* Sample IDs (samples which contain those variants)
* Protein change

In [None]:
top10_genes <- top10_genes %>%
  inner_join(
    maf_subset %>%
      select(
        Chromosome,
        Start_Position,
        End_Position,
        Tumor_Sample_Barcode,
        Matched_Norm_Sample_Barcode,
        SIFT,
        PolyPhen,
        Amino_acids,
        Hugo_Symbol
      ),
    by = "Hugo_Symbol"
  )

head(top10_genes)

### Oncoplot for the top 10 mutated genes

In [None]:
oncoplot(
  maf = my_maf,
  top = 10
)

### Exclusive/co-occurance analysis of top 10 genes

In [None]:
somatic_interaction <- somaticInteractions(
  maf = my_maf,
  top = 10,
  pvalue = c(0.05, 0.1)
)
head(somatic_interaction)

# 5. Upload results to the project
We write the output dataframes to csv files and upload them to our project using CLI dx-toolbox command, `dx upload <file_name>`.

In [None]:
# Top 10 genes
write_csv(top10_genes, top_10_file)
# Co-occurence analysis
write_csv(somatic_interaction, co_occurrence_file)

In [None]:
system(paste("dx upload", shQuote(top_10_file)))
system(paste("dx upload", shQuote(co_occurrence_file)))