# Common Somatic Tertiary Analysis (COSTA) Notebooks

This series of notebooks is created to common tertiary analysis of somatic genetic variants. The series consists of the following notebooks:

- Notebook 0: Somatic Variant Source Data (not in OpenBio)
- Notebook 1: Somatic VCF to annotated MAF
- Notebook 2: Kaplan-Meier Survival Curve: Phenotype Based Cohort
- Notebook 3: Population Level Somatic Mutation Analysis
- Notebook 4: Kaplan-Meier Survival Curve: Somatic Variant Based Cohort
- Notebook 5: Gene Level Somatic Mutation Analysis

# Notebook 5: Gene Level Somatic Mutation Analysis
This notebook demonstrates how to to analyze the effects of variants on protein structure and active domains using lollipop plots. Lollipop plot is a common tool to visualize the mutation activity as it gives a concise view of somatic variants at a specific genomic region.

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

## 1. Preparing your environment
### Launch spec:

* App name: JupyterLab with Python, R, Stata, ML
* Kernel: R
* Instance type: mem1_ssd1_v2_x16
* Cost: < $0.2
* Runtime: =~ 8 min
* Data description: File input for this notebook is:

    * MAF files produced from a previous notebook in this series, or project-level TCGA somatic MAF file from GDC Data Portal.
    * A phenotype table containing clinical information of the patients like vital status, days to death, days to last contact etc.
    * A flat file containing top 10 mutated genes across all samples. This information can be found in the file `tcga-brca-top10-genes.csv` which we created in the previous step- `Notebook 3: Population Level Somatic Mutation Analysis`.
    
### Package and tools dependency:

| Package | License | 
| --- | --- |
| <a href="https://bioconductor.org/packages/maftools">maftools</a> | <a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> + <a href="https://cran.r-project.org/web/packages/tidyverse/LICENSE">file LICENSE</a> |
| <a href="https://readr.tidyverse.org/">readr</a> |  <a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> + <a href="https://cran.r-project.org/web/packages/tidyverse/LICENSE">file LICENSE</a> |

**Install Packages**

Uncomment the install commands if you are comfortable with the library license and want to install and run the parts notebook that depend on the library.

_Note: Package installation takes ~5 minutes_

In [None]:
# BiocManager::install("maftools")
# install.packages("readr")

**Declare input file/folder names**

Here we use the individual MAF files generated from individual VCF files in `Notebook 1:Somatic VCF to annotated MAF` as our input.
We also use the phenotype file `tcga-brca-phenotype.csv` that has clinical infromation of patients and the file containing information regarding top 10 mutated genes- `tcga-brca-top10-genes.csv`.

In [None]:
# Input folder containing individual level MAF files
maf_folder <- "individual_maf"

# Phenotype file
pheno_file <- "tcga-brca-phenotype.csv"

# Co-occurrence file
gene_file <- "tcga-brca-top10-genes.csv"

**Download Data**

If we want to use the project level MAF file obtained from GDC Data Portal that we have saved in our project, download it using `dx download <file_name>`.

In [None]:
# # Get data from GDC Data portal (open, Mutect2 Variant Aggregation and Masking)
# # File = TCGA.BRCA.mutect.995c0111-d90b-4140-bee7-3845436c3b42.DR-10.0.somatic.maf.gz
# # UUID = 995c0111-d90b-4140-bee7-3845436c3b42
# # Data Category =  Simple Nucleotide Variation
# # Data Type = Masked Somatic Mutation
# # File page link: https://portal.gdc.cancer.gov/files/995c0111-d90b-4140-bee7-3845436c3b42

# system("dx download gdc_download_20220310_213209.934185.tar.gz
# tar -xvf gdc_download_20220310_213209.934185.tar.gz
# mv 995c0111-d90b-4140-bee7-3845436c3b42/*.maf.gz .
# mv *.maf.gz tcga-brca.somatic.maf.gz
# gunzip tcga-brca.somatic.maf.gz")

## 2. Load Libraries

In [None]:
library(maftools)
library(data.table)
library(readr)
library(dplyr)

_Note: At this point, we suggest creating a snapshot of the environment for resuse --> DNAnexus/Create SnapshotOnce a snapshot is created, the object may be used when launching a new JupyterLab instance and will contain all installed packages and any downloaded data._

## 3. Load and Transform Data

**Clinical data**

Load the phenotype csv file with appropriate column types.

In [None]:
pheno_df <- readr::read_csv(
  paste0("/mnt/project/", pheno_file),
  show_col_types = FALSE,
  na = c("NA", "null"),
  col_types = list(
    last_contact_days_to = col_integer(),
    death_days_to = col_integer(),
    vital_status = col_factor()
  )
)

**MAF files**

We will read and merge the individual level MAF files into one MAF object, add clinical data to it, and use it for analysis.

In [None]:
setwd(paste0("/mnt/project/", maf_folder))
my_maf <- merge_mafs(
  mafs = list.files(path = "."),
  verbose = TRUE
)
setwd("/opt/notebooks/")

my_maf@clinical.data <- as.data.table(pheno_df)
my_maf

Alternatively, use the project level MAF file from GDC Data Portal.

In [None]:
# my_maf <- read.maf(
#   maf = "tcga-brca.somatic.maf",
#   clinicalData = pheno_df,
#   verbose = FALSE
# )
# my_maf

**Top 10 genes file**

The genes are arranged in descending order of the total number of mutations observed. We select the gene with the maximum number of total mutations, and hence load only one row and the columns containing gene names and total mutations.

In [None]:
top_gene_df <- readr::read_csv(
  paste0("/mnt/project/", gene_file),
  show_col_types = FALSE,
  n_max = 1,
  col_select = c(
    "Hugo_Symbol",
    "total"
  )
)
top_gene_df

## 4. Somatic Mutation Analysis

We use the `lollipopPlot` function from `maftools` to perform Somatic Mutation Analysis. We choose the gene with the highest numner of total mutations and use the `lollipop` function to plot the variants corresponding to its proteins.

First we want to understand the typical mutation landscape of the entire cohort, this gives us what are some of the variants that are commonly seen for the phenotype. Then we visualize a single sample which helps to understand what is outstanding in the specific sample comparing to the entire cohort.

### Whole cohort
We visualize the functional amino acids mutations and their impacts on various domains across all the samples.

**Select top mutated gene**

In [None]:
top_gene <- top_gene_df %>% pull(Hugo_Symbol)

**Lollipop plot**

In [None]:
options(repr.plot.width = 20, repr.plot.height = 15)
lollipopPlot(
  maf = my_maf,
  gene = top_gene,
  AACol = "HGVSp_Short"
)

### Single sample
We visualize the functional amino acids mutations and their impacts on various domains across in one sample.

**Load the individual level MAF file**

In [None]:
path_to_maf <- paste0("/mnt/project/", maf_folder, "/TCGA-3C-AALI-01A-11D-A41F-09_vs_TCGA-3C-AALI-10A-01D-A41F-09.maf")
my_single_maf <- read.maf(path_to_maf, clinicalData = pheno_df)

**Get the top mutated gene**

In [None]:
genes <- getGeneSummary(my_single_maf) %>% pull(Hugo_Symbol)
genes[1]

**Lollipop plot**

In [None]:
lollipopPlot(
  maf = my_single_maf,
  gene = genes[1],
  AACol = "HGVSp_Short"
)