# Common Somatic Tertiary Analysis (COSTA) Notebooks

This series of notebooks is created to common tertiary analysis of somatic genetic variants. The series consists of the following notebooks:

- Notebook 0: Somatic Variant Source Data (not in OpenBio)
- Notebook 1: Somatic VCF to annotated MAF
- Notebook 2: Kaplan-Meier Survival Curve: Phenotype Based Cohort
- Notebook 3: Population Level Somatic Mutation Analysis
- Notebook 4: Kaplan-Meier Survival Curve: Somatic Variant Based Cohort
- Notebook 5: Gene Level Somatic Mutation Analysis

# Notebook 4: Kaplan-Meier Survival Curve: Somatic Variant Based Cohort
This notebook demonstrates how to to analyze the survival rate for a cohort with a set of mutated genes compared with a cohort which does not have the set of mutations.

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

## 1. Preparing your environment
### Launch spec:

* App name: JupyterLab with Python, R, Stata, ML
* Kernel: R
* Instance type: mem1_ssd1_v2_x16
* Cost: < $0.2
* Runtime: =~ 10 min
* Data description: File input for this notebook is:

    * MAF files produced from a previous notebook in this series, or project-level TCGA somatic MAF file from GDC Data Portal.
    * A phenotype table containing clinical information of the patients like vital status, days to death, days to last contact etc.
    * A flat file containing top 10 co-occurrent genes. The co-occurence information can be found in the file `tcga-brca-top10-co-occurence.csv` which we created in the previous step- `Notebook 3: Population Level Somatic Mutation Analysis`.
    
### Package and tools dependency:

| Package | License | 
| --- | --- |
| <a href="https://bioconductor.org/packages/maftools">maftools</a> | <a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> + <a href="https://cran.r-project.org/web/packages/tidyverse/LICENSE">file LICENSE</a> |
| <a href="https://readr.tidyverse.org/">readr</a> |  <a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> + <a href="https://cran.r-project.org/web/packages/tidyverse/LICENSE">file LICENSE</a> |

**Install Packages**

Uncomment the install commands if you are comfortable with the library license and want to install and run the parts notebook that depend on the library.

_Note: Package installation takes ~5 minutes_

In [None]:
# BiocManager::install("maftools")
# install.packages("readr")

**Declare input file/folder names**

Here we use the individual MAF files generated from individual VCF files in `Notebook 1:Somatic VCF to annotated MAF` as our input.
We also use the phenotype file `tcga-brca-phenotype.csv` that has clinical infromation of patients.

In [None]:
# Input folder containing individual level MAF files
maf_folder <- "individual_maf"

# Phenotype file
pheno_file <- "tcga-brca-phenotype.csv"

# Co-occurrence file
co_occurrence_file <- "tcga-brca-top10-co-occurrence.csv"

**Download Data**

If we want to use the project level MAF file obtained from GDC Data Portal that we have saved in our project, download it using `dx download <file_name>`.

In [None]:
# # Get data from GDC Data portal (open, Mutect2 Variant Aggregation and Masking)
# # File = TCGA.BRCA.mutect.995c0111-d90b-4140-bee7-3845436c3b42.DR-10.0.somatic.maf.gz
# # UUID = 995c0111-d90b-4140-bee7-3845436c3b42
# # Data Category =  Simple Nucleotide Variation
# # Data Type = Masked Somatic Mutation
# # File page link: https://portal.gdc.cancer.gov/files/995c0111-d90b-4140-bee7-3845436c3b42

# system("dx download gdc_download_20220310_213209.934185.tar.gz
# tar -xvf gdc_download_20220310_213209.934185.tar.gz
# mv 995c0111-d90b-4140-bee7-3845436c3b42/*.maf.gz .
# mv *.maf.gz tcga-brca.somatic.maf.gz
# gunzip tcga-brca.somatic.maf.gz")

## 2. Load Libraries

In [None]:
library(maftools)
library(readr)
library(dplyr)
library(data.table)

_Note: At this point, we suggest creating a snapshot of the environment for resuse --> DNAnexus/Create SnapshotOnce a snapshot is created, the object may be used when launching a new JupyterLab instance and will contain all installed packages and any downloaded data._

## 3. Load and Transform Data

**Clinical data**

Load the phenotype csv file with appropriate column types.

In [None]:
pheno_df <- readr::read_csv(
  paste0("/mnt/project/", pheno_file),
  show_col_types = FALSE,
  na = c("NA", "null"),
  col_types = list(
    last_contact_days_to = col_integer(),
    death_days_to = col_integer(),
    vital_status = col_factor()
  )
)

Add a column for time (days) to terminal event. In this case, terminal event is either death or patient dropping out of the study.

In [None]:
pheno_df <- pheno_df %>%
  mutate(
    termination_days_to = ifelse(is.na(death_days_to), last_contact_days_to, death_days_to)
  ) %>%
  filter(
    termination_days_to > 1
  ) %>%
  mutate(
    vital_status = as.numeric(vital_status)
  )

head(pheno_df)

**MAF files**

We will read and merge the individual level MAF files into one MAF object, add clinical data to it, and use it for analysis.

In [None]:
setwd(paste0("/mnt/project/", maf_folder))
my_maf <- merge_mafs(
  mafs = list.files(path = "."),
  verbose = TRUE
)
setwd("/opt/notebooks/")

my_maf@clinical.data <- as.data.table(pheno_df)
my_maf

Alternatively, use the project level MAF file from GDC Data Portal.

In [None]:
# my_maf <- read.maf(
#   maf = "tcga-brca.somatic.maf",
#   clinicalData = pheno_df,
#   verbose = FALSE
# )
# my_maf

**Gene co-occurrence file**

The co-occurring gene pairs are arranged in ascending order of their p-value in this file. We select the gene pair with the least p-value, and hence load only one row and the columns containing gene names and p-value.

In [None]:
co_occurrence_df <- readr::read_csv(
  paste0("/mnt/project/", co_occurrence_file),
  show_col_types = FALSE,
  n_max = 1,
  col_select = c(
    "gene1",
    "gene2",
    "pValue"
  )
)
co_occurrence_df

# 4. Survival Analysis

We use the `mafSurvival` function from `maftools` to perform survival analysis and compare cohorts.

We choose the set of co-occurrent genes, with the lowest p-value and create cohorts based on mutations observed in these genes. We perform survival analysis between the group of samples having the mutated gene sets (`Mutant`) and the group who have no mutations on either gene (`Wildtype`).

In [None]:
gene1 <- co_occurrence_df %>% pull(gene1)
gene2 <- co_occurrence_df %>% pull(gene2)

In [None]:
mafSurvival(
  maf = my_maf,
  genes = c(gene1, gene2),
  time = "termination_days_to",
  Status = "vital_status"
)