# Common Somatic Tertiary Analysis (COSTA) Notebooks

This series of notebooks is created to common tertiary analysis of somatic genetic variants. The series consists of the following notebooks:

- Notebook 0: Somatic Variant Source Data (not in OpenBio)
- Notebook 1: Somatic VCF to annotated MAF
- Notebook 2: Kaplan-Meier Survival Curve: Phenotype Based Cohort
- Notebook 3: Population Level Somatic Mutation Analysis
- Notebook 4: Kaplan-Meier Survival Curve: Somatic Variant Based Cohort
- Notebook 5: Gene Level Somatic Mutation Analysis

# Notebook 2: Kaplan-Meier Survival Curve: Phenotype Based Cohort
This notebook gives the demonstrates how to perform survival analysis with Kaplan-Meier (K-M) Survival Curve and visualize the survival rate between two cohorts.

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.



## 1. Preparing your environment
### Launch spec:

* App name: JupyterLab with Python, R, Stata, ML
* Kernel: R
* Instance type: mem1_ssd1_v2_x16
* Cost: < $0.2
* Runtime: =~ 8 min
* Data description: File input for this notebook is:

    Phenotype table having clinical data for all the individuals in the study. Some of the columns of our interest are given in the table below.
    * A _derived_ column has values that we obtain using values from other columns. In this case, we have `end_days_to` as either `death_days_to` or `last_contact_days_to`.
    * A _synthetic_ column has synthetically generated data. In this case, we have generated synthetic data to be able to analyze multi-state events, as described in section 4 below.

| Column | Description | 
| --- | --- |
| vital_status | Whether the patient is Alive or Dead. |
| history_other_malignancy | The yes/no/unknown indicator used to describe the patient's history of prior cancer diagnosis. |
| surgical_procedure_first | First surgical procedure related to the diagnosis performed on the patient. |
| tumor_status | Whether the patient has tumor or is tumor free. |
| last_contact_days_to | Number of days between the date used for index and the date of the patient's last follow-up appointment or contact. NA if the patient dies. |
| death_days_to | Number of days between the date used for index and the date of the patient's death. NA if the person drops out of the study or loses contact. |
| end_days_to (_derived_)| Number of days between the date used for index and the date of terminal event (death or dropping out/last contact. |
| new_tumor_dx_days_to (_synthetic_) | Number of days between the date used for index and the date of diagnosis of a new tumor. |
| surgical_procedure_first_days_to (_synthetic_) | Number of days between the date used for index and the date of first surgical procedure. |
| complete_response_days_to (_synthetic_) | Number of days between the date used for index and the date when the patient attains complete response. |    
        
### Package and tools dependency:

| Package | License | 
| --- | --- |
| <a href="https://readr.tidyverse.org/">readr</a> | <a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> + <a href="https://cran.r-project.org/web/packages/tidyverse/LICENSE">file LICENSE</a> |
| <a href="https://cran.r-project.org/package=ggfortify">ggfortify</a> | <a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> + <a href="https://cran.r-project.org/web/packages/tidyverse/LICENSE">file LICENSE</a> |
| <a href="https://cran.r-project.org/package=survival">survival</a> | <a href="https://cran.r-project.org/web/licenses/LGPL-2">LGPL (>= 2)</a> |
| <a href="https://patchwork.data-imaginist.com/">patchwork</a> | <a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> + <a href="https://cran.r-project.org/web/packages/tidyverse/LICENSE">file LICENSE</a> |

**Install Packages**

Uncomment the install commands if you are comfortable with the library license and want to install and run the parts notebook that depend on the library.

_Note: Package installation takes ~5 minutes_

In [None]:
# install.packages("survival")
# install.packages("ggfortify")
# install.packages("readr")
# install.packages("patchwork")

**Declare input file name**

Here we use the phenotype table generated using the source MAF file and clinical data of the BRCA project that we downloaded from GDC.

In [None]:
# Input files
pheno_file <- "tcga-brca-phenotype.csv"

## 2. Load Libraries

In [None]:
library(survival)
library(ggfortify)
library(readr)
library(patchwork)
library(dplyr)

_Note: At this point, we suggest creating a snapshot of the environment for resuse --> DNAnexus/Create SnapshotOnce a snapshot is created, the object may be used when launching a new JupyterLab instance and will contain all installed packages and any downloaded data._

## 3. Load Data
Read all the "time to event" columns as numeric and the categorical columns as factors.

In [None]:
pheno_df <- readr::read_csv(
  paste0("/mnt/project/", pheno_file),
  show_col_types = FALSE,
  na = c("NA", "null"),
  col_types = list(
    birth_days_to = col_integer(),
    last_contact_days_to = col_integer(),
    death_days_to = col_integer(),
    vital_status = col_factor(),
    gender = col_factor(),
    history_other_malignancy = col_factor(),
    surgical_procedure_first = col_factor(),
    her2_status_by_ihc = col_factor(),
    tumor_status = col_factor()
  )
)
head(pheno_df, 5)
colnames(pheno_df)

## 4. Add synthetic data and transform the data for Survival analysis

#### Add a column for time (days) to terminal/end event
In this case, terminal/end event is either death or patient dropping out of the study.

In [None]:
pheno_df <- pheno_df %>%
  rowwise() %>%
  mutate(end_days_to = if_else(is.na(death_days_to), last_contact_days_to, death_days_to)) %>%
  filter(end_days_to > 1)

# Convert vital status to numeric type, were 1 denotes death and 0 denotes alive (thus censored)
pheno_df <- pheno_df %>%
  mutate(vital_status = as.numeric(vital_status))

head(pheno_df)

#### Add synthetic data for multi-state events
Our clinical data has information only on a two-state event (patient is either alive or in the terminal state). To showcase how K-M can be used to support multi-state analysis, existing data wasn't sufficient since it did not have the time to event information for multi-state events. Hence we add synthetic data to our existing phenotype table.

In addition to the alive and terminal states, we introduce new time to events-  time to surgery (`surgical_procedure_first_days_to`), complete response (`complete_response_days_to`) and development of new tumor (`new_tumor_dx_days_to`). Each of these days is a random day some time before the terminal event day. We assign `surgical_procedure_first_days_to` only to the patients who have definitive surgical procedures (either Simple Mastectomy, Modified Radical Mastectomy, or Lumpectomy).

In [None]:
set.seed(5)

surgery_types <- c(
  "Simple Mastectomy",
  "Modified Radical Mastectomy",
  "Lumpectomy"
)

# new_tumor_dx_days_to = a random day some time before the terminal day
# surgical_procedure_first_days_to = a random day some time before the terminal day
#                                    only if surgery in surgery types
# complete_response_days_to = a random day some time before the terminal day
pheno_df <- pheno_df %>%
  rowwise() %>%
  mutate(
    new_tumor_dx_days_to = end_days_to - floor(runif(1, min = 1, max = end_days_to))
  ) %>%
  mutate(
    surgical_procedure_first_days_to = ifelse(
      surgical_procedure_first %in% surgery_types,
      end_days_to - floor(runif(1, min = 1, max = end_days_to)),
      NA
    )
  ) %>%
  mutate(
    complete_response_days_to = end_days_to - floor(runif(1, min = 1, max = end_days_to))
  )

#### Adjust the time intervals

If two events occur on the same day, it results in a length 0 interval. Intervals of length 0 are illegal for the Surv objects of the survival package. To overcome the problem of length 0 intervals, we correct the data by forcing one of the overlapping events to fall a day prior to its original date.

In [None]:
pheno_df <- pheno_df %>%
  rowwise() %>%
  mutate(
    complete_response_days_to = if_else(
      complete_response_days_to == surgical_procedure_first_days_to,
      complete_response_days_to - 1,
      complete_response_days_to
    )
  ) %>%
  mutate(
    complete_response_days_to = if_else(
      complete_response_days_to == new_tumor_dx_days_to,
      complete_response_days_to - 1,
      complete_response_days_to
    )
  ) %>%
  mutate(
    surgical_procedure_first_days_to = if_else(
      surgical_procedure_first_days_to == new_tumor_dx_days_to,
      surgical_procedure_first_days_to - 1,
      surgical_procedure_first_days_to
    )
  )

When we deal with multiple events per patient, we create intervals (start time, end time) instead of using only the time to event information. The `tmerge` function of survival package is used to create time intervals for multi-state data and add columns indicating whether a patient has had a prior surgery or complete response.

In [None]:
cols <- c(
        "case_id", 
        "gender", 
        "tumor_status", 
        "history_other_malignancy", 
        "surgical_procedure_first", 
        "her2_status_by_ihc", 
        "vital_status",
        "end_days_to")

merged_df <- tmerge(
  pheno_df[, cols],
  pheno_df,
  id = case_id,
  death = event(end_days_to, vital_status),
  surgery = event(surgical_procedure_first_days_to),
  complete_response = event(complete_response_days_to),
  new_tumor = event(new_tumor_dx_days_to),
  prior_complete_response = tdc(complete_response_days_to),
  prior_surgery = tdc(surgical_procedure_first_days_to)
)

#### Decorate the dataframe with columns required for Survival Analysis

* Add a column denoting the event occurring during every interval in the table.
* Add columns denoting Complete Response Status and Surgery Status and a column that tells whether the terminal event occurred before or after surgery.


In [None]:
merged_df <- merged_df %>%
  rowwise() %>%
  mutate(
    event = factor(
      (complete_response + 2 * surgery + 4 * new_tumor + 8 * death),
      c(0, 1, 2, 4, 8),
      c("none", "complete_response", "surgery", "new_tumor", "death")
    )
  ) %>%
  mutate(
    cr_stat = factor(
      if_else(prior_complete_response == 1, 0, c(0, 1, 0, 0, 2)[event]),
      0:2,
      c("none", "CR", "death")
    )
  ) %>%
  mutate(
    surgery_stat = factor(
      (ifelse(prior_surgery, 0, c(0, 0, 1, 0, 2)[event])),
      0:2,
      c("censor", "surgery", "death")
    )
  ) %>%
  mutate(
    surgery2 = factor(
      (c(0, 0, 1, 0, 2)[event] + prior_surgery),
      0:3,
      c("censor", "surgery", "death w/o surgery", "death after surgery")
    )
  )

head(merged_df, 10)

#### Convert time to event from days to months

In [None]:
merged_df <- merged_df %>%
  rowwise() %>%
  mutate(end_days_to = end_days_to * 12 / 365.25) %>%
  mutate(tstart = tstart * 12 / 365.25) %>%
  mutate(tstop = tstop * 12 / 365.25)

## 5. Survival Analysis

We use the <a href="https://cran.r-project.org/web/packages/survival/vignettes/survival.pdf">survival</a> package to perform survival analysis on our data.

#### Single event

Fit a model that gives the probability of survival with respect to time.

In [None]:
model_se1 <- survfit(Surv(end_days_to, vital_status) ~ 1, data = pheno_df)
model_se1

In [None]:
# Obtain summary of the model
summary(model_se1)

In [None]:
# Survival plot
autoplot(
    model_se1, 
    main = "Survival plot of patients", 
    xlab = "Time (in days)", 
    ylab = "Probability of survival"
)

Fit a model that gives the probability of survival with respect to time, based on tumor status.

In [None]:
# Fit model based on tumor status
model_se2 <- survfit(
    Surv(end_days_to, vital_status) ~ tumor_status, 
    data = pheno_df[pheno_df$tumor_status %in% c("TUMOR FREE", "WITH TUMOR"), ])

# Survival plot
autoplot(
    model_se2, 
    xlab = "Time (in days)", 
    ylab = "Probability of survival", 
    main = "Tumor status based Survival plot") +
labs(color = "Tumor status", fill = "Tumor status")

#### Multi state event


Fit a model that gives the probability of events like Complete Response, Surgery and Death occurring at different points in time post enrollment, based on the Surgery that was performed.

In [None]:
# Compare the probabilities of Simple Mastectomy -vs- Modified Radical Mastectomy
data <- merged_df %>%
  filter(surgical_procedure_first %in% c("Modified Radical Mastectomy", "Simple Mastectomy"))

sfit1 <- survfit(Surv(end_days_to, vital_status) ~ surgical_procedure_first, data) # Survival
sfit2 <- survfit(Surv(tstart, tstop, cr_stat) ~ surgical_procedure_first, data = data, id = case_id) # Complete Response
sfit3 <- survfit(Surv(tstart, tstop, surgery_stat) ~ surgical_procedure_first, data = data, id = case_id) # Surgery

In [None]:
p1 <- autoplot(
  sfit1,
  xlab = "Time (in months)",
  ylab = "Probability of survival",
  main = "Survival plot based on Surgical procedure"
) +
  labs(color = "Surgery", fill = "Surgery")

In [None]:
p2 <- autoplot(
  sfit2[, "CR"],
  xlab = "Time (in months)",
  ylab = "Fraction with the endpoint",
  main = "Percentage of patients reaching Complete Response based on Surgical procedure"
) +
  labs(color = "Surgery", fill = "Surgery")

In [None]:
p3 <- autoplot(
  sfit3[, "surgery"],
  xlab = "Time (in months)",
  ylab = "Fraction with the endpoint",
  main = "Percentage of patients having surgery based on Surgical procedure"
) +
  labs(color = "Surgery", fill = "Surgery")

In [None]:
options(repr.plot.width = 30, repr.plot.height = 10)
p1 + p2 + p3

#### Other statistics
**Estimating x-time survival:** Estimate the probability of survival after 5 years, as a function of malignancy history.

In [None]:
sfit4 <- survfit(Surv(end_days_to, vital_status) ~ history_other_malignancy, data = pheno_df[pheno_df$history_other_malignancy %in% c("No", "Yes"),])

In [None]:
summary(sfit4, times = 60)

If the patient has a history of malignancy, the probability of survival after 5 years is 62%, and if there is no history of malignancy, the probability of survuval is 79%.

**Comparing survival times between groups**: Compare survival times between patients with and without malignancy history.

In [None]:
survdiff(Surv(end_days_to, vital_status) ~ history_other_malignancy, data = pheno_df[pheno_df$history_other_malignancy %in% c("No", "Yes"),])