# Somatic Variant Survival Analysis Using `dx extract_assay somatic`
<hr/>
***As-Is Software Disclaimer***

This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

<hr/>

This notebook demonstrates usage of the `dx` command `extract_assay somatic` perform Kaplan-Meier analysis using a somatic variant dataset.

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

## Preparing your environment
### Launch spec:

* App name: JupyterLab with Python, R, Stata, ML
* Kernel: Python
* Instance type: mem1_ssd1_v2_x2
* Cost: < $0.1
* Runtime: =~ 2 min
* Data description: Input for this notebook is a v3.0 Dataset or Cohort object ID. The example dataset used in this notebook has 10 patients and around 3.5K variants.

### Setup dxpy to retrieve somatic variants

This notebook requires dxpy >= v0.352.0. You can check your version

In [None]:
%%bash
dx --version

and if necessary upgrade from a repository

In [None]:
%%bash
#pip install -U dxpy

or a file

In [None]:
%%bash
#pip install dxpy-0.352.0-py2.py3-none-any.whl

`dx extract_assay somatic` is the command to retrieve somatic variants from a datasets. Available options can be examined with the `--help` flag.

In [None]:
%%bash
dx extract_assay somatic --help

### Select the dataset

Choose a dataset with a somatic variant assay using the `project-id:record-id` identifier.

In [None]:
%env dataset project-GX0Jpp00ZJ46qYPq5G240k1k:record-GXjb4Pj01KjxKYKJvvY50PqZ

## Extract data and sample identifiers

### Retrieve somatic variant data from the dataset 

Download somatic variants of genes TP53 and PIK3CA to a TSV file of by filtering variants with the `--retrieve-variant` option. See the description and JSON template for filtering somatic variants by using the `--json-help` flag for further information. Pass `sample_id` to the argument `--additional-fields` to get the sample identifiers to merge with phenotypes.

In [None]:
%%bash

dx extract_assay somatic ${dataset} \
--retrieve-variant '{
  "annotation": {
    "symbol": ["PIK3CA", "TP53"]
  }
}' \
--additional-fields sample_id,CLIN_SIG \
-o somatic_variants.tsv

cat somatic_variants.tsv

### Retrieve phenotype data from the dataset

Download the phenotype data linked to somatic variant samples using `extract_dataset`. See [this notebook](https://github.com/dnanexus/OpenBio/blob/master/dx-toolkit/dx_extract_dataset_bash.ipynb) on `extract_dataset` for more details. The `last_contact_days_to`, `death_days_to`, and `vital_status` fields have been selected for the survival analysis.

In [None]:
%%bash

dx extract_dataset ${dataset} \
--fields sample.sample_id,patient.last_contact_days_to,patient.death_days_to,patient.vital_status \
-o sample_phenotypes.tsv

cat sample_phenotypes.tsv

`last_contact_days_to` is included to show an example of real values in the phenotypic data, but will not be included in the survival analysis. `death_days_to` and `vital_status` are uniform. These values will be manipulated later for the purpose of providing an example survival analysis.

### Derive features from somatic variants and merge with phenotypic data

Use pandas in Python to aggregate somatic variant features and then merge with the phenotypic data to use as the basis for the survival analysis.

In [None]:
import pandas as pd

Load the somatic variant TSV file into a pandas DataFrame.

In [None]:
somatic_variants_df = pd.read_csv("somatic_variants.tsv", sep="\t")
somatic_variants_df

Load the sample-linked phenotype data TSV file into a pandas DataFrame.

In [None]:
sample_phenotypes_df = pd.read_csv("sample_phenotypes.tsv")
sample_phenotypes_df

Aggregate somatic variants to the sample level with a flag for if the sample contains a variant with "pathogenic" in the `CLIN_SIG` field.

In [None]:
somatic_variants_df['pathogenic'] = somatic_variants_df["CLIN_SIG"].str.contains("[\",]pathogenic[\",]")
pathogenic_variants_df = somatic_variants_df[['sample_id', 'pathogenic']].groupby('sample_id')
pathogenic_variants_df = pathogenic_variants_df.any('pathogenic')
pathogenic_variants_df = pathogenic_variants_df.reset_index()
pathogenic_variants_df

Merge the aggregated somatic variant data with the phenotype data using `sample_id`. Some samples in the phenotypic data might not have any variants for the filter selected. Set the `pathogenic` flag to `False` for samples with missing values.

In [None]:
pathogenic_variant_phenotypes_df = sample_phenotypes_df.merge(pathogenic_variants_df,
                                                           how='left',
                                                           left_on='sample.sample_id',
                                                           right_on='sample_id')
pathogenic_variant_phenotypes_df['pathogenic'] = pathogenic_variant_phenotypes_df['pathogenic'].fillna(False)
pathogenic_variant_phenotypes_df

Fill in the `vital_status` and `death_days_to` with artifical values for a more interesing example survival analysis.

In [None]:
pathogenic = pathogenic_variant_phenotypes_df['pathogenic']

pathogenic_variant_phenotypes_df.loc[pathogenic, 'patient.vital_status'] = 'Dead'
pathogenic_variant_phenotypes_df.loc[pathogenic, 'patient.death_days_to'] = [100, 350, 550]

pathogenic_variant_phenotypes_df.loc[~pathogenic, 'patient.vital_status'] = \
    ['Dead', 'Dead', 'Alive', 'Alive', 'Alive', 'Alive', 'Alive']
pathogenic_variant_phenotypes_df.loc[~pathogenic, 'patient.death_days_to'] = \
    [240, 600, 800, 900, 1000, 1005, 1010]

### Perform the survival analysis and visualization

The [lifelines](https://lifelines.readthedocs.io/) Python package performs survival analysis and visualization. Uncomment and run the line to install.

In [None]:
#pip install lifelines

Variable `T` is the durations of `death_days_to` and variable `E` is whether `vital_status` is "Dead". Kaplan-Meier fits are performed with `T` and `E` for the sets of patients with and without pathogenic variants and plotted together.

In [None]:
from lifelines import KaplanMeierFitter

kmf = KaplanMeierFitter()

T = pathogenic_variant_phenotypes_df['patient.death_days_to']
E = pathogenic_variant_phenotypes_df['patient.vital_status'] == 'Dead'
pathogenic = pathogenic_variant_phenotypes_df['pathogenic']

kmf.fit(T[~pathogenic], E[~pathogenic], label='no pathogenic variants')
ax = kmf.plot_survival_function()

kmf.fit(T[pathogenic], E[pathogenic], label='pathogenic variants')
ax = kmf.plot_survival_function(ax=ax)