#  Using Somatic Variant Meta Information from `dx extract_assay somatic`
<hr/>
***As-Is Software Disclaimer***

This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

<hr/>

This notebook demonstrates usage of the `dx` command `extract_assay somatic` to filter somatic variants by INFO and FORMAT fields using VCF meta information and allele specific values.

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

## Preparing your environment
### Launch spec:

* App name: JupyterLab with Python, R, Stata, ML
* Kernel: Python
* Instance type: mem1_ssd1_v2_x2
* Cost: < $0.1
* Runtime: =~ 2 min
* Data description: Input for this notebook is a v3.0 Dataset or Cohort object ID. The example dataset used in this notebook has 10 patients and around 3.5K variants.

### Setup dxpy to retrieve somatic variants

This notebook requires dxpy >= v0.352.0. You can check your version

In [None]:
%%bash
dx --version

and if necessary upgrade from a repository

In [None]:
%%bash
#pip install -U dxpy

or a file

In [None]:
%%bash
#pip install dxpy-0.352.0-py2.py3-none-any.whl

`dx extract_assay somatic` is the command to retrieve somatic variants from a datasets. Available options can be examined with the `--help` flag.

In [None]:
%%bash
dx extract_assay somatic --help

### Select the dataset

Choose a dataset with a somatic variant assay using the `project-id:record-id` identifier.

In [None]:
%env dataset project-GX0Jpp00ZJ46qYPq5G240k1k:record-GXjb4Pj01KjxKYKJvvY50PqZ

## Extract meta-information and data

### Retrieve the meta-information from the dataset 

Display somatic variant assay information and genotype field meta-information using the `--retrieve-meta-info` flag. INFO and FORMAT (Field) definitions (ID, Type, Number, and Description) are displayed as a tab-separated list. See the [VCF 4.3 specification](https://samtools.github.io/hts-specs/VCFv4.3.pdf) for full descriptions of the values.

In [None]:
%%bash
dx extract_assay somatic ${dataset} --retrieve-meta-info --output -

### Retrieve somatic variant data from the dataset 

Download somatic variants of genes TP53 and PIK3CA to a TSV file of by filtering variants with the `--retrieve-variant` option. See the description and JSON template for filtering somatic variants by using the `--json-help` flag for further information. Additionally, include the `FORMAT` and `GENOTYPE` information for each variant with the argument `--additional-fields 'FORMAT,GENOTYPE'`.

In [None]:
%%bash

dx extract_assay somatic ${dataset} \
--retrieve-variant '{
  "annotation": {
    "symbol": ["PIK3CA", "TP53"]
  }
}' \
--additional-fields 'FORMAT,GENOTYPE' \
-o format_genotype.tsv

cat format_genotype.tsv

### Filter somatic variants using genotype information using pandas

Load the somatic variant data into a pandas DataFrame from the TSV file. The combination of `assay_sample_id` and `allele_id` can be used as the unique index for the DataFrame.

In [None]:
import pandas as pd

original_df = pd.read_csv('format_genotype.tsv', sep='\t')
index_columns = ['assay_sample_id', 'allele_id']
original_df = original_df.set_index(index_columns).sort_values(index_columns)
original_df.head()

Parse the `GENOTYPE` column using the `FORMAT` column to index the `GENOTYPE` column and return a DataFrame with columns for each `FORMAT` field.

In [None]:
def parse_genotype(x):
    return pd.Series(x['GENOTYPE'].split(':'), index=x['FORMAT'].split(':'))

parsed_df = original_df.apply(parse_genotype, axis='columns')
parsed_df.head()

Replace the genotype information in the original DataFrame with the parsed columns. Another level of indexing can be added to distinguish it from the original columns.

In [None]:
df = pd.concat(
    [original_df.drop(['FORMAT', 'GENOTYPE'], axis='columns'), parsed_df],
    axis='columns',
    keys=['', 'GENOTYPE'],
)
df.head()

Recalling the meta information from using `--retrieve-meta-info` the flag, the field with `ID DP` is single value field (`Number 1`) of integer type (`Type Integer`) representing total read depth. The DP field can be cast to type `int`.

In [None]:
df[('GENOTYPE', 'DP')] = df[('GENOTYPE', 'DP')].astype(int)

Now somatic variants can be filtered by read depth using numeric comparisons. Here only variant with total read depth more than 200 are selected.

In [None]:
df[df[('GENOTYPE', 'DP')] > 200]

Using allele specific values requires an additional parsing step. The field with `ID AD` has values for each allele, including the reference, (`Number R`) of integer type (`Type Integer`) representing the read depth of each allele. Split the comma separated values into column for each allele and cast to `int`.

In [None]:
allelic_depths_df = df[('GENOTYPE', 'AD')].str.split(',', expand=True).astype(int)
allelic_depths_df.head()

Use this information to calculate the variant allele frequencies by dividing each alternate allele read depth by the sum of read depths from the reference allele and all alternate alleles.

In [None]:
variant_allele_frequency = allelic_depths_df.iloc[:, 1:].apply(lambda x: x.div(allelic_depths_df.sum(axis='columns')))
variant_allele_frequency.head()

Somatic variant identifiers can now be filtered by variant allele frequency. For example if all alternate alleles had a read depth frequency of less then 0.2:

In [None]:
variant_allele_frequency[(variant_allele_frequency < 0.2).all(axis='columns')]

Variant allele frequency can be added back to the full DataFrame by formating the numeric value(s) as a string(s) and delimiting the values with commas.

In [None]:
variant_allele_frequency = variant_allele_frequency.applymap(lambda x: '{:6g}'.format(x))
variant_allele_frequency = variant_allele_frequency.agg(','.join, axis='columns')
variant_allele_frequency.head()

Variant allele frequecy is added to the full DataFrame.

In [None]:
df[('GENOTYPE', 'VAF')] = variant_allele_frequency
df.head()