# “dx extract_assay somatic” in R
<hr/>
***As-Is Software Disclaimer***

This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

<hr/>

This notebook demonstrates usage of the dx command `extract_assay somatic` to:
* Retrieve somatic variants and associated annotations from a Dataset and Cohort.
* Compare a tumor sample to a respective normal sample from a Dataset and Cohort.

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

## Preparing your environment
### Launch spec:

* App name: JupyterLab with Python, R, Stata, ML
* Kernel: R
* Instance type: mem1_ssd1_v2_x2
* Cost: < $0.2
* Runtime: ~= 5 min
* Data description: Input for this notebook is a v3.0 Dataset or Cohort object ID.

### Install DNAnexus supported package, dxpy

In [None]:
# For dx extract_assay somatic, dxpy must be v0.352.1 or greater
# However, a more recent version of dxpy on PyPI may already be available
# and installed, making the below "pip" install unecessary.
system("pip3 install -U dxpy==0.363.0")

In [None]:
install.packages("readr")
install.packages("jsonlite")

In [None]:
library(dplyr)
library(readr)
library(stringr)
library(jsonlite)

In [None]:
cmd <- "dx extract_assay somatic --help"
system(cmd, intern = TRUE)

### Assign environment variables

In [None]:
# The referenced dataset is not a public dataset and listed here only to demonstrate as an example input.
# The user will need to supply a permissible and valid project ID and record ID

# Assign a project qualified dataset, project-id:record-id
dataset <- "project-GXYZ82Q04YzF5q68Fyqp9PbX:record-GXYpp3j04YzGx2zFyk8Yykvz"
cohort <- "project-GXYZ82Q04YzF5q68Fyqp9PbX:record-GXqQPK804YzJ3yB9Y6B231gX"

## Extract data from a dataset

### 1. Explore the somatic assays of a dataset

Check which somatic variant assays are available in the dataset using the `--list-assays` flag.

In [None]:
cmd <- paste("dx extract_assay somatic", dataset, "--list-assays")
system(cmd, intern = TRUE)

### 2. Retrieve somatic data from the dataset

Data may be retrieved using the option, `--retrieve-variant`. The option accepts a JSON object as input for filtering purposes, where the object is supplied either as a \*.json file or as a string. For additional help, use `--json-help` trailing `--retrieve-variant` for both a template and a list of filters available to the respective option.

In [None]:
cmd <- paste("dx extract_assay somatic", dataset, "--retrieve-variant --json-help")
print(system(cmd, intern = TRUE), quote=FALSE)

#### Example on how to create a JSON of filters to retrieve data

JSON may be created in many ways, including writing the object to a file.

In [None]:
filter <- '{"location": [{"chromosome": "chr1", "starting_position": "10000000", "ending_position": "14000000"}]}'
write(filter, "filter.json")

#### Use the generated JSON file and retrieve variant data

List allele IDs and related information based on the variant filters mentioned in the variant_filter.json. The output data can be printed to STDOUT or written to a file. Here we print it to STDOUT.

In [None]:
cmd <- paste("dx extract_assay somatic", dataset, "--retrieve-variant filter.json -o -")
print(system(cmd, intern=TRUE), quote = FALSE)

#### Add additional fields to output

For a full list of fields available for output, use the flag, `--additional-fields-help`.

In [None]:
cmd <- paste("dx extract_assay somatic", dataset, "--additional-fields-help")
print(system(cmd, intern=TRUE), quote = FALSE)

Specify additional fields using `--additional-fields`

In [None]:
cmd <- paste("dx extract_assay somatic", dataset,
             "--retrieve-variant filter.json",
             "--additional-fields 'sample_id,tumor_normal'",
             "-o", "-")
print(system(cmd, intern=TRUE), quote = FALSE)

#### Additionally, list normal samples alongside tumor samples

By default, normal samples are excluded from results. To include, use the flag, `--include-normal-sample`.

In [None]:
cmd <- paste("dx extract_assay somatic", dataset,
             "--retrieve-variant filter.json",
             "--additional-fields 'sample_id,tumor_normal'",
             "--include-normal-sample",
             "-o", "-")
print(system(cmd, intern=TRUE), quote = FALSE)

#### Generate a list of unique sample IDs

Here, we write the retrieved data to the `retrieve_output.tsv` file using the `--output <filename>` option. Additionally, we can include information to map a tumor sample to a normal sample and VCF data such as the FORMAT and GENOTYPE fields, verbatim.

In [None]:
cmd <- paste("dx extract_assay somatic", dataset,
             "--retrieve-variant filter.json",
             "--additional-fields 'sample_id,tumor_normal,normal_assay_sample_id,normal_allele_ids,FORMAT,GENOTYPE'",
             "--include-normal-sample",
             "--output retrieve_output.tsv")
print(system(cmd, intern=TRUE), quote = FALSE)

In [None]:
retrieve_allele_output <- readr::read_tsv("retrieve_output.tsv", show_col_types = FALSE)
head(retrieve_allele_output)

Parse the output for unique sample IDs

In [None]:
distinct(retrieve_allele_output["sample_id"])

#### Use VCF metadata to assist in parsing data

VCF fields, such as INFO and FORMAT may contain multiple IDs having additional context specific to your data. These IDs and context are described in the metadata of the VCF, and available using the flag, `--retrieve-meta-info`. This context is helpful (often, necessary) for further parsing INFO and FORMAT sections of a VCF.

In [None]:
cmd <- paste("dx extract_assay somatic", dataset,
             "--retrieve-meta-info",
             "--output -")
print(system(cmd, intern=TRUE), quote = FALSE)

## Extract data from a cohort

Here we will retrieve all the variants in the gene `ENSG00000120942` and their associated annotations from a saved cohort.

#### Retrieve sample IDs based on allele IDs

Use the allele IDs generated in the previous step as a filter and get sample IDs and related genotype information using the `--retrieve-genotype` filter. Here, we use a JSON string for filtering as opposed to a .json file. 

#### Retrieve sample IDs based on allele IDs

Use the allele IDs generated in the previous step as a filter and get sample IDs and related genotype information using the `--retrieve-genotype` filter. Here, we use a JSON string for filtering as opposed to a .json file.

In [None]:
filter <- toJSON(list(annotation = list(gene = "ENSG00000120942")))
cmd <- paste("dx extract_assay somatic", dataset, "--retrieve-variant '", as.character(filter), "' -o -")
print(system(cmd, intern=TRUE), quote = FALSE)

## Generate SQL for data extraction

Alternatively, we can generate a SQL query instead of extracting the data. This SQL query can be run in a spark-enabled cluster to retrieve data. To generate a SQL query, we use the `--sql` flag and print it on STDOUT.

In [None]:
filter <- toJSON(list(annotation = list(gene = "ENSG00000120942")))
cmd <- paste("dx extract_assay somatic", dataset,
             "--retrieve-variant '", as.character(filter),
             "' --sql -o -")
print(system(cmd, intern=TRUE), quote = FALSE)

## Return data from a specific assay

We provide multi-assay support, where there may be one, or many of the same assay type. If multiple assays of the same type exist in a dataset, it is possible to specify the exact assay to query from using the argument, --assay-name. Default functionality is to simply return data from the first eligible assay in the dataset.

In [None]:
filter <- toJSON(list(annotation = list(gene = "ENSG00000120942")))
cmd <- paste("dx extract_assay somatic", dataset,
             "--assay-name example_sval_202307062200",
             "--retrieve-variant '", as.character(filter),
             "' --sql -o out.sql")
print(system(cmd, intern=TRUE), quote = FALSE)