# “dx extract_assay germline” in R
<hr/>
***As-Is Software Disclaimer***

This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

<hr/>

This notebook demonstrates usage of the dx command `extract_assay germline` to:
* Get a list of sample-ids based corresponding to a set of RSIDs from a dataset.
* Retrieve all the alleles in one gene and their associated annotation from a cohort.

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

## Preparing your environment
### Launch spec:

* App name: JupyterLab with Python, R, Stata, ML, Image Processing
* Kernel: R
* Instance type: mem1_ssd1_v2_x2
* Cost: < $0.2
* Runtime: ~= 5 min
* Data description: Input for this notebook is a v3.0 Dataset or Cohort object ID. In the example considered below, the dataset and cohort have 50K samples and around 16M variants.

### Install DNAnexus supported package, dxpy

In [None]:
# For dx extract_assay germline, dxpy must be v0.349.1 or greater
# However, a more recent version of dxpy on PyPI may already be available
# and installed, making the below "pip" install unecessary.
system("pip3 install -U dxpy==0.350.1")

In [None]:
install.packages("readr")

In [None]:
library(dplyr)
library(readr)
library(stringr)
library(jsonlite)

In [None]:
cmd <- "dx extract_assay germline --help"
system(cmd, intern = TRUE)

### Assign environment variables

In [None]:
# The referenced dataset is not a public dataset and listed here only to demonstrate as an example input.
# The user will need to supply a permissible and valid project ID and record ID

# Assign a project qualified dataset, project-id:record-id
dataset="project-GXYXjj00j49b5qX2Kzq1qbZk:record-FykXjyj0pZ7B8gKvKkxFF7QJ"
cohort="project-GXYXjj00j49b5qX2Kzq1qbZk:record-GXYkQyQ0j49jZ2fKpKZbxZ57"

## Extract data from a dataset

### 1. Explore the genetic assays of a dataset

Check which genetic variant assays are available in the dataset using the `--list-assays` flag.

In [None]:
cmd <- paste("dx extract_assay germline", dataset, "--list-assays")
print(system(cmd, intern = TRUE))

### 2. Retrieve genomic data from the dataset

Data may be retrieved using one of three distinct options, `--retrieve-allele`, `--retrieve-annotation`, or `--retrieve-genotype`. Each option accepts a JSON object as input for filtering purposes, where the object is supplied either as a \*.json file or as a string. For additional help, use `--json-help` trailing one of the `--retrieve-allele`, `--retrieve-annotation`, or `--retrieve-genotype` options for both a template and a list of filters available to the respective option.

In [None]:
cmd <- paste("dx extract_assay germline", dataset, "--retrieve-allele --json-help")
print(system(cmd, intern = TRUE), quote=FALSE)

#### Example on how to create a JSON of filters to retrieve data

A common way to create a json object used for retrieving data is to update the json template from `--json-help` with the filters needed for the situation.

In [None]:
allele_filter <- list(rsid = c("rs1342568097", "rs1267100748"), type = c("SNP"))

write_json(allele_filter, "allele_filter.json")

#### Use the generated JSON file and retrieve allele data

List allele IDs and related information based on the allele filters mentioned in the allele_filter.json and save the results in the `retrieve_allele_output.tsv` file using the `--output <filename>` option.

In [None]:
cmd <- paste("dx extract_assay germline", dataset, "--retrieve-allele allele_filter.json -o retrieve_allele_output.tsv")
system(cmd)

In [None]:
retrieve_allele_output <- readr::read_tsv("retrieve_allele_output.tsv", show_col_types = FALSE)
head(retrieve_allele_output)

#### Retrieve sample IDs based on allele IDs

Use the allele IDs generated in the previous step as a filter and get sample IDs and related genotype information using the `--retrieve-genotype` filter. Here, we use a JSON string for filtering as opposed to a \*.json file.

In [None]:
# Create filter
genotype_filter <- toJSON(list(allele_id = retrieve_allele_output$allele_id))

# Run extract_assay command
cmd <- paste("dx extract_assay germline", dataset, "--retrieve-genotype '", as.character(genotype_filter), "' -o retrieve_genotype_output.tsv")
system(cmd)

In [None]:
retrieve_genotype_output <- readr::read_tsv("retrieve_genotype_output.tsv", show_col_types = FALSE)
head(retrieve_genotype_output)

Get a list of sample IDs

In [None]:
print("List of sample IDs with previously derived allele IDs.")
retrieve_genotype_output$sample_id

## Extract data from a cohort

Here we will retrieve all the alleles in the gene `TP53` and their associated annotations from a saved cohort.

#### Check the available filters in the `--retrieve-annotation` option.

In [None]:
cmd <- paste("dx extract_assay germline", cohort, "--retrieve-annotation --json-help")
print(system(cmd, intern=TRUE), quote = FALSE)

We will use the `gene_name` filter and save the output data to a .tsv file.

In [None]:
cmd <- paste("dx extract_assay germline", cohort, "--retrieve-annotation '{\"gene_name\": [\"TP53\"]}' -o TP53_allele_annotation.tsv")
system(cmd)

In [None]:
TP53_allele_annotation <- readr::read_tsv("TP53_allele_annotation.tsv", show_col_types = FALSE)
head(TP53_allele_annotation)

## Generate SQL for data extraction

Alternatively, we can generate a SQL query instead of extracting the data. This SQL query can be run in a spark-enabled cluster to retrieve data. To generate a SQL query, we use the `--sql` flag and print it on STDOUT.

In [None]:
cmd <- paste("dx extract_assay germline", cohort, "--retrieve-annotation '{\"gene_name\": [\"TP53\"]}' --sql -o -")
print(system(cmd, intern = TRUE), quote=FALSE)

## Return data from a specific assay

We provide multi-assay support, where there may be one, or many of the same assay type. If multiple assays of the same type exist in a dataset, it is possible to specify the exact assay to query from using the argument, `--assay-name`. Default functionality is to simply return data from the first eligible assay in the dataset.

In [None]:
cmd <- paste("dx extract_assay germline", dataset, "--assay-name 'UK Biobank Synthetic 100k_assay' --retrieve-annotation '{\"gene_name\": [\"TP53\"]}' --sql -o out.sql")
system(cmd)