# “dx extract_assay germline” in Bash
<hr/>
***As-Is Software Disclaimer***

This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

<hr/>

This notebook demonstrates usage of the dx command `extract_assay germline` to:
* Get a list of sample-ids based corresponding to a set of RSIDs from a dataset.
* Retrieve all the alleles in one gene and their associated annotation from a cohort.

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

## Preparing your environment
### Launch spec:

* App name: JupyterLab with Python, R, Stata, ML
* Kernel: Bash
* Instance type: mem1_ssd1_v2_x2
* Cost: < $0.1
* Runtime: =~ 2 min
* Data description: Input for this notebook is a v3.0 Dataset or Cohort object ID. In the example considered below, the dataset and cohort have 50K samples and around 16M variants.

### Download and install dxpy

In [None]:
# For dx extract_assay germline, dxpy must be v0.349.1 or greater
pip3 install -U dxpy==0.350.1

In [None]:
dx extract_assay germline --help

### Assign environment variables

In [None]:
# The referenced dataset is not a public dataset and listed here only to demonstrate as an example input.
# The user will need to supply a permissible and valid project ID and record ID

# Assign a project qualified dataset, project-id:record-id
dataset="project-GXYXjj00j49b5qX2Kzq1qbZk:record-FykXjyj0pZ7B8gKvKkxFF7QJ"
cohort="project-GXYXjj00j49b5qX2Kzq1qbZk:record-GXYkQyQ0j49jZ2fKpKZbxZ57"

## Extract data from a dataset

### 1. Explore the genetic assays of a dataset

Check which genetic variant assays are available in the dataset using the `--list-assays` flag.

In [None]:
dx extract_assay germline ${dataset} --list-assays

### 2. Retrieve genomic data from the dataset

Data may be retrieved using one of three distinct options, `--retrieve-allele`, `--retrieve-annotation`, or `--retrieve-genotype`. Each option accepts a JSON object as input for filtering purposes, where the object is supplied either as a \*.json file or as a string. For additional help, use `--json-help` trailing one of the `--retrieve-allele`, `--retrieve-annotation`, or `--retrieve-genotype` options for both a template and a list of filters available to the respective option.

In [None]:
dx extract_assay germline ${dataset} --retrieve-allele --json-help

#### Example on how to create a JSON of filters to retrieve data

A common way to create a json object used for retrieving data is to update the json template from `--json-help` with the filters needed for the situation.

In [None]:
dx extract_assay germline ${dataset} --retrieve-allele --json-help \
| grep -v '^#' \
> allele_filter_template.json

In [None]:
cat allele_filter_template.json

Change the rsID values to the desired values, select the type of allele as SNP and remove all other filters from the JSON file.

In [None]:
jq '{rsid, type}' allele_filter_template.json \
| jq '.rsid |= ["rs1342568097", "rs1267100748"]' \
| jq '.type |= ["SNP"]' \
> allele_filter.json

In [None]:
cat allele_filter.json

Alternatively, the \*.json file may be generated *de novo*.

#### Use the generated JSON file and retrieve allele data

List allele IDs and related information based on the allele filters mentioned in the allele_filter.json. The output data can be printed to STDOUT or written to a .tsv file. Here we print it to STDOUT.

In [None]:
dx extract_assay germline ${dataset} \
--retrieve-allele allele_filter.json \
-o -

#### Retrieve sample IDs based on allele IDs

Use the allele IDs generated in the previous step as a filter and get sample IDs and related genotype information using the `--retrieve-genotype` filter. Here, we use a JSON string for filtering as opposed to a \*.json file. We write the retrieved data to the `retrieve_genotype_output.tsv` file using the `--output OUTPUT` option.

In [None]:
dx extract_assay germline ${dataset} \
--retrieve-genotype '{"allele_id": ["18_47359_G_T", "18_47360_C_T"]}' \
--output 'retrieve_genotype_output.tsv'

In [None]:
head retrieve_genotype_output.tsv

Get a list of sample IDs

In [None]:
cut -f1 retrieve_genotype_output.tsv

## Extract data from a cohort

Here we will retrieve all the alleles in the gene, TP53, and the associated annotations from a saved cohort.

#### Check the available filters in the `--retrieve-annotation` option.

In [None]:
dx extract_assay germline ${cohort} --retrieve-annotation --json-help

#### We will use the `gene_name` filter and save the output data to a .tsv file.

In [None]:
dx extract_assay germline ${cohort} \
--retrieve-annotation '{"gene_name": ["TP53"]}' \
--output "TP53_allele_annotation.tsv"

In [None]:
head TP53_allele_annotation.tsv

## Generate SQL for data extraction

Alternatively, we can generate a SQL query instead of extracting the data. This SQL query can be run in a spark-enabled cluster to retrieve data. To generate a SQL query, we use the `--sql` flag and print it on STDOUT.

In [None]:
dx extract_assay germline ${cohort} \
--retrieve-annotation '{"gene_name": ["TP53"]}' \
--sql \
-o -

## Return data from a specific assay

We provide multi-assay support, where there may be one, or many of the same assay type. If multiple assays of the same type exist in a dataset, it is possible to specify the exact assay to query from using the argument, `--assay-name`. Default functionality is to simply return data from the first eligible assay in the dataset.

In [None]:
dx extract_assay germline ${cohort} \
--assay-name 'UK Biobank Synthetic 50k_assay' \
--retrieve-annotation '{"gene_name": ["TP53"]}' \
--sql \
-o 'out.sql'