# “dx extract_assay germline” in Python
<hr/>
***As-Is Software Disclaimer***

This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

<hr/>

This notebook demonstrates usage of the dx command `extract_assay germline` to:
* Get a list of sample-ids based corresponding to a set of RSIDs from a dataset.
* Retrieve all the alleles in one gene and their associated annotation from a cohort.

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

## Preparing your environment
### Launch spec:

* App name: JupyterLab with Python, R, Stata, ML
* Kernel: Python
* Instance type: mem1_ssd1_v2_x2
* Cost: < $0.1
* Runtime: < 2 min
* Data description: Input for this notebook is a v3.0 Dataset or Cohort object ID. In the example considered below, the dataset and cohort have 50K samples and around 16M variants.

### Download and install dxpy - These steps will be removed once new version of dxpy is available in JupyterLab

In [None]:
# For dx extract_assay germline, dxpy must be v0.349.1 or greater
!pip3 install -U dxpy==0.350.1

In [None]:
import subprocess
import dxpy
import pandas as pd
import os
import glob
import json
pd.set_option('display.max_columns', None)

In [None]:
cmd = ["dx", "extract_assay", "germline", "--help"]
help_text = subprocess.check_output(cmd)
print(help_text.decode(), end="\n")

### Assign environment variables

In [None]:
# The referenced dataset is not a public dataset and listed here only to demonstrate as an example input.
# The user will need to supply a permissible and valid project ID and record ID

# Assign a project qualified dataset, project-id:record-id
dataset="project-GXYXjj00j49b5qX2Kzq1qbZk:record-FykXjyj0pZ7B8gKvKkxFF7QJ"
cohort="project-GXYXjj00j49b5qX2Kzq1qbZk:record-GXYkQyQ0j49jZ2fKpKZbxZ57"

## Extract data from a dataset

### 1. Explore the genetic assays of a dataset

Check which genetic variant assays are available in the dataset using the `--list-assays` flag.

In [None]:
cmd = ["dx", "extract_assay", "germline", dataset, "--list-assays"]
list_assay = subprocess.check_output(cmd)
print(list_assay.decode(), end="\n")

### 2. Retrieve genomic data from the dataset

Data may be retrieved using one of three distinct options, `--retrieve-allele`, `--retrieve-annotation`, or `--retrieve-genotype`. Each option accepts a JSON object as input for filtering purposes, where the object is supplied either as a \*.json file or as a string. For additional help, use `--json-help` trailing one of the `--retrieve-allele`, `--retrieve-annotation`, or `--retrieve-genotype` options for both a template and a list of filters available to the respective option.

In [None]:
cmd = ["dx", "extract_assay", "germline", dataset, "--retrieve-allele", "--json-help"]
allele_json_help = subprocess.check_output(cmd).decode()
print(allele_json_help)

#### Example on how to create a JSON of filters to retrieve data

A common way to create a json object used for retrieving data is to update the json template from `--json-help` with the filters needed for the situation.

In [None]:
allele_filter_dict = {
    "rsid":
        ["rs1342568097", 
         "rs1267100748"], 
    "type":
        ["SNP"]
}
allele_filter = json.dumps(allele_filter_dict)

with open("allele_filter.json", "w") as op:
    op.write(allele_filter)

In [None]:
cmd = ["cat", "allele_filter.json"]
allele_filter_file = subprocess.check_output(cmd).decode()
print(allele_filter_file)

#### Use the generated JSON file and retrieve allele data

List allele IDs and related information based on the allele filters mentioned in the allele_filter.json and save the results in the `retrieve_allele_output.tsv` file using the `--output OUTPUT` option.

In [None]:
cmd = ["dx", "extract_assay", "germline", dataset, "--retrieve-allele", "allele_filter.json", "-o", "retrieve_allele_output.tsv"]
subprocess.check_call(cmd)

In [None]:
retrieve_allele_output = pd.read_csv("retrieve_allele_output.tsv", sep="\t")
retrieve_allele_output.head()

#### Retrieve sample IDs based on allele IDs

Use the allele IDs generated in the previous step as a filter and get sample IDs and related genotype information using the `--retrieve-genotype` filter. Here, we use a JSON string for filtering as opposed to a \*.json file. 

In [None]:
# Create filter
genotype_filter = str({"allele_id": list(retrieve_allele_output["allele_id"])})

# Run extract_assay command
cmd = ["dx", "extract_assay", "germline", dataset, "--retrieve-genotype", genotype_filter.replace("'", '"'), "--output", "retrieve_genotype_output.tsv"]
subprocess.check_call(cmd)

In [None]:
retrieve_genotype_output = pd.read_csv("retrieve_genotype_output.tsv", sep="\t")
retrieve_genotype_output.head()

Get a list of sample IDs

In [None]:
print("List of sample IDs with previously derived allele IDs.")
print(list(retrieve_genotype_output["sample_id"]))

## Extract data from a cohort

Here we will retrieve all the alleles in the gene `TP53` and their associated annotations from a saved cohort.

#### Check the available filters in the `--retrieve-annotation` option.

In [None]:
cmd = ["dx", "extract_assay", "germline", cohort, "--retrieve-annotation", "--json-help"]
annotation_json_help = subprocess.check_output(cmd).decode()
print(annotation_json_help)

#### We will use the `gene_name` filter and save the output data to a .tsv file.

In [None]:
cmd = ["dx", "extract_assay", "germline", cohort, "--retrieve-annotation", "{\"gene_name\": [\"TP53\"]}", "-o", "TP53_allele_annotation.tsv"]
subprocess.check_call(cmd)

In [None]:
TP53_allele_annotation = pd.read_csv("TP53_allele_annotation.tsv", sep="\t")
TP53_allele_annotation.head()

## Generate SQL for data extraction

Alternatively, we can generate a SQL query instead of extracting the data. This SQL query can be run in a spark-enabled cluster to retrieve data. To generate a SQL query, we use the `--sql` flag and print it on STDOUT.

In [None]:
cmd = ["dx", "extract_assay", "germline", cohort, "--retrieve-annotation", "{\"gene_name\": [\"TP53\"]}", "--sql", "-o", "-"]
print(subprocess.check_output(cmd).decode())

## Return data from a specific assay

We provide multi-assay support, where there may be one, or many of the same assay type. If multiple assays of the same type exist in a dataset, it is possible to specify the exact assay to query from using the argument, `--assay-name`. Default functionality is to simply return data from the first eligible assay in the dataset.

In [None]:
cmd = ["dx", "extract_assay", "germline", cohort, "--assay-name", "UK Biobank Synthetic 50k_assay", "--retrieve-annotation", "{\"gene_name\": [\"TP53\"]}", "--sql", "-o", "out.sql"]
subprocess.check_call(cmd)