# “dx extract_assay expression” in R
<hr/>
***As-Is Software Disclaimer***

This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

<hr/>
This notebook demonstrates usage of the dx command `extract_assay expression` to:

* Filter expression data in a DNAnexus dataset
* Generate an expression matrix from a DNAnexus dataset

## Preparing your environment
### Launch spec:

* App name: JupyterLab with Python, R, Stata, ML
* Kernel: R
* Instance type: mem1_ssd1_v2_x2
* Cost: < $0.1
* Runtime: < 2 min
* Data description: Input for this notebook is a v3.0 Dataset or Cohort object ID.

In [None]:
# For dx extract_assay expression, dxpy must be v0.364.0 or greater
# However, a more recent version of dxpy on PyPI may already be available
# and installed, making the below "pip" install unecessary.
system('pip3 install -U "dxpy>=0.364.0"')

### Print help message

In [None]:
cmd <- "dx extract_assay expression --help"
writeLines(system(cmd, intern = TRUE))

### Assign environment variables

In [None]:
# The referenced dataset is not a public dataset and listed here only to demonstrate as an example input.
# The user will need to supply a permissible and valid project ID and record ID

# Assign a project qualified dataset record, project-id:record-id
dataset <- "project-G5Bzk5806j8V7PXB678707bv:record-GYPg9Jj06j8pp3z41682J23p"

### Explore the expression assays of a dataset

Check which expression assays are available in the dataset using the `--list-assays` flag.

In [None]:
cmd <- paste("dx extract_assay expression", dataset, "--list-assays", sep = " ")
writeLines(system(cmd, intern = TRUE))

### Retrieve expression data from the dataset

Data may be retrieved using the option, `--retrieve-expression`. The option accepts a JSON object as input for filtering purposes, where the object is supplied either as a \*.json file or as a string. For additional help, use `--json-help` trailing `--retrieve-expression` for both a template and a list of filters available to the respective option.

In [None]:
cmd <- paste(
    "dx extract_assay expression",
    dataset,
    "--retrieve-expression",
    "--json-help",
    sep = " "
)
writeLines(system(cmd, intern = TRUE))

#### Example on how to create a JSON of filters to retrieve data

JSON may be created in many ways, including writing the object to a file.

In [None]:
filter <- '{
    "location": [
        {
            "chromosome": "1",
            "starting_position": "1",
            "ending_position": "250000000"
        }
    ]
}'
write(filter, "filter.json")

In [None]:
cmd <- "cat filter.json"
writeLines(system(cmd, intern = TRUE))

#### Use the generated JSON file and retrieve expression data

List feature names and related information based on the variant filters mentioned in the filter.json. The output data can be printed to STDOUT or written to a file. Here we print it to STDOUT.

In [None]:
cmd <- paste(
    "dx extract_assay expression",
    dataset,
    "--retrieve-expression",
    "--filter-json-file filter.json",
    "-o -",
    sep = " "
)
writeLines(system(cmd, intern=TRUE))

#### Add additional fields to output

For a full list of fields available for output, use the flag, `--additional-fields-help`.

In [None]:
cmd <- paste(
    "dx extract_assay expression",
    dataset,
    "--additional-fields-help",
    sep = " "
)
writeLines(system(cmd, intern=TRUE))

Specify additional fields using `--additional-fields`

In [None]:
cmd <- paste(
    "dx extract_assay expression",
    dataset,
    "--retrieve-expression",
    "--filter-json-file filter.json",
    "--additional-fields 'start,end,strand'",
    "-o std_output.csv",
    sep = " "
)
system(cmd, intern=TRUE)
cmd <- "cat std_output.csv"
writeLines(system(cmd, intern=TRUE))

## Generate an expression matrix from a dataset
The extract_assay expression function can be configured to output an expression matrix rather than its standard output.  This matrix contains a column for sample_id, and an additional column for every feature_id in the dataset.  There is only one sample_id per row.  Note, filtering is limited when creating an expression matrix and additional fields are not included in the expression matrix.

In [None]:
cmd <- paste(
    "dx extract_assay expression",
    dataset,
    "--retrieve-expression",
    "--filter-json-file filter.json",
    "--expression-matrix",
    "-o expression_matrix.csv",
    sep = " "
)
system(cmd, intern=TRUE)
cmd <- "cat expression_matrix.csv"
writeLines(system(cmd, intern=TRUE))

## Generate SQL for data extraction

Alternatively, we can generate a SQL query instead of extracting the data. This SQL query can be run in a spark-enabled cluster to retrieve data. To generate a SQL query, we use the `--sql` flag and print it on STDOUT.

In [None]:
cmd <- paste(
    "dx extract_assay expression",
    dataset,
    "--retrieve-expression",
    "--filter-json-file filter.json",
    "--sql -o -",
    sep = " "
)
writeLines(system(cmd, intern=TRUE))

## Return data from a specific assay

We provide multi-assay support, where there may be one, or many of the same assay type. If multiple assays of the same type exist in a dataset, it is possible to specify the exact assay to query from using the argument, --assay-name. Default functionality is to simply return data from the first eligible assay in the dataset.

In [None]:
cmd <- paste(
    "dx extract_assay expression",
    dataset,
    "--assay-name enst_short_multiple_assays_sample_2_enst",
    "--retrieve-expression",
    "--filter-json-file filter.json",
    "-o expression_long.csv",
    sep = " "
)
system(cmd, intern=TRUE)
cmd <- "cat expression_long.csv"
writeLines(system(cmd, intern=TRUE))