# “dx extract_dataset” in Bash
<hr/>
***As-Is Software Disclaimer***

This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

<hr/>

This notebook demonstrates usage of the dx command `extract_dataset` for:
* Retrieval of Apollo-stored data, as referenced within entities and fields of a Dataset or Cohort object on the platform
* Retrieval of the underlying data dictionary files used to generate a Dataset object on the platform

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

## Preparing your environment
### Launch spec:

* App name: JupyterLab with Python, R, Stata, ML ()
* Kernel: Bash
* Instance type: mem1_ssd1_v2_x2
* Cost: < $0.2
* Runtime: =~ 10 min
* Data description: Input for this notebook is a v3.0 Dataset or Cohort object ID

### dxpy version
extract_dataset requires dxpy version >= 0.329.0. If running the command from your local environment (i.e. off of the DNAnexus platform), it may be required to also install pandas. For example, pip3 install -U dxpy[pandas]
Listing options are available in dxpy version >= 0.341.0 

In [None]:
pip3 install -U dxpy 
dx --version

### 1. Assign environment variables

In [None]:
# The referenced Dataset is private and provided only to demonstrate an example input. The user will need to supply a permissible and valid record-id
# Assign joint dataset project-id:record-id
dataset="project-G5BzYk80kP5bvbXy5J7PQZ36:record-GJ3Y7jQ0VKyy592yPxB4yG7Y"

### 2. Inspecting the dataset structure

#### A) Extract the three dictionary files
`<record_name>.data_dictionary.csv`, `<record_name>.entity_dictionary.csv`, and `<record_name>.codings.csv`

In [None]:
dx extract_dataset ${dataset} -ddd --delimiter ","

#### Preview data in the three dictionary (*.csv) files

In [None]:
head -5 *.csv

#### B) List names and titles for entities and fields 
Names and titles are printed as tab separated columns.

In [None]:
dx extract_dataset ${dataset} --list-entities

Listing fields in the main entity.

In [None]:
dx extract_dataset ${dataset} --list-fields

Listing fields in the specified entities. 

In [None]:
dx extract_dataset ${dataset} --list-fields --entities=doctor,baseline

### 3. Parse metadata to get entity/field names in format for extraction

#### A) Parsing dictionary files

In [None]:
entity_field_input=`cut -d "," -f 1,2 *.data_dictionary.csv | tail -n +2 | tr ',' '.'| tr '\n' ',' | sed 's/.$//'`
echo ${entity_field_input}

#### B) Parsing output of `dx extract_dataset ${dataset} --list-fields` 
This can be further processed to filter the fileds of interest e.g.

In [None]:
entity_field_input=`dx extract_dataset ${dataset} --list-fields |cut -f1 |grep risk |tr '\n' ',' |sed 's/.$//'`
echo ${entity_field_input}

### 4. Use extracted entity and field names as input to the called function, “dx extract_dataset” and extract data

In [None]:
dx extract_dataset "${dataset}" --fields "${entity_field_input}" -o extracted_data.csv

#### Print data in the retrieved data file

In [None]:
head -3 extracted_data.csv

### 5. Upload extracted dictionaries and data back to the project

In [None]:
dx upload *.csv