# “dx extract_dataset” in Bash
<hr/>
***As-Is Software Disclaimer***

This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

<hr/>

This notebook demonstrates usage of the dx command `extract_dataset` for:
* Retrieval of Apollo-stored data, using a dataset or cohort, and for a set of entities and fields.
* Retrieval of data dictionary files supporting a dataset

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

## Preparing your environment
### Launch spec:

* App name: JupyterLab with Python, R, Stata, ML ()
* Kernel: Bash
* Instance type: mem1_ssd1_v2_x2
* Cost: < $0.2
* Runtime: =~ 10 min
* Data description: Input for this notebook is a v3.0 Dataset or Cohort object ID (project-id:record-id where ":record-id" indicates the current selected project) or name

### Install dxpy
extract_dataset requires dxpy version >= 0.329.0. Current version of dxpy in Jupyterlab is 0.314.0. Update dxpy by running `pip3 install -U dxpy[pandas]` in the terminal.

### 1. Assign environment variables

In [None]:
# Assign project-id of dataset
pid=project-G5BzYk80kP5bvbXy5J7PQZ36
# Assign dataset record-id
rid=record-GJ3Y7jQ0VKyy592yPxB4yG7Y
# Assign joint dataset project-id:record-id
dataset="${pid}:${rid}"

### 2. Call “dx extract_dataset” using a supplied dataset

In [None]:
dx extract_dataset ${dataset} -ddd --delimiter ","

#### Preview data in the three dictionary (*.csv) files

In [None]:
head -5 *.csv

### 3. Parse returned metadata and extract entity/field names

In [None]:
entity_field=()
while IFS="," read -r entity field
do
    entity_field+=("${entity}.${field}")
done < <(cut -d "," -f 1,2 *.data_dictionary.csv | tail -n +2)
echo ${entity_field[@]:0:10}

### 4. Use extracted entity and field names as input to the called function, “dx extract_dataset” and extract data

In [None]:
entity_field_input=$(IFS=, ; echo "${entity_field[*]}")
echo ${entity_field_input}

In [None]:
dx extract_dataset ${dataset} --fields ${entity_field_input} -o extracted_data.csv

#### Print data in the retrieved data file

In [None]:
head -3 extracted_data.csv

### 5. Upload extracted dictionaries and data back to the project

In [None]:
dx upload *.csv