# “dx extract_dataset” in Python
<hr/>
***As-Is Software Disclaimer***

This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

<hr/>

This notebook demonstrates usage of the dx command `extract_dataset` for:
* Retrieval of Apollo-stored data, as referenced within entities and fields of a Dataset or Cohort object on the platform
* Retrieval of the underlying data dictionary files used to generate a Dataset object on the platform

<a href="https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md">MIT License</a> applies to this notebook.

## Preparing your environment
### Launch spec:

* App name: JupyterLab with Python, R, Stata, ML ()
* Kernel: Python
* Instance type: mem1_ssd1_v2_x2
* Cost: < $0.2
* Runtime: =~ 10 min
* Data description: Input for this notebook is a v3.0 Dataset or Cohort object ID

### dxpy version
extract_dataset requires dxpy version >= 0.329.0. However, a more recent version of dxpy on PyPI may already be available and installed, making the below "pip" install unecessary. If running the command from your local environment (i.e. off of the DNAnexus platform), it may be required to also install pandas. For example, pip3 install -U dxpy[pandas]

In [None]:
!pip3 install -U dxpy==0.363.0

In [None]:
import subprocess
import dxpy
import pandas as pd
import os
import glob
pd.set_option('display.max_columns', None)

In [None]:
dxpy.__version__

### 1. Assign environment variables

In [None]:
# The referenced Dataset is private and provided only to demonstrate an example input. The user will need to supply a permissible and valid record-id
# Assign project-id of dataset
pid = 'project-G5BzYk80kP5bvbXy5J7PQZ36'
# Assign dataset record-id
rid = 'record-GJ3Y7jQ0VKyy592yPxB4yG7Y'
# Assign joint dataset project-id:record-id
dataset = (':').join([pid, rid])

### 2. Call “dx extract_dataset” using a supplied dataset

In [None]:
cmd = ["dx", "extract_dataset", dataset, "-ddd", "--delimiter", ","]
subprocess.check_call(cmd)

#### Preview data in the three dictionary (*.csv) files

In [None]:
path = os.getcwd()

In [None]:
data_dict_csv = glob.glob(os.path.join(path, "*.data_dictionary.csv"))[0]
data_dict_df = pd.read_csv(data_dict_csv)
data_dict_df.head()

In [None]:
codings_csv = glob.glob(os.path.join(path, "*.codings.csv"))[0]
codings_df = pd.read_csv(codings_csv)
codings_df.head()

In [None]:
entity_dict_csv = glob.glob(os.path.join(path, "*.entity_dictionary.csv"))[0]
entity_dict_df = pd.read_csv(entity_dict_csv)
entity_dict_df.head()

### 3. Parse returned metadata and extract entity/field names

In [None]:
data_dict_df['ent_field'] = data_dict_df['entity'].astype(str) + \
                            '.' + data_dict_df['name'].astype(str)
        
entity_field = data_dict_df.ent_field.values.tolist()
entity_field[:10]

### 4. Use extracted entity and field names as input to the called function, “dx extract_dataset” and extract data

In [None]:
str_entity_field = ','.join(entity_field)
cmd = ["dx", "extract_dataset", dataset, "--fields", str_entity_field, 
       "-o", "extracted_data.csv"]
subprocess.check_call(cmd)

#### Print data in the retrieved data file

In [None]:
fields_df = pd.read_csv("extracted_data.csv", float_precision='round_trip')
fields_df.head()

#### Alternitavely, save the extracted entity into a file and supply it by using "--fields-file" option

In [None]:
with open("entity_fields.txt", "w") as f:
    for e in entity_field:
        f.write("{}\n".format(e))
        
cmd = ["dx", "extract_dataset", dataset, "--fields-file", "entity_fields.txt", 
       "-o", "extracted_data_fields_file.csv"]
subprocess.check_call(cmd)

fields_file_df = pd.read_csv("extracted_data_fields_file.csv", float_precision='round_trip')
fields_file_df.head()

### 5. Replace any coded column values of extracted data with the coded meaning

In [None]:
data_dict_df['coding_value_type'] = data_dict_df['is_multi_select'].apply(
                                    lambda x: "list" if x == "yes" else "string")

In [None]:
fields_file_df_decoded = fields_file_df.copy(deep=True)

def get_meaning(code_name, code):
    if isinstance(code, int):
        code = str(code)
    elif isinstance(code, float):
        code = str(code)
        # If field type is float, and an integer sparse code is used for the field 
        # (example `1`), the retrieved data represents it as a float (`1.0`).
        # Strip the `.0` suffix and search for the code in codings dataframe
        if codings_df.loc[(codings_df["coding_name"]== code_name) & 
                          (codings_df["code"]== code), "meaning"].empty:
            if code.endswith('.0'):
                code = code[:-2]
    return(codings_df.loc[(codings_df["coding_name"]== code_name) & 
                          (codings_df["code"]== code), "meaning"])

for (columnName, columnData) in fields_file_df_decoded.items():
    code_name, data_type= data_dict_df[(data_dict_df["ent_field"]==columnName)][
                          ["coding_name", "coding_value_type"]].values[0]
    if not pd.isna(code_name):
        set_of_values = set(columnData.dropna())
        for val in set_of_values:
            if data_type == "list":
                new_val = []
                list_val = eval(val)
                for i in list_val:
                    meaning = get_meaning(code_name, i)
                    if not meaning.empty:
                        new_val.append(meaning.values.item())
                    else:
                        new_val.append(i)
                fields_file_df_decoded.loc[fields_file_df_decoded[columnName] == val, 
                                           columnName] = str(new_val)
                continue
            elif data_type == "string":
                meaning = get_meaning(code_name, val)
            if not meaning.empty:
                fields_file_df_decoded.loc[fields_file_df_decoded[columnName] == val, 
                                           columnName] = meaning.values.item()
fields_file_df_decoded.head()

In [None]:
fields_file_df_decoded.to_csv("extracted_data_with_code_meanings.csv", index=False)

### 6. Drop sparsely coded values

In [None]:
fields_sparse_code = fields_file_df.copy(deep=True)

In [None]:
for (columnName, columnData) in fields_sparse_code.items():
    code_name, data_type, is_sparse_coding= data_dict_df[
        (data_dict_df["ent_field"]==columnName)][
        ["coding_name", "coding_value_type", "is_sparse_coding"]].values[0]
    if not (pd.isna(code_name) and pd.isna(is_sparse_coding)) and \
            is_sparse_coding=='yes':
        set_of_values = set(columnData.dropna())
        for val in set_of_values:
            if data_type == "list":
                new_val = []
                list_val = eval(val)
                for i in list_val:
                    meaning = get_meaning(code_name, i)
                    if meaning.empty:
                        new_val.append(i)
                fields_sparse_code.loc[fields_sparse_code[columnName] == val, 
                                       columnName] = str(new_val)
                continue
            elif data_type == "string":
                meaning = get_meaning(code_name, val)
                if not meaning.empty:
                    fields_sparse_code.loc[fields_sparse_code[columnName] == val, 
                                           columnName] = None
fields_sparse_code.head()

In [None]:
fields_sparse_code.to_csv("extracted_data_with_sparse_code_drop.csv", index=False)

### 7. Replace the column titles (field names) of extracted data with the field titles

In [None]:
current_columns = list(fields_file_df.columns)

In [None]:
new_columns = {}
titles = []
duplicate_titles = []
for val in current_columns:
    meaning = data_dict_df.loc[data_dict_df["ent_field"]==val, 
                               "title"].values.item()
    if meaning not in titles:
        titles.append(meaning)
    elif meaning not in duplicate_titles:
        duplicate_titles.append(meaning)
for val in current_columns:
    meaning = data_dict_df.loc[data_dict_df["ent_field"]==val, 
                               "title"].values.item()
    if meaning not in duplicate_titles:
        new_columns[val] = meaning
    else:
        new_columns[val] = val.replace(".", "-")

In [None]:
fields_file_df.rename(columns = new_columns, inplace = True)
fields_file_df.head()

In [None]:
fields_file_df.to_csv("extracted_data_with_updated_titles.csv", index=False)

### 8. Upload extracted dictionaries and data back to the project

In [None]:
cmd = "dx upload *.csv"
subprocess.check_call(cmd, shell=True)