# Working with dxdata 
#### Datasets and Cohorts

`dxdata` is a python module developed and maintained by DNAnexus. The module contains functionality to access and describe data within Apollo by leveraging Dataset and Cohort objects. A Dataset object contains content that describes logical content in data and maps the logical data organization to the physical location of data stored in a database. It allows assign _something_ a "Sample ID," a term with specific attributes, and easily store, retrieve, and use that term in a consistent manner. Below, we walk through highlights of `dxdata` specifically around initiating a spark cluster, pulling a datasets/cohort, exploring the data, and retrieving data to a dataframe.

This notebook is delivered "As-Is". Notwithstanding anthing to the contrary, [DNAnexus] will have no warranty, support  or other obligations with respect to [Materials] provided hereunder.

### Import dxdata and initiate Spark cluster

In [None]:
import dxdata
import databricks.koalas as ks

In [None]:
# Connect to Spark
engine = dxdata.connect(dialect="hive+pyspark")

### dxdata.dataset

`dxdata` contains the following classes and respective attributes within the `dataset` sub-module. Items in **bold** are attributes that are commonly used in the provided JupyterLab Notebooks that leverage `dxdata`.

- Datasets
    - **_entities_** (list of Entity): Entities in the dataset.
    - _edges_ (list of Edge): Graph edges connecting entities.
    - _folders_ (list of dict/string): Folder hierarchy to organize fields in the UI.
    - _primary_entity_ (str): Name of the entry containing the global primary key.
    - _dashboards_ (dict or None): Mapping of dashboard names to dxlinks
    - **_pheno_geno_link_info_** (dict or None): Identifiers for the subject-assay linking table.


- Entity
    - **_name_** (str): Logical name for the entity. From the data dictionary column "entity".
    - **_fields_** (list of Field): Fields associated with this entity.
    - **_database_name_** (str): Name of the database containing this entity's data.
    - _database_id_ (str): Platform ID of the data object, e.g. "database-xxxx".
    - _primary_key_ (str): Name of the field to use as the primary key for database operations. Derived from the data dictionary column "primary_key_type".
    - _longitudinal_axis_ (str): Name of the field to use as the default longitudinal axis for analysis. Derived from the data dictionary column "is_longitudinal_axis".


- Field - Represents all other DataDictionary columns not covered by Entity and Edge 
    - **_name_** (str): Field's internal name.
    - _type_ (str): Primitive type name, e.g. "integer", "string", "date" ...
    - **_table_name_** (str): Database table where this field's data values are stored.
    - **_column_name_** (str): Database column where field values are stored.
    - **_coding_** (Coding): Coding instance that applies to this field's values.
    - _is_multi_select_ (bool): Whether the field can contain multiple values per cell (array/set type)
    - _is_sparse_coding_ (bool): Whether all data values should be covered by codings.
    - **_title_** (str): Optional.
    - _description_ (str): Optional.
    - _units_ (str): Optional.
    - _concept_ (str): Optional.
    - _linkout_ (str): Optional.
    - ** kwargs (dict of strings): Arbitrary additional attributes with string values.


- Edge - Represents DataDictionary columns: referenced_entity_field, relationship. Determines join_info. All joins are left joins.
    - _source_entity_ (str): Existing entity name
    - _source_field_ (str): Name of existing field in `source_entity`
    - _dest_entity_ (str): Existing entity name
    - _dest_field_ (str): Name of existing field in `dest_entity`
    - _relationship_ (str): "one_to_one" or "many_to_one"


- Coding - Capture meanings and hierarchy for categorical values.
    - _name_ (str): Name for the coding. (May be applied to multiple fields.)
    - **_codes_** (dict): Corresponds to 'encoding' element in descriptor JSON.
    - _hierarchy_ (list): Corresponds to 'hierarchy' element in descriptor JSON.

### Global functions in `dxdata` for loading data:
- _load_cohort_ : Read the contents of a cohort record on the platform.
- _load_dataset_ : Read the contents of a Dataset record on the platform.
   

### A Dataset contains defined attributes, such as entities
#### Let's load a dataset to examine all attributes

In [None]:
# load with record ID
dataset = dxdata.load_dataset(id='record-G3814k006Fjgk5J74Kx3X1QG')

# OR with path to the dataset
# dataset = dxdata.load_dataset(''/Datasets/Demo Dataset')

#### Entities

Lets look at the first attribute, `entities,` in the `Dataset` object. The `entities` attribute returns a list of `entity` objects.

In [None]:
entities = dataset.entities
print(*entities, sep = "\n") 

#### Edges

The entities of a dataset map to each other and may map in a specific manner (i.e, "one to one" or "one to many").

In [None]:
# Graph edges connecting entities.
for edge in dataset.edges[:4]:
    #print("* Edge:{}-{}.{}.{}".format(edge, edge.source_entity, edge.dest_entity, edge.relationship))
    print(f"* Edge: {edge}")
    print(f"\tSource entity: {edge.source_entity}")
    print(f"\tDestination entity: {edge.dest_entity}")
    print(f"\tRelationship from source to destination: {edge.relationship}")

#### Primary Entity
Name of the entry containing the global primary key.

In [None]:
# primary_entity (str): Name of the entry containing the global primary key.
prim = dataset.primary_entity
prim

## Explore Entity Object

To view details of these entity object's attributes which are `str` types can be displayed like:

In [None]:
entity = dataset.entities
for entity in dataset.entities:
    print(f'Entity: {entity.name}')
    print(f'    * Entity Title: {entity.entity_title}')
    print(f'    * Entity Label Singular: {entity.entity_label_singular}')
    print(f'    * Entity Label Plural: {entity.entity_label_plural}')
    print(f'    * Entity Primary Key: {entity.primary_key}')

Since, `entity.fields` will return a list of `Fields` objects which have their own attributes, we'll have to use nested for loop to iterate over both entities list and fields list.

In [None]:
n_entities = len(dataset.entities)
for entity in dataset.entities[:n_entities]:
    n_fields = len(entity.fields)
    print(f'Entity: {entity.name}')
    for field in entity.fields[:n_fields]:
        print(f'     Field column name: {field.name}')
        print(f'     Field title: {field.title}')
        print('\n')
            
    print('----------')

### Class Function in `Entity`
`dataset` object has 5 entities and each of those entities have names which can be displayed by `entity.name`. For calling a class function `find_field` on `entity` object, we have to pick one from the list of entities returned from `dataset.entities` or by name.

This function returns `field` object and so we need to call field attributes to display its content.

In [None]:
# First, assign the entity
pheno = dataset["phenotype"]

# Find a field title
bmi_field = pheno.find_field(title="Body mass index (BMI) | Instance 1")

# Print information on the field title
print(f"Field title: {bmi_field.title}, Field column name: {bmi_field.column_name}")

## Explore Field Object

Let's look at all attributes of the `field` object we just created `bmi_field` 

In [None]:
for attr, value in vars(bmi_field).items():
    print("Field", attr, ":", value)

## Extracting data from a dataset into a dataframe

### Retrieve Fields

The `retrieve_fields()` function can be added to any dataset or cohort object to extract data into a dataframe of your choosing.  The function natively returns a spark dataframe but can be cast to other common types: e.g. pandas (`.to_pandas()`) or koalas (`.to_koalas`).

In [None]:
# First, start out with a list of fields that you want to export
# Find by exact title
field_sex = pheno.find_field(title="Sex")
field_age = pheno.find_field(title="Age at recruitment")
field_smoke = pheno.find_field(title="Ever smoked | Instance 0")

field_list = [field_sex, field_smoke, field_age] + bmi_field

In [None]:
# Extract data, decode any codings, and cast to a koalas dataframe
pheno_data = pheno.retrieve_fields(engine=engine, fields=field_list, coding_values="replace").to_koalas()

# See first five entries
pheno_data.head()