# Working with dxdata 
#### Datasets and Cohorts

`dxdata` is a python module developed and maintained by DNAnexus. The module contains functionality to access and describe data within Apollo by leveraging Dataset and Cohort objects. A Dataset object contains content that describes logical content in data and maps the logical data organization to the physical location of data stored in a database. It allows assign _something_ a "Sample ID," a term with specific attributes, and easily store, retrieve, and use that term in a consistent manner. Below, we walk through highlights of `dxdata` specifically around initiating a spark cluster, pulling a datasets/cohort, exploring the data, and retrieving data to a dataframe.

This notebook is delivered "As-Is". Notwithstanding anthing to the contrary, [DNAnexus] will have no warranty, support  or other obligations with respect to [Materials] provided hereunder.

### Import dxdata and initiate Spark cluster

In [1]:
import dxdata
import databricks.koalas as ks

In [2]:
# Connect to Spark
engine = dxdata.connect(dialect="hive+pyspark")

### dxdata.dataset

`dxdata` contains the following classes and respective attributes within the `dataset` sub-module. Items in **bold** are attributes that are commonly used in the provided JupyterLab Notebooks that leverage `dxdata`.

- Datasets
    - **_entities_** (list of Entity): Entities in the dataset.
    - _edges_ (list of Edge): Graph edges connecting entities.
    - _folders_ (list of dict/string): Folder hierarchy to organize fields in the UI.
    - _primary_entity_ (str): Name of the entry containing the global primary key.
    - _dashboards_ (dict or None): Mapping of dashboard names to dxlinks
    - **_pheno_geno_link_info_** (dict or None): Identifiers for the subject-assay linking table.


- Entity
    - **_name_** (str): Logical name for the entity. From the data dictionary column "entity".
    - **_fields_** (list of Field): Fields associated with this entity.
    - **_database_name_** (str): Name of the database containing this entity's data.
    - _database_id_ (str): Platform ID of the data object, e.g. "database-xxxx".
    - _primary_key_ (str): Name of the field to use as the primary key for database operations. Derived from the data dictionary column "primary_key_type".
    - _longitudinal_axis_ (str): Name of the field to use as the default longitudinal axis for analysis. Derived from the data dictionary column "is_longitudinal_axis".


- Field - Represents all other DataDictionary columns not covered by Entity and Edge 
    - **_name_** (str): Field's internal name.
    - _type_ (str): Primitive type name, e.g. "integer", "string", "date" ...
    - **_table_name_** (str): Database table where this field's data values are stored.
    - **_column_name_** (str): Database column where field values are stored.
    - **_coding_** (Coding): Coding instance that applies to this field's values.
    - _is_multi_select_ (bool): Whether the field can contain multiple values per cell (array/set type)
    - _is_sparse_coding_ (bool): Whether all data values should be covered by codings.
    - **_title_** (str): Optional.
    - _description_ (str): Optional.
    - _units_ (str): Optional.
    - _concept_ (str): Optional.
    - _linkout_ (str): Optional.
    - ** kwargs (dict of strings): Arbitrary additional attributes with string values.


- Edge - Represents DataDictionary columns: referenced_entity_field, relationship. Determines join_info. All joins are left joins.
    - _source_entity_ (str): Existing entity name
    - _source_field_ (str): Name of existing field in `source_entity`
    - _dest_entity_ (str): Existing entity name
    - _dest_field_ (str): Name of existing field in `dest_entity`
    - _relationship_ (str): "one_to_one" or "many_to_one"


- Coding - Capture meanings and hierarchy for categorical values.
    - _name_ (str): Name for the coding. (May be applied to multiple fields.)
    - **_codes_** (dict): Corresponds to 'encoding' element in descriptor JSON.
    - _hierarchy_ (list): Corresponds to 'hierarchy' element in descriptor JSON.

### Global functions in `dxdata` for loading data:
- _load_cohort_ : Read the contents of a cohort record on the platform.
- _load_dataset_ : Read the contents of a Dataset record on the platform.
   

### A Dataset contains defined attributes, such as entities
#### Let's load a dataset to examine all attributes

In [3]:
# load with record ID
dataset = dxdata.load_dataset(id='record-G3814k006Fjgk5J74Kx3X1QG')

# OR with path to the dataset
# dataset = dxdata.load_dataset(''/Datasets/Demo Dataset')

#### Entities

Lets look at the first attribute, `entities,` in the `Dataset` object. The `entities` attribute returns a list of `entity` objects.

In [4]:
entities = dataset.entities
print(*entities, sep = "\n") 

<Entity "labs">
<Entity "variants">
<Entity "diagnoses">
<Entity "pg_link2">
<Entity "medications">
<Entity "patients">


#### Edges

The entities of a dataset map to each other and may map in a specific manner (i.e, "one to one" or "one to many").

In [5]:
# Graph edges connecting entities.
for edge in dataset.edges[:4]:
    #print("* Edge:{}-{}.{}.{}".format(edge, edge.source_entity, edge.dest_entity, edge.relationship))
    print(f"* Edge: {edge}")
    print(f"\tSource entity: {edge.source_entity}")
    print(f"\tDestination entity: {edge.dest_entity}")
    print(f"\tRelationship from source to destination: {edge.relationship}")

* Edge: pheno_geno_sample_ids:p_sample -> pheno_sample_1:sample_id
	Source entity: pheno_geno_sample_ids
	Destination entity: phenotype
	Relationship from source to destination: one_to_many
* Edge: genotype_alt_read_optimized:sample_id -> pheno_geno_sample_ids:g_sample
	Source entity: genotype
	Destination entity: pheno_geno_sample_ids
	Relationship from source to destination: one_to_many
* Edge: allele_read_optimized:a_id -> genotype_alt_read_optimized:a_id
	Source entity: allele
	Destination entity: genotype
	Relationship from source to destination: one_to_many
* Edge: annotation_read_optimized:a_id -> genotype_alt_read_optimized:a_id
	Source entity: annotation
	Destination entity: genotype
	Relationship from source to destination: one_to_many


#### Primary Entity
Name of the entry containing the global primary key.

In [5]:
# primary_entity (str): Name of the entry containing the global primary key.
prim = dataset.primary_entity
prim

<Entity "patients">

## Explore Entity Object

To view details of these entity object's attributes which are `str` types can be displayed like:

In [19]:
# entity = dataset.entities
for entity in dataset.entities:
    print(f'Entity: {entity.name}')
    print(f'    * Entity Title: {entity.entity_title}')
    print(f'    * Entity Label Singular: {entity.entity_label_singular}')
    print(f'    * Entity Label Plural: {entity.entity_label_plural}')
    print(f'    * Entity Primary Key: {entity.primary_key}')

Entity: labs
    * Entity Title: Lab
    * Entity Label Singular: Lab
    * Entity Label Plural: Labs
    * Entity Primary Key: lab_id
Entity: variants
    * Entity Title: Variant
    * Entity Label Singular: Variant
    * Entity Label Plural: Variants
    * Entity Primary Key: None
Entity: diagnoses
    * Entity Title: Diagnosis
    * Entity Label Singular: Diagnosis
    * Entity Label Plural: Diagnoses
    * Entity Primary Key: diagnosis_id
Entity: pg_link2
    * Entity Title: pg_link2
    * Entity Label Singular: pg_link2
    * Entity Label Plural: pg_link2
    * Entity Primary Key: None
Entity: medications
    * Entity Title: Medication
    * Entity Label Singular: Medication
    * Entity Label Plural: Medications
    * Entity Primary Key: medication_id
Entity: patients
    * Entity Title: Patient
    * Entity Label Singular: Patient
    * Entity Label Plural: Patients
    * Entity Primary Key: patient_id


Since, `entity.fields` will return a list of `Fields` objects which have their own attributes, we'll have to use nested for loop to iterate over both entities list and fields list. With this, we're aiming to view the various fields that exist for each entity.

In [37]:
# for entity in dataset.entities[:2]:
#     for field in entity.fields[:6]:
#         print(f' Entity: {entity.name} - Field: {field.name}')
#     print('----------')

n_entities = len(dataset.entities)
for entity in dataset.entities[:n_entities]:
    n_fields = len(entity.fields)
    print(f'Entity: {entity.name}')
    for field in entity.fields[:n_fields]:
        print(f'     Field column name: {field.name}')
        print(f'     Field title: {field.title}')
        print('\n')
            
    print('----------')

Entity: labs
     Field column name: o2_saturation
     Field title: Oxygen Saturation


     Field column name: alt
     Field title: Alanine Transaminase (Alanine AminoTransferase)


     Field column name: phosphate
     Field title: Phosphate


     Field column name: hgb
     Field title: Hemoglobin


     Field column name: heart_rate
     Field title: Heart Rate


     Field column name: bilirubin
     Field title: Bilirubin, Total


     Field column name: hct
     Field title: Hematocrit


     Field column name: magnesium
     Field title: Magnesium


     Field column name: lab_id
     Field title: lab_id


     Field column name: ast
     Field title: Aspartate Transaminase (Aspartate AminoTransferase)


     Field column name: temperature
     Field title: Temperature


     Field column name: potassium
     Field title: Potassium


     Field column name: lab_date
     Field title: Lab Date


     Field column name: patient_id
     Field title: patient_id


     Field col

### Class Function in `Entity`
`dataset` object has 6 unique entities for this particular dataset. Each of those entities have names which can be displayed by `entity.name`, and titles which can be displayed by `entity.title`, as listed above. For calling a class function `find_field` on `entity` object, we have to pick one from the list of entities returned from `dataset.entities` or by name.

This function returns `field` object and so we need to call field attributes to display its content.

In [38]:
# First, assign the entity. Labs was arbitararly chosen
labs_entity = dataset["labs"]

# Find a field title, existing for the chosen entity and copied as it appears above
bilirubin_field = labs_entity.find_field("bilirubin")

# Print information on the field title
print(f"Field title: {bilirubin_field.title}, Field column name: {bilirubin_field.column_name}")

Field title: Bilirubin, Total, Field column name: bilirubin


## Explore Field Object

Let's look at all attributes of the `field` object we just created `bilirubin_field`. This can be used to explore any field in more depth.

In [41]:
for attr, value in vars(bilirubin_field).items():
    print("Field", attr, ":", value)

Field name : bilirubin
Field coding : None
Field entity : <Entity "labs">
Field title : Bilirubin, Total
Field description : None
Field units : mg/dL
Field concept : None
Field linkout : None
Field folder_path : ['Labs', 'Liver functions']
Field longitudinal_axis_type : None
Field type : double
Field is_multi_select : False
Field is_sparse_coding : False
Field table_name : labs
Field column_name : bilirubin
Field database_name : xds_clinical_demo_dataset
Field database_id : database-G13pbx80bZgVPgj9Bzx4zfpP
Field optimized_column : None
Field primary_table_name : labs
Field logical_table_name : phenotype


## Extracting data from a dataset into a dataframe

### Retrieve Fields

The `retrieve_fields()` function can be added to any dataset or cohort object to extract data into a dataframe of your choosing.  The function natively returns a spark dataframe but can be cast to other common types: e.g. pandas (`.to_pandas()`) or koalas (`.to_koalas`).

In [49]:
labs_entity = dataset["labs"]

# First, start out with a list of fields that you want to export
# Find by exact title
field_O2sat = labs_entity.find_field(title="Oxygen Saturation")
field_alt = labs_entity.find_field(title="Alanine Transaminase (Alanine AminoTransferase)")
field_P = labs_entity.find_field(title="Phosphate")

field_list = [field_O2sat, field_alt, field_P]

# Alternatively, this could be done by specifying the column name
# field_O2sat = labs_entity.find_field("o2_saturation")
# field_alt = labs_entity.find_field("alt")
# field_P = labs_entity.find_field("phosphate")

# field_list = [field_O2sat, field_alt, field_P]

field_list

[<Field "o2_saturation">, <Field "alt">, <Field "phosphate">]

In [51]:
# Extract data, decode any codings, and cast to a koalas dataframe
labs_data = labs_entity.retrieve_fields(engine=engine, fields=field_list, coding_values="replace").to_koalas()

# See first five entries
labs_data.head()

Unnamed: 0,o2_saturation,alt,phosphate,lab_id
0,0.0,0.0,0.0,188
1,0.0,0.0,0.0,179
2,100.0,0.0,0.0,52
3,99.0,0.0,0.0,137
4,100.0,0.0,0.0,139
