# Introduction to working with Molecular Expression Datasets
***

***As-Is Software Disclaimer***

*This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.*

*[MIT License](https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md) applies to this notebook.*
***

**Launch spec:**
- App name: JupyterLab with Spark Cluster
- Kernel: Python 3
- Instance type: mem1_ssd1_v2_x16
- Spark cluster configuration: single node
- runtime: ~ 5 min

**Package dependencies:** 
- pprint [License](https://docs.python.org/3/license.html?#psf-license)
- pyspark [License](http://www.apache.org/licenses/LICENSE-2.0)

**Data description:** The record ID of a Dataset (or Cohort) that has an instance of a MolecularExpressionAssay. All molecular expression data in this notebook is synthetic.

**This notebook shows how to:** Access molecular expression data from a Dataset.
***

#### Load packages using `import` 

In [4]:
import dxdata
import pprint
import pyspark
from pyspark.sql import functions as F

#### Initiate Spark

In [5]:
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

#### Load Dataset (or Cohort)

In [3]:
dataset = dxdata.load_dataset(id="record-G7y66Vj0GjvK7xbp4Xg2Zjy5")

#### View attributes of the Dataset

In [4]:
pprint.pprint(dataset.__dict__)

{'annotations': [],
 'assays': DxOrderedDict([('my_expression_project',
                           <dxdata.dataset.assay.MolecularExpressionAssay object at 0x7f666adbc5f8>)]),
 'dashboards': OrderedDict(),
 'edges': [<dxdata.dataset.dataset.Edge object at 0x7f666adbc278>,
           <dxdata.dataset.dataset.Edge object at 0x7f666adbc358>],
 'entities': [<Entity "sample">],
 'entities_by_name': OrderedDict([('sample', <Entity "sample">)]),
 'primary_dashboard': None,
 'primary_entity': <Entity "sample">}


#### View Assay instances in the Dataset

In [5]:
dataset.assays

DxOrderedDict([('my_expression_project',
                <dxdata.dataset.assay.MolecularExpressionAssay at 0x7f666adbc5f8>)])

#### Select the desired assay and parse the Assay class for desired metadata

In [6]:
assay = dataset.assays["my_expression_project"]
assay.entities

OrderedDefaultDict([('expr_annotation', <Entity "expr_annotation">),
                    ('expression', <Entity "expression">)])

#### Identify the table and unique database name for the 'expression' Entity

In [20]:
table = "expression"

database_id = assay.entities[table].database_id.lower().replace('-', '_')
database_name = assay.entities[table].database_name
unique_database_name = f"{database_id}__{database_name}"

#### Use Assay information to run query to extract molecular expression data into a Spark DataFrame

In [7]:
# Build SQL
sql = f"SELECT * from {unique_database_name}.{table}"

# Build Spark DataFrame from SQL
sdf = spark.sql(sql)
sdf.show(10)

+---------------+---------+-----+
|     feature_id|sample_id|value|
+---------------+---------+-----+
|ENST00000346162| sample_2|   85|
|ENST00000438810| sample_1|   68|
|ENST00000305531| sample_2|   88|
|ENST00000427231| sample_3|   27|
|ENST00000427970| sample_3|    4|
|ENST00000290037| sample_3|   21|
|ENST00000358472| sample_1|    5|
|ENST00000398738| sample_1|    8|
|ENST00000421624| sample_1|   90|
|ENST00000397381| sample_2|   24|
+---------------+---------+-----+
only showing top 10 rows



#### Convert long format table to count matrix
*Note: For transformations resulting in tables with number of columns greater than the default setting of 10,000, it may be necessary to update the max value using spark.conf.set('spark.sql.pivotMaxValues', max_n). See Spark documentation at https://spark.apache.org/docs/latest/configuration.html#runtime-sql-configuration*

In [8]:
# Pivot to matrix format
sdf_pivot = sdf.groupBy("feature_id").pivot("sample_id").agg(F.first("value"))
sdf_pivot.show(5)

+---------------+--------+--------+--------+
|     feature_id|sample_1|sample_2|sample_3|
+---------------+--------+--------+--------+
|ENST00000427970|      53|      34|       4|
|ENST00000448786|      37|      74|      29|
|ENST00000394713|      76|      76|      24|
|ENST00000459639|      10|      78|      52|
|ENST00000445980|      22|      46|      58|
+---------------+--------+--------+--------+
only showing top 5 rows



#### A Spark DataFrame may be written to a file, further analyzed using Spark  (or Koalas), or read into memory and analyzed using Pandas.