# Introduction to working with a MolecularExpressionAssay

This notebook shows how to:

1. Parse MolecularExpressionAssay metadata from a Dataset
2. Access molecular expression data from a Molecular Expression Dataset

## 1: Parse MolecularExpressionAssay metadata from a Dataset

#### From the JupyterLab instance, select a Python3 kernel and import dxdata

In [2]:
import dxdata

#### Load Dataset (or Cohort)

In [10]:
molecular_expression_dataset = dxdata.load_dataset(id="record-G7y66Vj0GjvK7xbp4Xg2Zjy5")

In [11]:
molecular_expression_dataset.__dict__

{'entities': [<Entity "sample">],
 'edges': [<dxdata.dataset.dataset.Edge at 0x7fc259133a90>,
  <dxdata.dataset.dataset.Edge at 0x7fc259126c18>],
 'entities_by_name': OrderedDict([('sample', <Entity "sample">)]),
 'primary_entity': <Entity "sample">,
 'dashboards': OrderedDict(),
 'primary_dashboard': None,
 'annotations': [],
 'assays': DxOrderedDict([('my_expression_project',
                 <dxdata.dataset.assay.MolecularExpressionAssay at 0x7fc259125ac8>)])}

In [15]:
molecular_expression_dataset.assays

DxOrderedDict([('my_expression_project',
                <dxdata.dataset.assay.MolecularExpressionAssay at 0x7fc259125ac8>)])

#### Select your desired assay and parse the Assay class for desired metadata

In [12]:
assay = molecular_expression_dataset.assays["my_expression_project"]
assay.entities

OrderedDefaultDict([('expr_annotation', <Entity "expr_annotation">),
                    ('expression', <Entity "expression">)])

## 2: Access molecular expression data from a Molecular Expression Dataset

#### A Spark-enabled JupyterLab instance may be used to access, analyze and/or extract data, as referenced in a Molecular Expression Assay Loader Dataset. From the JupyterLab instance, select a Python3 kernel and set up the environment.

In [5]:
import dxdata
import pyspark

# Initiate Spark
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

#### Extract assay information and run query.

In [14]:
# Specify dataset and assay
dataset = dxdata.load_dataset(id="record-G7y66Vj0GjvK7xbp4Xg2Zjy5")
assay = dataset.assays["my_expression_project"]

# Build SQL
table = "expression"
database = assay.entities[table].database_name
sql = f"SELECT * from {database}.{table}"

# Build Spark DataFrame from SQL
sdf = spark.sql(sql)
sdf.show(10)

+---------------+---------+-----+
|     feature_id|sample_id|value|
+---------------+---------+-----+
|ENST00000346162| sample_2|   85|
|ENST00000438810| sample_1|   68|
|ENST00000305531| sample_2|   88|
|ENST00000427231| sample_3|   27|
|ENST00000427970| sample_3|    4|
|ENST00000290037| sample_3|   21|
|ENST00000358472| sample_1|    5|
|ENST00000398738| sample_1|    8|
|ENST00000421624| sample_1|   90|
|ENST00000397381| sample_2|   24|
+---------------+---------+-----+
only showing top 10 rows



#### A Spark DataFrame may be written to a file, further analyzed using Spark  (or Koalas), or read into memory and analyzed using Pandas.