# Notebook 1B: Expression Data Transformation from a Dataset
***

***As-Is Software Disclaimer***

*This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.*

*[MIT License](https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md) applies to this notebook.*
***

**Launch spec:**
- App name: JupyterLab with Spark Cluster
- Kernel: Python 3
- Instance type: mem1_ssd1_v2_x16
- Spark cluster configuration: single node
- runtime: ~ 10 min

**Package dependencies:** 
- pprint [License](https://docs.python.org/3/license.html?#psf-license)
- pandas [License](https://github.com/pandas-dev/pandas/blob/main/LICENSE)
- pyspark [License](http://www.apache.org/licenses/LICENSE-2.0)

**Data description:** The record ID of a Dataset (or Cohort) that has an instance of a MolecularExpressionAssay. This dataset used in this notebook has an instance of a MolecularExpressionAssay and pheno data. It was created by
1. Ingesting molecular expression data using the `Molecular Expression Assay Loader` app  
2. Using the `Assay Dataset Merger` app to add the MolecularExpressionAssay to a pheno dataset

**This notebook shows how to:** This notebook shows how to extract molecular expression data from a Dataset to be used for downstream analyses.

1. Retrieve Molecular Expression Data from a Dataset
2. Create and Export Count Matrix from Expression Data
3. Export Phenotype Summary from a Dataset
***

## Preparing your environment 
`import` dependencies and initiate spark context and connection

In [5]:
import pprint
import pandas as pd
import pyspark
import dxdata

In [6]:
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

## 1. Retrieve Molecular Expression Data from a Dataset

#### Load Dataset

In [4]:
molecular_expression_dataset = dxdata.load_dataset(id="record-G7b0pKQ0fPx7Pkf960PJ2x60")

#### View attributes of the Dataset

In [1]:
pprint.pprint(molecular_expression_dataset.__dict__)

#### View the Entities of the desired Assay

In [2]:
molecular_expression_dataset.assays['matrix_test_011922'].entities

#### Retrieve unique database name and table name of the Entities

In [10]:
def get_database_table_names(dataset, entity_name):
    """This helper function retrieves the unique database name and table name from an Entitiy in a Dataset.
    
    Args:
            dataset (Dataset):
                Dataset containing the Entity from which to retrieve the database and
                table name of
            entity_name (str):
                Name of the Entity from which too retrieve the database and table name of.

    Returns: The unique database name and table name of the specified Entity.
    """
    if entity_name in dataset.entities_by_name:
        db_name = dataset[entity_name].database_name
        db_id = dataset[entity_name].database_id.lower().replace('-', '_')
        db_uname = f"{db_id}__{db_name}"
        tb_name = dataset[entity_name].fields[0].table_name
    else:
        for assay in dataset.assays:
            for entity in dataset.assays[assay].entities:
                if entity == entity_name:
                    db_name = dataset.assays[assay].entities[entity].database_name
                    db_id = dataset.assays[assay].entities[entity].database_id.lower().replace('-', '_')
                    db_uname = f"{db_id}__{db_name}"
                    tb_name = dataset.assays[assay].entities[entity].fields[0].table_name
    
    return f"{db_uname}.{tb_name}"

In [5]:
expr_annotation_udb_tb_name = get_database_table_names(molecular_expression_dataset, 'expr_annotation')

In [6]:
expression_udb_tb_name = get_database_table_names(molecular_expression_dataset, 'expression')

#### Retrieve molecular expression data into a Spark DataFrame using SparkSQL

In [13]:
expr_annotation_table = spark.sql(f"SELECT * FROM {expr_annotation_udb_tb_name}")
expression_table = spark.sql(f"SELECT * FROM {expression_udb_tb_name}")

## 2. Create and Export Count Matrix from Expression Data

#### Create a pandas dataframe of the expression table

In [7]:
expression_pdf = expression_table.toPandas()

#### Transform the expression data from table to matrix

In [8]:
expression_matrix = pd.pivot_table(expression_pdf, values="value", index="feature_id", columns="sample_id")

#### Export matrix as CSV file

In [29]:
expression_matrix.to_csv("expression_matrix.csv")

In [9]:
!head expression_matrix.csv

#### Optional preprocessing step: Filter out data where expression value = 0 before making a count matrix

In [16]:
no_zero_expression_pdf = expression_pdf[expression_pdf.value != 0]

In [10]:
no_zero_expression_matrix = pd.pivot_table(no_zero_expression_pdf, values="value", index="feature_id", columns="sample_id")

## 3. Export Pheno Summary from Dataset

NOTE: The MolecularExpressionAssay was added to a Dataset containing a pheno summary entity using `Assay Dataset Merger`

#### View entities in the Dataset

In [11]:
molecular_expression_dataset.entities

#### Get the database and table name of the pheno_summary entity

In [12]:
get_database_table_names(molecular_expression_dataset, 'pheno_summary')

#### Retrieve pheno_summary data into a Spark DataFrame using SparkSQL

In [52]:
pheno_summary_table = spark.sql(f"SELECT * FROM pheno_012022.pheno_summary")

#### Create a pandas dataframe of the pheno_summary table

In [14]:
pheno_summary_pdf = pheno_summary_table.toPandas()

#### Export pheno_summary as a CSV file

In [45]:
pheno_summary_pdf.to_csv("pheno_summary.csv", index=False)

In [15]:
!head pheno_summary.csv

## Next Steps

With the exported csv files, downstream analyses can be performed as shown in the following series of transcriptomics tutorial notebooks (starting at part 2): https://github.com/dnanexus/OpenBio/blob/master/transcriptomics/tutorial_notebooks/