# Introduction to GenePattern Notebook

This document should help you understand how to run an analysis in the GenePattern Notebook environment. In it you will perform a simple preprocessing step followed by clustering and then view the results in a heat map. 

In this notebook, we will run a few simple analysis steps to normalize a dataset, cluster it, and then visualize the results.  The analysis workflow looks like this;

<img src="https://datasets.genepattern.org/data/ccmi_tutorial/2020-01-24/GPNB_Intro_Workflow.png" style="height:350px;" align="left">
<div><br/></div>
<p style="clear: both;">
    
**Instructions are given in blue boxes, such as with the one below.**

<div class="alert alert-info"><h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
Sign in to GenePattern by clicking the login button or entering your username and password into the form below.</div>
</p>

In [6]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.display(genepattern.session.register("https://cloud.genepattern.org/gp", "", ""))

GPAuthWidget()

# Data Preparation

## Breast Cancer Data

For this notebook we will use RNA-Seq data from <a href="https://cancergenome.nih.gov/" target="_blank">The Cancer Genome Atlas (TCGA)</a>. We have selected 20 breast cancer primary tumors (BRCA) with their matched normal samples. We can retrieve the&nbsp;read counts files for each tumor and normal sample (40 in total) downloaded from the <a href="https://gdc.cancer.gov" target="_blank">TCGA Genomic Data Commons</a>&nbsp;(GDC) as well as sample metadata (tumor/normal status).&nbsp;

While the example we are using is for a specfic selection of breast cancer samples, you can use the steps in this notebook to prepare any TCGA-derived data for use in GenePattern analyses.

We will use this dataset to perform Hierarchical Clustering.  This analysis was developed with microarray data in mind, therefore, it is a good practice to normalize RNA-Seq data as a preprocessing step. 

## Step 1: DESeq2 normalization

We recommend using DESeq2 to preprocess RNA-seq count data to make it suitable for our clustering analysis. The <a href="https://htmlpreview.github.io/?https://github.com/genepattern/DESeq2/blob/master/docs/v1/index.html">DESeq2</a> module perfoms the normalization explained in [Love, Huber, and Anders (2014)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4302049/#Sec2title).

<div class="well">
    
**Note:** DESeq2 works only with raw read counts as produced by HTSeq or RSEM.  These counts should not be normalized a priori and also should not be RPKM or FPKM values.

<div class="alert alert-info">

<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>

<ul>
	<li>For the <b>input file*</b> parameter, click and drag & drop <a href="https://datasets.genepattern.org/data/module_support_files/DESeq2/BRCA_tumor_and_normal_20783x40.gct">this sample BRCA dataset</a>.</li>
    <li>For the <b>cls file*</b> parameter, click and drag & drop <a href="https://datasets.genepattern.org/data/module_support_files/DESeq2/BRCA_tumor_and_normal_x40.cls" target="_blank">this sample CLS file</a>.</li>
    <li>Feel free to leave the rest of parameters as default.</li>
</ul>
</div>


<div class="well well-sm"><strong>Runtime:</strong> About 10-15 minutes.</div>


<div class="alert alert-warning">
<p class="lead"> NOTE: <i class="fa fa-exclamation-triangle"></i></p>
Due to the long runtime, pre-generated results are provided below this analysis for subsequent steps to avoid waiting.
</div>

In [7]:
deseq2_task = gp.GPTask(genepattern.session.get(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00362')
deseq2_job_spec = deseq2_task.make_job_spec()
deseq2_job_spec.set_parameter("input.file", "https://datasets.genepattern.org/data/module_support_files/DESeq2/BRCA_tumor_and_normal_20783x40.gct")
deseq2_job_spec.set_parameter("cls.file", "https://datasets.genepattern.org/data/module_support_files/DESeq2/BRCA_tumor_and_normal_x40.cls")
deseq2_job_spec.set_parameter("confounding.variable.cls.file", "")
deseq2_job_spec.set_parameter("output.file.base", "<input.file_basename>")
deseq2_job_spec.set_parameter("qc.plot.format", "skip")
deseq2_job_spec.set_parameter("fdr.threshold", "0.1")
deseq2_job_spec.set_parameter("top.N.count", "20")
deseq2_job_spec.set_parameter("random.seed", "779948241")
deseq2_job_spec.set_parameter("job.memory", "2 Gb")
deseq2_job_spec.set_parameter("job.queue", "gp-cloud-default")
deseq2_job_spec.set_parameter("job.cpuCount", "1")
deseq2_job_spec.set_parameter("job.walltime", "02:00:00")
genepattern.display(deseq2_task)

job194669 = gp.GPJob(genepattern.session.get(0), 194669)
genepattern.display(job194669)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00362')

GPJobWidget(job_number=194669)

## Step 2. Hierarchical Clustering

Run hierarchical clustering on genes and/or samples to create dendrograms for the clustered genes (*.gtr) and/or clustered samples (*.atr), as well as a file (*.cdt) that contains the original gene expression data ordered to reflect the clustering.

### Dataset
The dataset consists of the preprocessed 20 breast cancer samples and 20 matched normal samples from [The Cancer Genome Atlas](https://cancergenome.nih.gov/) we ran through DESeq2 above or we can use previously generated results included in this notebook (below). 

### Considerations
* Best practice is to normalize (row/column normalize parameters) and center (row/column center parameters) the data being clustered. 
* The CDT output file must be converted to a GCT file before it can be used as an input file for another GenePattern module (other than HierachicalClusteringViewer). For instructions on converting a CDT file to a GCT file, see [Creating Input Files](http://www.broadinstitute.org/cancer/software/genepattern/file-formats-guide#creating-input-files).
* Learn more by reading about the [HierarchicalClustering](http://software.broadinstitute.org/cancer/software/genepattern/modules/docs/HierarchicalClustering/6) module.

### Pregenerated DESeq2 Results

<p>You can use the results from a previously run and publicly shared job (194669) to so that you do not need to wait for your DESeq2 job (above) to complete. </p>

In [8]:
job194669 = gp.GPJob(genepattern.session.get(0), 194669)
genepattern.display(job194669)

GPJobWidget(job_number=194669)

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ol>
<li>If your DESeq2 has finished, Click on the <strong><em>input.filename</em></strong> parameter and select <b>BRCA_tumor_and_normal_20783x40.matched_normal.vs.primary_tumor.normalized_counts.gct</b> from the drop-down menu.  Otherwise you may drag the pre-generated result file <a class="nbtools-markdown-file" href="https://cloud.genepattern.org/gp/jobResults/194669/BRCA_tumor_and_normal_20783x40.matched_normal.vs.primary_tumor.normalized_counts.gct" target="_blank">BRCA_tumor_and_normal_20783x40.matched_normal.vs.primary_tumor.normalized_counts.gct</a> to the input.</li>
<li>Click <strong><em>Run</em></strong>.</li>
</ol>
</div>

<div class="well well-sm"><strong>Runtime:</strong> About 30 seconds</div>

In [9]:
hierarchicalclustering_task = gp.GPTask(genepattern.session.get(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00009')
hierarchicalclustering_job_spec = hierarchicalclustering_task.make_job_spec()
hierarchicalclustering_job_spec.set_parameter("input.filename", "https://cloud.genepattern.org/gp/jobResults/194669/BRCA_tumor_and_normal_20783x40.matched_normal.vs.primary_tumor.normalized_counts.gct")
hierarchicalclustering_job_spec.set_parameter("column.distance.measure", "2")
hierarchicalclustering_job_spec.set_parameter("output_distance_matrix", "False")
hierarchicalclustering_job_spec.set_parameter("row.distance.measure", "No row clustering")
hierarchicalclustering_job_spec.set_parameter("clustering.method", "a")
hierarchicalclustering_job_spec.set_parameter("output.base.name", "<input.filename_basename>")
hierarchicalclustering_job_spec.set_parameter("row.centering", "Mean")
hierarchicalclustering_job_spec.set_parameter("row.normalize", "False")
hierarchicalclustering_job_spec.set_parameter("col.centering", "Mean")
hierarchicalclustering_job_spec.set_parameter("col.normalize", "False")
hierarchicalclustering_job_spec.set_parameter("job.memory", "2 Gb")
hierarchicalclustering_job_spec.set_parameter("job.queue", "gp-cloud-default")
hierarchicalclustering_job_spec.set_parameter("job.cpuCount", "1")
hierarchicalclustering_job_spec.set_parameter("job.walltime", "02:00:00")
genepattern.display(hierarchicalclustering_task)


GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00009')

## Step 3. View clustering results

Display a heat map of the clustered gene expression data, with dendrograms showing how the genes and/or samples were clustered.
 
Learn more by reading about the [HierarchicalClusteringViewer](http://software.broadinstitute.org/cancer/software/genepattern/modules/docs/HierarchicalClusteringViewer/12) module.

### Considerations

* HierarchicalClusteringViewer displays gene expression data as a heat map, which makes it easier to see patterns in the numeric data. Gene names are row labels and sample names are column labels.
* Notebooks may contain any number of visualizations.
* You can  <b>Select File > Save Image</b> to save the heat map and dendrograms to an image file. Supported formats include bmp, eps, jpeg, png, and tiff.
* To give the visualization more space on the screen, you can open the visualizer in a separate window by clicking on the gear icon (top left corner) and selecting <b>"Pop Out Visualizer"</b>.


<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ul>
    <li>Fill out the <strong>CDT</strong> and <strong>ATR</strong> parameters for <em>HierarchicalClusteringViewer</em> below, using the files produced in the last step.</li>
	<li>The <strong>GTR</strong> parameter is not necessary. Leave it blank</li>
    <li>Click <strong><em>Run</em></strong>.</li>
</ul>
</div>

In [10]:
hierarchicalclusteringviewer_task = gp.GPTask(genepattern.session.get(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00031')
hierarchicalclusteringviewer_job_spec = hierarchicalclusteringviewer_task.make_job_spec()
hierarchicalclusteringviewer_job_spec.set_parameter("cdt.file", "")
hierarchicalclusteringviewer_job_spec.set_parameter("gtr.file", "")
hierarchicalclusteringviewer_job_spec.set_parameter("atr.file", "")
hierarchicalclusteringviewer_job_spec.set_parameter("job.memory", "2 Gb")
hierarchicalclusteringviewer_job_spec.set_parameter("job.queue", "gp-cloud-default")
hierarchicalclusteringviewer_job_spec.set_parameter("job.cpuCount", "1")
hierarchicalclusteringviewer_job_spec.set_parameter("job.walltime", "02:00:00")
genepattern.display(hierarchicalclusteringviewer_task)



GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00031')