# Hierarchical Clustering in GenePattern Notebook

Cluster genes and/or samples based on how close they are to one another. The result is a tree structure, referred to as dendrogram.

## Before you begin

* Sign in to GenePattern by entering your username and password into the form below.
* Gene expression data must be in a [GCT or RES file](https://genepattern.broadinstitute.org/gp/pages/protocols/GctResFiles.html).
    * Example file: [all_aml_test.gct](https://datasets.genepattern.org/data/all_aml/all_aml_test.gct).
* Learn more by reading about [file formats](http://www.broadinstitute.org/cancer/software/genepattern/file-formats-guide#GCT).


In [5]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.GPAuthWidget(genepattern.register_session("https://genepattern.broadinstitute.org/gp", "", ""))

## Step 1: PreprocessReadCounts

Preprocess RNA-Seq count data in a GCT file so that it is suitable for use in GenePattern analyses. 

### Introduction

* This module is used to preprocess RNA-Seq data into a form suitable for use downstream in other GenePattern analyses such as GSEA, ComparativeMarkerSelection, HierarchicalClustering, as well as visualizers.  Many of these tools were originally designed to handle microarray data - particularly from Affymetrix arrays - and so we must be mindful of that origin when preprocessing data for use with them.
The module does this by using a mean-variance modeling technique to transform the dataset to fit an approximation of a normal distribution, with the goal of thus being able to apply classic normal-based microarray-oriented statistical methods and workflows.
* Learn more by reading about the [PreprocessReadCounts](http://software.broadinstitute.org/cancer/software/genepattern/modules/docs/PreprocessReadCounts/1) module.

<div class="alert alert-info">
<h3>Instructions</h3>

<p>For the <strong><em>input.file</em></strong> parameter, click &quot;Add Path or URL...&quot; then copy and paste this URL into the <em>&quot;Enter Path or URL&quot; </em>text box, and click <strong><em>Select</em></strong>:&nbsp;</p>

<p><a href="https://datasets.genepattern.org/data/VIB/MergedHTSeqCounts_GSE52778.gct" target="_blank">https://datasets.genepattern.org/data/VIB/MergedHTSeqCounts_GSE52778.gct</a></p>

<p>&nbsp;</p>

<p>For the <strong><em>cls.file</em></strong> parameter, click &quot;Add Path or URL...&quot; then copy and paste this URL into the <em>&quot;Enter Path or URL&quot; </em>text box, and click <strong><em>Select</em></strong>:</p>

<p><a href="https://datasets.genepattern.org/data/VIB/MergedHTSeqCounts_GSE52778.cls" target="_blank">https://datasets.genepattern.org/data/VIB/MergedHTSeqCounts_GSE52778.cls</a></p>

<p>&nbsp;</p>

<p>Set the <strong><em>expression.value.filter.threshold</em></strong> parameter to 4.
(so as to reduce the number of rows, and thus the time for computation)</p>

<p>&nbsp;</p>

<p>Click the button <strong><em>Run</em></strong> on the analysis below.</p>
</div>


In [7]:
preprocessreadcounts_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00355')
preprocessreadcounts_job_spec = preprocessreadcounts_task.make_job_spec()
preprocessreadcounts_job_spec.set_parameter("input.file", "")
preprocessreadcounts_job_spec.set_parameter("cls.file", "")
preprocessreadcounts_job_spec.set_parameter("output.file", "<input.file_basename>.preprocessed.gct")
preprocessreadcounts_job_spec.set_parameter("expression.value.filter.threshold", "1")
genepattern.GPTaskWidget(preprocessreadcounts_task)

## Step 2: HierarchicalClustering

Run hierarchical clustering on genes and/or samples to create dendrograms for the clustered genes (*.gtr) and/or clustered samples (*.atr), as well as a file (*.cdt) that contains the original gene expression data ordered to reflect the clustering.

### Considerations
* Best practice is to normalize (row/column normalize parameters) and center (row/column center parameters) the data being clustered. 
* The CDT output file must be converted to a GCT file before it can be used as an input file for another GenePattern module (other than HierachicalClusteringViewer). For instructions on converting a CDT file to a GCT file, see [Creating Input Files](http://www.broadinstitute.org/cancer/software/genepattern/file-formats-guide#creating-input-files).
* Learn more by reading about the [HierarchicalClustering](https://genepattern.broadinstitute.org/gp/getTaskDoc.jsp?name=HierarchicalClustering) module.

<div class="alert alert-info">
<h3 style="position: relative; top: -10px">Instructions</h3>
After the PreprocessReadCounts job above finishes running, send the GCT result of that job to HierarchicalClustering below. To do this either drag-and-drop the link for the file above to the *input.filename* input below, or click the link for the file above and select *Send to an Existing GenePattern Cell > HierarchicalClustering* in the menu that appears.

Once this is done, click *Run* for the analysis below.
</div>

In [6]:
hierarchicalclustering_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00009')
hierarchicalclustering_job_spec = hierarchicalclustering_task.make_job_spec()
hierarchicalclustering_job_spec.set_parameter("input.filename", "")
hierarchicalclustering_job_spec.set_parameter("column.distance.measure", "2")
hierarchicalclustering_job_spec.set_parameter("row.distance.measure", "2")
hierarchicalclustering_job_spec.set_parameter("clustering.method", "a")
hierarchicalclustering_job_spec.set_parameter("log.transform", "")
hierarchicalclustering_job_spec.set_parameter("row.center", "mean.row")
hierarchicalclustering_job_spec.set_parameter("row.normalize", "")
hierarchicalclustering_job_spec.set_parameter("column.center", "mean.column")
hierarchicalclustering_job_spec.set_parameter("column.normalize", "")
hierarchicalclustering_job_spec.set_parameter("output.base.name", "<input.filename_basename>")
genepattern.GPTaskWidget(hierarchicalclustering_task)

## Step 3: HierarchicalClusteringViewer

Display a heat map of the clustered gene expression data, with dendrograms showing how the genes and/or samples were clustered.

### Considerations

* Select File > Save Image to save the heat map and dendrograms to an image file. Supported formats include bmp, eps, jpeg, png, and tiff. 
* Learn more by reading about the [HierarchicalClusteringViewer](https://genepattern.broadinstitute.org/gp/getTaskDoc.jsp?name=HierarchicalClusteringViewer) module.

<div class="alert alert-info">
<h3>Instructions</h3>

<p>After the HierarchicalClustering job above finishes running, send the CDT, GTR and ATR results of that job to HierarchicalClustering below. You will need to scroll back to the HierarchicalClustering job after sending the first file. Once this is done, click <em>Run</em> for the analysis below. <em>(note that the HierarchicalClustering job will take ~10-15 minutes to complete)</em></p>
</div>

In [8]:
hierarchicalclusteringviewer_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00031')
hierarchicalclusteringviewer_job_spec = hierarchicalclusteringviewer_task.make_job_spec()
hierarchicalclusteringviewer_job_spec.set_parameter("cdt.file", "")
hierarchicalclusteringviewer_job_spec.set_parameter("gtr.file", "")
hierarchicalclusteringviewer_job_spec.set_parameter("atr.file", "")
genepattern.GPTaskWidget(hierarchicalclusteringviewer_task)