# Hierarchical Clustering in GenePattern Notebook

Cluster genes and/or samples based on how close they are to one another. The result is a tree structure, referred to as dendrogram.

## Before you begin

* Sign in to GenePattern by entering your username and password into the form below.
* Gene expression data must be in a [GCT or RES file](https://genepattern.broadinstitute.org/gp/pages/protocols/GctResFiles.html).
    * Example file: [all_aml_test.gct](https://datasets.genepattern.org/data/all_aml/all_aml_test.gct).
* Learn more by reading about [file formats](http://www.broadinstitute.org/cancer/software/genepattern/file-formats-guide#GCT).


In [5]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.display(genepattern.session.register("https://cloud.genepattern.org/gp", "", ""))

GPAuthWidget()

## Step 1: PreprocessDataset

Preprocess gene expression data to remove platform noise and genes that have little variation. Although researchers generally preprocess data before clustering if doing so removes relevant biological information, skip this step. 

### Considerations

* PreprocessDataset can preprocess the data in one or more ways (in this order):
    1. Set threshold and ceiling values. Any value lower/higher than the threshold/ceiling value is reset to the threshold/ceiling value.
    2. Convert each expression value to the log base 2 of the value.
    3. Remove genes (rows) if a given number of its sample values are less than a given threshold.
    4. Remove genes (rows) that do not have a minimum fold change or expression variation.
    5. Discretize or normalize the data.
* When using ratios to compare gene expression between samples, convert values to log base 2 of the value to bring up- and down-regulated genes to the same scale. For example, ratios of 2 and .5 indicating two-fold changes for up- and down-regulated expression, respectively, are converted to +1 and -1. 
* If you did not generate the expression data, check whether preprocessing steps have already been taken before running the PreprocessDataset module. 
* Learn more by reading about the [PreprocessDataset](https://genepattern.broadinstitute.org/gp/getTaskDoc.jsp?name=PreprocessDataset) module.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
Set the *min.fold.change* parameter to 5. Click the run button on the analysis below.
</div>

In [8]:
preprocessdataset_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00020')
preprocessdataset_job_spec = preprocessdataset_task.make_job_spec()
preprocessdataset_job_spec.set_parameter("input.filename", "https://datasets.genepattern.org/data/all_aml/all_aml_test.gct")
preprocessdataset_job_spec.set_parameter("threshold.and.filter", "1")
preprocessdataset_job_spec.set_parameter("floor", "20")
preprocessdataset_job_spec.set_parameter("ceiling", "20000")
preprocessdataset_job_spec.set_parameter("min.fold.change", "3")
preprocessdataset_job_spec.set_parameter("min.delta", "100")
preprocessdataset_job_spec.set_parameter("num.outliers.to.exclude", "0")
preprocessdataset_job_spec.set_parameter("row.normalization", "0")
preprocessdataset_job_spec.set_parameter("row.sampling.rate", "1")
preprocessdataset_job_spec.set_parameter("threshold.for.removing.rows", "")
preprocessdataset_job_spec.set_parameter("number.of.columns.above.threshold", "")
preprocessdataset_job_spec.set_parameter("log2.transform", "0")
preprocessdataset_job_spec.set_parameter("output.file.format", "3")
preprocessdataset_job_spec.set_parameter("output.file", "<input.filename_basename>.preprocessed")
genepattern.display(preprocessdataset_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00020')

## Step 2: HierarchicalClustering

Run hierarchical clustering on genes and/or samples to create dendrograms for the clustered genes (*.gtr) and/or clustered samples (*.atr), as well as a file (*.cdt) that contains the original gene expression data ordered to reflect the clustering.

### Considerations
* Best practice is to normalize (row/column normalize parameters) and center (row/column center parameters) the data being clustered. 
* The CDT output file must be converted to a GCT file before it can be used as an input file for another GenePattern module (other than HierachicalClusteringViewer). For instructions on converting a CDT file to a GCT file, see [Creating Input Files](http://www.broadinstitute.org/cancer/software/genepattern/file-formats-guide#creating-input-files).
* Learn more by reading about the [HierarchicalClustering](https://genepattern.broadinstitute.org/gp/getTaskDoc.jsp?name=HierarchicalClustering) module.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    
- After the PreprocessDataset job above finishes running, send the GCT result of that job to HierarchicalClustering below. To do this either drag-and-drop the link for the file above to the *input.filename* input below, or click the link for the file above and select *Send to an Existing GenePattern Cell > HierarchicalClustering* in the menu that appears.
    
- Once this is done, click Run for the analysis below.
</div>

In [6]:
hierarchicalclustering_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00009')
hierarchicalclustering_job_spec = hierarchicalclustering_task.make_job_spec()
hierarchicalclustering_job_spec.set_parameter("input.filename", "")
hierarchicalclustering_job_spec.set_parameter("column.distance.measure", "2")
hierarchicalclustering_job_spec.set_parameter("row.distance.measure", "0")
hierarchicalclustering_job_spec.set_parameter("clustering.method", "a")
hierarchicalclustering_job_spec.set_parameter("log.transform", "")
hierarchicalclustering_job_spec.set_parameter("row.center", "")
hierarchicalclustering_job_spec.set_parameter("row.normalize", "")
hierarchicalclustering_job_spec.set_parameter("column.center", "")
hierarchicalclustering_job_spec.set_parameter("column.normalize", "")
hierarchicalclustering_job_spec.set_parameter("output.base.name", "<input.filename_basename>")
genepattern.display(hierarchicalclustering_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00009')

## Step 3: HierarchicalClusteringViewer

Display a heat map of the clustered gene expression data, with dendrograms showing how the genes and/or samples were clustered.

### Considerations

* Select File > Save Image to save the heat map and dendrograms to an image file. Supported formats include bmp, eps, jpeg, png, and tiff. 
* Learn more by reading about the [HierarchicalClusteringViewer](https://genepattern.broadinstitute.org/gp/getTaskDoc.jsp?name=HierarchicalClusteringViewer) module.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    
- After the HierarchicalClustering job above finishes running, send the CDT and ATR results of that job to HierarchicalClustering below. You will need to scroll back to the HierarchicalClustering job after sending the first file. The GTR file input will be left blank.
- Once this is done, click Run for the analysis below.
</div>

In [7]:
hierarchicalclusteringviewer_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00031')
hierarchicalclusteringviewer_job_spec = hierarchicalclusteringviewer_task.make_job_spec()
hierarchicalclusteringviewer_job_spec.set_parameter("cdt.file", "")
hierarchicalclusteringviewer_job_spec.set_parameter("gtr.file", "")
hierarchicalclusteringviewer_job_spec.set_parameter("atr.file", "")
genepattern.display(hierarchicalclusteringviewer_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00031')