# K-means Clustering in GenePattern Notebook

Cluster genes and/or samples into a specified number of clusters. The result is k clusters, each centered around a randomly selected data point. 

## Before you begin

* Sign in to GenePattern by entering your username and password into the form below. If you are seeing a block of code instead of the login form, go to the menu above and select Cell > Run All.
* Gene expression data must be in a [GCT or RES file](https://genepattern.broadinstitute.org/gp/pages/protocols/GctResFiles.html).
    * Example file: [all_aml_test.gct](https://software.broadinstitute.org/cancer/software/genepattern/data/all_aml/all_aml_test.gct).
* Learn more by reading about [file formats](http://www.broadinstitute.org/cancer/software/genepattern/file-formats-guide#GCT).


In [3]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.GPAuthWidget(genepattern.register_session("https://genepattern.broadinstitute.org/gp", "", ""))

Widget Javascript not detected.  It may not be installed properly. Did you enable the widgetsnbextension? If not, then run "jupyter nbextension enable --py --sys-prefix widgetsnbextension"


## Step 1: PreprocessDataset

Preprocess gene expression data to remove platform noise and genes that have little variation. Although researchers generally preprocess data before clustering if doing so removes relevant biological information, skip this step. 

### Considerations

* PreprocessDataset can preprocess the data in one or more ways (in this order):
    1. Set threshold and ceiling values. Any value lower/higher than the threshold/ceiling value is reset to the threshold/ceiling value.
    2. Convert each expression value to the log base 2 of the value.
    3. Remove genes (rows) if a given number of its sample values are less than a given threshold.
    4. Remove genes (rows) that do not have a minimum fold change or expression variation.
    5. Discretize or normalize the data.
* When using ratios to compare gene expression between samples, convert values to log base 2 of the value to bring up- and down-regulated genes to the same scale. For example, ratios of 2 and .5 indicating two-fold changes for up- and down-regulated expression, respectively, are converted to +1 and -1. 
* If you did not generate the expression data, check whether preprocessing steps have already been taken before running the PreprocessDataset module. 
* Learn more by reading about the [PreprocessDataset](https://genepattern.broadinstitute.org/gp/getTaskDoc.jsp?name=PreprocessDataset) module.

In [5]:
preprocessdataset_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00020')
preprocessdataset_job_spec = preprocessdataset_task.make_job_spec()
preprocessdataset_job_spec.set_parameter("input.filename", "https://software.broadinstitute.org/cancer/software/genepattern/data/all_aml/all_aml_test.gct")
preprocessdataset_job_spec.set_parameter("threshold.and.filter", "1")
preprocessdataset_job_spec.set_parameter("floor", "20")
preprocessdataset_job_spec.set_parameter("ceiling", "20000")
preprocessdataset_job_spec.set_parameter("min.fold.change", "3")
preprocessdataset_job_spec.set_parameter("min.delta", "100")
preprocessdataset_job_spec.set_parameter("num.outliers.to.exclude", "0")
preprocessdataset_job_spec.set_parameter("row.normalization", "0")
preprocessdataset_job_spec.set_parameter("row.sampling.rate", "1")
preprocessdataset_job_spec.set_parameter("threshold.for.removing.rows", "")
preprocessdataset_job_spec.set_parameter("number.of.columns.above.threshold", "")
preprocessdataset_job_spec.set_parameter("log2.transform", "0")
preprocessdataset_job_spec.set_parameter("output.file.format", "3")
preprocessdataset_job_spec.set_parameter("output.file", "<input.filename_basename>.preprocessed")
genepattern.GPTaskWidget(preprocessdataset_task)

Widget Javascript not detected.  It may not be installed properly. Did you enable the widgetsnbextension? If not, then run "jupyter nbextension enable --py --sys-prefix widgetsnbextension"


## Step 2: KMeansClustering

Run k-means clustering on genes (rows) or samples (columns). The module creates a GCT file for each cluster and a GCT file that organizes all of the expression data by cluster. 

In [6]:
kmeansclustering_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00081')
kmeansclustering_job_spec = kmeansclustering_task.make_job_spec()
kmeansclustering_job_spec.set_parameter("input.filename", "https://software.broadinstitute.org/cancer/software/genepattern/data/protocols/all_aml_test.preprocessed.gct")
kmeansclustering_job_spec.set_parameter("output.base.name", "<input.filename_basename>_KMcluster_output")
kmeansclustering_job_spec.set_parameter("number.of.clusters", "2")
kmeansclustering_job_spec.set_parameter("seed.value", "12345")
kmeansclustering_job_spec.set_parameter("cluster.by", "0")
kmeansclustering_job_spec.set_parameter("distance.metric", "0")
genepattern.GPTaskWidget(kmeansclustering_task)

Widget Javascript not detected.  It may not be installed properly. Did you enable the widgetsnbextension? If not, then run "jupyter nbextension enable --py --sys-prefix widgetsnbextension"


## Step 2: HeatMapViewer

For an overview of the results, use a heatmap to display the expression data organized by cluster. 

### Considerations

* The HeatMapViewer displays gene expression data as a heat map, which makes it easier to see patterns in the numeric data. Gene names are row labels and sample names are column labels. 
* Learn more by reading about the [HeatMapViewer](https://genepattern.broadinstitute.org/gp/getTaskDoc.jsp?name=HeatMapViewer) module.

In [7]:
heatmapviewer_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00010')
heatmapviewer_job_spec = heatmapviewer_task.make_job_spec()
heatmapviewer_job_spec.set_parameter("dataset", "")
genepattern.GPTaskWidget(heatmapviewer_task)

Widget Javascript not detected.  It may not be installed properly. Did you enable the widgetsnbextension? If not, then run "jupyter nbextension enable --py --sys-prefix widgetsnbextension"
