# Introduction to GenePattern Notebook

This document should help you understand how to run an analysis in the GenePattern Notebook environment. In it you will perform a simple preprocessing step and then view the results in a heat map. 

**Instructions are given in blue boxes, such as with the one below.**

<div class="alert alert-info"><h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
Sign in to GenePattern by clicking the login button or entering your username and password into the form below.</div>

## Dataset information
In this example we will preprocess a dataset of 38 samples of leukemia, 27 of subtype ALL and 11 of subtype AML. The data was created on a microarray platform, but the resulting [GCT](http://software.broadinstitute.org/cancer/software/genepattern/file-formats-guide#GCT) file is compatible with RNA-Seq, as well as any other data type that can be expressed with samples as columns and features as rows.

In [13]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.display(genepattern.session.register("https://cloud.genepattern.org/gp", "", ""))

GPAuthWidget()

## Step 1: PreprocessDataset

Preprocess gene expression data to remove platform noise and genes that have little variation. You can test this step by starting a job using parameters entered into the form below.

### Considerations

* PreprocessDataset can preprocess the data in one or more ways (in this order):
    1. Set threshold and ceiling values. Any value lower/higher than the threshold/ceiling value is reset to the threshold/ceiling value.
    2. Convert each expression value to the log base 2 of the value.
    3. Remove genes (rows) if a given number of its sample values are less than a given threshold.
    4. Remove genes (rows) that do not have a minimum fold change or expression variation.
    5. Discretize or normalize the data.
* If you did not generate the expression data, check whether preprocessing steps have already been taken before running the PreprocessDataset module.
* Learn more by reading about the [PreprocessDataset](https://gp-beta-ami.genepattern.org/gp/getTaskDoc.jsp?name=PreprocessDataset) module.

## Example

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ol>
    <li>Change the <em>min fold change</em> parameter to 10</li>
    <li>Click <b>Run</b> to launch the analysis.</li>
</ol>
</div>

In [14]:
preprocessdataset_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00020')
preprocessdataset_job_spec = preprocessdataset_task.make_job_spec()
preprocessdataset_job_spec.set_parameter("input.filename", "https://datasets.genepattern.org/data/all_aml/all_aml_train.gct")
preprocessdataset_job_spec.set_parameter("threshold.and.filter", "1")
preprocessdataset_job_spec.set_parameter("floor", "100")
preprocessdataset_job_spec.set_parameter("ceiling", "20000")
preprocessdataset_job_spec.set_parameter("min.fold.change", "3")
preprocessdataset_job_spec.set_parameter("min.delta", "100")
preprocessdataset_job_spec.set_parameter("num.outliers.to.exclude", "0")
preprocessdataset_job_spec.set_parameter("row.normalization", "0")
preprocessdataset_job_spec.set_parameter("row.sampling.rate", "1")
preprocessdataset_job_spec.set_parameter("threshold.for.removing.rows", "")
preprocessdataset_job_spec.set_parameter("number.of.columns.above.threshold", "")
preprocessdataset_job_spec.set_parameter("log2.transform", "0")
preprocessdataset_job_spec.set_parameter("output.file.format", "3")
preprocessdataset_job_spec.set_parameter("output.file", "<input.filename_basename>.preprocessed")
genepattern.display(preprocessdataset_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00020')

## Step 2: HeatMapViewer

Display a heat map of the preprocessed gene expression data. Since the *min fold change* was so stringent in the previous step, this will show only the genes that had significant changes in expression.

### Considerations

* HeatMapViewer displays gene expression data as a heat map, which makes it easier to see patterns in the numeric data. Gene names are row labels and sample names are column labels.
* Notebooks may contain any number of visualizations.
* The features displayed here use Ensembl IDs. In the next section we will convert these IDs to gene names.
* Learn more by reading about the [HeatMapViewer](https://gp-beta-ami.genepattern.org/gp/getTaskDoc.jsp?name=HeatMapViewer) module.

### Note on instructions
- Many result files have similar names and differ only in their suffixes. When we indicate a file as `<filename>.gct` for example, we mean the result file that has the extension `.gct`.

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
When the job above shows that it is completed (status in the upper right corner of the job cell displays <b>Completed</b>):
<ol>
    <li>Click the link for the `<filename>.preprocessed.gct` result file.</li>
    <li>You will see a list of choices.</li>
    <li>Select <b>Send to Existing GenePattern Cell</b>.</li>
    <li>You will see a list of available cells.</li>
    <li>Select the <b>HeatMapViewer</b> cell.</li>
    <li>You will see the file populated in the <em>dataset</em> parameter of the <b>HeatMapViewer</b> cell below.</li>
    <li>Launch the <b>HeatMapViewer</b> job by clicking <b>Run</b></li>
</ol>
</div>

In [15]:
heatmapviewer_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00010')
heatmapviewer_job_spec = heatmapviewer_task.make_job_spec()
heatmapviewer_job_spec.set_parameter("dataset", "")
genepattern.display(heatmapviewer_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00010')

## Step 3: Convert feature identifiers to gene names (CollapseDataset)

You can see that the dataset that was created uses an identifier system from [UniGene](https://www.ncbi.nlm.nih.gov/unigene) to name each gene (really transcript). We would like to convert these to more easily understandable gene names. The standard identification system used for genes is the [HUGO (HUman Genome Organization)](https://www.genenames.org/) standard. 

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ol>
    <li>In the cell below, for the <em>dataset file</em> parameter, click the dropdown arrow on the right side of the input box.</li>
<li>You will see all of the available result files in this notebook that can be sent to this input.</li>
<li>Select `<filename>.preprocessed.gct`</li>
    <li>Click <b>Run</b>.</li>
</ol>
</div>

In [16]:
collapsedataset_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00134')
collapsedataset_job_spec = collapsedataset_task.make_job_spec()
collapsedataset_job_spec.set_parameter("dataset.file", "")
collapsedataset_job_spec.set_parameter("chip.platform", "ftp://ftp.broadinstitute.org/pub/gsea/annotations/HU6800.chip")
collapsedataset_job_spec.set_parameter("collapse.mode", "Maximum")
collapsedataset_job_spec.set_parameter("output.file.name", "<dataset.file_basename>.collapsed")
genepattern.display(collapsedataset_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00134')