# Generation of a GTC file from TCGA and analysis

## Before you begin

* Sign in to GenePattern by entering your username and password into the form below. 
* Get manifest and metadata files from TCGA.
* Learn more by reading about [how to download the files from TCGA](https://github.com/genepattern/download_from_gdc/blob/master/how_to_download_a_manifest_and_metadata.pdf).

In [10]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.GPAuthWidget(genepattern.register_session("https://gp-beta-ami.genepattern.org/gp", "", ""))

## Generating GTC file from manifest and metadata
- You should now have the manifest and metadata files. Otherwise you can use [sample manifest](https://raw.githubusercontent.com/genepattern/download_from_gdc/master/data/gdc_medifest_20171221_005438.txt) and [sample metadata](https://raw.githubusercontent.com/genepattern/download_from_gdc/master/data/metadata.cart.2017-12-21T21_41_22.870798.json)
- Click and drag your manifest file into the manifest section and the metadata file into the metadata section below. Select translate gene id to True if you would like to translate ENSEMBL IDs into Hugo Gene Symbols (this will take somewhat longer time however).
- The download_from_gdc module takes these files then downloads and converts the relevant information into a [GCT](http://software.broadinstitute.org/cancer/software/genepattern/file-formats-guide#GCT) format for later analysis in GenePattern.

In [14]:
download_from_gdc_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00369')
download_from_gdc_job_spec = download_from_gdc_task.make_job_spec()
download_from_gdc_job_spec.set_parameter("manifest", "")
download_from_gdc_job_spec.set_parameter("metadata", "")
download_from_gdc_job_spec.set_parameter("output_file_name", "TCGA_dataset")
download_from_gdc_job_spec.set_parameter("gct", "True")
download_from_gdc_job_spec.set_parameter("translate_gene_id", "False")
genepattern.GPTaskWidget(download_from_gdc_task)

# Preprocess Dataset

This module preprocesses the data to remove platform noise and genes with low variance. By removing the data with less information, we are able to decrease the runtime of later steps

In [15]:
preprocessdataset_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00020')
preprocessdataset_job_spec = preprocessdataset_task.make_job_spec()
preprocessdataset_job_spec.set_parameter("input.filename", "")
preprocessdataset_job_spec.set_parameter("threshold.and.filter", "1")
preprocessdataset_job_spec.set_parameter("floor", "20")
preprocessdataset_job_spec.set_parameter("ceiling", "20000")
preprocessdataset_job_spec.set_parameter("min.fold.change", "3")
preprocessdataset_job_spec.set_parameter("min.delta", "100")
preprocessdataset_job_spec.set_parameter("num.outliers.to.exclude", "0")
preprocessdataset_job_spec.set_parameter("row.normalization", "0")
preprocessdataset_job_spec.set_parameter("row.sampling.rate", "1")
preprocessdataset_job_spec.set_parameter("threshold.for.removing.rows", "")
preprocessdataset_job_spec.set_parameter("number.of.columns.above.threshold", "")
preprocessdataset_job_spec.set_parameter("log2.transform", "0")
preprocessdataset_job_spec.set_parameter("output.file.format", "3")
preprocessdataset_job_spec.set_parameter("output.file", "<input.filename_basename>.preprocessed")
genepattern.GPTaskWidget(preprocessdataset_task)

# Hierarchical Clustering

Run hierarchical clustering to cluster genes or samples based on how close they are. This will produce a CDT file (clustered data file) which is reordered to reflect the clustering. If you clustered by genes, it will generate a GTR file (gene tree); b

In [16]:
hierarchicalclustering_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00009')
hierarchicalclustering_job_spec = hierarchicalclustering_task.make_job_spec()
hierarchicalclustering_job_spec.set_parameter("input.filename", "")
hierarchicalclustering_job_spec.set_parameter("column.distance.measure", "2")
hierarchicalclustering_job_spec.set_parameter("output_distance_matrix", "False")
hierarchicalclustering_job_spec.set_parameter("row.distance.measure", "No row clustering")
hierarchicalclustering_job_spec.set_parameter("clustering.method", "a")
hierarchicalclustering_job_spec.set_parameter("output.base.name", "<input.filename_basename>")
hierarchicalclustering_job_spec.set_parameter("row.centering", "Mean")
hierarchicalclustering_job_spec.set_parameter("row.normalize", "False")
hierarchicalclustering_job_spec.set_parameter("col.centering", "Mean")
hierarchicalclustering_job_spec.set_parameter("col.normalize", "False")
genepattern.GPTaskWidget(hierarchicalclustering_task)

# Hierarchical Clustering Viewer

We can now take the output from hierarchical clustering and render it as a heatmap

In [17]:
hierarchicalclusteringviewer_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00031')
hierarchicalclusteringviewer_job_spec = hierarchicalclusteringviewer_task.make_job_spec()
hierarchicalclusteringviewer_job_spec.set_parameter("cdt.file", "")
hierarchicalclusteringviewer_job_spec.set_parameter("gtr.file", "")
hierarchicalclusteringviewer_job_spec.set_parameter("atr.file", "")
genepattern.GPTaskWidget(hierarchicalclusteringviewer_task)

 <a href id="deseq2"></a>
## Compute differentially expressed transcripts

We can also run other modules using the data from TCGA. For example, this module uses the [DESeq2](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4302049/) method to find significantly differentially expressed transcripts.

<div class="alert alert-info">
- Select the GCT file from the Preprocess dataset job by clicking on the "Add GenePattern File or URL" next to **input file** and selecting the gct file under PreprocessDataset.
- Select the cls file from the download_from_gdc job by clicking on the "Add GenePattern File or URL" next to **cls file** and selecting the cls file under download_from_gdc.
- Click **Run**.

In [18]:
deseq2_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00362')
deseq2_job_spec = deseq2_task.make_job_spec()
deseq2_job_spec.set_parameter("input.file", "")
deseq2_job_spec.set_parameter("cls.file", "")
deseq2_job_spec.set_parameter("confounding.variable.cls.file", "")
deseq2_job_spec.set_parameter("output.file.base", "<input.file_basename>")
deseq2_job_spec.set_parameter("qc.plot.format", "skip")
deseq2_job_spec.set_parameter("fdr.threshold", "0.1")
deseq2_job_spec.set_parameter("top.N.count", "20")
deseq2_job_spec.set_parameter("random.seed", "779948241")
genepattern.GPTaskWidget(deseq2_task)

The output of DESeq2 is in a text-based, tab-delimited format. Before visualizing it using the GenePattern ComparativeMarkerSelectionViewer, we must convert it to the ODF file format that that viewer accepts.

<div class="alert alert-info">
- Click the DESeq2_results_report file (`<filename>.DESeq2_results_report.txt`)
- Choose **Send to Existing GenePattern Cell**
- Select **txt2odf**
- Set the *prune gct* parameter to `True`
- Select the GCT file from the Preprocess dataset job by clicking on the "Add GenePattern File or URL" next to **gct** and selecting the gct file under PreprocessDataset.
- Select the cls file from the download_from_gdc job by clicking on the "Add GenePattern File or URL" next to **cls** and selecting the cls file under download_from_gdc.
- Click **Run**.

In [11]:
txt2odf_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:8080.gpserver.ip-172-31-26-71.ip-172-31-26-71.ec2.internal:genepatternmodules:23')
txt2odf_job_spec = txt2odf_task.make_job_spec()
txt2odf_job_spec.set_parameter("txt_file", "")
txt2odf_job_spec.set_parameter("prune_gct", "False")
txt2odf_job_spec.set_parameter("gct", "")
txt2odf_job_spec.set_parameter("cls", "")
genepattern.GPTaskWidget(txt2odf_task)

# Visualizing Differential Expression Results

The ComparativeMarkerSelectionViewer allows you to view the results of a differential expression analysis as a heatmap, profile of differentially expressed genes, histogram, or list. It also includes features that allow you to filter results, zoom in and out of a section of the gene list, and export results in a number of formats.

Run the ComparativeMarkerSelectionViewer module to view the results. The viewer displays the test statistic score, its p value, and additional statistics as computed by the differential expression method.

* Learn more by reading about the [ComparativeMarkerSelectionViewer](https://gp-beta-ami.genepattern.org/gp/getTaskDoc.jsp?name=ComparativeMarkerSelectionViewer) module.

<div class="alert alert-info">
- In the **comparative marker selection filename** parameter, click the triangle in the file input box.
- Select the txt2odf result file as the input.
- In the **dataset filename** parameter, click the triangle in the file input box.
- Select the txt2odf result file as the input.
- Click **Run**.

Once the job downloads the necessary data it will display a visualization of the differential expression results.

In [12]:
comparativemarkerselectionviewer_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00045')
comparativemarkerselectionviewer_job_spec = comparativemarkerselectionviewer_task.make_job_spec()
comparativemarkerselectionviewer_job_spec.set_parameter("comparative.marker.selection.filename", "")
comparativemarkerselectionviewer_job_spec.set_parameter("dataset.filename", "")
genepattern.GPTaskWidget(comparativemarkerselectionviewer_task)