# Data Preparation


<h3 id="Breast-Cancer-Data">Breast Cancer Data</h3>

<p>For this notebook and some of the later exercises we will use RNA-Seq data from <a href="https://cancergenome.nih.gov/" target="_blank">The Cancer Genome Atlas (TCGA)</a>. We have selected 20 breast cancer primary tumors (BRCA) with their matched normal samples. We can retrieve the&nbsp;read counts files for each tumor and normal sample (40 in total) downloaded from the <a href="https://gdc.cancer.gov" target="_blank">TCGA Genomic Data Commons</a>&nbsp;(GDC) as well as sample metadata (tumor/normal status).&nbsp;</p>

<p><strong>While the example we are using is for a specfic selection of breast cancer samples, you can use the steps in this notebook to prepare any TCGA-derived data for use in GenePattern analyses.</strong></p>

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<p>Throughout this notebook, actions that you are supposed to take are written in a blue box like this.</p>
</div>

## Scientific summary

<p>For our data preparation phase, we will perform the following steps:</p>

<h3 id="1.-Download-TCGA-Breast-Cancer-data-from-the-GDC-Data-Portal-as-a-GCT-file">1. Download TCGA Breast Cancer data from the GDC Data Portal as a GCT file</h3>

<p>We have generated the TCGA Genomic Data Commons (GDC) <strong>metadata</strong> and <strong>manifest</strong> files for your&nbsp;use today to allow easy importation of the breast cancer read counts data into GenePattern.&nbsp; For a notebook with full details on how to generate these two files, see&nbsp;<a href="https://github.com/genepattern/TCGAImporter/blob/develop/how_to_download_a_manifest_and_metadata.ipynb" target="_blank">How to Download a TCGA Manifest and Metadata</a>.</p>

<p>The manifest and metadata files will then be provided to the TCGAImporter module.&nbsp; This module will dowload the individual&nbsp;read counts files (one per sample) and generate a <a href="http://software.broadinstitute.org/cancer/software/genepattern/file-formats-guide#GCT" target="_blank">GCT</a> file with each sample as one column and each Ensembl gene id as a row. The module will also use the&nbsp;tumor/normal<strong>&nbsp;</strong>status of each sample&nbsp;in the metadata to generate a <a href="http://software.broadinstitute.org/cancer/software/genepattern/file-formats-guide#CLS" target="_blank">CLS file</a> assigning a phenotype to each sample.</p>

<h3 id="2.-Replace-Ensembl-gene-ids-with-HUGO-symbols-and-remove-duplicates">2. Replace Ensembl gene ids with HUGO symbols and remove duplicates</h3>

<p>To make the dataset more human-friendly for analysis we will replace the Ensembl gene ids with HUGO symbols. Since more than one Ensembl gene id can map to a single symbol, we need to collapse any rows with duplicate symbols to a single row. The CollapseDataset module does both the remapping and collapsing for us.</p>

<h3 id="3.-Normalize-for-downstream-analysis">3. Normalize for downstream analysis</h3>

<p>Preprocess RNA-Seq count data in a GCT file so that it is suitable for use in GenePattern analyses. The <a href="http://software.broadinstitute.org/cancer/software/genepattern/modules/docs/VoomNormalize/2">VoomNormalize</a> module will be used for this.&nbsp; It uses a mean-variance modeling technique (&#39;voom&#39; from the limma Bioconductor package) to transform the dataset to fit an approximation of a normal distribution, with the goal of being able to apply statistical methods and workflows that assume a normal distribution to the resulting output dataset.</p>

<h3 id="Login">Login</h3>

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ul>
	<li>If you are not logged in, select the GenePattern server, enter your username and password in the cell below and click <strong>Login</strong>.</li>
</ul>
</div>

<p>The logins to the notebook server and the GenePattern server are separate to allow you to run analyses hosted on different GenePattern servers in the same notebook.</p>


In [8]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.display(genepattern.session.register("https://cloud.genepattern.org/gp", "", ""))

GPAuthWidget()

## Begin the analysis

<p>We will use the cell below (TCGAImporter). This module will take the <a href="https://github.com/genepattern/TCGAImporter/blob/develop/how_to_download_a_manifest_and_metadata.ipynb">pre-generated&nbsp;TCGA manifest and metadata files from the TCGA Genomic Data Commons</a></p>


<h3 id="Step-1.-Compile-multiple-read-count-files-into-a-matrix-and-a-file-describing-the-phenotypes">Step 1. Download TCGA Breast Cancer data from the GDC Data Portal as a GCT file</h3>

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>

<ul>
	<li>Click <strong>Run</strong> in the cell below to generate the compiled read count (<strong>gct</strong>) and phenotype (<strong>cls</strong>) files.</li>
</ul>
</div>


<p>Once you hit run, a new GenePattern output cell will appear. You can watch the job&#39;s status change in its top right corner. Once it is complete it will show you links to the output files in the job status cell.</p>

<div class="well well-sm"><strong>Runtime:</strong> About 4 minutes</div>

In [10]:
download_from_gdc_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00369')
download_from_gdc_job_spec = download_from_gdc_task.make_job_spec()
download_from_gdc_job_spec.set_parameter("manifest", "https://datasets.genepattern.org/data/ccmi_tutorial/2018-06-07/workshop_gdc_manifest_20180607.txt")
download_from_gdc_job_spec.set_parameter("metadata", "https://datasets.genepattern.org/data/ccmi_tutorial/2018-06-07/workshop_gdc_metadata_20180607.json")
download_from_gdc_job_spec.set_parameter("output_file_name", "BRCA_Dataset")
download_from_gdc_job_spec.set_parameter("gct", "True")
download_from_gdc_job_spec.set_parameter("translate_gene_id", "False")
download_from_gdc_job_spec.set_parameter("cls", "True")
genepattern.display(download_from_gdc_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00369')

<h3 id="Examine-your-output-files">Examine your output files</h3>

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>

<ul>
	<li>Click on the <strong>cls</strong> and <strong>gct</strong> files you just generated and select the <q>Open in new tab</q> option to view them in your browser.</li>
    <li>Alternatively, for the <strong>gct</strong> file you can click on it and select <q>Send to DataFrame</q> to look at the resulting gct file within this Jupyter notebook.</li>
</ul>
</div>


<h3 id="Step-2.-Replace-Ensembl-gene-ids-with-HUGO-symbols-and-remove-duplicates">Step 2. Replace Ensembl gene ids with HUGO symbols and remove duplicates</h3>

<p>When you looked into the gct file that was output, you may have noticed that it uses Ensembl IDs for the rows (counts). We would like to change this to HUGO symbols to make this more human-friendly. When we do this, we will end up with multiple Ensembl transcripts that all map to a single HUGO symbol. However the analysis we will do later does not like to see duplicate rows, so we will want to collapse instances of multiple transcripts down to a single row.</p>

<p>To do this we will use the CollapseDataset module which can collapse the rows and replace the Ensembl IDs with HUGO symbols in one step.</p>

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>

<ol>
	<li>Click on the&nbsp;<b>dataset file</b>&nbsp;parameter below and select the&nbsp;<q>BRCA_Dataset.gct</q>&nbsp;result file from the&nbsp;<b>TCGAImporter</b>&nbsp;analysis you just ran.</li>
	<li>Leave the <strong>chip platform</strong> parameter as&nbsp;<q><A HREF="https://datasets.genepattern.org/data/TCGA_HTSeq_counts/ENSEMBL_human_gene.chip">https://datasets.genepattern.org/data/TCGA_HTSeq_counts/ENSEMBL_human_gene.chip</a></q> which should already be filled in.</li>
	<li>Click the button&nbsp;<strong><em>Run</em></strong>&nbsp;on the analysis below.</li>
</ol>
</div>


<div class="well well-sm"><strong>Runtime:</strong> Less than 2 minutes</div>

In [11]:
collapsedataset_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00134')
collapsedataset_job_spec = collapsedataset_task.make_job_spec()
collapsedataset_job_spec.set_parameter("dataset.file", "")
collapsedataset_job_spec.set_parameter("chip.platform", "https://datasets.genepattern.org/data/TCGA_HTSeq_counts/ENSEMBL_human_gene.chip")
collapsedataset_job_spec.set_parameter("collapse.mode", "Maximum")
collapsedataset_job_spec.set_parameter("output.file.name", "BRCA_Dataset.collapsed.gct")
genepattern.display(collapsedataset_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00134')

<h3 id="Step-3.-Normalize-for-downstream-analysis">Step 3. Normalize for downstream analysis</h3>

<p>Preprocess RNA-Seq count data in a GCT file so that it is suitable for use in GenePattern analyses.</p>

<ul>
	<li>The <strong>VoomNormalize</strong> module is used to preprocess RNA-Seq data into a form suitable for use downstream in other GenePattern analyses such as Gene Set Enrichment Analysis (GSEA), ComparativeMarkerSelection, HierarchicalClustering, as well as visualizers.</li>
	<li>Many of these approaches assume that the data is distributed normally, yet this is not true of RNA-seq read count data. The VoomNormalize module provides one approach to accommodate this. It uses a mean-variance modeling technique (&#39;<strong>voom</strong>&#39; from the <strong>limma</strong> Bioconductor package) to transform the dataset to fit an approximation of a normal distribution, with the goal of being able to apply statistical methods and workflows that assume a normal distribution.</li>
	<li>Learn more by reading about the <a href="http://software.broadinstitute.org/cancer/software/genepattern/modules/docs/VoomNormalize/2" target="_blank">VoomNormalize</a> module.</li>
</ul>


<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>

<ol>
	<li>Click on the <b>input file</b> parameter below and select the <q>BRCA_Dataset.collapsed.gct</q> result file from the <b>CollapseDataset</b> analysis you just ran.</li>
	<li>Click on the <b>cls file</b> parameter below and select the <q>BRCA_Dataset.cls</q>&nbsp;from the <b>Download_from_GDC</b> analysis you performed at the beginning of this notebook.</li>
    <li>Click on the <i class="fa fa-plus"></i> at the right edge of the <em>Advanced Parameters</em> header to display additional parameters.</li>
	<li>Set the <strong><em>expression.value.filter.threshold</em></strong> parameter to <q>4</q> (so as to reduce the number of rows, and thus the time for computation)</li>
	<li>Copy and paste the name <q>BRCA_HUGO_symbols.preprocessed.gct</q> into the <strong>output file</strong> parameter.</li>
	<li>Click the button <strong><em>Run</em></strong> on the analysis below.</li>
</ol>
</div>


<div class="well well-sm"><strong>Runtime:</strong> About 30 seconds</div>

In [12]:
preprocessreadcounts_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00355')
preprocessreadcounts_job_spec = preprocessreadcounts_task.make_job_spec()
preprocessreadcounts_job_spec.set_parameter("input.file", "")
preprocessreadcounts_job_spec.set_parameter("cls.file", "")
preprocessreadcounts_job_spec.set_parameter("output.file", "BRCA_HUGO_symbols.preprocessed.gct")
preprocessreadcounts_job_spec.set_parameter("expression.value.filter.threshold", "4")
genepattern.display(preprocessreadcounts_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00355')

<h3>Review the newly generated file</h3>
<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ol>
    <li>Click on the output file called <q>BRCA_HUGO_symbols.preprocessed.gct</q> above.</li>
<li>Select <q>Send to existing GenePattern Cell</q>. </li>
<li>Select <q>HeatMapViewer</q> (below).</li>
    <li><strong><em>Run</em></strong> the HeatMapViewer cell.</li>
</ol></div>

In [9]:
heatmapviewer_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00010')
heatmapviewer_job_spec = heatmapviewer_task.make_job_spec()
heatmapviewer_job_spec.set_parameter("dataset", "")
genepattern.display(heatmapviewer_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00010')

## Extra Credit

<p>In the example above, we provided the cell with the <strong>TCGAImporter</strong> module pre-selected and with the metadata and manifest files pre-generated and pre-entered as input.&nbsp; If you would like to go through the steps to set this up for yourself, follow the instructions <a href="https://github.com/genepattern/TCGAImporter/blob/develop/how_to_download_a_manifest_and_metadata.ipynb" target="_blank">here</a>. You can use this as an opportunity to select samples from a different cancer type. Then you can re-run this notebook using your new inputs.</p>
