<h1>Data Preparation</h1>

<div class="alert alert-info"><p>Throughout this notebook, actions that you are supposed to take are written in a blue box like this.</p></div>

<h3>Breast Cancer Data</h3>

<p>For this notebook and some of the later exercises we will use RNA-Seq data from [The Cancer Genome Atlas (TCGA)](https://cancergenome.nih.gov/). We have selected 20 breast cancer primary tumors (BRCA) with their matched normal samples. We are starting off with the read counts files for each tumor and normal sample (40 in total) downloaded from the <a href="https://cancergenome.nih.gov/" target="_blank">TCGA data portal</a> and placed on a web server that permits unrestricted access. We also have a <a href="http://software.broadinstitute.org/cancer/software/genepattern/file-formats-guide#sample-information-file" target="_blank">sample information file</a> that we created in a spreadsheet that has the filenames, sample names and an indication if it is a tumor or a matched normal sample.</p>

**While the example we are using is for a specfic selection of breast cancer samples, you can use the steps in this notebook to prepare any TCGA-derived data for use in GenePattern analyses.**

<h2>Scientific summary</h2>

<p>For our data preparation phase, we will perform the following steps:</p>

<h3>1. Compile multiple read count files into a matrix and a file describing the phenotypes</h3>

<p>We will provide the read counts files (one per sample) and a sample info file to the MergeHTSeqCounts module. It will generate a <a href="http://software.broadinstitute.org/cancer/software/genepattern/file-formats-guide#GCT" target="_blank">GCT</a> file with each sample as one column and each Ensembl gene id as a row. The module will use the &#39;filename&#39; column of the <a href="http://software.broadinstitute.org/cancer/software/genepattern/file-formats-guide#sample-information-file" target="_blank">sample information file</a> to identify which row corresponds to which read counts filename. It uses the **samplename** column to replace the read counts filename with a more informative sample name in the output files. It uses the second column (**primary tumor/normal**) to generate a <a href="http://software.broadinstitute.org/cancer/software/genepattern/file-formats-guide#CLS" target="_blank">CLS file</a> assigning a phenotype to each sample.</p>

<h3>2. Remove version suffix from Ensembl gene ids</h3>

<p>The Ensembl gene ids in the read counts files include a version suffix (e.g. ENSG00000000419.11) but the module we will use in Step 3 (CollapseDataset) does not accept Ensembl ids that include the version We will load the GCT file generated in step 1 into this notebook, strip the versions from the ids (using Python), and then save it back to the GenePattern server to be used in the next step.</p>

<h3>3. Replace Ensembl gene ids with HUGO symbols and remove duplicates</h3>

<p>To make the dataset more human-friendly for analysis we will replace the Ensembl gene ids with HUGO symbols. Since more than one Ensembl gene id can map to a single symbol, we need to collapse any rows with duplicate symbols to a single row. The CollapseDataset module does both the remapping and collapsing for us.</p>

<h3>4. Normalize for downstream analysis</h3>
<p>Preprocess RNA-Seq count data in a GCT file so that it is suitable for use in GenePattern analyses.</p>

<h3>Login</h3>

<div class="alert alert-info"><ul><li>If you are not logged in, enter your username and password in the cell below and click **Login**.</ul></li></div>

<p>The logins to the notebook server and the GenePattern server are separate to allow you to run analyses hosted on different GenePattern servers in the same notebook.</p>


In [1]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.GPAuthWidget(genepattern.register_session("https://gp-beta-ami.genepattern.org/gp", "", ""))


<h3>Begin the analysis</h3>

<p>We will use the cell below (MergeHTSeqCounts). This module will take the 40 read count files and the sample info file.  We also tell it which columns of the sample info file have the filename, the sample name we want to use going forward, and a class distinction to use to generate a companion <b>cls</b> file. 

<p><i>We have prefilled the sample info file and the 40 input file URLs into the next cell to save time.</i></p>

<h3>Step 1. Compile multiple read count files into a matrix and a file describing the phenotypes</h3>
<p><div class="alert alert-info"><ul><li>Click **Run** in the cell below to generate the compiled read count (<strong>gct</strong>) and phenotype (<strong>cls</strong>) files.</ul></li></div></p>

<p>Once you hit run, a new GenePattern output cell will appear. You can watch the job's status change in its top right corner. Once it is complete it will show you links to the output files in the job status cell.</p>

In [2]:
mergehtseqcounts_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00354')
mergehtseqcounts_job_spec = mergehtseqcounts_task.make_job_spec()
mergehtseqcounts_job_spec.set_parameter("input.files", ["https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-A7-A0CE-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-A7-A0CE-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-A7-A0CH-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-A7-A0CH-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-A7-A0D9-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-A7-A0D9-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-A7-A0DB-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-A7-A0DB-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-A7-A13E-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-A7-A13E-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-A7-A13F-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-A7-A13F-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-A7-A13G-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-A7-A13G-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-AC-A23H-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-AC-A23H-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-AC-A2FB-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-AC-A2FB-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-AC-A2FF-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-AC-A2FF-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-AC-A2FM-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-AC-A2FM-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0AU-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0AU-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0AY-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0AY-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0AZ-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0AZ-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0B3-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0B3-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0B5-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0B5-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0B7-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0B7-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0B8-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0B8-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0BA-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0BA-11.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0BC-01.htseq.counts", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA-BH-A0BC-11.htseq.counts"])
mergehtseqcounts_job_spec.set_parameter("output.prefix", "BRCA_with_versioned_ensemble_ids")
mergehtseqcounts_job_spec.set_parameter("sampleinfo.file", "https://datasets.genepattern.org/data/TCGA_BRCA/BRCA_HTSeqCounts/TCGA_BRCA_SAMPLEINFO.txt")
mergehtseqcounts_job_spec.set_parameter("filenames.column", "filename")
mergehtseqcounts_job_spec.set_parameter("class.division.column", "2")
mergehtseqcounts_job_spec.set_parameter("sample.name.column", "samplename")
genepattern.GPTaskWidget(mergehtseqcounts_task)

<h3>Examine your output files</h3>

<p><div class="alert alert-info"><ul><li>Click on the <strong>cls</strong> and <strong>gct</strong> files you just generated and select the `Open in new tab` option to view them in your browser.</li></ul></div></p>
<p>Alternatively, for the <strong>gct</strong> file you can click on it and select `Send to data frame` to look at the resulting gct file within this Jupyter notebook.</p>

<p>&nbsp;</p>

<h3>Step 2. Remove version suffix from Ensembl gene ids</h3>
<p><div class="alert alert-info">
<ol>
<li>Click on the `MergeHTSeqCounts gct file` input in the cell below and select <b>BRCA_with_versioned_ensemble_ids.gct</b> as its input.  Leave the output variable unchanged.</li>
<li>Run the cell below.</li>
</ol>
</div></p>


In [6]:
import os 
from gp.data import GCT
global my_local_url
@genepattern.build_ui( name="Strip Ensembl Version and write a new GCT",
                      description="Strip out the version from the Ensembl ids in a gct file and save it as "
                      +" a new gct file on the GenePattern server.  Returns the URL to be used in the next job "
                      +"and also writes it to a notebook variable called \"my_local_url\". The method requires the gct "
                      +" file from a completed MergeHTSeqCounts job.", parameters={
    "MergeHTSeqCounts_gct_file": {
        "type": "file",
        "kinds": ["gct"]
    }
})
def stripEnsembleIdAndGetLocalUrl(MergeHTSeqCounts_gct_file):
    
    output_gct_filename = "BRCA_unversioned_ensembl_ids.gct"
    
    # get the input filename and job number
    jobNum = MergeHTSeqCounts_gct_file.split("/")[-2]
    input_gct_file_Name = MergeHTSeqCounts_gct_file.split("/")[-1]
    
    # get the GenePattern input job object and my username
    lastJob = gp.GPJob(genepattern.get_session(0), jobNum)
    myUserId = genepattern.get_session(0).username
    
    # this is the part that actually removes the version id
    input_gct = GCT(lastJob.get_file(input_gct_file_Name))
    df2 = input_gct.dataframe.reset_index()
    df2['Name'] = df2['Name'].apply(lambda x: x.split(".")[0])
    
    # reset the index on name and Description in case we want to look at this dataframe later
    #df2.set_index(['Name', 'Description'])
    
    # now save it back as a new file local to the Notebook server
    with open(output_gct_filename, 'w') as f:
        f.writelines('#1.2\n{}\t{}\n'.format(df2.shape[0], df2.shape[1] - 2))
        df2.to_csv(f, sep='\t', index= False)

    # upload the local file onto the GenePattern server so we can use it in the next module
    uploaded = genepattern.get_session(0).upload_file(output_gct_filename, output_gct_filename) 
    my_local_url = uploaded.get_url()  
    print("Stripped GCT file url is: "+ my_local_url)
    return my_local_url

<h3>Step 3. Replace Ensembl gene ids with HUGO symbols and remove duplicates</h3>

<p>When you looked into the gct file that was output, you may have noticed that it uses Ensembl IDs for the rows (counts). We would like to change this to HUGO symbols to make this more human-friendly. When we do this, we will end up with multiple Ensembl transcripts that all map to a single HUGO symbol. However the analysis we will do later does not like to see duplicate rows, so we will want to collapse instances of multiple transcripts down to a single row.</p>

<p>To do this we will use the CollapseDataset module which can collapse the rows and replace the Ensembl IDs with HUGO symbols in one step.</p>
<p><div class="alert alert-info">
<ol>
<li>Run the CollapeDataset cell below leaving the inputs unchanged.</li></ol>
</div></p><br/>

In [3]:
collapsedataset_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00134')
collapsedataset_job_spec = collapsedataset_task.make_job_spec()
collapsedataset_job_spec.set_parameter("dataset.file", "{{BRCA_with_versioned_ensemble_ids_stripped}}")
collapsedataset_job_spec.set_parameter("chip.platform", "ftp://ftp.broadinstitute.org/pub/gsea/annotations/ENSEMBL_human_gene.chip")
collapsedataset_job_spec.set_parameter("collapse.mode", "Maximum")
collapsedataset_job_spec.set_parameter("output.file.name", "<dataset.file_basename>.collapsed")
genepattern.GPTaskWidget(collapsedataset_task)

### Step 4. Normalize for downstream analysis

Preprocess RNA-Seq count data in a GCT file so that it is suitable for use in GenePattern analyses. 

* The **PreprocessReadCounts** module is used to preprocess RNA-Seq data into a form suitable for use downstream in other GenePattern analyses such as Gene Set Enrichment Analysis (GSEA), ComparativeMarkerSelection, HierarchicalClustering, as well as visualizers. 
* Many of these approaches assume that the data is distributed normally, yet this is not true of RNA-seq read count data. The PreprocessReadCounts module provides one approach to accommodate this. It uses a mean-variance modeling technique to transform the dataset to fit an approximation of a normal distribution, with the goal of being able to apply statistical methods and workflows that assume a normal distribution.
* Learn more by reading about the [PreprocessReadCounts](http://software.broadinstitute.org/cancer/software/genepattern/modules/docs/PreprocessReadCounts/1) module.

<div class="alert alert-info">

1. Click on the <b>input file*</b> parameter below and select the `BRCA_unversioned_ensembl_ids.collapsed.gct` result file from the analysis you just did.
2. Click on the <b>cls file*</b> parameter below and select the <a href="https://raw.githubusercontent.com/genepattern/example-notebooks/master/2017-11-07_CCMI_workshop/BRCA_40_samples.cls">BRCA_with_versioned_ensemble_ids.cls</a> from the MergeHTSeqCounts analysis you performed at the beginning of this notebook.
4. Click on the + at the right edge of the *Advanced Parameters* header to display additional parameters.
3. Set the <strong><em>expression.value.filter.threshold</em></strong> parameter to 4 (so as to reduce the number of rows, and thus the time for computation)
4. Copy and paste the name `BRCA_HUGO_symbols.preprocessed.gct` into the **output file** parameter.
5. Click the button <strong><em>Run</em></strong> on the analysis below.</li>


In [4]:
preprocessreadcounts_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00355')
preprocessreadcounts_job_spec = preprocessreadcounts_task.make_job_spec()
preprocessreadcounts_job_spec.set_parameter("input.file", "")
preprocessreadcounts_job_spec.set_parameter("cls.file", "")
preprocessreadcounts_job_spec.set_parameter("output.file", "<input.file_basename>.preprocessed.gct")
preprocessreadcounts_job_spec.set_parameter("expression.value.filter.threshold", "4")
genepattern.GPTaskWidget(preprocessreadcounts_task)

<h3>Review the newly generated file</h3>
<div class="alert alert-info"><ol>
<li>Click on the output file called `BRCA_HUGO_symbols.preprocessed.gct`. </li>
<li>Select `Send to existing GenePattern Cell`. </li>
<li>Select `HeatMapViewer` (below).</li>
<li>Run the HeatMapViewer cell.</li>
</ol></div>



In [5]:
heatmapviewer_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00010')
heatmapviewer_job_spec = heatmapviewer_task.make_job_spec()
heatmapviewer_job_spec.set_parameter("dataset", "")
genepattern.GPTaskWidget(heatmapviewer_task)

<h3>Extra Credit</h3>

<p>In the example above, we provided the cell with the MergeHTSeqCounts module pre-selected and with all of the files pre-entered as input.&nbsp; If you would like to go through the steps to set this up for yourself, here is what you can do:</p>

<ol>
	<li>Create a cell and change its type to &quot;GenePattern&quot;.</li>
    <li>When the select module dialog opens, search to find the MergeHTSeqCounts module and select it.</li>
	<li>Open the datasets page from the link below in a seperate window.</li>
	<li>Drag the file, <b>TCGA_BRCA_SAMPLEINFO.txt</b>,  from the datasets page into the <strong>sampleinfo file</strong> parameter.</li>
	<li>Drag all of the the htseq.counts files from datasets page to the <strong>input files</strong> parameter. <strong>You can select and drag all the htseq.counts files as a single block rather than one at a time.</strong></li>
	<li>Set the appropriate values for the columns. Click on the link for the sample info file to see its format. In this file, the columns are <strong>filename</strong>, <strong>samplename</strong> and <strong>primary tumor/normal</strong>. You can use either these names or their column index to identify the columns to the module.</li>
	<li>Run the module and confirm that the results match the example above.</li>
	<li>Repeat the rest of the steps from the example above.</li>
</ol>

<p>&nbsp;</p>

<h3>Input Files:</h3>

<p>You can drag and drop the input files from the page at the link below. Do not worry about selecting just the links as the GenePattern file drop parameter will do the right thing if you grab the whole page. Make sure you do differentiate between the <strong>&quot;*.htseq.counts&quot;</strong><em> </em>files and the sample info file<em>, </em><strong>TCGA_BRCA_SAMPLEINFO.txt</strong></p>
<p><a href="https://datasets.genepattern.org/index.html?prefix=data/ccmi_tutorial/2017-12-15/BRCA_HTSeqCounts/" target="_blank">https://datasets.genepattern.org/index.html?prefix=data/ccmi_tutorial/2017-12-15/BRCA_HTSeqCounts/</a></p>
