<h1 id="Differential-Expression-of-RNA-Seq-data-in-GenePattern-Notebook">Differential Expression of RNA-Seq data in GenePattern Notebook</h1>

<p>Compute differentially expressed genes or transcripts and visualize the results</p>

<h2 id="Before-you-begin">Before you begin</h2>

<p>You must log in to a GenePattern server, in this notebook we will use the <strong>GenePattern Production Server</strong>, hosted at the Broad Institute.</p>


<div class="alert alert-info">
<ul>
	<li>Sign in to GenePattern by clicking &quot;Login as...&quot; in the dialog that should be displayed below.&nbsp;
	<ul>
		<li>This will log you in as the same user you used to log into the notebook repository.</li>
	</ul>
	</li>
</ul>
</div>


In [18]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.GPAuthWidget(genepattern.register_session("https://genepattern.broadinstitute.org/gp", "", ""))

<h2 id="RNA-Seq-Differential-Analysis-Workflow">RNA-Seq Differential Analysis Workflow</h2>

<p>As we progress through RNA-seq analysis, we are looking to answer several questions. The first question is &ndash; what does my data look like? Does it look approximately the way most RNA-seq data looks, or are there significant issues which suggest a problem with the sequencing? Are there biases toward 3&rsquo; or 5&rsquo; ends, different bases, different sequences?</p>

<p>Once we&rsquo;ve determined your data is approximately normal, then we can proceed with alignment and downstream analyses. Then, we are asking other questions, such as &ndash; how did my data align? How many of the reads aligned? Are there visual differences between the phenotypes?</p>

<p>Considerations to bear in mind as we progress:</p>

<ul>
	<li>Tools aren&#39;t always 100% accurate - check their work
	<ul>
		<li>algorithms have biases - read the documentation and references to be familiar with these biases</li>
	</ul>
	</li>
	<li>When removing data, remember that anomalies may be relevant - be sure about why you are removing data, and what you might gain/lose by doing so.</li>
</ul>


<h3>Quality Control Checking - FastQC</h3>

<ul>
	<li>Is my raw data of high quality?</li>
	<li>Are any bases or sequences over- or under-represented?</li>
	<li>Are there any biases in the reads (e.g. 3&rsquo; end or 5&rsquo; end)?</li>
	<li>Is there any kind of enrichment bias in my samples (e.g. from PCR)?</li>
</ul>

<h3>Dataset: GSE52778, SRP033351</h3>

<p><strong>Human Airway Smooth Muscle Transcriptome Changes in Response to Asthma Medications</strong></p>

<p>Himes, Blanca E. et al. &ldquo;RNA-Seq Transcriptome Profiling Identifies&nbsp;<em>CRISPLD2&nbsp;</em>as a Glucocorticoid Responsive Gene That Modulates Cytokine Function in Airway Smooth Muscle Cells.&rdquo; Ed. Jan Peter Tuckermann.<em>&nbsp;PLoS ONE</em>&nbsp;9.6 (2014): e99625.&nbsp;<em>PMC</em>.</p>

<ul>
	<li>HASM cells from four white male donors</li>
	<li>Four treatment conditions:</li>
<ol>
	<li>no treatment</li>
	<li>treatment with a &beta;2-agonist (i.e. Albuterol, 1&mu;M for 18h)</li>
	<li>treatment with a glucocorticosteroid (i.e. Dexamethasone (Dex), 1&mu;M for 18h)</li>
	<li>simultaneous treatment with a &beta;2-agonist and glucocorticoid</li>
</ol>
</ul>

<div class="alert alert-info">
<ul>
	<li>Drag <a href="https://s3.amazonaws.com/datasets-genepattern-org/data/HASM_Asthma_RNA-seq_workshop_files/FastQC_Input/SRR1039508_1.fastq.gz" target="_blank">SRR1039508_1.fastq.gz</a> to the <b>input file</b>&nbsp;field.</li>
	<li>Leave the rest of the parameters as default (found in the Advanced section)</li>
	<li>Click <strong>Run</strong></li>
</ul>
</div>

In [19]:
fastqc_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00340')
fastqc_job_spec = fastqc_task.make_job_spec()
fastqc_job_spec.set_parameter("input.file", "")
fastqc_job_spec.set_parameter("input.format", "")
fastqc_job_spec.set_parameter("contaminant.file", "")
fastqc_job_spec.set_parameter("kmer.size", "5")
fastqc_job_spec.set_parameter("extract.output", "true")
genepattern.GPTaskWidget(fastqc_task)

In [21]:
genepattern.GPJobWidget(None)

<p>If your FastQC job has not finished running, then you can look at an example job we&rsquo;ve already created, below this cell.<br />
Here you can see all the outputs of the job. As you can see, there are several files. We are interested in one which aggregates all of this information together &ndash; the <b><em>fastqc_report.html</em></b> file.</p>

<div class="alert alert-info">
<ul>
	<li>Click the <strong>report.hml</strong> file below</li>
	<li>then click &quot;Open in New Tab&quot; to view the file. This will open a new webpage.</li>
</ul>
</div>


In [23]:
job1657086 = gp.GPJob(genepattern.get_session(0), 1657086)
genepattern.GPJobWidget(job1657086)

<p>On the&nbsp;webpage that opens we can see the results of our job.</p>

<p>Each metric is given an evaluation of good (green), warning (orange), or bad (red). As you can see most of our data is green, which is good. We have some orange in overrepresented sequences and in kmer content, and we have one bad flag in &lsquo;sequence duplication levels&rsquo;. For this workshop&nbsp;we&rsquo;re going to look specifically at the overrepresented sequences.</p>

<div class="alert alert-info">
<ul>
	<li>Click on <em><strong>&lsquo;overrepresented sequences&rsquo;</strong></em>.

	<ul>
		<li>This shows us that a specific sequence makes up 0.25 percent of our library, which this has flagged as unusually high.</li>
		<li>The software has found that this matches a TruSeq adapter, so we know there&rsquo;s adapter contamination in our dataset.</li>
	</ul>
	</li>
</ul>
</div>

<img src="https://s3.amazonaws.com/datasets-genepattern-org/images/Workshop+Notebooks/BU+Molecular+Bio/FastQCReport_beforeTrim.PNG"/>

<p>The FastQC documentation will show you examples of good and bad results to help you in determining how to handle your own data.</p>

<p>We&rsquo;re going to need to trim this data to improve the quality.</p>


<h3 id="Run-Trimmomatic-on-Paired-FASTQ-files">Run Trimmomatic on Paired FASTQ files</h3>

<p>We use Trimmomatic to trim contaminating adapter sequences. We discard reads if they are below a specified length after trimming.</p>

<div class="alert alert-info">
<ul>
	<li>Drag <a href="https://s3.amazonaws.com/datasets-genepattern-org/data/HASM_Asthma_RNA-seq_workshop_files/FastQC_Input/SRR1039508_1.fastq.gz" target="_blank">SRR1039508_1.fastq.gz</a> to the <b>input file 1</b>&nbsp;field.</li>
	<li>Drag <a href="https://s3.amazonaws.com/datasets-genepattern-org/data/HASM_Asthma_RNA-seq_workshop_files/FastQC_Input/SRR1039508_2.fastq.gz" target="_blank">SRR1039508_2.fastq.gz</a> to the <b>input file 2</b>&nbsp;field.</li>
	<li>Collapse the <strong>Basic Input Parameters and Options</strong> section</li>
	<li>Expand the <strong>Adapter Clipping</strong><em>&nbsp;</em>section</li>
	<li>In order to trim the TruSeq 2 adaptor we must either supply or choose an adaptor clip sequence file. Since TruSeq2 is common, the module provides this for us.
	<ul>
		<li>Click on the box containing&nbsp;<strong>Add GenePattern File or URL...</strong></li>
		<li>Choose<em>&nbsp;</em><strong>TruSeq2-PE.fa</strong>&nbsp;</li>
	</ul>
	</li>
	<li>Set the <strong>adapter clip seed mismatches</strong> to the recommended value of <strong>2</strong></li>
	<li>Set&nbsp;the <strong>adapter clip palindrome clip threshold</strong> to the recommended value of <strong>40</strong></li>
	<li>Set&nbsp;the <strong>adapter clip simple clip threshold</strong> to <strong>15</strong> (from the recommended range)</li>
	<li>Allow Trimmomatic to set the adaptor clip min length to 8, which is the default.</li>
	<li>Set&nbsp;<strong>adapter clip keep both reads</strong>&nbsp;to&nbsp;<strong>yes</strong>, as is recommended</li>
	<li>Leave the rest of the parameters as default -&nbsp;&nbsp;starting with defaults is a good idea, if you don&rsquo;t know otherwise.</li>
</ul>
*<em>More information about these parameters and why you might want to change them or add other trimming methods can be found in the Trimmomatic documentation.</em></div>

<div class="alert alert-warning">Due to time constraints, please do not run this job. A &quot;prebaked&quot; output has been supplied for you below.</div>


In [25]:
trimmomatic_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00341')
trimmomatic_job_spec = trimmomatic_task.make_job_spec()
trimmomatic_job_spec.set_parameter("input.file.1", "")
trimmomatic_job_spec.set_parameter("input.file.2", "")
trimmomatic_job_spec.set_parameter("output.filename.base", "<input.file.1_basename>")
trimmomatic_job_spec.set_parameter("adapter.clip.sequence.file", "")
trimmomatic_job_spec.set_parameter("adapter.clip.seed.mismatches", "")
trimmomatic_job_spec.set_parameter("adapter.clip.palindrome.clip.threshold", "")
trimmomatic_job_spec.set_parameter("adapter.clip.simple.clip.threshold", "")
trimmomatic_job_spec.set_parameter("adapter.clip.min.length", "")
trimmomatic_job_spec.set_parameter("adapter.clip.keep.both.reads", "true")
trimmomatic_job_spec.set_parameter("trim.leading.quality.threshold", "")
trimmomatic_job_spec.set_parameter("trim.trailing.quality.threshold", "")
trimmomatic_job_spec.set_parameter("max.info.target.length", "")
trimmomatic_job_spec.set_parameter("max.info.strictness", "")
trimmomatic_job_spec.set_parameter("sliding.window.size", "")
trimmomatic_job_spec.set_parameter("sliding.window.quality.threshold", "")
trimmomatic_job_spec.set_parameter("min.read.length", "")
trimmomatic_job_spec.set_parameter("extra.steps", "")
trimmomatic_job_spec.set_parameter("phred.encoding", "phred33")
trimmomatic_job_spec.set_parameter("convert.phred.scores", "")
trimmomatic_job_spec.set_parameter("create.trimlog", "false")
genepattern.GPTaskWidget(trimmomatic_task)

In [27]:
genepattern.GPJobWidget(None)

<p>Below, you can see all the outputs of the job.&nbsp;</p>

<ul>
	<li>_1P or _2P are forward and reverse reads which both passed quality checks</li>
	<li>_1U or _2U are &ldquo;unpaired&rdquo; reads in which one of the pair did not pass</li>
</ul>

<p>The output files we are interested in&nbsp;are the _1P/_2P files which passed both of the quality checks.</p>


In [29]:
job1657089 = gp.GPJob(genepattern.get_session(0), 1657089)
genepattern.GPJobWidget(job1657089)

<p>We will now send them through FastQC again, to see if the trimming has improved the quality of our reads.</p>

<div class="alert alert-info">
<ul>
	<li>Click on the <strong>_1P output&nbsp;</strong></li>
	<li><strong>Send to Existing GenePattern Cell&nbsp;</strong>
	<ul>
		<li><strong>FastQC - be sure to select the second instance of FastQC in the list</strong>
		<ul>
			<li><em>this will send the output to the FastQC Cell below. You could instead send it to the FastQC cell you started with and run again with this newly trimmed data, however&nbsp;to keep our flow linear, as we are learning, we will pass to the FastQC cell below.&nbsp;</em></li>
		</ul>
		</li>
	</ul>
	</li>
</ul>
</div>

<div class="alert alert-warning">Due to time constraints, please do not run this job. A &quot;prebaked&quot; output has been supplied for you below.</div>

In [31]:
fastqc_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00340')
fastqc_job_spec = fastqc_task.make_job_spec()
fastqc_job_spec.set_parameter("input.file", "")
fastqc_job_spec.set_parameter("input.format", "")
fastqc_job_spec.set_parameter("contaminant.file", "")
fastqc_job_spec.set_parameter("kmer.size", "5")
fastqc_job_spec.set_parameter("extract.output", "true")
genepattern.GPTaskWidget(fastqc_task)

In [33]:
genepattern.GPJobWidget(None)

<div class="alert alert-info">
<ul>
    <li>Click the report.hml file below</li>
    <li>then click "Open in New Tab" to view the file. This will open a new webpage.</li>
</ul></div>

In [34]:
job1657628 = gp.GPJob(genepattern.get_session(0), 1657628)
genepattern.GPJobWidget(job1657628)

<h3 id="-Reviewing-our-Trimmed-Results">Reviewing our Trimmed Results</h3>

<div class="alert alert-info">
<ul>
	<li>Click on <em><strong>&lsquo;overrepresented sequences&rsquo;&nbsp;</strong></em>in the new window which is now displaying the latest FastQC report</li>
</ul>
</div>

<img src="https://s3.amazonaws.com/datasets-genepattern-org/images/Workshop+Notebooks/BU+Molecular+Bio/FastQCReport_afterTrim.PNG" />

<p>We can see that the <strong>&lsquo;overrepresented sequences&rsquo;</strong> metric, which was previously orange, is now green and the issue has been resolved.</p>

<p>We still have an error in <strong>sequence duplication levels, </strong>which means we&nbsp;&nbsp;have some sequences that appear repeatedly in the samples. This can be due to technical duplicates arising from PCR artifacts, or biological duplicates which are natural collisions where different copies of exactly the same sequence are randomly selected. From a sequence level there is no way to distinguish between these two types and both will be reported as duplicates here.</p>

<p>In general, a&nbsp;warning or error in this section&nbsp;is simply a statement that you have exhausted the diversity in at least part of your library and are re-sequencing the same sequences. Some library types naturally over-sequence parts of the library and therefore generate duplication and we will therefore expect to see warnings or errors.</p>

<p><em>(IE &ndash; you will rarely if ever see&nbsp;all green checks)&nbsp;</em></p>

<h3>Run TopHat to align reads - Guided Alignment</h3>

<p>TopHat can be used in a few different modes. We will use TopHat to do a guided alignment. This uses a genome or transcriptome to guide the alignment of the reads. Here, the read fragments are aligned to the genome, and then transcripts are assembled from the aligned reads. This is the least computationally intensive mode in which to run TopHat.</p>

<p>Aligning to the transcriptome with TopHat requires either a genome annotation file (GTF file), or a transcriptome index. We will use the UCSC hg19 reference annotation.</p>

<ol>
	<li>Create a transcriptome index.
	<ul>
		<li>To speed up alignment, we create a transcriptome index by running the UCSC hg19 reference annotation on its own.</li>
	</ul>
	</li>
	<li>Align reads to the transcriptome.
	<ul>
		<li>We align the reads to the transcriptome using the transcriptome index created in Step 1. We run TopHat for each sample (8 times).</li>
	</ul>
	</li>
</ol>

<p><em>This is the least computationally intensive mode to run TopHat. However, alignment takes <u>&gt; 1 hour per sample.</u></em></p>


<h3>Create Transcriptome Index</h3>

<div class="alert alert-info">
<p>Here we just need to run tophat with a bowtie index and a gtf file, to create a transcriptome index that we can use in subsequent alignments.</p>

<ul>
	<li>Click on the file input for <b>bowtie index</b>
	<ul>
		<li>Scroll to the&nbsp;<strong>FTP Server Files</strong>&nbsp;section of the list</li>
		<li>Select <strong>Homo_sapiens_hg19_UCSC</strong></li>
	</ul>
	</li>
    
	<li>Click on the file input for <strong>GTF file</strong>
	<ul>
		<li>Select <strong>Homo_sapiens_hg19_UCSC.gtf</strong></li>
	</ul>
	</li>
</ul>
</div>

<div class="alert alert-warning">Due to time constraints, please do not run this job. A &quot;prebaked&quot; output has been supplied for you below.</div>

In [20]:
tophat_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00230')
tophat_job_spec = tophat_task.make_job_spec()
tophat_job_spec.set_parameter("GTF.file", "")
tophat_job_spec.set_parameter("transcriptome.index", "")
tophat_job_spec.set_parameter("bowtie.index", "")
tophat_job_spec.set_parameter("reads.pair.1", "")
tophat_job_spec.set_parameter("reads.pair.2", "")
tophat_job_spec.set_parameter("mate.inner.dist", "50")
tophat_job_spec.set_parameter("mate.std.dev", "")
tophat_job_spec.set_parameter("library.type", "")
tophat_job_spec.set_parameter("Bowtie.preset.options", "")
tophat_job_spec.set_parameter("transcriptome.only", "no")
tophat_job_spec.set_parameter("max.transcriptome.hits", "")
tophat_job_spec.set_parameter("prefilter.multihits", "")
tophat_job_spec.set_parameter("raw.junctions.file", "")
tophat_job_spec.set_parameter("find.novel.junctions", "")
tophat_job_spec.set_parameter("min.anchor.length", "8")
tophat_job_spec.set_parameter("max.splice.mismatches", "0")
tophat_job_spec.set_parameter("min.intron.length", "")
tophat_job_spec.set_parameter("max.intron.length", "500000")
tophat_job_spec.set_parameter("max.insertion.length", "3")
tophat_job_spec.set_parameter("max.deletion.length", "3")
tophat_job_spec.set_parameter("quality.value.scale", "")
tophat_job_spec.set_parameter("max.multihits", "20")
tophat_job_spec.set_parameter("read.mismatches", "2")
tophat_job_spec.set_parameter("coverage.search", "--no-coverage-search")
tophat_job_spec.set_parameter("microexon.search", "")
tophat_job_spec.set_parameter("fusion.mapping", "")
tophat_job_spec.set_parameter("fusion.anchor.length", "20")
tophat_job_spec.set_parameter("fusion.read.mismatches", "2")
tophat_job_spec.set_parameter("output.prefix", "<job_id>_<reads.pair.1_basename>")
tophat_job_spec.set_parameter("read.edit.dist", "")
tophat_job_spec.set_parameter("read.gap.length", "")
tophat_job_spec.set_parameter("additional.tophat.options", "")
genepattern.GPTaskWidget(tophat_task)

In [22]:
genepattern.GPJobWidget(None)

In [24]:
job1657629 = gp.GPJob(genepattern.get_session(0), 1657629)
genepattern.GPJobWidget(job1657629)

<h3 id="Align-Reads-to-Transcriptome">Align Reads to Transcriptome</h3>

<div class="alert alert-info">This time, we are aligning reads to the transcriptome.
<ul>
	<li>Choose the same bowtie index as before (Homo_sapiens_hg19_UCSC)</li>
	<li>Drag the link to the folder of our transcriptome index, <a href="https://genepattern.broadinstitute.org/gp/jobResults/1657629/transcriptome_index/" target="_blank">https://genepattern.broadinstitute.org/gp/jobResults/1657629/transcriptome_index/</a>, to the <b>transcriptome index</b> parameter.</li>
	<li>Click the input field for&nbsp;<strong>reads pair 1&nbsp;</strong>and select&nbsp;<strong>SRR1039508_1P.fastq.gz&nbsp;</strong>from the Trimmomatic ouput.</li>
	<li>Click the input field for&nbsp;<strong>reads pair 2&nbsp;</strong>and select&nbsp;<strong>SRR1039508_2P.fastq.gz</strong>from the Trimmomatic ouput.</li>
	<li>Set the&nbsp;<strong>library type&nbsp;</strong>to&nbsp;<strong>Standard Illumina (fr-unstranded)</strong></li>
	<li>Set&nbsp;<strong>transcriptome only</strong>&nbsp;to&nbsp;<strong>yes</strong></li>
</ul>
</div>

<div class="alert alert-warning">Due to time constraints, please do not run this job. A &quot;prebaked&quot; output has been supplied for you below.</div>


In [26]:
tophat_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00230')
tophat_job_spec = tophat_task.make_job_spec()
tophat_job_spec.set_parameter("GTF.file", "")
tophat_job_spec.set_parameter("transcriptome.index", "")
tophat_job_spec.set_parameter("bowtie.index", "")
tophat_job_spec.set_parameter("reads.pair.1", "")
tophat_job_spec.set_parameter("reads.pair.2", "")
tophat_job_spec.set_parameter("mate.inner.dist", "50")
tophat_job_spec.set_parameter("mate.std.dev", "")
tophat_job_spec.set_parameter("library.type", "")
tophat_job_spec.set_parameter("Bowtie.preset.options", "")
tophat_job_spec.set_parameter("transcriptome.only", "no")
tophat_job_spec.set_parameter("max.transcriptome.hits", "")
tophat_job_spec.set_parameter("prefilter.multihits", "")
tophat_job_spec.set_parameter("raw.junctions.file", "")
tophat_job_spec.set_parameter("find.novel.junctions", "")
tophat_job_spec.set_parameter("min.anchor.length", "8")
tophat_job_spec.set_parameter("max.splice.mismatches", "0")
tophat_job_spec.set_parameter("min.intron.length", "")
tophat_job_spec.set_parameter("max.intron.length", "500000")
tophat_job_spec.set_parameter("max.insertion.length", "3")
tophat_job_spec.set_parameter("max.deletion.length", "3")
tophat_job_spec.set_parameter("quality.value.scale", "")
tophat_job_spec.set_parameter("max.multihits", "20")
tophat_job_spec.set_parameter("read.mismatches", "2")
tophat_job_spec.set_parameter("coverage.search", "--no-coverage-search")
tophat_job_spec.set_parameter("microexon.search", "")
tophat_job_spec.set_parameter("fusion.mapping", "")
tophat_job_spec.set_parameter("fusion.anchor.length", "20")
tophat_job_spec.set_parameter("fusion.read.mismatches", "2")
tophat_job_spec.set_parameter("output.prefix", "<job_id>_<reads.pair.1_basename>")
tophat_job_spec.set_parameter("read.edit.dist", "")
tophat_job_spec.set_parameter("read.gap.length", "")
tophat_job_spec.set_parameter("additional.tophat.options", "")
genepattern.GPTaskWidget(tophat_task)

In [28]:
genepattern.GPJobWidget(None)

In [30]:
job1658233 = gp.GPJob(genepattern.get_session(0), 1658233)
genepattern.GPJobWidget(job1658233)

<div class="alert alert-warning">To run the next step in this workflow, <b><a href="https://genepattern.broadinstitute.org/gp/pages/index.jsf?lsid=urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00236:7" target="_blank">CuffDiff</a></b>, - please use the <a href="https://genepattern.broadinstitute.org" target="_blank">GenePattern WebApp</a> - the setting of conditions is currently not available in the GenePattern Notebook extension. This feature is due in an upcoming release of the GenePattern Notebook Extension.<br />
Log in to the GenePattern server with the same username and password you used for this notebook, and you will be able to use all of the jobs you have run in this notebook.</div>

<p>I've included the output of the CuffDiff job here - you can also copy the <b>genes.fpkm_tracking</b> file from here to use in IGV</p>


In [32]:
job1657632 = gp.GPJob(genepattern.get_session(0), 1657632)
genepattern.GPJobWidget(job1657632)