<div style="padding-bottom:30px">
<a href="https://github.com/iqtk/iqtk"><img src="https://raw.githubusercontent.com/iqtk/iqtk/master/inquiry/docs/assets/logotype_blue_small.png" style="width:100px; margin-left:0px"></img></a>
<p style="color:#9E9E9E">
<a href="https://github.com/cwbeitel/inquiry/tree/master/docs">Getting Started Guide</a> // <a href="https://goo.gl/forms/2cOmuUrQ3n3CKpim1">Documentation Feedback</a></p>
</div>

<h1 style="color:#9E9E9E">Gene expression analysis</h1>

In this analysis we perform differential expression analysis with the Cufflinks toolset which includes [cufflinks](https://cole-trapnell-lab.github.io/cufflinks/), [tophat](https://ccb.jhu.edu/software/tophat/), and [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml). You can brush up on gene expression profiling [here](https://en.wikipedia.org/wiki/Gene_expression_profiling).

To review, here we'll be running workflows generated by the Inquiry Toolkit directly from the Cloud DataFlow UI so once you've completed the [getting started documentation]() you'll be ready for this tutorial!

Our data processing pipeline will deliver data to a BigQuery data table and we'll both explore our data from the BigQuery UI as well as pull a subset into this notebook for subsequent analysis and visualization.

<h2 style="color:#9E9E9E">Configuration and Run</h2>

The first thing we need to do is parameterize our analysis. The parameters file we'll need to provide should have the following structure.

```json
{
  "ref_fasta": "gs://cflow-public/data/genomes/Drosophila_melanogaster/Ensembl/BDGP5.25/Sequence/BowtieIndex/genome.fa",
  "genes_gtf": "gs://cflow-public/data/genomes/Drosophila_melanogaster/Ensembl/BDGP5.25/Annotation/Archives/archive-2015-07-17-14-30-26/Genes/genes.gtf",
  "cond_a_pairs": [
      ["gs://cflow-public/data/rnaseq/downsampled_reads/GSM794483_C1_R1_1_small.fq",
       "gs://cflow-public/data/rnaseq/downsampled_reads/GSM794483_C1_R1_2_small.fq"],
      ["gs://cflow-public/data/rnaseq/downsampled_reads/GSM794484_C1_R2_1_small.fq",
       "gs://cflow-public/data/rnaseq/downsampled_reads/GSM794484_C1_R2_2_small.fq"],
      ["gs://cflow-public/data/rnaseq/downsampled_reads/GSM794485_C1_R3_1_small.fq",
       "gs://cflow-public/data/rnaseq/downsampled_reads/GSM794485_C1_R3_2_small.fq"]
      ],
  "cond_b_pairs": [
       ["gs://cflow-public/data/rnaseq/downsampled_reads/GSM794486_C2_R1_1_small.fq",
        "gs://cflow-public/data/rnaseq/downsampled_reads/GSM794486_C2_R1_2_small.fq"],
       ["gs://cflow-public/data/rnaseq/downsampled_reads/GSM794487_C2_R2_1_small.fq",
        "gs://cflow-public/data/rnaseq/downsampled_reads/GSM794487_C2_R2_2_small.fq"],
       ["gs://cflow-public/data/rnaseq/downsampled_reads/GSM794488_C2_R3_1_small.fq",
        "gs://cflow-public/data/rnaseq/downsampled_readsGSM794488_C2_R3_2_small.fq"]
       ]
}
```

The job can be initiated from the DataFlow custom template job submission UI which should look like the following (TODO):
![](https://camo.githubusercontent.com/6557bc51dd5b7d8b55e636ae94298fec7adfee11/68747470733a2f2f636c6f75642e676f6f676c652e636f6d2f64617461666c6f772f696d616765732f776f7264636f756e745f74656d706c6174655f657865637574652e706e67)

Once our job is submitted we'll be able to check its status using the Cloud DataFow UI as described in the [getting started documentation](). For illustration this workflow should look like the following:

![](../assets/wf-sshot-transcriptomics.png)

When the workflow is complete you can obtain the path in Google Cloud Storage to the resulting files as described in the [getting started documentation]().

<h2 style="color:#9E9E9E">Exploring the data</h2>

Next we'll obtain some data resulting from the above workflow and play around a bit.

In [2]:
import google.datalab.bigquery as bq
query = bq.Query('SELECT * FROM `jbei-cloud.somedataset.sometable2` LIMIT 3')
output_options = bq.QueryOutput.table(use_cache=False)
result = query.execute(output_options=output_options).result()
result.to_dataframe()

Unnamed: 0,id,geneid,gene,locus,sample1,sample2,status,expression1,expression2,lnFoldChange,testStatistic,pValue,qValue,significant
0,XLOC_001687,XLOC_001687,l(2)05714,2L:4969003-4971453,C1,C2,NOTEST,,,,,1.0,1.0,
1,XLOC_014024,XLOC_014024,CG14408,X:14721090-14726090,C1,C2,NOTEST,,,,,1.0,1.0,
2,XLOC_000602,XLOC_000602,gcm2,2L:9608478-9612710,C1,C2,NOTEST,,,,,1.0,1.0,


<h3 style="color:#9E9E9E">References</h3>

1. Trapnell, Cole, et al. "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks." Nature protocols 7.3 (2012): 562-578.
2. Trapnell, Cole, Lior Pachter, and Steven L. Salzberg. "TopHat: discovering splice junctions with RNA-Seq." Bioinformatics 25.9 (2009): 1105-1111.
3. Langmead, Ben, and Steven L. Salzberg. "Fast gapped-read alignment with Bowtie 2." Nature methods 9.4 (2012): 357-359.

<h3 style="color:#9E9E9E">Contact</h3>

Want to get in touch? You can [provide feedback](https://goo.gl/forms/2cOmuUrQ3n3CKpim1) regarding this or other documentation,
[reach out to us](https://goo.gl/forms/j8FWdNJqABAoJvcW2) regarding collaboration, or [request a new feature or analytical capability](https://goo.gl/forms/dQm3SDcoNZsV7AAd2). We're looking forward to hearing from you!

<div style="padding-top: 30px">
<p style="color:#9E9E9E; text-align:center">This notebook was prepared for <a href="https://github.com/cwbeitel/inquiry">Project Inquiry</a> in support of the research mission of the Joint BioEnergy Institute (JBEI). Learn more at https://www.jbei.org/.</p>
<p style="color:#9E9E9E; text-align:center">The Joint BioEnergy Institute is a program of the U.S. Department of Energy Office of Science.</p>
<p style="color:#9E9E9E; text-align:center">© Regents of the University of California, 2017. Licensed under a BSD-3 <a href="https://github.com/cwbeitel/inquiry/blob/master/LICENSE">license</a>.</p>
<img src="https://raw.githubusercontent.com/iqtk/iqtk/master/inquiry/docs/assets/logotype_blue_small.png" style="width:100px"></img>
</div>