---
# A Novel de novo Transcriptome Visualization
## Project Proposal
-----
#### Camille Scott
#### Lab for Data Intensive Biology
#### UC Davis
---

---
## Background

* A transcriptome is the set of RNA sequences which are "expressed" by an organism
* In contrast to a *genome*, which is the set of DNA sequences in an organism
* Transcriptomes are generated from the same data as genomes, and many transcriptomes are being published every day
---

<center>
![dogma](https://d2gne97vdumgn3.cloudfront.net/api/file/fjFYnUJEQxmhyo2b10h2)
</center>
from wikipedia commons

<center>
![cdna](http://www.discoveryandinnovation.com/BIOL202/notes/images/gene_model1.jpg)
</center>
from http://www.discoveryandinnovation.com/BIOL202/notes/lecture24.html

## Motivation

* Transcriptome sequencing ("RNA-seq") is cheaper and more accessible than complete genome sequencing.
* Though many are being published, there are fewer accepted standards and protocols for them.
* Especially difficult is quickly assessing and comparing different results: there is no "canonical" transcriptome visualization
* Several approaches exist for producing short numerical reports (such as [transrate](http://hibberdlab.com/transrate/)), but few or none that I'm aware of for generating visualizations.

## The Project

* I aim to develop a new method for producing summary visualizations of transcriptomes based on established phylogenetic methods.
* This will serve as a means of assessing the *annotation quality* of a transcriptome, and indirectly assessing the *assembly quality*.
* This method will project annotated genes on a known phylogeny to give an at-a-glance view of annotation quality.

### An example pylogeny rendered as a sunburst

* The following is an example of the ITIS phylogeny rendered as a sunburst using d3. The final product will have sizing based on the number of annotated genes for a clade and coloring based on kingdom.
* I also aim to use a measurement of pylogenetic signal to produce quantitative results.

![example](example.png)

## Challenges

* There are many phylogenetic methods and varying quality within existing phylogenies.
* Performance on such large trees could be an issue.
* Assessing the intuitiveness of the method is an open question.

## Further Work

This method will be implemented within the framework of an existing annotator that I maintain: https://github.com/camillescott/dammit

It will be implemented along with a more feature-complete annotation explorer. Using this existing annotator alleviates a lot of the installation burden for potential users and provides a way of testing the method on many different datasets in a manner which is alread automated.

### The Explorer

This will have at least four panes:

1. The summary visualization pane;
2. One to view subtrees of the summary and view metadata such as the gene identifiers and signal;
3. One to view gene models and alignments, built on the [GenomeD3Plot](https://github.com/lairdm/islandplot);
4. One to view full gene information and metadata, using the API provided by [mygene.info](mygene.info).


### Implementation Details

* The annotator is implemented in Python.
* Web server will be done with the python library flask.
* The main visualization will be D3.js, with HTML and Javascript for the explorer.
* The annotator already helps manage the installation of other analysis software.

In [5]:
# the annotator is fully exposed as a Python package as well
# some parsers
from dammit.parsers import cmscan_to_df_iter
# the actual annotation app
from dammit import app

## References
```
Bostock, Michael, Vadim Ogievetsky, and Jeffrey Heer. 2011. “D3: Data-Driven Documents.” IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis). http://vis.stanford.edu/papers/d3.
Laird, Matthew R., Morgan G.I. Langille, and Fiona S.L. Brinkman. 2015. “GenomeD3Plot: A Library for Rich, Interactive Visualizations of Genomic Data in Web Applications.” Bioinformatics 31 (20): 3348–49. doi:10.1093/bioinformatics/btv376.
Revell, Liam, Luke Harmon, and David Collar. 2008. “Phylogenetic Sig- nal, Evolutionary Process, and Rate.” Systematic Biology 57 (4): 591–601. doi:10.1080/10635150802302427.
Sayers, E. W., T. Barrett, D. A. Benson, S. H. Bryant, K. Canese, V. Chetvernin, D. M. Church, et al. 2009. “Database Resources of the National Center for Biotechnology Information.” Nucleic Acids Research 37 (Database): D5–D15. doi:10.1093/nar/gkn741.
Scott, Camille. 2016. “dammit: An Open and Accessible de Novo Transcriptome Annotator.” In Prep.
Wu, Chunlei, Ian MacLeod, and Andrew I. Su. 2013. “BioGPS and MyGene.info: Organizing Online, Gene-Centric Information.” Nucleic Acids Research 41 (D1): D561–65. doi:10.1093/nar/gks1114.
```