Bioinformatics use case (RNA-Seq analysis) #135

olgabot · 2014-07-06T22:55:08Z

Hi @jbenet and @maxogden! Thank you so much for the time you took to meet with @mlovci and me this weekend. Here's an overview of our current data management situation and what our ideal case would be.

What we have now

Currently, we host a datapackages.json file which contains resources with the names "experiment_design" (metadata on the samples, e.g. celltype and colors to plot them with), "expression" (gene expression data), "splicing" (scores of which version of a gene was used). Then, at the end of the file, we have an attribute called "species" (e.g. "hg19" for the human genome build 19) that only works with hg19 and mm10 (Mus musculus aka house mouse genome build 10) because it points to the URL "http://sauron.ucsd.edu/flotilla_projects//datapackage.json", which we hand-curated. So if the data we use is from one of these two species, we can grab the data.

Try this:

On a command line:

git clone git@github.com:YeoLab/flotilla
cd flotilla
pip install -e .

In Python:

import flotilla
study = flotilla.embark("http://sauron.ucsd.edu/flotilla_projects/neural_diff_chr22/datapackage.json")

This will load the data from our server from sauron.ucsd.edu, and since you haven't downloaded anything with that filename yet, it will download it. Additionally, this is a test dataset with only information from human chromosome 22, so it is loadable on a regular laptop. Feel free to look through the code and JSON files. flotilla.data_model.Study.from_data_package does most of the heavy lifting in loading the data. Keep in mind that the project is also in a pre-alpha stage, and has a long way to go :)

What we would like

Two major issues are:

Get the data in the neural_diff_chr22 datapackage into a pandas.DataFrame object which can then be imported into flotilla.
- Currently this is managed by the URL in the datapackage.json file for that file, but it should first check locally for the data and be able to be loaded offline, if you already have the data downloaded.
Grab related data, e.g. descriptions of genes and their functions given an ID like ENSG00000100320 and get the "gene symbol" (i.e. the familiar name that we know it by) of RBFOX2 and that this gene is an RNA-binding protein involved in alternative splicing and neural development.
- Currently this is is managed by the "species" attribute, but ideally it would be something like ENSEMBL_v75_homo_sapiens which would link to the human data here: http://uswest.ensembl.org/info/data/ftp/index.html and then grab gene annotation (gtf files)/sequence information (fasta files) as necessary by the analysis.
- Relatedly, there is apparently an "eHive" system on ENSEMBL for data processing. I haven't explored it yet, but it may be good to be aware of.
- Another major issue is how to merge analyses of different species' data. For example, the ENSEMBL website has mappings of human and mouse versions of genes that we could use to compare gene expression. Plus there's the HAVANA project which categorizes orthologous (evolutionarily related) genes between different vertebrates. But what if I want to compare across non-traditional species? And many of them, not just between two? I would like to be able to easily grab these data, submit a job (either to our local supercomputer or to Amazon AWS) which runs a script that outputs a mapping with some unique keys that you could merge all your different data on.

Ideally, we could do something like this:

study = flotilla.embark('neurons')

Which would fetch our mouse and human neuron data, which has some kind of link to ENSEMBL and attach all the relevant metadata about mouse and human genes, and give common keys where possible.

@mlovci - please add on with anything I missed.

The text was updated successfully, but these errors were encountered:

bmpvieira · 2014-07-10T22:00:48Z

Hi @olgabot and @mlovci,

I think for the first major issue you could do something like Dat -> JSON -> Pandas. That is, use Dat for versioning, archiving, distribution, etc, and Pandas for analysis. However, maybe Dat could replace Pandas and provide functionality similar to flotilla when #133 get's solved?

For the second major issue, you could for now build a pipeline/tool (have a look at gasket) with bionode-ncbi and dat. A bionode-ensembl module is planned, but for now you can use ncbi and do bionode-ncbi search gene ENSG00000100320 | dat import --json. The property name of the object returned by the search has the value RBFOX2 and taxid has the taxon id 9606 corresponding to Homo sapiens. If you do bionode-ncbi search genome txid9606 in the property assembly_name you get the common name of the reference assembly GRCh38. If you do bionode-ncbi search assembly GRCH38 You get an array in meta.FtpSites.FtpPath with two objects. The one that has type RefSeq has property _ with value ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/. In that FTP folder there's a directory named GFF with the annotations.

Alternatively, you can just update bionode-ncbi and do bionode-ncbi download gff homo sapiens :-P

As we keep adding features to these projects, things will get simpler. What I'm trying to do in my PhD is similar to this use case, first get all metadata and store in Dat, second reformat and query the metadata to find what's interesting to me, third send the reference IDs to the cluster so that the pipeline fetches all the raw data and runs the heavy analysis there.

max-mapper · 2014-07-19T10:08:21Z

@mlovci hey random question, but can you share the big flowchart diagram you showed me at nodeconf?

mlovci · 2014-07-19T14:57:02Z

Here @maxogden : http://imgur.com/Ixub1nk.jpg ( I'd appreciate if you credit me when you use it ). I can give you high-res too, if you need it.

@bmpvieira : Just saw this note, I'll try to get bionode-ncbi going and see what happens.

jbenet · 2014-07-19T21:38:47Z

That's so depressing to me :( It looks like a metabolic pathway in the worst of ways.—
Sent from Mailbox

On Sat, Jul 19, 2014 at 7:57 AM, Michael Lovci notifications@github.com
wrote:

Here @maxogden : http://imgur.com/Ixub1nk.jpg

@bmpvieira : Just saw this note, I'll try to get bionode-ncbi going and see what happens.

Reply to this email directly or view it on GitHub:
#135 (comment)

mlovci · 2014-07-19T22:10:02Z

:(

webmaven · 2014-07-21T20:14:05Z

I find it stimulating, myself. It is nice that the flows are (relatively) straightforward, and aren't tangled into circular references or a big-ball-o-mud, which means parts can be encapsulated at various scales, components switched out, APIs standardized, etc. Thanks, @mlovci!

webmaven · 2014-07-21T20:15:48Z

@mlovci, an article describing the flow in more detail would likely be rather useful, BTW (and give the image a natural 'home' on the web).

mlovci · 2014-07-21T20:48:53Z

thanks @webmaven that made me feel better. I'll think about writing an article.

webmaven · 2014-07-21T21:07:05Z

You're welcome @mlovci. Anyway, the 'metabolic pathways' analogy breaks down since there aren't a bunch of crazy feedback loops (and the couple of places that look like there might be are just because things were moved to fit horizontally).

max-mapper · 2014-08-19T03:01:10Z

@mlovci @olgabot Heya! We are finally done with the first stable version of dat. Our new website is up at http://dat-data.com/

It would be good to do a google hangout with you guys and @bmpvieira and @mafintosh to discuss how we could start modeling your data pipeline with dat!

olgabot · 2014-08-19T03:28:54Z

Yes, definitely! Out of curiosity, what's your current hosting solution? If
we have some data but no public server, how do we proceed?

Sent from my mobile device.
On Aug 18, 2014 8:01 PM, "Max Ogden" notifications@github.com wrote:

@mlovci https://github.com/mlovci @olgabot https://github.com/olgabot
Heya! We are finally done with the first stable version of dat. Our new
website is up at http://dat-data.com/

It would be good to do a google hangout with you guys and @bmpvieira
https://github.com/bmpvieira and @mafintosh
https://github.com/mafintosh to discuss how we could start modeling
your data pipeline with dat!

—
Reply to this email directly or view it on GitHub
#135 (comment).

max-mapper · 2014-08-19T05:08:03Z

@olgabot we have been hosting mostly on https://www.digitalocean.com/ and working on large file backend hosting on Google Cloud Services, but we are also working on this list of hosts we'd like to support: dat-ecosystem-archive/datproject-discussions#5

olgabot · 2014-08-20T01:10:56Z

Cool! Would you be available for a hangout tomorrow at 1pm pst?

Sent from my mobile device.
On Aug 18, 2014 10:08 PM, "Max Ogden" notifications@github.com wrote:

@olgabot https://github.com/olgabot we have been hosting mostly on
https://www.digitalocean.com/ and working on large file backend hosting
on Google Cloud Services, but we are also working on this list of hosts
we'd like to support: dat-ecosystem-archive/datproject-discussions#5
dat-ecosystem-archive/datproject-discussions#5

—
Reply to this email directly or view it on GitHub
#135 (comment).

max-mapper · 2014-08-20T01:14:05Z

@olgabot yes definitely. @bmpvieira @mafintosh i'll invite you guys to the hangout tomorrow as well in case you wanna join

max-mapper · 2014-08-20T21:28:47Z

took some notes from the hangout today (mostly random, just posting for archival purposes):

high level rna-seq pipeline notes (probably wrong):

rna* -> bam/sam
     -> metadata
     -> splice junction file

-> gene expression -> tpm
-> splicing -> miso -> miso summary -> percent spliced in (psi)

would be cool:
bam -> dat -> dalliance browser
^ dat could even act as cross-domain (CORS) proxy

random notes/links:

http://i.imgur.com/Ixub1nk.jpg
https://github.com/datproject/discussions/issues/5
dat -> pandas dataframe (can just use csv)
https://github.com/gpratt/gatk/tree/master/public/scala/src/org/broadinstitute/sting/queue/extensions/yeo_scripts/yeo
http://dat-ncbi-arthropod-full.inb.io
SGE
qsub
npm.dathub.org
https://github.com/bionode/bionode#project-status
Ensembl RESTful API http://rest.ensembl.org/

fastq -> vcf, genome map, structural variance

bionode-ensembl
bionode-sra

use google-drive for easily sharing big files over internet

olgabot · 2014-08-20T22:06:20Z

More Bioinformatics pipelining tools to know about (suggested from @gpratt)

Spiral Genetics: http://www.spiralgenetics.com/
DNA Nexus (Amazon AWS) backends: https://dnanexus.com/
Encode Data Coordination Center metadata database: code: https://github.com/ENCODE-DCC/encoded and interface: https://www.encodeproject.org/ (e.g. search for "stem cells")

bmpvieira · 2014-08-21T17:50:57Z

More pipelines/workflows:

thadguidry · 2014-08-21T18:17:27Z

+1 for Taverna... a few of my bioinformatics friends around Dallas
universities and research groups actually use this, and I introduced
Taverna to a few of them myself.
Taverna already has data integration built in to many external tools as
well and there is already a REST and XPATH plugin available as a good
starting point for anyone:
http://www.taverna.org.uk/documentation/taverna-2-x/taverna-2-x-plugins/#rest

On Thu, Aug 21, 2014 at 12:51 PM, Bruno Vieira notifications@github.com
wrote:

More pipelines/workflows:

Basespace https://basespace.illumina.com

Genestack http://genestack.org

iPlant http://www.iplantcollaborative.org

MG-RAST http://metagenomics.anl.gov / github.com/MG-RAST/MG-RAST

Taverna http://www.taverna.org.uk

Kepler https://kepler-project.org

—
Reply to this email directly or view it on GitHub
#135 (comment).

-Thad
+ThadGuidry https://www.google.com/+ThadGuidry
Thad on LinkedIn http://www.linkedin.com/in/thadguidry/

gpratt · 2014-08-21T18:20:08Z

If we are talking about pipeline tools I use
http://gatkforums.broadinstitute.org/discussion/1306/overview-of-queue

It works quite well for managing a cluster / wrapping commandline tools.

Gabriel Pratt
Bioinformatics Graduate Student, Yeo Lab
University of California San Diego

On Thu, Aug 21, 2014 at 11:17 AM, Thad Guidry notifications@github.com
wrote:

+1 for Taverna... a few of my bioinformatics friends around Dallas
universities and research groups actually use this, and I introduced
Taverna to a few of them myself.
Taverna already has data integration built in to many external tools as
well and there is already a REST and XPATH plugin available as a good
starting point for anyone:

http://www.taverna.org.uk/documentation/taverna-2-x/taverna-2-x-plugins/#rest

On Thu, Aug 21, 2014 at 12:51 PM, Bruno Vieira notifications@github.com
wrote:

More pipelines/workflows:

Basespace https://basespace.illumina.com

Genestack http://genestack.org

iPlant http://www.iplantcollaborative.org

MG-RAST http://metagenomics.anl.gov / github.com/MG-RAST/MG-RAST

Taverna http://www.taverna.org.uk

Kepler https://kepler-project.org

—
Reply to this email directly or view it on GitHub
#135 (comment).

-Thad
+ThadGuidry https://www.google.com/+ThadGuidry
Thad on LinkedIn http://www.linkedin.com/in/thadguidry/

—
Reply to this email directly or view it on GitHub
#135 (comment).

brainstorm · 2014-08-22T09:15:28Z

@bmpvieira, don't forget about Galaxy: http://galaxyproject.org/

bmpvieira · 2014-08-22T09:36:37Z

Thanks @brainstorm! Of course, Galaxy should be the first to look. In my head, I thought someone had already mentioned it and was giving alternatives. We talked about Galaxy in the hangout.

saketkc · 2014-11-15T06:20:30Z

This sounds really cool! I am going to rope in @jmchilton

olgabot · 2014-11-27T19:24:06Z

Has someone started writing a wrapper for the STAR genome aligner already? I'd like to write a blog post which uses Docker and dat to download the data from this paper (~202gb of sequences), run RNA-seq alignment using either STAR or GSNAP, quantify gene expression using RSEM, create a matrix of expression values for all genes and all samples (~350 samples), and then using flotilla to recreate all the figures.

joehand · 2016-06-17T18:43:58Z

Moved dat-ecosystem-archive/datproject-discussions#46

webmaven · 2016-07-03T03:17:52Z

@joehand, it looks like you missed copying over the last comment by olgabot.

max-mapper added the discussion/use case label Aug 18, 2014

max-mapper added the use case label Aug 20, 2014

joehand mentioned this issue Jun 17, 2016

Bioinformatics use case (RNA-Seq analysis) dat-ecosystem-archive/datproject-discussions#46

Open

joehand closed this as completed Jun 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bioinformatics use case (RNA-Seq analysis) #135

Bioinformatics use case (RNA-Seq analysis) #135

olgabot commented Jul 6, 2014

bmpvieira commented Jul 10, 2014

max-mapper commented Jul 19, 2014

mlovci commented Jul 19, 2014

jbenet commented Jul 19, 2014

@bmpvieira : Just saw this note, I'll try to get `bionode-ncbi` going and see what happens.

mlovci commented Jul 19, 2014

webmaven commented Jul 21, 2014

webmaven commented Jul 21, 2014

mlovci commented Jul 21, 2014

webmaven commented Jul 21, 2014

max-mapper commented Aug 19, 2014

olgabot commented Aug 19, 2014

max-mapper commented Aug 19, 2014

olgabot commented Aug 20, 2014

max-mapper commented Aug 20, 2014

max-mapper commented Aug 20, 2014

olgabot commented Aug 20, 2014

bmpvieira commented Aug 21, 2014

thadguidry commented Aug 21, 2014

gpratt commented Aug 21, 2014

brainstorm commented Aug 22, 2014

bmpvieira commented Aug 22, 2014

saketkc commented Nov 15, 2014

olgabot commented Nov 27, 2014

joehand commented Jun 17, 2016

webmaven commented Jul 3, 2016

Bioinformatics use case (RNA-Seq analysis) #135

Bioinformatics use case (RNA-Seq analysis) #135

Comments

olgabot commented Jul 6, 2014

What we have now

What we would like

bmpvieira commented Jul 10, 2014

max-mapper commented Jul 19, 2014

mlovci commented Jul 19, 2014

jbenet commented Jul 19, 2014

@bmpvieira : Just saw this note, I'll try to get bionode-ncbi going and see what happens.

mlovci commented Jul 19, 2014

webmaven commented Jul 21, 2014

webmaven commented Jul 21, 2014

mlovci commented Jul 21, 2014

webmaven commented Jul 21, 2014

max-mapper commented Aug 19, 2014

olgabot commented Aug 19, 2014

max-mapper commented Aug 19, 2014

olgabot commented Aug 20, 2014

max-mapper commented Aug 20, 2014

max-mapper commented Aug 20, 2014

olgabot commented Aug 20, 2014

bmpvieira commented Aug 21, 2014

thadguidry commented Aug 21, 2014

gpratt commented Aug 21, 2014

brainstorm commented Aug 22, 2014

bmpvieira commented Aug 22, 2014

saketkc commented Nov 15, 2014

olgabot commented Nov 27, 2014

joehand commented Jun 17, 2016

webmaven commented Jul 3, 2016

@bmpvieira : Just saw this note, I'll try to get `bionode-ncbi` going and see what happens.