-
-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bioinformatics use case (RNA-Seq analysis) #135
Comments
I think for the first major issue you could do something like Dat -> JSON -> Pandas. That is, use Dat for versioning, archiving, distribution, etc, and Pandas for analysis. However, maybe Dat could replace Pandas and provide functionality similar to flotilla when #133 get's solved? For the second major issue, you could for now build a pipeline/tool (have a look at gasket) with bionode-ncbi and dat. A bionode-ensembl module is planned, but for now you can use ncbi and do Alternatively, you can just update bionode-ncbi and do As we keep adding features to these projects, things will get simpler. What I'm trying to do in my PhD is similar to this use case, first get all metadata and store in Dat, second reformat and query the metadata to find what's interesting to me, third send the reference IDs to the cluster so that the pipeline fetches all the raw data and runs the heavy analysis there. |
@mlovci hey random question, but can you share the big flowchart diagram you showed me at nodeconf? |
Here @maxogden : http://imgur.com/Ixub1nk.jpg ( I'd appreciate if you credit me when you use it ). I can give you high-res too, if you need it. @bmpvieira : Just saw this note, I'll try to get |
That's so depressing to me :( It looks like a metabolic pathway in the worst of ways.— On Sat, Jul 19, 2014 at 7:57 AM, Michael Lovci notifications@github.com
|
:( |
I find it stimulating, myself. It is nice that the flows are (relatively) straightforward, and aren't tangled into circular references or a big-ball-o-mud, which means parts can be encapsulated at various scales, components switched out, APIs standardized, etc. Thanks, @mlovci! |
@mlovci, an article describing the flow in more detail would likely be rather useful, BTW (and give the image a natural 'home' on the web). |
thanks @webmaven that made me feel better. I'll think about writing an article. |
You're welcome @mlovci. Anyway, the 'metabolic pathways' analogy breaks down since there aren't a bunch of crazy feedback loops (and the couple of places that look like there might be are just because things were moved to fit horizontally). |
@mlovci @olgabot Heya! We are finally done with the first stable version of dat. Our new website is up at http://dat-data.com/ It would be good to do a google hangout with you guys and @bmpvieira and @mafintosh to discuss how we could start modeling your data pipeline with dat! |
Yes, definitely! Out of curiosity, what's your current hosting solution? If Sent from my mobile device.
|
@olgabot we have been hosting mostly on https://www.digitalocean.com/ and working on large file backend hosting on Google Cloud Services, but we are also working on this list of hosts we'd like to support: dat-ecosystem-archive/datproject-discussions#5 |
Cool! Would you be available for a hangout tomorrow at 1pm pst? Sent from my mobile device.
|
@olgabot yes definitely. @bmpvieira @mafintosh i'll invite you guys to the hangout tomorrow as well in case you wanna join |
took some notes from the hangout today (mostly random, just posting for archival purposes):
|
More Bioinformatics pipelining tools to know about (suggested from @gpratt)
|
+1 for Taverna... a few of my bioinformatics friends around Dallas On Thu, Aug 21, 2014 at 12:51 PM, Bruno Vieira notifications@github.com
-Thad |
If we are talking about pipeline tools I use It works quite well for managing a cluster / wrapping commandline tools. Gabriel Pratt On Thu, Aug 21, 2014 at 11:17 AM, Thad Guidry notifications@github.com
|
@bmpvieira, don't forget about Galaxy: http://galaxyproject.org/ |
Thanks @brainstorm! Of course, Galaxy should be the first to look. In my head, I thought someone had already mentioned it and was giving alternatives. We talked about Galaxy in the hangout. |
This sounds really cool! I am going to rope in @jmchilton |
Has someone started writing a wrapper for the STAR genome aligner already? I'd like to write a blog post which uses Docker and dat to download the data from this paper (~202gb of sequences), run RNA-seq alignment using either STAR or GSNAP, quantify gene expression using RSEM, create a matrix of expression values for all genes and all samples (~350 samples), and then using flotilla to recreate all the figures. |
@joehand, it looks like you missed copying over the last comment by olgabot. |
Hi @jbenet and @maxogden! Thank you so much for the time you took to meet with @mlovci and me this weekend. Here's an overview of our current data management situation and what our ideal case would be.
What we have now
Currently, we host a
datapackages.json
file which containsresources
with the names"experiment_design"
(metadata on the samples, e.g. celltype and colors to plot them with),"expression"
(gene expression data),"splicing"
(scores of which version of a gene was used). Then, at the end of the file, we have an attribute called"species"
(e.g."hg19"
for the human genome build 19) that only works withhg19
andmm10
(Mus musculus aka house mouse genome build 10) because it points to the URL "http://sauron.ucsd.edu/flotilla_projects//datapackage.json", which we hand-curated. So if the data we use is from one of these two species, we can grab the data.Try this:
On a command line:
In Python:
This will load the data from our server from sauron.ucsd.edu, and since you haven't downloaded anything with that filename yet, it will download it. Additionally, this is a test dataset with only information from human chromosome 22, so it is loadable on a regular laptop. Feel free to look through the code and JSON files.
flotilla.data_model.Study.from_data_package
does most of the heavy lifting in loading the data. Keep in mind that the project is also in a pre-alpha stage, and has a long way to go :)What we would like
Two major issues are:
neural_diff_chr22
datapackage into apandas.DataFrame
object which can then be imported intoflotilla
.datapackage.json
file for that file, but it should first check locally for the data and be able to be loaded offline, if you already have the data downloaded.ENSG00000100320
and get the "gene symbol" (i.e. the familiar name that we know it by) of RBFOX2 and that this gene is an RNA-binding protein involved in alternative splicing and neural development."species"
attribute, but ideally it would be something likeENSEMBL_v75_homo_sapiens
which would link to the human data here: http://uswest.ensembl.org/info/data/ftp/index.html and then grab gene annotation (gtf
files)/sequence information (fasta
files) as necessary by the analysis.Ideally, we could do something like this:
Which would fetch our mouse and human neuron data, which has some kind of link to ENSEMBL and attach all the relevant metadata about mouse and human genes, and give common keys where possible.
@mlovci - please add on with anything I missed.
The text was updated successfully, but these errors were encountered: