-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proteogenomics workflow #70
Comments
For me a workflow is just a higher level abstraction, based on tools. I added this label, but it would be nice to explicitly state which tools are needed :) Thanks @chambm!!! |
@chambm I generally build a workflow and test it in galaxy. All the tools in the workflow should be retrievable from the same toolshed, preferably https://toolshed.g2.bx.psu.edu |
Silly me, I didn't even search the toolshed for workflows. Actually I've never installed a workflow from the toolshed before. 😮 This looks like a good candidate? https://toolshed.g2.bx.psu.edu/view/galaxyp/proteomics_rnaseq_sap_db_workflow/3a11830963e3 |
Or this one? https://github.com/bgruening/galaxytools/tree/master/workflows/glimmer3 |
This work is aimed at human (possibly mouse) data, so we don't need to infer annotations. That's a whole other bag of genes! |
Hello Matt and Bjoern, I will get back to you with answers by evening today. Regards, |
Here's the workflows we used at GCC2016 tutorial: |
@chambm this was just an example of workflows in git and github with example data and so on :) It's all on the TS as well. |
Hello Matt, I will work on workflow components from the workflow that we generated for https://github.com/galaxyproteomics/tools-galaxyp/tree/master/workflows/ We will need a few hours to come up with a hybrid workflow after discussion Regards, |
This is a well-written tutorial! https://netfiles.umn.edu/users/pjagtap/ABRF%202016/ABRF_2016_SW4_Galaxy_for_Multi-Omics.pdf I look forward to hearing your ideas for a hybrid. I worry a bit when I look at the GCC workflow's complexity. Galaxy's job failure handling just isn't robust enough (yet) to handle failures in such a complex workflow. It was a pain to rerun subsets of the Omicron workflow when one step failed halfway through the workflow, and that workflow is about a tenth of the size. |
Hello Matt, Yes - we will need to run these as sub-workflows which we did for the As you might be aware, Galaxy also offers ability to rerun the subsequent Looking forward to a hybrid OMicron-GalaxyP workflow! Regards, Pratik On Thu, Nov 3, 2016 at 11:18 AM, Matt Chambers notifications@github.com
|
I should clarify that it was failures in a dataset collection that caused the problem. Jobs on single input files could be easily rerun. But jobs where only a few files from a collection failed could not be rerun. Yet using collections reduces the history size by a factor of , e.g. 10-25 fewer history items. |
Interesting - dataset collection has worked well in our hands. It will be Pratik Jagtap, |
Ping @tjgriff1 |
|
Pratik and Tim: the Omicron version of CustomProDB and PSM2SAM depend on a few Galaxy data managers to download the genome FASTA, index it with Bowtie, and download gene annotations from UCSC Table Browser (done inside the CustomProDB R script). Do you want to keep this architecture or look at some alternate implementation? Currently, the user needs to run these data manager "workflows" first, and refresh the local data tables manually because Galaxy doesn't update it automatically yet. I can see either moving all this reference data into the user's history so it can be a proper part of the workflow, or doing these steps in some kind of initialization step. The Omicron approach uses Docker to get a custom Galaxy flavor running instantly, but it can't include the reference data without making the Docker image gigantic. So I let the user download this reference data from within Galaxy using the data managers, rather than with an initialization script which runs when the Docker container runs. With the script alternative, it will take hours before the Galaxy flavor is running, but when it does run, it's directly ready for the workflow. |
Hello Matt, I am copying @getiria-onsongo, @jj-umn, @tmcgowan on this so that they can Regards, On Mon, Nov 7, 2016 at 4:53 PM, Matt Chambers notifications@github.com
|
Hi - here's my thoughts. When using the Docker implementation, it seems I'm also envisioning that not all users will be dependent on the Docker In both cases it will be important to have the workflow documented so they Make sense?
On Mon, Nov 7, 2016 at 5:01 PM, Pratik Jagtap notifications@github.com
Tim Griffin, Ph.D. Office: 7-144 Molecular Cellular Biology (MCB) Tel: 612-624-5249 https://www.cbs.umn.edu/bmbb/contacts/timothy-j-griffin |
My preference would be for the tool to have the option to use prebuilt indices and Shared Data Libraries for reference data. That would be the model that makes the most sense for institutions such as galaxy main or MSI that run persistent Galaxy instances. Retaining the option to build these on the fly makes sense when one is using the docker model and only using galaxy as a workflow engine, rather than as a collaboration and data sharing environment. JJ |
Does Galaxy have the necessary data types for indices to be able to share them in data libraries? |
No. |
Good point JJ - for the persistent instances it would be really good to
On Tue, Nov 8, 2016 at 9:45 AM, Pratik Jagtap notifications@github.com
Tim Griffin, Ph.D. Office: 7-144 Molecular Cellular Biology (MCB) Tel: 612-624-5249 https://www.cbs.umn.edu/bmbb/contacts/timothy-j-griffin |
How coupled is SearchGUI/PeptideShaker to the MGF format? If we could keep things as mzML or mz5, preserving nativeID would be a big advantage for tracing downstream ids back to their source spectra. |
Hello Matt, Currently MGF is the only input format that SearchGUI seems to support. We Regards, |
@jj-umn In the GCC2016 workflow, there is a
So I guess it has something to do with the input FASTA. Is it supposed to select only the non-reference accessions for each PSM? |
Hello Matt,
If this is a filtering step for the PSM Report output from PeptideShaker -
then we used the accession numbers from protein FASTA file (pre-pro-B and
pro-B) to parse out the peptides from RNAseq data. Do you have accession
numbers associated with these peptide identifications?
@jj-umn <https://github.com/jj-umn> might be able to provide more
information.
Regards,
Pratik
Pratik Jagtap,
Research Assistant Professor, Department of Biochemistry, Molecular Biology
and Biophysics, University of Minnesota.
*Address: 7-166 MCB, 420 Washington Ave SE, Minneapolis, MN 55455*
*Phone: 612-624-0381 <http://612-624-0381/> Email: pjagtap@umn.edu
<pjagtap@umn.edu> **Twitter: twitter.com/pratikomics
<http://twitter.com/pratikomics> *
*Google Scholar: z.umn.edu/pjgs <http://z.umn.edu/pjgs> *
*How Are You – and How’s Your Microbiome?* <http://z.umn.edu/hayahym>
…On Tue, Dec 13, 2016 at 1:25 PM, Matt Chambers ***@***.***> wrote:
@jj-umn <https://github.com/jj-umn> In the GCC2016 workflow, there is a
Select step filtering for ^\d+\tpr.B[^\t,]*(, pr.B[^\t,]*)*\t.*$ - This
appears to be related to the Mouse pre-pro-B and pro-B FASTQ files, but the
proteomic data I have is named like Mo_Tai_iTRAQ_f5. It looks like it's
trying to filter on the protein column in the Peptide Shaker output, but my
protein column looks like:
1 0, 1.503, 1.5436, 49.5419, 5.2189, 5.2784, 5.2833 GLLLYGPPGTGK
2 731.5188 SIYYITGESK
So I guess it has something to do with the input FASTA. Is it supposed to
select only the non-reference accessions for each PSM?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#70 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKL0OkyLpI3LLbvlBUxQhGgqzOTzU4Rkks5rHvFBgaJpZM4KngwS>
.
|
@jj-umn Is the regex above for selecting only PSMs that map to ONLY non-reference protein sequences? |
Can someone (@PratikDJagtap) point me to the Galaxy-P proteogenomic workflow into which I should integrate my Omicron tools, e.g. CustomProDB and PSM2SAM? I checked the "Published workflows" section of the public Galaxy-P site and it's not there. We can discuss any design considerations for the fused workflow here.
I see there's a "Tool needed" label; it begs the question, why is there no "Workflow needed" label? Pinging @bgruening because I don't know who better to ask. ;)
The text was updated successfully, but these errors were encountered: