Proteogenomics workflow #70

chambm · 2016-11-02T17:19:48Z

Can someone (@PratikDJagtap) point me to the Galaxy-P proteogenomic workflow into which I should integrate my Omicron tools, e.g. CustomProDB and PSM2SAM? I checked the "Published workflows" section of the public Galaxy-P site and it's not there. We can discuss any design considerations for the fused workflow here.

I see there's a "Tool needed" label; it begs the question, why is there no "Workflow needed" label? Pinging @bgruening because I don't know who better to ask. ;)

bgruening · 2016-11-02T17:24:42Z

For me a workflow is just a higher level abstraction, based on tools. I added this label, but it would be nice to explicitly state which tools are needed :)

Thanks @chambm!!!

jj-umn · 2016-11-02T18:01:07Z

@chambm I generally build a workflow and test it in galaxy. All the tools in the workflow should be retrievable from the same toolshed, preferably https://toolshed.g2.bx.psu.edu
To publish the workflow in the toolshed, it should have a repository_dependencies.xml that contains all the tool dependencies. @PratikDJagtap Perhaps you and Getiria can work on a demo workflow.

chambm · 2016-11-02T18:11:50Z

Silly me, I didn't even search the toolshed for workflows. Actually I've never installed a workflow from the toolshed before. 😮 This looks like a good candidate? https://toolshed.g2.bx.psu.edu/view/galaxyp/proteomics_rnaseq_sap_db_workflow/3a11830963e3

bgruening · 2016-11-02T18:33:21Z

Or this one? https://github.com/bgruening/galaxytools/tree/master/workflows/glimmer3
We accepting PR here :)

chambm · 2016-11-02T18:39:26Z

This work is aimed at human (possibly mouse) data, so we don't need to infer annotations. That's a whole other bag of genes!

PratikDJagtap · 2016-11-02T18:41:34Z

Hello Matt and Bjoern,

I will get back to you with answers by evening today.

Regards,
Pratik

jj-umn · 2016-11-02T18:43:41Z

Here's the workflows we used at GCC2016 tutorial:
https://github.com/galaxyproteomics/tools-galaxyp/tree/master/workflows/gcc2016_tutorial

bgruening · 2016-11-02T18:53:38Z

@chambm this was just an example of workflows in git and github with example data and so on :) It's all on the TS as well.

PratikDJagtap · 2016-11-03T03:01:45Z

Hello Matt,

I will work on workflow components from the workflow that we generated for
GCC2016 (mentioned by JJ below). JJ, we can also look at workflows from
ABRF2016.

https://github.com/galaxyproteomics/tools-galaxyp/tree/master/workflows/
gcc2016_tutorial

We will need a few hours to come up with a hybrid workflow after discussion
with the Galaxy-P team. If required, it will also be a good idea to have a
telephone / google hangouts session.

Regards,
Pratik

chambm · 2016-11-03T16:18:45Z

This is a well-written tutorial! https://netfiles.umn.edu/users/pjagtap/ABRF%202016/ABRF_2016_SW4_Galaxy_for_Multi-Omics.pdf

I look forward to hearing your ideas for a hybrid. I worry a bit when I look at the GCC workflow's complexity. Galaxy's job failure handling just isn't robust enough (yet) to handle failures in such a complex workflow. It was a pain to rerun subsets of the Omicron workflow when one step failed halfway through the workflow, and that workflow is about a tenth of the size.

PratikDJagtap · 2016-11-03T16:26:20Z

Hello Matt,

Yes - we will need to run these as sub-workflows which we did for the
workshop and also suggest users to use when running it for their projects.

As you might be aware, Galaxy also offers ability to rerun the subsequent
steps in case of a workflow failure (and once the failed job issue is taken
care of) so that user need not start from the beginning.

Looking forward to a hybrid OMicron-GalaxyP workflow!

Regards,

Pratik

On Thu, Nov 3, 2016 at 11:18 AM, Matt Chambers notifications@github.com
wrote:

This is a well-written tutorial! https://netfiles.umn.edu/
users/pjagtap/ABRF%202016/ABRF_2016_SW4_Galaxy_for_Multi-Omics.pdf

I look forward to hearing your ideas for a hybrid. I worry a bit when I
look at the GCC workflow's complexity. Galaxy's job failure handling just
isn't robust enough (yet) to handle failures in such a complex workflow. It
was a pain to rerun subsets of the Omicron workflow when one step failed
halfway through the workflow, and that workflow is about a tenth of the
size.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#70 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKL0OjkwPqFNOAaRbFmPvMpKW2vRgF7Nks5q6glmgaJpZM4KngwS
.

chambm · 2016-11-03T16:37:46Z

I should clarify that it was failures in a dataset collection that caused the problem. Jobs on single input files could be easily rerun. But jobs where only a few files from a collection failed could not be rerun. Yet using collections reduces the history size by a factor of , e.g. 10-25 fewer history items.

PratikDJagtap · 2016-11-03T16:39:33Z

Interesting - dataset collection has worked well in our hands. It will be
good to exchange notes as we proceed.

Pratik Jagtap,

chambm · 2016-11-03T17:06:13Z

Ping @tjgriff1

bgruening · 2016-11-03T21:53:03Z

Galaxy's job failure handling just isn't robust enough (yet)
There is a fix in latest dev that makes this more robust if I recall correctly.

chambm · 2016-11-07T22:53:10Z

Pratik and Tim: the Omicron version of CustomProDB and PSM2SAM depend on a few Galaxy data managers to download the genome FASTA, index it with Bowtie, and download gene annotations from UCSC Table Browser (done inside the CustomProDB R script).

Do you want to keep this architecture or look at some alternate implementation? Currently, the user needs to run these data manager "workflows" first, and refresh the local data tables manually because Galaxy doesn't update it automatically yet. I can see either moving all this reference data into the user's history so it can be a proper part of the workflow, or doing these steps in some kind of initialization step.

The Omicron approach uses Docker to get a custom Galaxy flavor running instantly, but it can't include the reference data without making the Docker image gigantic. So I let the user download this reference data from within Galaxy using the data managers, rather than with an initialization script which runs when the Docker container runs. With the script alternative, it will take hours before the Galaxy flavor is running, but when it does run, it's directly ready for the workflow.

PratikDJagtap · 2016-11-07T23:01:16Z

Hello Matt,

I am copying @getiria-onsongo, @jj-umn, @tmcgowan on this so that they can
comment on which of the approaches would work best - or if any alternative
would help to make this process more user-friendly.

Regards,
Pratik

On Mon, Nov 7, 2016 at 4:53 PM, Matt Chambers notifications@github.com
wrote:

Pratik and Tim: the Omicron version of CustomProDB and PSM2SAM depend on a
few Galaxy data managers to download the genome FASTA, index it with
Bowtie, and download gene annotations from UCSC Table Browser (done inside
the CustomProDB R script).

Do you want to keep this architecture or look at some alternate
implementation? Currently, the user needs to run these data manager
"workflows" first, and refresh the local data tables manually because
Galaxy doesn't update it automatically yet. I can see either moving all
this reference data into the user's history so it can be a proper part of
the workflow, or doing these steps in some kind of initialization step.

The Omicron approach uses Docker to get a custom Galaxy flavor running
instantly, but it can't include the reference data without making the
Docker image gigantic. So I let the user download this reference data from
within Galaxy using the data managers, rather than with an initialization
script which runs when the Docker container runs. With the script
alternative, it will take hours before the Galaxy flavor is running, but
when it does run, it's directly ready for the workflow.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#70 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKL0Oq4-lB6VGaBMCH0C6nCWz5tcXvYOks5q76vWgaJpZM4KngwS
.

tjgriff1 · 2016-11-08T15:21:47Z

Hi - here's my thoughts. When using the Docker implementation, it seems
reasonable to continue to do this the way you have in the past -- having
the user download the reference data needed after the instance is up and
running. It seems more reasonable to do it this way and keep the Docker
image more agile. As long as we are able to document the process users
need to follow to download the reference data and get it processed
correctly for DB generation it should be an OK way to approach this.

I'm also envisioning that not all users will be dependent on the Docker
image to gain access to these workflows. Some users may have their own
Galaxy instance already running or be utilizing cloud infrastructure where
instances are already in place (e.g. Jetstream). For these users, we can
share the workflows they need to make the pipeline run. These workflows
can include the steps needed to create the DB via CustomProDB.

In both cases it will be important to have the workflow documented so they
know how and what they need to do to make it work.

Make sense?

TG

On Mon, Nov 7, 2016 at 5:01 PM, Pratik Jagtap notifications@github.com
wrote:

Hello Matt,

I am copying @getiria-onsongo, @jj-umn, @tmcgowan on this so that they can
comment on which of the approaches would work best - or if any alternative
would help to make this process more user-friendly.

Regards,
Pratik

Pratik Jagtap,

Research Assistant Professor, Department of Biochemistry, Molecular Biology
and Biophysics,

University of Minnesota.

Address: 7-166 MCB, 420 Washington Ave SE, Minneapolis, MN 55455

*Phone: 612-624-0381 http://612-624-0381/ Email: pjagtap@umn.edu
pjagtap@umn.edu *

*Twitter: twitter.com/pratikomics http://twitter.com/pratikomics
Google Scholar: z.umn.edu/pjgs http://z.umn.edu/pjgs *

http://z.umn.edu/galaxypreferences

On Mon, Nov 7, 2016 at 4:53 PM, Matt Chambers notifications@github.com
wrote:

Pratik and Tim: the Omicron version of CustomProDB and PSM2SAM depend on
a
few Galaxy data managers to download the genome FASTA, index it with
Bowtie, and download gene annotations from UCSC Table Browser (done
inside
the CustomProDB R script).

Do you want to keep this architecture or look at some alternate
implementation? Currently, the user needs to run these data manager
"workflows" first, and refresh the local data tables manually because
Galaxy doesn't update it automatically yet. I can see either moving all
this reference data into the user's history so it can be a proper part of
the workflow, or doing these steps in some kind of initialization step.

The Omicron approach uses Docker to get a custom Galaxy flavor running
instantly, but it can't include the reference data without making the
Docker image gigantic. So I let the user download this reference data
from
within Galaxy using the data managers, rather than with an initialization
script which runs when the Docker container runs. With the script
alternative, it will take hours before the Galaxy flavor is running, but
when it does run, it's directly ready for the workflow.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#70
issuecomment-258989632>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKL0Oq4-
lB6VGaBMCH0C6nCWz5tcXvYOks5q76vWgaJpZM4KngwS>
.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#70 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKMY3pakdI6_kczomRqlKIrKpIjtC5DQks5q7629gaJpZM4KngwS
.

Tim Griffin, Ph.D.
Professor, and
Director, Center for Mass Spectrometry and Proteomics
University of Minnesota
Dept. of Biochemistry, Molecular Biology and Biophysics
6-155 Jackson Hall
321 Church Street SE
Minneapolis, MN 55455
USA

Office: 7-144 Molecular Cellular Biology (MCB)

Tel: 612-624-5249
Fax: 612-624-0432
Email: tgriffin@umn.edu

https://www.cbs.umn.edu/bmbb/contacts/timothy-j-griffin
Center for Mass Spectrometry and Proteomics website:
http://www.cbs.umn.edu/msp/

PratikDJagtap · 2016-11-08T15:45:51Z

My preference would be for the tool to have the option to use prebuilt indices and Shared Data Libraries for reference data.

That would be the model that makes the most sense for institutions such as galaxy main or MSI that run persistent Galaxy instances.

Retaining the option to build these on the fly makes sense when one is using the docker model and only using galaxy as a workflow engine, rather than as a collaboration and data sharing environment.

JJ

chambm · 2016-11-08T17:17:43Z

Does Galaxy have the necessary data types for indices to be able to share them in data libraries?

jj-umn · 2016-11-08T18:23:49Z

No.
We use Data managers for the index files: fai,2bit,bowtie,bwa,hisat,etc. which get added to the .loc files. We symlink the annotation files: GTF, VCF, etc. for those references in as Shared Data Libraries. These are admin tasks.

tjgriff1 · 2016-11-08T18:29:07Z

Good point JJ - for the persistent instances it would be really good to
have necessary files in the shared data library. Much easier on the user
than downloading and/or uploading large files.

TG

On Tue, Nov 8, 2016 at 9:45 AM, Pratik Jagtap notifications@github.com
wrote:

On 11/7/16 5:01 PM, Pratik Jagtap wrote:

Hello Matt,

I am copying @getiria-onsongo, @jj-umn, @tmcgowan on this so that they
can comment on which of the approaches would work best - or if any
alternative would help to make this process more user-friendly.

Regards,
Pratik

Pratik Jagtap,

Research Assistant Professor, Department of Biochemistry, Molecular
Biology and Biophysics,

University of Minnesota.

/*Address: *7-166 MCB, 420 Washington Ave SE, Minneapolis, MN 55455/

/_Phone:_612-624-0381 http://612-624-0381/*Email: *pjagtap@umn.edu
mailto:pjagtap@umn.edu /

/Twitter: twitter.com/pratikomics http://twitter.com/pratikomics
_Google Scholar:_z.umn.edu/pjgs http://z.umn.edu/pjgs/

http://z.umn.edu/galaxypreferences

On Mon, Nov 7, 2016 at 4:53 PM, Matt Chambers <notifications@github.com
mailto:notifications@github.com> wrote:

Pratik and Tim: the Omicron version of CustomProDB and PSM2SAM depend on
a few Galaxy data managers to download the genome FASTA, index it with
Bowtie, and download gene annotations from UCSC Table Browser (done inside
the CustomProDB R script).

Do you want to keep this architecture or look at some alternate
implementation? Currently, the user needs to run these data manager
"workflows" first, and refresh the local data tables manually because
Galaxy doesn't update it automatically yet. I can see either moving all
this reference data into the user's history so it can be a proper part of
the workflow, or doing these steps in some kind of initialization step.

The Omicron approach uses Docker to get a custom Galaxy flavor running
instantly, but it can't include the reference data without making the
Docker image gigantic. So I let the user download this reference data from
within Galaxy using the data managers, rather than with an initialization
script which runs when the Docker container runs. With the script
alternative, it will take hours before the Galaxy flavor is running, but
when it does run, it's directly ready for the workflow.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <https://github.com/
/issues/70#issuecomment-258989632>, or mute
the thread <https://github.com/notifications/unsubscribe-auth/AKL0Oq4-
lB6VGaBMCH0C6nCWz5tcXvYOks5q76vWgaJpZM4KngwS>.

My preference would be for the tool to have the option to use prebuilt
indices and Shared Data Libraries for reference data.

That would be the model that makes the most sense for institutions such as
galaxy main or MSI that run persistent Galaxy instances.

Retaining the option to build these on the fly makes sense when one is
using the docker model and only using galaxy as a workflow engine,

rather than as a collaboration and data sharing environment.

JJ

James E. Johnson Minnesota Supercomputing Institute University of Minnesota

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#70 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKMY3jyiAmAwrYCUxcml-cT_jkiMFYG4ks5q8JkwgaJpZM4KngwS
.

Tim Griffin, Ph.D.
Professor, and
Director, Center for Mass Spectrometry and Proteomics
University of Minnesota
Dept. of Biochemistry, Molecular Biology and Biophysics
6-155 Jackson Hall
321 Church Street SE
Minneapolis, MN 55455
USA

Office: 7-144 Molecular Cellular Biology (MCB)

Tel: 612-624-5249
Fax: 612-624-0432
Email: tgriffin@umn.edu

https://www.cbs.umn.edu/bmbb/contacts/timothy-j-griffin
Center for Mass Spectrometry and Proteomics website:
http://www.cbs.umn.edu/msp/

chambm · 2016-11-18T17:29:53Z

How coupled is SearchGUI/PeptideShaker to the MGF format? If we could keep things as mzML or mz5, preserving nativeID would be a big advantage for tracing downstream ids back to their source spectra.

PratikDJagtap · 2016-11-18T17:44:23Z

Hello Matt,

Currently MGF is the only input format that SearchGUI seems to support. We
can generate a github issue on the SearchGUI and PeptideShaker github site
for this.

Regards,
Pratik

chambm · 2016-12-13T19:25:53Z

@jj-umn In the GCC2016 workflow, there is a Select step filtering for ^\d+\tpr.B[^\t,]*(, pr.B[^\t,]*)*\t.*$ - This appears to be related to the Mouse pre-pro-B and pro-B FASTQ files, but the proteomic data I have is named like Mo_Tai_iTRAQ_f5. It looks like it's trying to filter on the protein column in the Peptide Shaker output, but my protein column looks like:

1 0, 1.503, 1.5436, 49.5419, 5.2189, 5.2784, 5.2833 GLLLYGPPGTGK
2 731.5188 SIYYITGESK

So I guess it has something to do with the input FASTA. Is it supposed to select only the non-reference accessions for each PSM?

PratikDJagtap · 2016-12-13T19:39:34Z

Hello Matt, If this is a filtering step for the PSM Report output from PeptideShaker - then we used the accession numbers from protein FASTA file (pre-pro-B and pro-B) to parse out the peptides from RNAseq data. Do you have accession numbers associated with these peptide identifications? @jj-umn <https://github.com/jj-umn> might be able to provide more information. Regards, Pratik Pratik Jagtap, Research Assistant Professor, Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota. *Address: 7-166 MCB, 420 Washington Ave SE, Minneapolis, MN 55455* *Phone: 612-624-0381 <http://612-624-0381/> Email: pjagtap@umn.edu <pjagtap@umn.edu> **Twitter: twitter.com/pratikomics <http://twitter.com/pratikomics> * *Google Scholar: z.umn.edu/pjgs <http://z.umn.edu/pjgs> * *How Are You – and How’s Your Microbiome?* <http://z.umn.edu/hayahym>

…

On Tue, Dec 13, 2016 at 1:25 PM, Matt Chambers ***@***.***> wrote: @jj-umn <https://github.com/jj-umn> In the GCC2016 workflow, there is a Select step filtering for ^\d+\tpr.B[^\t,]*(, pr.B[^\t,]*)*\t.*$ - This appears to be related to the Mouse pre-pro-B and pro-B FASTQ files, but the proteomic data I have is named like Mo_Tai_iTRAQ_f5. It looks like it's trying to filter on the protein column in the Peptide Shaker output, but my protein column looks like: 1 0, 1.503, 1.5436, 49.5419, 5.2189, 5.2784, 5.2833 GLLLYGPPGTGK 2 731.5188 SIYYITGESK So I guess it has something to do with the input FASTA. Is it supposed to select only the non-reference accessions for each PSM? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKL0OkyLpI3LLbvlBUxQhGgqzOTzU4Rkks5rHvFBgaJpZM4KngwS> .

chambm · 2016-12-13T22:28:01Z

@jj-umn Is the regex above for selecting only PSMs that map to ONLY non-reference protein sequences?

chambm added question Tool needed labels Nov 2, 2016

chambm added Workflow needed and removed question labels Nov 3, 2016

chambm closed this as completed Apr 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proteogenomics workflow #70

Proteogenomics workflow #70

chambm commented Nov 2, 2016 •

edited

Loading

bgruening commented Nov 2, 2016

jj-umn commented Nov 2, 2016

chambm commented Nov 2, 2016

bgruening commented Nov 2, 2016

chambm commented Nov 2, 2016

PratikDJagtap commented Nov 2, 2016 •

edited by bgruening

Loading

jj-umn commented Nov 2, 2016

bgruening commented Nov 2, 2016

PratikDJagtap commented Nov 3, 2016 •

edited by bgruening

Loading

chambm commented Nov 3, 2016

PratikDJagtap commented Nov 3, 2016

chambm commented Nov 3, 2016

PratikDJagtap commented Nov 3, 2016 •

edited by bgruening

Loading

chambm commented Nov 3, 2016

bgruening commented Nov 3, 2016

chambm commented Nov 7, 2016

PratikDJagtap commented Nov 7, 2016 •

edited by chambm

Loading

tjgriff1 commented Nov 8, 2016

PratikDJagtap commented Nov 8, 2016 •

edited by chambm

Loading

chambm commented Nov 8, 2016

jj-umn commented Nov 8, 2016

tjgriff1 commented Nov 8, 2016

chambm commented Nov 18, 2016

PratikDJagtap commented Nov 18, 2016 •

edited by bgruening

Loading

chambm commented Dec 13, 2016

PratikDJagtap commented Dec 13, 2016 via email

chambm commented Dec 13, 2016

Proteogenomics workflow #70

Proteogenomics workflow #70

Comments

chambm commented Nov 2, 2016 • edited Loading

bgruening commented Nov 2, 2016

jj-umn commented Nov 2, 2016

chambm commented Nov 2, 2016

bgruening commented Nov 2, 2016

chambm commented Nov 2, 2016

PratikDJagtap commented Nov 2, 2016 • edited by bgruening Loading

jj-umn commented Nov 2, 2016

bgruening commented Nov 2, 2016

PratikDJagtap commented Nov 3, 2016 • edited by bgruening Loading

chambm commented Nov 3, 2016

PratikDJagtap commented Nov 3, 2016

chambm commented Nov 3, 2016

PratikDJagtap commented Nov 3, 2016 • edited by bgruening Loading

chambm commented Nov 3, 2016

bgruening commented Nov 3, 2016

chambm commented Nov 7, 2016

PratikDJagtap commented Nov 7, 2016 • edited by chambm Loading

tjgriff1 commented Nov 8, 2016

PratikDJagtap commented Nov 8, 2016 • edited by chambm Loading

chambm commented Nov 8, 2016

jj-umn commented Nov 8, 2016

tjgriff1 commented Nov 8, 2016

chambm commented Nov 18, 2016

PratikDJagtap commented Nov 18, 2016 • edited by bgruening Loading

chambm commented Dec 13, 2016

PratikDJagtap commented Dec 13, 2016 via email

chambm commented Dec 13, 2016

chambm commented Nov 2, 2016 •

edited

Loading

PratikDJagtap commented Nov 2, 2016 •

edited by bgruening

Loading

PratikDJagtap commented Nov 3, 2016 •

edited by bgruening

Loading

PratikDJagtap commented Nov 3, 2016 •

edited by bgruening

Loading

PratikDJagtap commented Nov 7, 2016 •

edited by chambm

Loading

PratikDJagtap commented Nov 8, 2016 •

edited by chambm

Loading

PratikDJagtap commented Nov 18, 2016 •

edited by bgruening

Loading