Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proteogenomics workflow #70

Closed
chambm opened this issue Nov 2, 2016 · 27 comments
Closed

Proteogenomics workflow #70

chambm opened this issue Nov 2, 2016 · 27 comments

Comments

@chambm
Copy link
Contributor

chambm commented Nov 2, 2016

Can someone (@PratikDJagtap) point me to the Galaxy-P proteogenomic workflow into which I should integrate my Omicron tools, e.g. CustomProDB and PSM2SAM? I checked the "Published workflows" section of the public Galaxy-P site and it's not there. We can discuss any design considerations for the fused workflow here.

I see there's a "Tool needed" label; it begs the question, why is there no "Workflow needed" label? Pinging @bgruening because I don't know who better to ask. ;)

@bgruening
Copy link
Member

For me a workflow is just a higher level abstraction, based on tools. I added this label, but it would be nice to explicitly state which tools are needed :)

Thanks @chambm!!!

@jj-umn
Copy link
Member

jj-umn commented Nov 2, 2016

@chambm I generally build a workflow and test it in galaxy. All the tools in the workflow should be retrievable from the same toolshed, preferably https://toolshed.g2.bx.psu.edu
To publish the workflow in the toolshed, it should have a repository_dependencies.xml that contains all the tool dependencies. @PratikDJagtap Perhaps you and Getiria can work on a demo workflow.

@chambm
Copy link
Contributor Author

chambm commented Nov 2, 2016

Silly me, I didn't even search the toolshed for workflows. Actually I've never installed a workflow from the toolshed before. 😮 This looks like a good candidate? https://toolshed.g2.bx.psu.edu/view/galaxyp/proteomics_rnaseq_sap_db_workflow/3a11830963e3

@bgruening
Copy link
Member

Or this one? https://github.com/bgruening/galaxytools/tree/master/workflows/glimmer3
We accepting PR here :)

@chambm
Copy link
Contributor Author

chambm commented Nov 2, 2016

This work is aimed at human (possibly mouse) data, so we don't need to infer annotations. That's a whole other bag of genes!

@PratikDJagtap
Copy link
Member

PratikDJagtap commented Nov 2, 2016

Hello Matt and Bjoern,

I will get back to you with answers by evening today.

Regards,
Pratik

@jj-umn
Copy link
Member

jj-umn commented Nov 2, 2016

Here's the workflows we used at GCC2016 tutorial:
https://github.com/galaxyproteomics/tools-galaxyp/tree/master/workflows/gcc2016_tutorial

@bgruening
Copy link
Member

@chambm this was just an example of workflows in git and github with example data and so on :) It's all on the TS as well.

@PratikDJagtap
Copy link
Member

PratikDJagtap commented Nov 3, 2016

Hello Matt,

I will work on workflow components from the workflow that we generated for
GCC2016 (mentioned by JJ below). JJ, we can also look at workflows from
ABRF2016.

https://github.com/galaxyproteomics/tools-galaxyp/tree/master/workflows/
gcc2016_tutorial

We will need a few hours to come up with a hybrid workflow after discussion
with the Galaxy-P team. If required, it will also be a good idea to have a
telephone / google hangouts session.

Regards,
Pratik

@chambm
Copy link
Contributor Author

chambm commented Nov 3, 2016

This is a well-written tutorial! https://netfiles.umn.edu/users/pjagtap/ABRF%202016/ABRF_2016_SW4_Galaxy_for_Multi-Omics.pdf

I look forward to hearing your ideas for a hybrid. I worry a bit when I look at the GCC workflow's complexity. Galaxy's job failure handling just isn't robust enough (yet) to handle failures in such a complex workflow. It was a pain to rerun subsets of the Omicron workflow when one step failed halfway through the workflow, and that workflow is about a tenth of the size.

@PratikDJagtap
Copy link
Member

Hello Matt,

Yes - we will need to run these as sub-workflows which we did for the
workshop and also suggest users to use when running it for their projects.

As you might be aware, Galaxy also offers ability to rerun the subsequent
steps in case of a workflow failure (and once the failed job issue is taken
care of) so that user need not start from the beginning.

Looking forward to a hybrid OMicron-GalaxyP workflow!

Regards,

Pratik

On Thu, Nov 3, 2016 at 11:18 AM, Matt Chambers notifications@github.com
wrote:

This is a well-written tutorial! https://netfiles.umn.edu/
users/pjagtap/ABRF%202016/ABRF_2016_SW4_Galaxy_for_Multi-Omics.pdf

I look forward to hearing your ideas for a hybrid. I worry a bit when I
look at the GCC workflow's complexity. Galaxy's job failure handling just
isn't robust enough (yet) to handle failures in such a complex workflow. It
was a pain to rerun subsets of the Omicron workflow when one step failed
halfway through the workflow, and that workflow is about a tenth of the
size.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#70 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKL0OjkwPqFNOAaRbFmPvMpKW2vRgF7Nks5q6glmgaJpZM4KngwS
.

@chambm
Copy link
Contributor Author

chambm commented Nov 3, 2016

I should clarify that it was failures in a dataset collection that caused the problem. Jobs on single input files could be easily rerun. But jobs where only a few files from a collection failed could not be rerun. Yet using collections reduces the history size by a factor of , e.g. 10-25 fewer history items.

@PratikDJagtap
Copy link
Member

PratikDJagtap commented Nov 3, 2016

Interesting - dataset collection has worked well in our hands. It will be
good to exchange notes as we proceed.

Pratik Jagtap,

@chambm
Copy link
Contributor Author

chambm commented Nov 3, 2016

Ping @tjgriff1

@bgruening
Copy link
Member

Galaxy's job failure handling just isn't robust enough (yet)
There is a fix in latest dev that makes this more robust if I recall correctly.

@chambm
Copy link
Contributor Author

chambm commented Nov 7, 2016

Pratik and Tim: the Omicron version of CustomProDB and PSM2SAM depend on a few Galaxy data managers to download the genome FASTA, index it with Bowtie, and download gene annotations from UCSC Table Browser (done inside the CustomProDB R script).

Do you want to keep this architecture or look at some alternate implementation? Currently, the user needs to run these data manager "workflows" first, and refresh the local data tables manually because Galaxy doesn't update it automatically yet. I can see either moving all this reference data into the user's history so it can be a proper part of the workflow, or doing these steps in some kind of initialization step.

The Omicron approach uses Docker to get a custom Galaxy flavor running instantly, but it can't include the reference data without making the Docker image gigantic. So I let the user download this reference data from within Galaxy using the data managers, rather than with an initialization script which runs when the Docker container runs. With the script alternative, it will take hours before the Galaxy flavor is running, but when it does run, it's directly ready for the workflow.

@PratikDJagtap
Copy link
Member

PratikDJagtap commented Nov 7, 2016

Hello Matt,

I am copying @getiria-onsongo, @jj-umn, @tmcgowan on this so that they can
comment on which of the approaches would work best - or if any alternative
would help to make this process more user-friendly.

Regards,
Pratik

On Mon, Nov 7, 2016 at 4:53 PM, Matt Chambers notifications@github.com
wrote:

Pratik and Tim: the Omicron version of CustomProDB and PSM2SAM depend on a
few Galaxy data managers to download the genome FASTA, index it with
Bowtie, and download gene annotations from UCSC Table Browser (done inside
the CustomProDB R script).

Do you want to keep this architecture or look at some alternate
implementation? Currently, the user needs to run these data manager
"workflows" first, and refresh the local data tables manually because
Galaxy doesn't update it automatically yet. I can see either moving all
this reference data into the user's history so it can be a proper part of
the workflow, or doing these steps in some kind of initialization step.

The Omicron approach uses Docker to get a custom Galaxy flavor running
instantly, but it can't include the reference data without making the
Docker image gigantic. So I let the user download this reference data from
within Galaxy using the data managers, rather than with an initialization
script which runs when the Docker container runs. With the script
alternative, it will take hours before the Galaxy flavor is running, but
when it does run, it's directly ready for the workflow.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#70 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKL0Oq4-lB6VGaBMCH0C6nCWz5tcXvYOks5q76vWgaJpZM4KngwS
.

@tjgriff1
Copy link

tjgriff1 commented Nov 8, 2016

Hi - here's my thoughts. When using the Docker implementation, it seems
reasonable to continue to do this the way you have in the past -- having
the user download the reference data needed after the instance is up and
running. It seems more reasonable to do it this way and keep the Docker
image more agile. As long as we are able to document the process users
need to follow to download the reference data and get it processed
correctly for DB generation it should be an OK way to approach this.

I'm also envisioning that not all users will be dependent on the Docker
image to gain access to these workflows. Some users may have their own
Galaxy instance already running or be utilizing cloud infrastructure where
instances are already in place (e.g. Jetstream). For these users, we can
share the workflows they need to make the pipeline run. These workflows
can include the steps needed to create the DB via CustomProDB.

In both cases it will be important to have the workflow documented so they
know how and what they need to do to make it work.

Make sense?

  • TG

On Mon, Nov 7, 2016 at 5:01 PM, Pratik Jagtap notifications@github.com
wrote:

Hello Matt,

I am copying @getiria-onsongo, @jj-umn, @tmcgowan on this so that they can
comment on which of the approaches would work best - or if any alternative
would help to make this process more user-friendly.

Regards,
Pratik

Pratik Jagtap,

Research Assistant Professor, Department of Biochemistry, Molecular Biology
and Biophysics,

University of Minnesota.

Address: 7-166 MCB, 420 Washington Ave SE, Minneapolis, MN 55455

*Phone: 612-624-0381 http://612-624-0381/ Email: pjagtap@umn.edu
pjagtap@umn.edu *

*Twitter: twitter.com/pratikomics http://twitter.com/pratikomics
Google Scholar: z.umn.edu/pjgs http://z.umn.edu/pjgs *

http://z.umn.edu/galaxypreferences

On Mon, Nov 7, 2016 at 4:53 PM, Matt Chambers notifications@github.com
wrote:

Pratik and Tim: the Omicron version of CustomProDB and PSM2SAM depend on
a
few Galaxy data managers to download the genome FASTA, index it with
Bowtie, and download gene annotations from UCSC Table Browser (done
inside
the CustomProDB R script).

Do you want to keep this architecture or look at some alternate
implementation? Currently, the user needs to run these data manager
"workflows" first, and refresh the local data tables manually because
Galaxy doesn't update it automatically yet. I can see either moving all
this reference data into the user's history so it can be a proper part of
the workflow, or doing these steps in some kind of initialization step.

The Omicron approach uses Docker to get a custom Galaxy flavor running
instantly, but it can't include the reference data without making the
Docker image gigantic. So I let the user download this reference data
from
within Galaxy using the data managers, rather than with an initialization
script which runs when the Docker container runs. With the script
alternative, it will take hours before the Galaxy flavor is running, but
when it does run, it's directly ready for the workflow.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#70
issuecomment-258989632>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKL0Oq4-
lB6VGaBMCH0C6nCWz5tcXvYOks5q76vWgaJpZM4KngwS>
.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#70 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKMY3pakdI6_kczomRqlKIrKpIjtC5DQks5q7629gaJpZM4KngwS
.

Tim Griffin, Ph.D.
Professor, and
Director, Center for Mass Spectrometry and Proteomics
University of Minnesota
Dept. of Biochemistry, Molecular Biology and Biophysics
6-155 Jackson Hall
321 Church Street SE
Minneapolis, MN 55455
USA

Office: 7-144 Molecular Cellular Biology (MCB)

Tel: 612-624-5249
Fax: 612-624-0432
Email: tgriffin@umn.edu

https://www.cbs.umn.edu/bmbb/contacts/timothy-j-griffin
Center for Mass Spectrometry and Proteomics website:
http://www.cbs.umn.edu/msp/

@PratikDJagtap
Copy link
Member

PratikDJagtap commented Nov 8, 2016

My preference would be for the tool to have the option to use prebuilt indices and Shared Data Libraries for reference data.

That would be the model that makes the most sense for institutions such as galaxy main or MSI that run persistent Galaxy instances.

Retaining the option to build these on the fly makes sense when one is using the docker model and only using galaxy as a workflow engine, rather than as a collaboration and data sharing environment.

JJ

@chambm
Copy link
Contributor Author

chambm commented Nov 8, 2016

Does Galaxy have the necessary data types for indices to be able to share them in data libraries?

@jj-umn
Copy link
Member

jj-umn commented Nov 8, 2016

No.
We use Data managers for the index files: fai,2bit,bowtie,bwa,hisat,etc. which get added to the .loc files. We symlink the annotation files: GTF, VCF, etc. for those references in as Shared Data Libraries. These are admin tasks.

@tjgriff1
Copy link

tjgriff1 commented Nov 8, 2016

Good point JJ - for the persistent instances it would be really good to
have necessary files in the shared data library. Much easier on the user
than downloading and/or uploading large files.

  • TG

On Tue, Nov 8, 2016 at 9:45 AM, Pratik Jagtap notifications@github.com
wrote:

On 11/7/16 5:01 PM, Pratik Jagtap wrote:

Hello Matt,

I am copying @getiria-onsongo, @jj-umn, @tmcgowan on this so that they
can comment on which of the approaches would work best - or if any
alternative would help to make this process more user-friendly.

Regards,
Pratik

Pratik Jagtap,

Research Assistant Professor, Department of Biochemistry, Molecular
Biology and Biophysics,

University of Minnesota.

/*Address: *7-166 MCB, 420 Washington Ave SE, Minneapolis, MN 55455/

/_Phone:_612-624-0381 http://612-624-0381/*Email: *pjagtap@umn.edu
mailto:pjagtap@umn.edu /

/Twitter: twitter.com/pratikomics http://twitter.com/pratikomics
_Google Scholar:_z.umn.edu/pjgs http://z.umn.edu/pjgs/

http://z.umn.edu/galaxypreferences

On Mon, Nov 7, 2016 at 4:53 PM, Matt Chambers <notifications@github.com
mailto:notifications@github.com> wrote:

Pratik and Tim: the Omicron version of CustomProDB and PSM2SAM depend on
a few Galaxy data managers to download the genome FASTA, index it with
Bowtie, and download gene annotations from UCSC Table Browser (done inside
the CustomProDB R script).

Do you want to keep this architecture or look at some alternate
implementation? Currently, the user needs to run these data manager
"workflows" first, and refresh the local data tables manually because
Galaxy doesn't update it automatically yet. I can see either moving all
this reference data into the user's history so it can be a proper part of
the workflow, or doing these steps in some kind of initialization step.

The Omicron approach uses Docker to get a custom Galaxy flavor running
instantly, but it can't include the reference data without making the
Docker image gigantic. So I let the user download this reference data from
within Galaxy using the data managers, rather than with an initialization
script which runs when the Docker container runs. With the script
alternative, it will take hours before the Galaxy flavor is running, but
when it does run, it's directly ready for the workflow.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <https://github.com/
/issues/70#issuecomment-258989632>, or mute
the thread <https://github.com/notifications/unsubscribe-auth/AKL0Oq4-
lB6VGaBMCH0C6nCWz5tcXvYOks5q76vWgaJpZM4KngwS>.

My preference would be for the tool to have the option to use prebuilt
indices and Shared Data Libraries for reference data.

That would be the model that makes the most sense for institutions such as
galaxy main or MSI that run persistent Galaxy instances.

Retaining the option to build these on the fly makes sense when one is
using the docker model and only using galaxy as a workflow engine,

rather than as a collaboration and data sharing environment.

JJ

James E. Johnson Minnesota Supercomputing Institute University of Minnesota


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#70 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKMY3jyiAmAwrYCUxcml-cT_jkiMFYG4ks5q8JkwgaJpZM4KngwS
.

Tim Griffin, Ph.D.
Professor, and
Director, Center for Mass Spectrometry and Proteomics
University of Minnesota
Dept. of Biochemistry, Molecular Biology and Biophysics
6-155 Jackson Hall
321 Church Street SE
Minneapolis, MN 55455
USA

Office: 7-144 Molecular Cellular Biology (MCB)

Tel: 612-624-5249
Fax: 612-624-0432
Email: tgriffin@umn.edu

https://www.cbs.umn.edu/bmbb/contacts/timothy-j-griffin
Center for Mass Spectrometry and Proteomics website:
http://www.cbs.umn.edu/msp/

@chambm
Copy link
Contributor Author

chambm commented Nov 18, 2016

How coupled is SearchGUI/PeptideShaker to the MGF format? If we could keep things as mzML or mz5, preserving nativeID would be a big advantage for tracing downstream ids back to their source spectra.

@PratikDJagtap
Copy link
Member

PratikDJagtap commented Nov 18, 2016

Hello Matt,

Currently MGF is the only input format that SearchGUI seems to support. We
can generate a github issue on the SearchGUI and PeptideShaker github site
for this.

Regards,
Pratik

@chambm
Copy link
Contributor Author

chambm commented Dec 13, 2016

@jj-umn In the GCC2016 workflow, there is a Select step filtering for ^\d+\tpr.B[^\t,]*(, pr.B[^\t,]*)*\t.*$ - This appears to be related to the Mouse pre-pro-B and pro-B FASTQ files, but the proteomic data I have is named like Mo_Tai_iTRAQ_f5. It looks like it's trying to filter on the protein column in the Peptide Shaker output, but my protein column looks like:

1 0, 1.503, 1.5436, 49.5419, 5.2189, 5.2784, 5.2833 GLLLYGPPGTGK
2 731.5188 SIYYITGESK

So I guess it has something to do with the input FASTA. Is it supposed to select only the non-reference accessions for each PSM?

@PratikDJagtap
Copy link
Member

PratikDJagtap commented Dec 13, 2016 via email

@chambm
Copy link
Contributor Author

chambm commented Dec 13, 2016

@jj-umn Is the regex above for selecting only PSMs that map to ONLY non-reference protein sequences?

@chambm chambm closed this as completed Apr 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants