Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement extended job metadata collection #8930

Merged
merged 2 commits into from
Nov 13, 2019

Conversation

jmchilton
Copy link
Member

@jmchilton jmchilton commented Nov 4, 2019

In traditional Galaxy job metadata collection, set_meta results in JSON serialized hda.metadata attributes for the declared outputs. With extended metadata - a "model store" containing full serialized model objects (including but not limited to their metadata attributes) for all outputs (declared and discovered datasets as well as collections) is produced. In theory, this model store is all the job needs to communicate back to the Galaxy process in order to get it merged with Galaxy's database and the job complete.

This modality is enabled by:

In this variant of metadata collection - set_metadata does significantly more work:

This approach is likely slower for a single job on a single machine - there is more overhead associated with the plumbing of Galaxy jobs - more "work" happens, more has to be serialized, etc.. However, for many jobs over a cluster this massively decentralizes the work that needed to be done by the job handlers. Using Pulsar - currently all job completion tasks are still done by the job handler and outputs need to be streamed back from each worker to the Galaxy server for jobs to be properly complete. The amount of this data is unbounded per job and the number of jobs are as well but one can easily imagine it being hundreds of gigs per job and hundreds of jobs per hour for large analyses. Having to transfer this amount of data to the Galaxy server and then from the Galaxy server to centralized storage is a tremendous amount of overhead and creates a huge bottleneck on the job handling servers/pods/etc..

Pairing this approach with Pulsar - presumably at most a few megabytes per job need to be transferred back to the Galaxy server and Galaxy does not need to read or write to the object store to finalize jobs. This largely eliminates job handlers as a bottleneck and eliminates that hugely wasteful double transfer of output files to their final destination. This is necessary for Galaxy to properly exploit distributed file systems and distributed compute.

@mvdbeek
Copy link
Member

mvdbeek commented Nov 7, 2019

This approach is likely slower for a single job on a single machine

I think output discovery and data moving as part of the job outweighs all associated overhead by far. This should be the default and also likely fixes #1854

@jmchilton jmchilton changed the title [WIP] Implement extended job metadata collection Implement extended job metadata collection Nov 7, 2019
@jmchilton
Copy link
Member Author

I guess I'll pull this out of WIP since all the tests pass and this seems to be a fairly atomic unit in the current form. The 3 commented out test cases in test/integration/test_extended_metadata.py are legitimate edge cases where this approach fails but they are pretty esoteric tool features. If this is merged I'll open up a WIP PR that uncomments those and try to track down the problems. This works with and without Pulsar, on a board range of tools, etc.. - I think the abstractions are solid and the approach is pretty good, it will be easier to proceed with next steps with this merged I think.

Next steps would be I think:

  • Making job metrics work as part of this abstraction so they don't need to be transferred separately.
  • Automatically restricting what Pulsar transfers in this mode (just automatically only transfer this stuff without special rules).
  • Make object_store_store_by=uuid the default for new Galaxy installs (how to detect?).
  • Make outputs_to_working=True the default for new Galaxy installs (for more robust, secure container handling by default).
  • Try to make a mix of object stores with different store_by attributes work - so we can use this on usegalaxy.* for new jobs.
  • Pulsar test cases in Galaxy using this.
  • Container test cases in Galaxy using this.


export_store = None
if extended_metadata_collection:
from galaxy.tool_util.parser.stdio import ToolStdioRegex, ToolStdioExitCode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to import them from the top ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_metadata gets called for every job - when I profiled tool tests years ago, loading stuff in set_metadata was a significant factor in general slowness of Galaxy (that was for tool tests - I'm sure it applies even more so across say Galaxy's API suite). So I've sort of interpreted that as the fewer imports it does the better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'd assume those are imports that do stuff at the module level, like setting up the ORM mapping and things like that. But since that is anyway sprinkled across set_metadata.py I guess we can keep it that way.

Copy link
Member

@mvdbeek mvdbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing and much needed. I opened a PR with 2 small fixes here: #8984


export_store = None
if extended_metadata_collection:
from galaxy.tool_util.parser.stdio import ToolStdioRegex, ToolStdioExitCode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'd assume those are imports that do stuff at the module level, like setting up the ORM mapping and things like that. But since that is anyway sprinkled across set_metadata.py I guess we can keep it that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants