Implement extended job metadata collection #8930

jmchilton · 2019-11-04T14:47:43Z

In traditional Galaxy job metadata collection, set_meta results in JSON serialized hda.metadata attributes for the declared outputs. With extended metadata - a "model store" containing full serialized model objects (including but not limited to their metadata attributes) for all outputs (declared and discovered datasets as well as collections) is produced. In theory, this model store is all the job needs to communicate back to the Galaxy process in order to get it merged with Galaxy's database and the job complete.

This modality is enabled by:

Setting the object store to store objects by uuid (setting object_store_store_by implemented in Allow storing objects in the object store by UUID. #7154). This is needed to write
Setting metadata_strategy to extended (metadata_strategy config option implemented for this work in Implement portable metadata generation. #7470).

In this variant of metadata collection - set_metadata does significantly more work:

Loads a JSON serialized description of the object store (serialization implemented in Add to_dict() to ObjectStores. #7110 and De-couple objectstore abstraction from XML. #7085).
The job itself decides if the tool failed using Move model/Tool interactions out of galaxy.jobs.output_checker. #7083.
Input dbkey and input file extension needed for output discovery and handling are loaded from the job description - using Do not re-compute job input_ext/dbkey in job finish. #8910.
Outputs are written to loaded object store.
"Dynamic output discovery" (e.g. discovered datasets <discover_datasets>, unpopulated collections, etc..) happens as part of the job and outputs are written to the object store. Leveraging dozens of refactorings including Collected Refactoring of Job Output Handling Code #7471, Make galaxy.model.dataset_collections more usable outside manager context. #7595, and Refactor output_collect.py toward a database session-less modality. #7541.
Tool provided metadata is collected and applied to output (hda and hdca) models.

This approach is likely slower for a single job on a single machine - there is more overhead associated with the plumbing of Galaxy jobs - more "work" happens, more has to be serialized, etc.. However, for many jobs over a cluster this massively decentralizes the work that needed to be done by the job handlers. Using Pulsar - currently all job completion tasks are still done by the job handler and outputs need to be streamed back from each worker to the Galaxy server for jobs to be properly complete. The amount of this data is unbounded per job and the number of jobs are as well but one can easily imagine it being hundreds of gigs per job and hundreds of jobs per hour for large analyses. Having to transfer this amount of data to the Galaxy server and then from the Galaxy server to centralized storage is a tremendous amount of overhead and creates a huge bottleneck on the job handling servers/pods/etc..

Pairing this approach with Pulsar - presumably at most a few megabytes per job need to be transferred back to the Galaxy server and Galaxy does not need to read or write to the object store to finalize jobs. This largely eliminates job handlers as a bottleneck and eliminates that hugely wasteful double transfer of output files to their final destination. This is necessary for Galaxy to properly exploit distributed file systems and distributed compute.

mvdbeek · 2019-11-07T11:07:10Z

This approach is likely slower for a single job on a single machine

I think output discovery and data moving as part of the job outweighs all associated overhead by far. This should be the default and also likely fixes #1854

jmchilton · 2019-11-07T16:20:30Z

I guess I'll pull this out of WIP since all the tests pass and this seems to be a fairly atomic unit in the current form. The 3 commented out test cases in test/integration/test_extended_metadata.py are legitimate edge cases where this approach fails but they are pretty esoteric tool features. If this is merged I'll open up a WIP PR that uncomments those and try to track down the problems. This works with and without Pulsar, on a board range of tools, etc.. - I think the abstractions are solid and the approach is pretty good, it will be easier to proceed with next steps with this merged I think.

Next steps would be I think:

Making job metrics work as part of this abstraction so they don't need to be transferred separately.
Automatically restricting what Pulsar transfers in this mode (just automatically only transfer this stuff without special rules).
Make object_store_store_by=uuid the default for new Galaxy installs (how to detect?).
Make outputs_to_working=True the default for new Galaxy installs (for more robust, secure container handling by default).
Try to make a mix of object stores with different store_by attributes work - so we can use this on usegalaxy.* for new jobs.
Pulsar test cases in Galaxy using this.
Container test cases in Galaxy using this.

mvdbeek · 2019-11-12T20:43:37Z

lib/galaxy/metadata/set_metadata.py

+
+    export_store = None
+    if extended_metadata_collection:
+        from galaxy.tool_util.parser.stdio import ToolStdioRegex, ToolStdioExitCode


Is there a reason not to import them from the top ?

set_metadata gets called for every job - when I profiled tool tests years ago, loading stuff in set_metadata was a significant factor in general slowness of Galaxy (that was for tool tests - I'm sure it applies even more so across say Galaxy's API suite). So I've sort of interpreted that as the fewer imports it does the better?

Hmm, I'd assume those are imports that do stuff at the module level, like setting up the ORM mapping and things like that. But since that is anyway sprinkled across set_metadata.py I guess we can keep it that way.

mvdbeek

This is amazing and much needed. I opened a PR with 2 small fixes here: #8984

mvdbeek · 2019-11-12T20:55:47Z

lib/galaxy/metadata/set_metadata.py

+
+    export_store = None
+    if extended_metadata_collection:
+        from galaxy.tool_util.parser.stdio import ToolStdioRegex, ToolStdioExitCode


Hmm, I'd assume those are imports that do stuff at the module level, like setting up the ORM mapping and things like that. But since that is anyway sprinkled across set_metadata.py I guess we can keep it that way.

Various refactorings ahead of extended metadata impl.

7483ce2

jmchilton added kind/enhancement area/jobs labels Nov 4, 2019

jmchilton added this to the 20.01 milestone Nov 4, 2019

galaxybot added the status/WIP label Nov 4, 2019

Implement extended metadata collection.

904ac4d

jmchilton force-pushed the extended_metadata branch from 4b59232 to 904ac4d Compare November 5, 2019 15:48

jmchilton changed the title ~~[WIP] Implement extended job metadata collection~~ Implement extended job metadata collection Nov 7, 2019

jmchilton removed the status/WIP label Nov 7, 2019

mvdbeek reviewed Nov 12, 2019

View reviewed changes

mvdbeek mentioned this pull request Nov 13, 2019

Implement extended job metadata collection + fixes #8984

Merged

mvdbeek approved these changes Nov 13, 2019

View reviewed changes

mvdbeek merged commit 18db389 into galaxyproject:dev Nov 13, 2019

jmchilton mentioned this pull request Nov 13, 2019

Assorted fixes for extended metadata handling. #8990

Merged

jmchilton mentioned this pull request Jan 8, 2020

Attempt to configure new Galaxies to storing files by UUID by default. #9189

Closed

bgruening mentioned this pull request Jun 26, 2020

add new extended metadata strategy usegalaxy-eu/infrastructure-playbook#199

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement extended job metadata collection #8930

Implement extended job metadata collection #8930

jmchilton commented Nov 4, 2019 •

edited by nsoranzo

Loading

mvdbeek commented Nov 7, 2019

jmchilton commented Nov 7, 2019

mvdbeek Nov 12, 2019

jmchilton Nov 12, 2019

mvdbeek Nov 12, 2019

mvdbeek left a comment

mvdbeek Nov 12, 2019

Implement extended job metadata collection #8930

Implement extended job metadata collection #8930

Conversation

jmchilton commented Nov 4, 2019 • edited by nsoranzo Loading

mvdbeek commented Nov 7, 2019

jmchilton commented Nov 7, 2019

mvdbeek Nov 12, 2019

Choose a reason for hiding this comment

jmchilton Nov 12, 2019

Choose a reason for hiding this comment

mvdbeek Nov 12, 2019

Choose a reason for hiding this comment

mvdbeek left a comment

Choose a reason for hiding this comment

mvdbeek Nov 12, 2019

Choose a reason for hiding this comment

jmchilton commented Nov 4, 2019 •

edited by nsoranzo

Loading