-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement extended job metadata collection #8930
Conversation
4b59232
to
904ac4d
Compare
I think output discovery and data moving as part of the job outweighs all associated overhead by far. This should be the default and also likely fixes #1854 |
I guess I'll pull this out of WIP since all the tests pass and this seems to be a fairly atomic unit in the current form. The 3 commented out test cases in test/integration/test_extended_metadata.py are legitimate edge cases where this approach fails but they are pretty esoteric tool features. If this is merged I'll open up a WIP PR that uncomments those and try to track down the problems. This works with and without Pulsar, on a board range of tools, etc.. - I think the abstractions are solid and the approach is pretty good, it will be easier to proceed with next steps with this merged I think. Next steps would be I think:
|
|
||
export_store = None | ||
if extended_metadata_collection: | ||
from galaxy.tool_util.parser.stdio import ToolStdioRegex, ToolStdioExitCode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to import them from the top ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_metadata gets called for every job - when I profiled tool tests years ago, loading stuff in set_metadata was a significant factor in general slowness of Galaxy (that was for tool tests - I'm sure it applies even more so across say Galaxy's API suite). So I've sort of interpreted that as the fewer imports it does the better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'd assume those are imports that do stuff at the module level, like setting up the ORM mapping and things like that. But since that is anyway sprinkled across set_metadata.py I guess we can keep it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing and much needed. I opened a PR with 2 small fixes here: #8984
|
||
export_store = None | ||
if extended_metadata_collection: | ||
from galaxy.tool_util.parser.stdio import ToolStdioRegex, ToolStdioExitCode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'd assume those are imports that do stuff at the module level, like setting up the ORM mapping and things like that. But since that is anyway sprinkled across set_metadata.py I guess we can keep it that way.
In traditional Galaxy job metadata collection, set_meta results in JSON serialized
hda.metadata
attributes for the declared outputs. With extended metadata - a "model store" containing full serialized model objects (including but not limited to theirmetadata
attributes) for all outputs (declared and discovered datasets as well as collections) is produced. In theory, this model store is all the job needs to communicate back to the Galaxy process in order to get it merged with Galaxy's database and the job complete.This modality is enabled by:
uuid
(settingobject_store_store_by
implemented in Allow storing objects in the object store by UUID. #7154). This is needed to writemetadata_strategy
toextended
(metadata_strategy
config option implemented for this work in Implement portable metadata generation. #7470).In this variant of metadata collection - set_metadata does significantly more work:
<discover_datasets>
, unpopulated collections, etc..) happens as part of the job and outputs are written to the object store. Leveraging dozens of refactorings including Collected Refactoring of Job Output Handling Code #7471, Make galaxy.model.dataset_collections more usable outside manager context. #7595, and Refactor output_collect.py toward a database session-less modality. #7541.This approach is likely slower for a single job on a single machine - there is more overhead associated with the plumbing of Galaxy jobs - more "work" happens, more has to be serialized, etc.. However, for many jobs over a cluster this massively decentralizes the work that needed to be done by the job handlers. Using Pulsar - currently all job completion tasks are still done by the job handler and outputs need to be streamed back from each worker to the Galaxy server for jobs to be properly complete. The amount of this data is unbounded per job and the number of jobs are as well but one can easily imagine it being hundreds of gigs per job and hundreds of jobs per hour for large analyses. Having to transfer this amount of data to the Galaxy server and then from the Galaxy server to centralized storage is a tremendous amount of overhead and creates a huge bottleneck on the job handling servers/pods/etc..
Pairing this approach with Pulsar - presumably at most a few megabytes per job need to be transferred back to the Galaxy server and Galaxy does not need to read or write to the object store to finalize jobs. This largely eliminates job handlers as a bottleneck and eliminates that hugely wasteful double transfer of output files to their final destination. This is necessary for Galaxy to properly exploit distributed file systems and distributed compute.