New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement portable metadata generation. #7470
Conversation
Indeed - with some more debugging I got:
Looks like that isn't getting initialized properly anymore. |
057f60e
to
b0cacdc
Compare
Rebased with a fix for that error. |
lib/galaxy/metadata/__init__.py
Outdated
@@ -217,20 +287,17 @@ def __get_filename_override(): | |||
# return command required to build | |||
fd, fp = tempfile.mkstemp(suffix='.py', dir=tmp_dir, prefix="set_metadata_") | |||
metadata_script_file = abspath(fp) | |||
os.fdopen(fd, 'w').write('from galaxy_ext.metadata.set_metadata import set_metadata; set_metadata()') | |||
os.fdopen(fd, 'w').write(SET_METADATA_SCRIPT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll want to close the descriptor here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to tracking branch (c4da5ab).
# Same block as below... | ||
set_meta_kwds = stringify_dictionary_keys(json.load(open(filename_kwds))) # load kwds; need to ensure our keywords are not unicode | ||
try: | ||
dataset = cPickle.load(open(filename_in, 'rb')) # load DatasetInstance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated comment: I think pickling might bite us if we really start doing portable metadata generation (plus it's a performance hit I think). We just need very little from the dataset instance, it just needs to be compatible for the set_meta methods in the dataype framework, right ? We could mock this out, which would also allow us to do metadata setting without the model / with a very minimal model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once my history import/export rewrite branch that introduces the concept of a JSON model store merges with my remote metadata branch - we will be using json, not pickles for remote metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is a slightly different context though - still here I agree it would be good to do this. I’ll try to keep this in mind.
Hmm, there's still a relevant error in https://jenkins.galaxyproject.org/job/docker-api/14864/testReport/junit/test.api.test_histories/HistoriesApiTestCase/test_import_metadata_regeneration/, it looks like:
|
Variant of external metadata generation that doesn't use command-line arguments or absolute paths and simply takes operates on the current, job working directory and knows what to do. - Uses relative paths allowing the directory to be staged elsewhere via Pulsar, Condor, or Docker for instance. - Avoids DB objects for hard-coded paths making everything more robust. - Avoids huge command-lines which for huge collection reduction jobs might hit system limits for command-line length. - Uses JSON dicts to communicate parameters - meaning we can avoid awkward argument parsing as this evolves (see old max_metadata_value_size handling in set_metadata.py for instance). - Uses output names instead of database IDs to key files - so file naming is consistent and human readable.
There is are somethings that happen in set_metadata.py that assume the old style galaxy.json. This sets the stage for proper external metadata collection of newer style galaxy.json files.
b0cacdc
to
f0912eb
Compare
This PR starts by implementing a new metadata generation interface (in both the narrow and broad sense of this word). This variant of external metadata generation that doesn't use command-line arguments or absolute paths and simply takes operates on the current, job working directory and knows what to do.
max_metadata_value_size
handling inset_metadata.py
for instance).This builds on the new abstraction around metadata generation introduced in #7158 to create a new implementation of
MetadataCollectionStrategy
calledPortableDirectoryMetadataGenerator
. This also builds on unit testing introduced in #7459.The second commit in this PR improves the correctness of metadata generation in the context of tool provided metadata by ensuring
set_metadata.py
uses theToolProvidedMetadata
abstraction introduced in #4437 and enhanced for this context in #7156. There were (and still are) things inset_metadata.py
that assume the old stylegalaxy.json
but using this abstraction sets the stage for proper external metadata collection of newer stylegalaxy.json
files.This PR also brings in a fix for the legacy
ToolProvidedMetadata
implementation for behavior defined previously only inset_metadata.py
.All of this work is from #7058 (tracked on board https://github.com/galaxyproject/galaxy/projects/11). In that context, this new metadata approach is extended even further to deal with job finish tasks including object store interactions as part of the Galaxy job script - setting the stage for data outputs that are truly remote from Galaxy.