Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement portable metadata generation. #7470

Merged
merged 5 commits into from Mar 11, 2019

Conversation

jmchilton
Copy link
Member

This PR starts by implementing a new metadata generation interface (in both the narrow and broad sense of this word). This variant of external metadata generation that doesn't use command-line arguments or absolute paths and simply takes operates on the current, job working directory and knows what to do.

  • Uses relative paths allowing the directory to be staged elsewhere via Pulsar, Condor, or Docker for instance.
  • Avoids DB objects for hard-coded paths making everything more robust.
  • Avoids huge command-lines which for huge collection reduction jobs might hit system limits for command-line length.
  • Uses JSON dicts to communicate parameters - meaning we can avoid awkward argument parsing as this evolves (see old max_metadata_value_size handling in set_metadata.py for instance).
  • Uses output names instead of database IDs to key files - so file naming is consistent and human readable.

This builds on the new abstraction around metadata generation introduced in #7158 to create a new implementation of MetadataCollectionStrategy called PortableDirectoryMetadataGenerator. This also builds on unit testing introduced in #7459.

The second commit in this PR improves the correctness of metadata generation in the context of tool provided metadata by ensuring set_metadata.py uses the ToolProvidedMetadata abstraction introduced in #4437 and enhanced for this context in #7156. There were (and still are) things in set_metadata.py that assume the old style galaxy.json but using this abstraction sets the stage for proper external metadata collection of newer style galaxy.json files.

This PR also brings in a fix for the legacy ToolProvidedMetadata implementation for behavior defined previously only in set_metadata.py.

All of this work is from #7058 (tracked on board https://github.com/galaxyproject/galaxy/projects/11). In that context, this new metadata approach is extended even further to deal with job finish tasks including object store interactions as part of the Galaxy job script - setting the stage for data outputs that are truly remote from Galaxy.

@jmchilton
Copy link
Member Author

Indeed - with some more debugging I got:

ERROR:galaxy_ext.metadata.set_metadata:Problem sniffing datatype.
Traceback (most recent call last):
  File "/Users/john/workspace/galaxy/lib/galaxy_ext/metadata/set_metadata.py", line 48, in set_meta_with_tool_provided
    extension = sniff.handle_uploaded_dataset_file(dataset_instance.dataset.external_filename, datatypes_registry)
  File "/Users/john/workspace/galaxy/lib/galaxy/datatypes/sniff.py", line 720, in handle_uploaded_dataset_file
    return handle_uploaded_dataset_file_internal(*args, **kwds)[0]
  File "/Users/john/workspace/galaxy/lib/galaxy/datatypes/sniff.py", line 757, in handle_uploaded_dataset_file_internal
    guessed_ext = guess_ext(converted_path, sniff_order=datatypes_registry.sniff_order, is_binary=is_binary)
AttributeError: 'NoneType' object has no attribute 'sniff_order'

Looks like that isn't getting initialized properly anymore.

@jmchilton
Copy link
Member Author

Rebased with a fix for that error.

@@ -217,20 +287,17 @@ def __get_filename_override():
# return command required to build
fd, fp = tempfile.mkstemp(suffix='.py', dir=tmp_dir, prefix="set_metadata_")
metadata_script_file = abspath(fp)
os.fdopen(fd, 'w').write('from galaxy_ext.metadata.set_metadata import set_metadata; set_metadata()')
os.fdopen(fd, 'w').write(SET_METADATA_SCRIPT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to close the descriptor here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to tracking branch (c4da5ab).

# Same block as below...
set_meta_kwds = stringify_dictionary_keys(json.load(open(filename_kwds))) # load kwds; need to ensure our keywords are not unicode
try:
dataset = cPickle.load(open(filename_in, 'rb')) # load DatasetInstance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated comment: I think pickling might bite us if we really start doing portable metadata generation (plus it's a performance hit I think). We just need very little from the dataset instance, it just needs to be compatible for the set_meta methods in the dataype framework, right ? We could mock this out, which would also allow us to do metadata setting without the model / with a very minimal model.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once my history import/export rewrite branch that introduces the concept of a JSON model store merges with my remote metadata branch - we will be using json, not pickles for remote metadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is a slightly different context though - still here I agree it would be good to do this. I’ll try to keep this in mind.

@mvdbeek
Copy link
Member

mvdbeek commented Mar 7, 2019

Hmm, there's still a relevant error in https://jenkins.galaxyproject.org/job/docker-api/14864/testReport/junit/test.api.test_histories/HistoriesApiTestCase/test_import_metadata_regeneration/, it looks like:

galaxy.jobs ERROR 2019-03-07 19:58:51,543 Unable to cleanup job 35
Traceback (most recent call last):
  File "/galaxy/lib/galaxy/jobs/__init__.py", line 1523, in cleanup
    galaxy.tools.imp_exp.JobImportHistoryArchiveWrapper(self.app, self.job_id).cleanup_after_job()
  File "/galaxy/lib/galaxy/tools/imp_exp/__init__.py", line 159, in cleanup_after_job
    hda, new_history, jiha.job
  File "/galaxy/lib/galaxy/tools/__init__.py", line 2266, in regenerate_imported_metadata_if_needed
    history.id, job.user, incoming={'input1': hda}, overwrite=False
  File "/galaxy/lib/galaxy/tools/actions/metadata.py", line 93, in execute_via_app
    kwds={'overwrite': overwrite})
  File "/galaxy/lib/galaxy/metadata/__init__.py", line 131, in setup_external_metadata
    "job_metadata": job_relative_path(job_metadata),
  File "/galaxy/lib/galaxy/metadata/__init__.py", line 109, in job_relative_path
    path_relative = os.path.relpath(path, tmp_dir)
  File "/galaxy_venv/lib/python2.7/posixpath.py", line 428, in relpath
    raise ValueError("no path specified")
ValueError: no path specified

Variant of external metadata generation that doesn't use command-line arguments or absolute paths and simply takes operates on the current, job working directory and knows what to do.

- Uses relative paths allowing the directory to be staged elsewhere via Pulsar, Condor, or Docker for instance.
- Avoids DB objects for hard-coded paths making everything more robust.
- Avoids huge command-lines which for huge collection reduction jobs might hit system limits for command-line length.
- Uses JSON dicts to communicate parameters - meaning we can avoid awkward argument parsing as this evolves (see old max_metadata_value_size handling in set_metadata.py for instance).
- Uses output names instead of database IDs to key files - so file naming is consistent and human readable.
There is are somethings that happen in set_metadata.py that assume the old style galaxy.json. This sets the stage for proper external metadata collection of newer style galaxy.json files.
@jmchilton jmchilton merged commit 1ca256a into galaxyproject:dev Mar 11, 2019
@nsoranzo nsoranzo deleted the portable_metadata branch March 11, 2019 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants