Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced tool options for dataset discovery #4437

Merged
merged 6 commits into from Aug 18, 2017

Conversation

Projects
None yet
2 participants
@jmchilton
Copy link
Member

commented Aug 16, 2017

Changes

  • Redo the structure of galaxy.json for profile >= 17.09 tools with two new test tools to demonstrate the new structure - for both output dataset metadata and discovered dataset metadata (tool_provided_metadata_5.xml and tool_provided_metadata_6.xml respectively). a73a027
  • Allow newer tools to specify the older style format for galaxy.json can be used - with XSD update and a test tool to demonstrate this (tool_provided_metadata_8.xml). This can be specified by setting the provided_metadata_style on the outputs element for a tool - valid values include legacy and default. 0df8205
  • When using the new galaxy.json structure - allow discovering datasets directly from galaxy.json instead of needing to specify file name patterns - including XSD update and test tools to demonstrate for new primary datasets and for dynamic collection contents (tool_provided_metadata_7.xml and collection_creates_dynamic_nested_from_json.xml respectively). ec37537
  • When using the new galaxy.json structure with collection dataset discovery - allow specification of collections in a nested fashion (the identifier_N flat semantics are harder to grok I think). There is more discussion below and the test tool collection_creates_dynamic_nested_from_json_elements.xml demonstrates the new syntax. 29532f3
  • Add test tool (tool_provided_metadata_4.xml) to demonstrate referencing entries in galaxy.json for datasets by dataset path basename instead of ID - this was existing functionality but not tested. a73a027
  • Add a bit more documentation to test tools. a73a027
  • Improved abstractions for interaction between tools and jobs to enable different tool provided metadata collection depending on tool profile version. a73a027
  • Allow override of the location of the tool provided metadata file and a test tool to demonstrate this (tool_provided_metadata_9.xml) 0df8205

New format for current galaxy.json functions

There are differences between discovering new primary datasets and finding metadata for existing outputs, but broadly speaking the older galaxy.json layout is one line per file, each line is a JSON blob containing the dataset_id to key in on. This stinks for a few reasons - it doesn't allow multiple line metadata, it isn't a real format, it requires exposing the dataset ID to through the tool interface and into the job (the output name is much more natural). The new format addresses these issues - for all newer profile tools this format is always keyed on the output name in the tool definition instead of the ID of the output dataset - additionally galaxy.json is expected to a real JSON file - a dictionary structure keyed on these output names.

For an example from the test tools, this older galaxy.json:

{"type": "dataset", "dataset_id": $out1.dataset.dataset.id, "name": "my dynamic name", "ext": "txt", "info": "my dynamic info", "dbkey": "cust1"}

becomes

{"out1": {
  "name": "my dynamic name",
  "ext": "txt",
  "info": "my dynamic info",
  "dbkey": "cust1"
}}

Which is a lot prettier and eliminates the need to pass around the dynamic dataset ID like that when the static tool output name (out1) is known to the tool other.

For dataset discovery - the following file:

{"type": "new_primary_dataset", "filename": "sample1.report.tsv", "name": "cool name 1", "ext": "txt", "info": "cool 1 info", "dbkey": "hg19"}
{"type": "new_primary_dataset", "filename": "sample2.report.tsv", "name": "cool name 2", "ext": "txt", "info": "cool 2 info", "dbkey": "hg19"}

would become:

{"sample": {
  "datasets": [
    {"filename": "sample1.report.tsv", "name": "cool name 1", "ext": "txt", "info": "cool 1 info", "dbkey": "hg19"},
    {"filename": "sample2.report.tsv", "name": "cool name 2", "ext": "txt", "info": "cool 2 info", "dbkey": "hg19"}
]
}}

This change is less substantial but still a good one I think.

New function for galaxy.json - discovery

Coming back to the example:

{"sample": {
  "datasets": [
    {"filename": "sample1.report.tsv", "name": "cool name 1", "ext": "txt", "info": "cool 1 info", "dbkey": "hg19"},
    {"filename": "sample2.report.tsv", "name": "cool name 2", "ext": "txt", "info": "cool 2 info", "dbkey": "hg19"}
]
}}

Previously, a discover dataset pattern would needed to be able to pick up these files first before the metadata would be used. The following matches this example in the test tools:

<discover_datasets pattern="(?P&lt;designation&gt;.+)\.report\.tsv" visible="true" />

Unfortunately this pattern based matching is non-intuitive and somewhat redundant given that we are listing the file paths in galaxy.json. After this work is merged, the files in galaxy.json can be used directly if the following discover_datasets flag is used:

<discover_datasets from_tool_provided_metadata="true" visible="true" />

Default values for ext, dbkey, etc... can still be specified on the discover_datasets tag when used in this way - just not the pattern or sort_by attributes. The pattern is unused and the sorting is assumed to be that specified in the JSON file.

If from_tool_provided_metadata is specified in conjunction with finding datasets for dynamically discovered collections - a nested variant of the output specified in galaxy.json can be used now as well:

{"list_output": {
    "elements": [
      {"name": "oe1", "elements": [
        {"name": "ie1", "filename": "oe1_ie1.fq"},
        {"name": "ie2", "filename": "oe1_ie2.fq"}
      ]},
      {"name": "oe2", "elements": [
        {"name": "ie1", "filename": "oe2_ie1.fq"},
        {"name": "ie2", "filename": "oe2_ie2.fq"}
      ]},
      {"name": "oe3", "elements": [
        {"name": "ie1", "filename": "oe3_ie1.fq"},
        {"name": "ie2", "filename": "oe3_ie2.fq"}
      ]}
]}}

This last discovered piece I think implements #2658.

@jmchilton

This comment has been minimized.

Copy link
Member Author

commented Aug 17, 2017

I made a bunch of fixes and enhancements to this PR tonight - I've updated the description with these changes.

@mvdbeek
Copy link
Member

left a comment

This looks nice to me!
I have some minor comments and questions inline. There is also a linting problem in https://travis-ci.org/galaxyproject/galaxy/jobs/265406252#L481

# LEGACY: Remove in 17.XX
if not os.path.exists( meta_file ):
# Maybe this is a legacy job, use the job working directory instead
meta_file = os.path.join( job_wrapper.working_directory, self.provided_metadata_file )

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Aug 17, 2017

Member

That's for jobs that were launched before the upgrade?

This comment has been minimized.

Copy link
@jmchilton

jmchilton Aug 17, 2017

Author Member

This is left over code from last year yeah - when we separated the "job directory" from the tool working directory "job_directory/working" I left this conditional in so that Galaxy would be able to find galaxy.json files for tools launched prior to the upgrade but finished after the upgrade.

try:
line = json.loads( line )
assert 'type' in line
except:

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Aug 17, 2017

Member

except Exception: ?

This comment has been minimized.

Copy link
@jmchilton

jmchilton Aug 17, 2017

Author Member

Yeah - this was copied and pasted from existing code but definitely I should fix that. I will do that.

class LegacyToolProvidedMetadata(object):

def __init__(self, job_wrapper, meta_file):
self.job_wrapper = job_wrapper

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Aug 17, 2017

Member

Seems like a great place to put doctests maybe.

This comment has been minimized.

Copy link
@jmchilton

jmchilton Aug 17, 2017

Author Member

There is already unit test coverage of this in test_collect_primary_datasets.py and there are several tool tests covering both tool provided metadata classes. Hopefully that is enough?


@property
def designation( self ):
# If collecting nested collection, grap identifier_0,

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Aug 17, 2017

Member

s/grap/grab/

""" Return name or None if not defined by the discovery pattern.
"""
name = None
if "name" in self.as_dict:

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Aug 17, 2017

Member

This function is equivalent to self.as_dict.get( "name" ), right ?

This comment has been minimized.

Copy link
@jmchilton

jmchilton Aug 17, 2017

Author Member

Yeah - these were copied and pasted from the variant below that was reading regexes. I should have thought that through and made them a bit cleaner.

@property
def dbkey( self ):
try:
return self.as_dict["dbkey"]

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Aug 17, 2017

Member

that could be written as

return self.as_dict.get("dbkey", self.collector.default_dbkey)
@property
def ext( self ):
try:
return self.as_dict["ext"]

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Aug 17, 2017

Member

same as above and below

Return "legacy" for older pre-17.09 default.
"""
return "default"

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Aug 17, 2017

Member

So classes that subclass get this as a default ? I.e is it intentional that XmlToolSource and YamlToolSource behave differently ?

This comment has been minimized.

Copy link
@jmchilton

jmchilton Aug 17, 2017

Author Member

Yes - the Yaml and XML implementations would be slightly different I think - outputs for instance is a list in the YAML version so we couldn't quite do the same thing. Actually I should fix the docstring for that though.

@jmchilton jmchilton force-pushed the jmchilton:galaxy_json_rev branch from 0df8205 to 6c9f46a Aug 17, 2017

jmchilton added a commit to jmchilton/galaxy that referenced this pull request Aug 17, 2017

@jmchilton jmchilton force-pushed the jmchilton:galaxy_json_rev branch from 6c9f46a to 42e067d Aug 17, 2017

jmchilton added a commit to jmchilton/galaxy that referenced this pull request Aug 17, 2017

@jmchilton jmchilton force-pushed the jmchilton:galaxy_json_rev branch from 42e067d to 85fb409 Aug 17, 2017

jmchilton added a commit to jmchilton/galaxy that referenced this pull request Aug 17, 2017

@jmchilton jmchilton force-pushed the jmchilton:galaxy_json_rev branch from 85fb409 to 8b45760 Aug 17, 2017

jmchilton added a commit to jmchilton/galaxy that referenced this pull request Aug 17, 2017

jmchilton added some commits Aug 16, 2017

Overhaul of tool provided job metadata.
- Add another test to demonstrate referencing entries in galaxy.json for datasets by dataset path basename instead of ID.
- Add a bit more documentation to test tools.
- Redo the structure of galaxy.json for profile >= 17.09 tools with two new test tools to demonstrate the new structure - for both output dataset metadata and discovered dataset metadata.
- Improved abstractions for interaction between tools and jobs to enable different tool provided metadata collection depending on tool profile version.
Allow discovering datasets directly from galaxy.json...
without needing to specify a discovered dataset pattern. If the ``discovered_datasets`` tag includes ``from_tool_provided_metadata="true"`` the datsets listed in galaxy.json will just be used directly without needing to be "discovered" using a dataset pattern. Metadata and such can continue to be specified either in that file or on the discovered_dataset element except things like pattern (which is not used) and sort_by (since the json should describe the order I suppose).
Move determining tool provided metadata style to tool parser.
Allow override by specifying "provided_metadata_style=" attribute on outputs in tool XML - this should allow newer profile tools to use the older style JSON providing a clear path for upgrades.
Allow override of provided metadata location.
Eliminates that terrible circular dependency between tools and jobs.

@jmchilton jmchilton force-pushed the jmchilton:galaxy_json_rev branch from 8b45760 to 1e1df67 Aug 17, 2017

@mvdbeek mvdbeek merged commit dd89a9c into galaxyproject:dev Aug 18, 2017

6 checks passed

api test Build finished. 281 tests run, 0 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 161 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 44 tests run, 0 skipped, 0 failed.
Details
lgtm analysis: JavaScript No alert changes
Details
toolshed test Build finished. 579 tests run, 0 skipped, 0 failed.
Details

natefoo added a commit to natefoo/galaxy that referenced this pull request Mar 6, 2018

WIP Upload and set_meta changes for a properly JSON galaxy.json befor…
…e (and shortly after) I found galaxyproject#4437. With that PR's changes, the scope is too large for this work now, so I will have to come back to it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.