Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[22.01] Limit required element validation to safe elements #13299

Conversation

mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Feb 3, 2022

There used to be a long-standing bug in setting up the MetadataValidator if default values were used. We fixed this in
e194ef9 (#13139), but that means we're now checking for all non-optional values before running a tool. We have a ton of non-optional MetadatElement items in datatypes that should maybe be optional (an indication might be if default and no_value are specified and set to the same value ... but I'm not sure that's a 100% thing). So I think that reviewing this requires domain knowledge of the datatypes and what elements are really required, and I'm not sure we can do this in a timely fashion, and not break something that used to work.

So my suggestion is that we add check_required_metadata=True on datatypes for which we have checked that non-optional metadata elements are really non-optional. For those metadata elements for which this is not the case we skip the validation as we would do prior to e194ef9 (#13139).

As an example I have marked RDS and RData with check_required_metadata and added a test for check_required_metadata.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

There used to be a long-standing bug in setting up the MetadataValidator
if default values were used. We fixed this in
galaxyproject@e194ef9,
but that means we're now checking for all non-optional values before
running a tool. We have a ton of non-optional MetadatElement items
in datatypes that should maybe be optional (an indication might be if
`default` and `no_value` are specified and set to the same value ... but
I'm not sure that's a 100% thing). So I think that reviewing this
requires domain knowledge of the datatypes and what elements are really required,
and I'm not sure we can do this in a timely fashion, and not break
something that used to work.

So my suggestion is that we add `check_required_metadata=True`
on datatypes for which we have checked that non-optional metadata
elements are really non-optional. For those metadata elements
for which this is not the case we skip the validation as we would
do prior to
galaxyproject@e194ef9.

As an example I have marked  RDS and RData with check_required_metadata
and added a test for check_required_metadata.
@mvdbeek mvdbeek force-pushed the disable_required_metadata_filter_for_now branch from 9418379 to 370d343 Compare February 3, 2022 11:58
@bernt-matthias
Copy link
Contributor

bernt-matthias commented Feb 3, 2022

Given the problems with getting an understanding what default and no_value are actually doing (#13296) this seems to be the direction of least resistance and will certainly work.

But I'm afraid that this will cement the current state of the art, as long as we do not have a plan how to achieve a good state. This should be that all required metadata is checked again.

Maybe we can add a deprecation date to it (e.g. 1 or 2 releases)?

Might it be an alternative to check the dozen cases identified here: galaxyproject/tools-iuc#4367 (comment) (but we might need a better grep because the definition of some metadataelements is broken over multiple lines)

I guess it might be save to set the metadataelements to optional in case that they are not touched by set_meta...

@mvdbeek
Copy link
Member Author

mvdbeek commented Feb 3, 2022

as long as we do not have a plan how to achieve a good state.

Do we really need a plan ? This feature has been broken ... maybe from the start ? We didn't have any bug reports specifically about this. It feels like we're potentially shooting ourselves in the foot for little gain. If there's a good usecase that would require all datatypes to handle optionality correctly I would maybe revise my position, but I don't see it now.

Mind you, for some specific datatypes this would makes sense, like in the BAM datatype we could say that the index is required and if it's not there you can't run tools on the dataset. But we handle this now by saying that setting metadata failed ... so, even for this case (which this PR gives us the option to do) the gain is small.

Might it be an alternative to check the dozen cases identified here

I think you're overestimating the coverage of datatypes that the IUC gives us, and as you correctly identified, we can never test datatypes that are display_in_upload="false" at this point. There's a lot of datatypes that were contributed by others for which we don't have IUC tools, and I think we don't want to break usage of a single datatype within Galaxy.

@bernt-matthias
Copy link
Contributor

You are probably right, but check_required_metadata just feels like a redundant optional.

Anyway, if we do it I would argue pro documentation.

I think you're overestimating the coverage of datatypes that the IUC gives us

Note, that the list is not based on the results on the IUC tests, but a simple grep in the sources. And I'm optimistic that those coincide with the failing tool tests. But I just don't have the time to look in detail.

we can never test datatypes that are display_in_upload="false" at this point.

This is not entirely true unless we merge #12073 .. at the moment we can test anything in planemo tool tests .. as far as I got it Galaxy does not check the datatype of data parameters :) But I guess metadata will be completely screwed up in these cases. So probably irrelevant.

@mvdbeek
Copy link
Member Author

mvdbeek commented Feb 3, 2022

This is from the IUC tests against this branch, where I just tried to make the most obvious case of a required metadata element work, and yet I guess wrong:
Screenshot 2022-02-03 at 16 43 18

I think this is going to be extremely difficult to get right.

@mvdbeek
Copy link
Member Author

mvdbeek commented Feb 3, 2022

You are probably right, but check_required_metadata just feels like a redundant optional.

Of course it is redundant, but optional can't be trusted, that's the point of the PR.

@bernt-matthias
Copy link
Contributor

So in plotdexseq the tool suggests rdata as input but its actually rds .. so its a tool error which we would never find without this .. but I have to admit that we (I) just recently introduced RDS. So its not really the tools fault.

Maybe this is an argument for your change .. because the old tool will stop working once we got strict metadata checking.

Or even better: we could hide strict metadata checking behind a profile version?

defintion.

Fixes
```
2022-02-03 12:40:03,491 ERROR [galaxy.web.framework.decorators] Uncaught exception in exposed API method:
Traceback (most recent call last):
  File "/tmp/tmp778b108w/galaxy-dev/lib/galaxy/web/framework/decorators.py", line 320, in decorator
    rval = func(self, trans, *args, **kwargs)
  File "/tmp/tmp778b108w/galaxy-dev/lib/galaxy/webapps/galaxy/api/tools.py", line 563, in create
    return self._create(trans, payload, **kwd)
  File "/tmp/tmp778b108w/galaxy-dev/lib/galaxy/webapps/galaxy/api/tools.py", line 631, in _create
    vars = tool.handle_input(trans, incoming, history=target_history, use_cached_job=use_cached_job, input_format=input_format)
  File "/tmp/tmp778b108w/galaxy-dev/lib/galaxy/tools/__init__.py", line 1767, in handle_input
    all_params, all_errors, rerun_remap_job_id, collection_info = self.expand_incoming(trans=trans, incoming=incoming, request_context=request_context, input_format=input_format)
  File "/tmp/tmp778b108w/galaxy-dev/lib/galaxy/tools/__init__.py", line 1747, in expand_incoming
    populate_state(request_context, self.inputs, expanded_incoming, params, errors, simple_errors=False, input_format=input_format)
  File "/tmp/tmp778b108w/galaxy-dev/lib/galaxy/tools/parameters/__init__.py", line 348, in populate_state
    _populate_state_legacy(request_context, inputs, incoming, state, errors=errors, context=context, check=check, simple_errors=simple_errors)
  File "/tmp/tmp778b108w/galaxy-dev/lib/galaxy/tools/parameters/__init__.py", line 408, in _populate_state_legacy
    state[input.name] = input.get_initial_value(request_context, context)
  File "/tmp/tmp778b108w/galaxy-dev/lib/galaxy/tools/parameters/basic.py", line 1001, in get_initial_value
    options = list(self.get_options(trans, other_values))
  File "/tmp/tmp778b108w/galaxy-dev/lib/galaxy/tools/parameters/basic.py", line 890, in get_options
    return self.options.get_options(trans, other_values)
  File "/tmp/tmp778b108w/galaxy-dev/lib/galaxy/tools/parameters/dynamic_options.py", line 763, in get_options
    rval = filter.filter_options(rval, trans, other_values)
  File "/tmp/tmp778b108w/galaxy-dev/lib/galaxy/tools/parameters/dynamic_options.py", line 207, in filter_options
    if not r.metadata.element_is_set(self.key):
  File "/tmp/tmp778b108w/galaxy-dev/lib/galaxy/model/metadata.py", line 175, in element_is_set
    meta_spec = self.parent.metadata.spec[name]
KeyError: 'groups'
```

which happens when forcing a tabular dataset into a more specialized
input https://github.com/galaxyproject/tools-iuc/blob/e22821f51ae326b3696a9456cbc44f9982512a52/tools/mothur/homova.xml#L48

Arguably this is a tool issue, but we already log a KeyError above and
it seems more of a linting issue than something that should cause a test
failure.
@mvdbeek mvdbeek force-pushed the disable_required_metadata_filter_for_now branch from e5c9ec1 to f7eaa0a Compare February 3, 2022 17:31
@@ -243,7 +243,10 @@ def missing_meta(self, dataset, check=None, skip=None):
continue
if not check and len(skip) == 0 and dataset.metadata.spec[key].get("optional"):
continue # we skip check for optional and nonrequested values here
if not dataset.metadata.element_is_set(key):
if not dataset.metadata.element_is_set(key) and (check or dataset.metadata.spec[key].check_required_metadata):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is to keep the strict check, but add a profile check. So metadata will be checked only for tools with profile >= 22.01. Then wenn do not need the additional variable check_required_metadata and old tools will still run.

In #13296 we can then take our time and document it the way we want to have this from 22.01.

Or we just go for the check and doc for 22.05...

Clearly adding a profile seems quite an effort to add to the approx 5 places where the function is called. But in my opinion this is the better way forward.

Rdata always has a version, but the tool uses the wrong datatype. So making it optional seems wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then that breaks runs on existing data.

@mvdbeek
Copy link
Member Author

mvdbeek commented Feb 4, 2022

Let's move ahead here so we can deploy on .org and start the release testing.

@mvdbeek mvdbeek merged commit a66a8b6 into galaxyproject:release_22.01 Feb 4, 2022
@mvdbeek mvdbeek modified the milestones: 22.05, 22.01 Feb 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants