Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metaschema predicates pollute salad document namespace #506

Open
mr-c opened this issue Jan 14, 2022 · 8 comments
Open

metaschema predicates pollute salad document namespace #506

mr-c opened this issue Jan 14, 2022 · 8 comments

Comments

@mr-c
Copy link
Member

mr-c commented Jan 14, 2022

Alternative title: CWL enum symbols get fully expanded instead of being left as pure strings

Source issue: common-workflow-language/cwljava#70

For example, CommandInputEnumSchema https://github.com/common-workflow-language/cwl-v1.2/blob/a0f2d38e37ff51721fdeaf993bb2ab474b17246b/CommandLineTool.yml#L355 extends InputEnumSchema https://github.com/common-workflow-language/cwl-v1.2/blob/a0f2d38e37ff51721fdeaf993bb2ab474b17246b/Process.yml#L689 which extends sld:EnumSchemahttps://github.com/common-workflow-language/cwl-v1.2/blob/a0f2d38e37ff51721fdeaf993bb2ab474b17246b/salad/schema_salad/metaschema/metaschema_base.yml#L135 which defines the symbols field as follows:

      type: string[]
      jsonldPredicate:
        _id: "sld:symbols"
        _type: "@id"
        identity: true
      doc: "Defines the set of valid sy

The _type: "@id" and identity: true might be fine for the metaschema of Schema Salad, but is inappropriate for a CWL Enum definition, where we expect plain strings.

cwltool ignores this, but codegen readers of CWL (cwljava, cwl-utils) cannot.

(all of this is also true for the Output and non-Command versions of EnumSchema in CWL)

I can't find a way to override this with a specializeFrom/To nor with overlapping field definitions, nor with re-rooting the CWL schema to not derive from sld:EnumSchema as the metaschema predicates collide with the CWL schema predicates

schema_salad.exceptions.SchemaException: Predicate collision on symbols, '{'@id': 'https://w3id.org/cwl/salad#symbols', '@type': '@id', 'identity': True}' != 'https://w3id.org/cwl/cwl#InputEnumSchema/symbols'

(even if I use jsonldPredicate: { _id: "sld:symbols" } ):

schema_salad.exceptions.SchemaException: Predicate collision on symbols, '{'@id': 'https://w3id.org/cwl/salad#symbols', '@type': '@id', 'identity': True}' != '{'@id': 'https://w3id.org/cwl/salad#symbols'}'

Fix options:

  1. Hack the schema (fixes the codegen, but introduces big problems down the line): Re-root the CWL schema to not extend sld:EnumSchema and rename symbols from the metaschema to be something else (and update all other schema salad documents to use this new name in their definitions)
  2. Hack the codegen (manually change the type of the symbols field post generation): would unblock Enum types are not parsed correctly cwljava#70 but it not sustainable
  3. Scope the predicates situation so that schema-salad documents themselves can use any predicate name from the metaschema, and then apply fix 1 without renaming the existing use of symbols in the metaschema. I like this, but I don't know how to implement it yet
  4. Do nothing. Consumers of the CWL codegen libraries will be forced to subtract out the ID of the parent object to get the true values of CWL type: enums.
@mr-c
Copy link
Member Author

mr-c commented Jan 14, 2022

Another option that might work: remove https://github.com/common-workflow-language/cwl-v1.2/blob/a0f2d38e37ff51721fdeaf993bb2ab474b17246b/Process.yml#L15 and copy in the salad metaschema parts we need, making the one change to the symbols definition

@ZimmerA
Copy link
Contributor

ZimmerA commented Jan 14, 2022

The TypeScript version has the same issue: https://runkit.com/zimmera/61e180e8b299c2000851e088 (output expandable at bottom)

@mr-c
Copy link
Member Author

mr-c commented Jan 14, 2022

@ZimmerA Thanks. Yeah, it is a schema issue; see common-workflow-language/cwl-v1.2#145 for my proposed fix

@tetron
Copy link
Member

tetron commented Jan 14, 2022

Enums are not intended to be pure strings, they're intended to be qualified URIs, but part of the vocabulary so you're able to use the shortened version. It sounds like it is behaving as designed, but the design is surprising.

@mr-c
Copy link
Member Author

mr-c commented Jan 14, 2022

Enums are not intended to be pure strings, they're intended to be qualified URIs, but part of the vocabulary so you're able to use the shortened version. It sounds like it is behaving as designed, but the design is surprising.

I agree for Schema Salad, but not for CWL.

CWL uses enums to create command line interfaces

https://www.commonwl.org/v1.2/CommandLineTool.html#CommandLineBinding doesn't say that we only use the short versions for the command line (but in fact cwltool does so)

So I think in practice the concepts are different.

@mr-c
Copy link
Member Author

mr-c commented Jan 15, 2022

@tetron common-workflow-language/cwltool#1591 shows that there is no cwltool unit test nor CWL conformance test that relies on enum symbols being qualified URIs; therefore I argue that they are factually not so and the schema proposal in common-workflow-language/cwl-v1.2#145 should be accepted for CWL v1.2.1 to confirm that.

@mr-c
Copy link
Member Author

mr-c commented Jan 19, 2022

Related: #132

@tetron
Copy link
Member

tetron commented Jan 19, 2022

Ok, I refreshed my recollection of what cwltool does.

cwltool has been operating under the premise that symbols are expanded to URIs for a long time, e.g.

$ cwltool --print-pre blah.cwl
INFO /home/peter/work/cwltool/venv3/bin/cwltool 3.1.20220103201022
INFO Resolved 'blah.cwl' to 'file:///home/peter/work/tmp/enum/blah.cwl'
{
    "arguments": [
        {
            "prefix": "species",
            "valueFrom": "$(inputs.first.species)"
        },
        {
            "prefix": "ncbi_build",
            "valueFrom": "$(inputs.first.ncbi_build)"
        }
    ],
    "baseCommand": "echo",
    "class": "CommandLineTool",
    "cwlVersion": "v1.2",
    "id": "file:///home/peter/work/tmp/enum/blah.cwl",
    "inputs": [
        {
            "id": "file:///home/peter/work/tmp/enum/blah.cwl#first",
            "type": "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params"
        }
    ],
    "outputs": [
        {
            "id": "file:///home/peter/work/tmp/enum/blah.cwl#result",
            "outputBinding": {
                "glob": "8e100d950d274f5732d81f3ac0bf3712b420fa07"
            },
            "type": "File"
        }
    ],
    "requirements": [
        {
            "class": "SchemaDefRequirement",
            "types": [
                {
                    "fields": [
                        {
                            "name": "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/ncbi_build",
                            "type": [
                                "null",
                                {
                                    "symbols": [
                                        "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/ncbi_build/GRCh37",
                                        "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/ncbi_build/GRCh38",
                                        "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/ncbi_build/GRCm38"
                                    ],
                                    "type": "enum"
                                }
                            ]
                        },
                        {
                            "name": "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/species",
                            "type": [
                                "null",
                                {
                                    "symbols": [
                                        "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/species/homo_sapiens",
                                        "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/species/mus_musculus"
                                    ],
                                    "type": "enum"
                                }
                            ]
                        }
                    ],
                    "name": "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params",
                    "type": "record"
                }
            ]
        }
    ],
    "stdout": "8e100d950d274f5732d81f3ac0bf3712b420fa07"
}

What happens is that for the purposes of validation, it gets the short name of the URI. How to get the short name is described in the schema salad spec:

https://www.commonwl.org/v1.2/SchemaSalad.html#Short_names

cwl-utils / cwljava should provide a conforming "shortname" function if it doesn't already.

The python_codegen itself also assumes enum symbols are URIs, and calls safe_name (which computes the short name).

                return self.declare_type(
                    TypeDef(
                        self.safe_name(type_declaration["name"]) + "Loader",
                        '_EnumLoader(("{}",))'.format(
                            '", "'.join(
                                self.safe_name(sym)
                                for sym in type_declaration["symbols"]
                            )
                        ),
                    )
                )

From my perspective, the correct thing to do is 4

Do nothing. Consumers of the CWL codegen libraries will be forced to subtract out the ID of the parent object to get the true values of CWL type: enums.

I feel that the proposed solution (making enum symbols unparsed strings) is effectively redefining the semantics of the enum data model which is inappropriate for a 1.2.1 version. It would be better to ensure existing validation and Command Line behavior is properly specified (i.e. derive the short name from the URI and check the string against that) for the purposes of checking the source document.

I agree that https://www.commonwl.org/v1.2/CommandLineTool.html#CommandLineBinding doesn't explicitly describe behavior for enums. However, in practice the current behavior is that it accepts short name as a string, and then the command line behavior is identical as for a string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants