Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Export CWL-abstract workflow representation #9407

Draft
wants to merge 10 commits into
base: dev
Choose a base branch
from

Conversation

ieguinoa
Copy link
Contributor

@ieguinoa ieguinoa commented Feb 20, 2020

This is a first step towards exporting a Galaxy workflow as an RO-crate object.
Besides the required metadata, the workflow RO-Crate profile recommends to accompany the native workflow definition with an abstract CWL description.

This PR is heavily dependant on the implementation of the Abstract Operation in CWL (common-workflow-language/cwl-v1.2#3) .

@jmchilton
Copy link
Member

I appreciate the effort here @ieguinoa, awesome start. I think I'm going to strongly advocate for us pushing this down a layer into gxformat2 though. There are some pros and cons of doing that - the serious cons include not having Galaxy runtime knowledge of valid tool state and connections. Certain things will require us ensuring workflow exports have the required information needed to process these formats correctly - that is potentially a lot of extra work but I'm more than happy to help with that process and improve the workflow exports to capture all the metadata we need.

Those are the cons - the pros are numerous however I think. I've opened a start of PR to do this at that level to highlight some of those advantages. That PR is here: https://github.com/galaxyproject/gxformat2/pull/38/files.

The advantages of that PR over this approach is that:

  • We have dozens of test cases - and they run instantly. The test cases are good too - I used cwltool's validation to ensure the resulting conversions are valid CWL documents.
  • The conversion process works on the textual format so it can be repackaged as a command-line utility (script to do this will be included in gxformat2 after that PR is merged) and can be integrated into Planemo.
  • Given that validation and linting and specification of the schema salad format happens in that repo, I think it is the correct place for reasoning about what the converted format should look like. It will be really nice and clean to just take a workflow - write a test case to describe what the result should look like and then collaborate together to ensure the native (.ga) workflow output from Galaxy has what we needs as well as the specification for format2 (http://galaxyproject.github.io/gxformat2/v19_09.html).

@ieguinoa
Copy link
Contributor Author

Hi @jmchilton ,

thanks for the response.
I agree in everything, and that was my realization shorty after creating this so we created a separate command-line utility that takes the Galaxy workflow file as input: https://github.com/workflowhub-eu/galaxy2cwl/tree/master/galaxy2cwl
Never thought about including it in gxformat2 repo because my main target was (and still is) native (.ga) formatted files.
That library is currently used in workflowhub.eu to get the cwl abstract and from this a diagram of the workflow but I'm happy to get this logic formally into the gxformat2 repo, specially because that package is already used in the Galaxy code and I would like to export prepackaged workflows in ro-crate format, which would include the cwl-abstract as an "interface"...so the conversion will be needed.
I think I can help with this, mainly getting the conversion and also proper tests (there are some in the examples/ dir of that repo) since I have some cases where .ga and gxformat2 differ in the level of information content.

@jmchilton
Copy link
Member

Ahhh - I had seen that repository also but I assumed this was newer because Björn pinged me on this yesterday. Sorry - I should have started from there - I do think the approach I outlined in gxformat2 is more promising because it doesn't have two separate blocks for format2 and native and handles type conversions and subworkflows a bit better - but your script has more help and handles the I/O information contained in the native format but dropped as unneeded in gxformat2 better. I think the format2 schema (since it based on CWL's) has room for all that extra input/output annotation information - it just wasn't strictly needed so it get drops in a naive conversion - I've long wanted to do a variant of the conversion that preserved more of that information - I'll see if I can do that. I'll try to find some time bring more of that other galaxy2cwl goodness into the gxformat2 script.

If there is anything else I can do to convince you to hack on the gxformat2 version and add test cases for missing features and get y'all to use it for workflowhub - please let me know.

I do have a quick question - are the inputs and outputs on the operation used in by the workflowhub (I assume just cwlviewer?) or is specifying the in/out on the workflow steps sufficient for your purposes? I handled that information and it seems to validate cwltool though the documents seem a bit off as a result.

@ieguinoa
Copy link
Contributor Author

no problem, I actually forgot to post a link to that galaxy2cwl repo here.

No need to convince me to hack on the gxformat2 ;) the galaxy2cwl was a bit of a quick fix since we rushed the launch of workflowhub and wanted to get the cwlviewer part going as it is definitely useful for users to see a diagram of the workflow.
It's indeed separated in blocks, mainly as a result of me realizing, while coding, about the differences in the information contained in native and format2 formats. It was an attempt to retain as much information as possible.
And definitely it doesn't handle subworkflows that well (if at all).

should I add native the formatted workflow examples to the gxformat2 repo? I will look for the use cases that had the most differences between the .ga and format2.

About the workflowhub aim: using cwl-abstract representations is mainly a mean to get a standardized metadata file with the workflow representation/interface. The goal is to actually get something similar for other wfms like Nextflow so that most workflows submited to workflowhub have a similar cwl-abstract representation. This can later be used for higher level processing (search by wf patterns, input and output types,EDAM ontologies, etc.)
So, it's important for us to have in the cwl-abstract representation as much information as possible from what is contained in the workflow files submitted. Ideally, as you mentioned before, the workflow exports would contain more metadata....e.g having the info of the datatypes (in galaxy terms) that each step input accepts would be a great step. But I assume that's what you meant with being a lot of work as it needs to be extracted from the tool definition itself. Also for this "new" information, I understand it will probably only by fitting in the gxformat2.
I will comment this topic on the next workflowhub meeting and see how we can switch to using gxformat2 library whenever this functionality is ready. Same for the ro-crate-py library.

@jmchilton
Copy link
Member

Since you're on board for using it - I merged my PR that did a release that adds the experimental feature (https://pypi.org/project/gxformat2/) - it should make it easier to PR that repo and play with the functionality. I'll keep you updated on any more progress I can make this week.

@jmchilton
Copy link
Member

should I add native the formatted workflow examples to the gxformat2 repo?

Add whatever - I'm fine with just developing test cases around the native format - every new test case helps and the implementation will translate the information to Format 2 anyway so we get free testing of conversion and spec and stuff when we right test cases around native format workflows.

@jmchilton
Copy link
Member

I did another release with gxformat2 with another pass at doing that abstract CWL export - this one includes Marius' test COVID workflow as an example and it validates (https://github.com/galaxyproject/gxformat2/pull/43/files#diff-8fd380a30112029c42a33c720693aa77R73). That PR maybe gives some indications of how one can add test workflows - if you run that test case the outputs are in tests/examples/abstractcwl/ also so you can manually inspect the generated artifacts.

@nsoranzo
Copy link
Member

@ieguinoa During last week ELIXIR-UK hackathon I've opened a few pull requests on the gxformat2 repository that may help for this:

@ieguinoa
Copy link
Contributor Author

Awesome @nsoranzo, thanks a lot. This was also discussed in the IWC plans where the plan is maybe to merge it into planemo, as its a more stable and known resource.

@nsoranzo
Copy link
Member

This was also discussed in the IWC plans where the plan is maybe to merge it into planemo, as its a more stable and known resource.

Thanks for the heads-up! planemo and Galaxy already depend on gxformat2 and use it for format conversions (e.g. the already existing planemo workflow_convert command), so I'd say that gxformat2 is where we should concentrate the efforts for the conversion, with user-level interfaces in the Galaxy API and Planemo.

Copy link
Contributor

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any updates on the plan for this code?

# Pack workflow data into a dictionary and return
data = {}
data['class'] = 'Workflow'
data['cwlVersion'] = "v1.2.0-dev1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data['cwlVersion'] = "v1.2.0-dev1"
data['cwlVersion'] = "v1.2"

@nsoranzo nsoranzo marked this pull request as draft September 16, 2022 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants