Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts on refactoring CWL #111

Open
stefan-pdx opened this issue Aug 15, 2015 · 10 comments
Open

Thoughts on refactoring CWL #111

stefan-pdx opened this issue Aug 15, 2015 · 10 comments

Comments

@stefan-pdx
Copy link

Hi all,

First off, thanks for all the hard work to get CWL up to where it is today. I'm pretty excited to see where it goes.

I'm currently working on a project at OHSU where we're wanting to use CWL for a workflow engine. I've been doing a bit of a deep dive into understanding the current implementation and figuring out how we can use it for our application. However, it looks like we're going to be targeting the JVM so I started to look into piecing together an alternative implementation in Ruby (language of choice) so that I can get an idea of what the best approach would be.

Talking the talk

In doing so, I have some first-hand experience of working with the spec and wanted to give some feedback.

  1. Traversing across Avro-LD, JSON-LD, and RDF isn't practical. I understand that Avro-LD isn't spec'd out yet, but I've spent a lot of time trying to reverse-engineer how to parse out the RDF graph and JSON-LD context. I still don't understand how we take the RDFS graph from the Avro-LD schema and compare it to the RDF graph that's generated from the JSON-LD document that's expanded from the workflow document and JSON-LD context. Long story short -- we really need to simplify the representation so that developers don't get overburdened with these details. I totally get the motivation for using Avro-LD since it allows us to sub-class out records, but I feel as though the cost (in complexity) outweighs the benefit (clean code).
  2. We need to pick between Avro or RDF. I feel like trying to represent the schema in Avro so that we can traverse the graph in RDF is redundant. What's the motivation for representing a workflow as triples? Why not use a rigid schema / data structure to express the workflow and allow the implementation to manage the logic for handling relationships? Can someone point me to where in the reference implementation the logic for parsing a workflow document relies on RDF queries?
  3. We need more tests both for the schema and coverage. The reference implementation needs to have positive and negative test cases for each record type so that we can prevent regressions as the schema grows.
  4. We should consider language-independent tests that can be used for "conformance" testing across implementations. For example, Gherkin allows you to write features that can be parsed in test suites across different languages. We have have one set of tests that can be used to validate Python implementations (via Behave), Ruby implementations (via Cucumber), etc.

I'd love to see CWL grow to adopt best practices to enable developers to build more applications and platforms that take advantage of the CWL.

Walking the walk

I wanted to take some time to see what would it look like if we refactored the CWL schema into Avro. Here's a rundown of what I've been able to do.

  • I create a bunch of Avro IDL records. Take workflow.avdl for example. IDL allows you to embed both comments and code documentation. The Avro IDL spec says:

    Comments that begin with /** are used as the documentation string for the type or field definition that follows the comment.

    So this should satisfy the requirement in being able to support multi-line documentation (one of the motivations for using YAML).

  • It's definitely nice being able to break the spec down into separate protocols. There's a top-level protocol that's used to compile the schemata.

  • There are tests! RSpec (a Ruby testing framework) provides a lot of functionality to have extensive tests.

My proposal

  1. Please consider refactoring CWL to use vanilla Avro. Relying on Avro-LD makes it difficult to build additional implementations.
  2. Please consider removing RDF as a dependency. I'm not sure what is gained by having multiple representations of a workflow and I think that validating to an Avro schema provides us the capabilities we need.
  3. Please consider adding more tests to both the schema definition and reference implementation. I prefer RSpec since you can write clean tests along the lines of "tests as documentation". The refernece implementation should be a guide for making implementations in other languages.
  4. Think about how we can have a test suite written in Gherkin that developers can use to validate their implementations.

Moving forward, I'd like to continue going down the path to see if I can come up with an Avro-sans-RDF schema implementation that is compliant with the CWL standard document. If this work can be reconciled with the CWL roadmap, I'd love to be committed to helping the development with this project.

Please let me know if you have any thoughts or feedback!

Thanks!

@tetron
Copy link
Member

tetron commented Aug 16, 2015

Hi @slnovak, you have a lot of different points here, so let me address what I can.

  1. The canonical representation of CWL is a YAML and is validated using a rigid Avro schema. The reference implementation internally converts the Avro-ld representation to plain Avro, which is fed programmatically to the Python Avro module.

  2. You can process and execute a CWL document without going to RDF. In fact the reference code doesn't convert to RDF at all for execution, it executes directly from the YAML document tree structure.

The link with RDF is the use of JSON-LD as an annotation which enables converting the rigid YAML document to RDF; however this is an entirely optional feature for an implementation. You can think of the conversion as YAML -> Avro -> JSON-LD -> RDF where you can stop at Avro and everything still be able to do everything.

The benefit of considering RDF in the design is that it provides a conceptual framework for describing how documents, objects, metadata, etc relate to one another (which does not exist in Avro, so a plain Avro schema would still require additional semantics to define these features). It provides a means of integrating metadata from ontologies, such as EDAM for describing file types (#7).

We did have an earlier version of the CWL schema written in straight Avro schema, and it was unmaintainable. Since plain Avro doesn't provide any "don't repeat yourself" types of abstractions, refactoring the schema to prototype new features (which happens a lot when you have a standard that's still under development) is painful and error prone and things and get out of sync very easily.

I am working on spinning off the schema language in its own project. It's not quite ready yet but I will go ahead and push what I have to get the conversation started. The idea is that you'll be able to use the schema tools to generate a plain Avro schema, then use the Avro tools in your favorite language for validation/code generation. The reference code (cwltool) is already able to export the straight Avro schema using --print-avro, however there are reports that isn't actually compatible with the Java Avro (#69) but that is just a bug that needs to be fixed.

  1. The conformance test suite already provides implementation-independent testing. It works by executing the CLI of a given CWL implementation against a set of CWL documents and inputs and checking for correct output. All that is needed is for the CWL implementation to provide a command line tool that conforms to the testing interface. Since CWL is a document format and not an API, I'm not sure how else one would go about providing implementation-independent tests.

  2. I think the conformance test tool already accomplishes what you are looking for, but I'm very curious (specifically, with examples) how a tool like Gherkin would be better.

Which is not to say the existing tests couldn't use more work, feature coverage is probably not complete and negative tests (checking an implementation fails when required to fail) don't currently exist.

  1. One of my to-do items has been to start an "implementers guide" (or even just a FAQ) which would clarify these kinds of issues. Maybe this is a good place to start.

(You should attend the next CWL Google hangout call. You should have gotten an invite if you are on the Google Groups mailing list, if not send me an email and I will send you the information)

@tetron
Copy link
Member

tetron commented Aug 17, 2015

One last comment, you can always skip formal validation and write your execution code assuming the input document is valid and raise an exception if something unexpected is found. This isn't best practice but the inability to validate directly from the official schema shouldn't be a blocker for producing a conforming CWL implementation.

@kellrott
Copy link
Contributor

Thank you both for bringing more attention to this issue. I’m glad to see there is movement in the area.

A couple opinions:

  1. @tetron, while CWL technically conforms to portions of the Avro, as noted in Document avro-ld #69 It is broken in the case of Avro Code generation. This is a critical failure, as our goal is to define a common RPC protocol for federating workflow invocations. We would like to define that RPC protocol using existing Avro schemas and tools.

  2. Part of the confusion comes from the lack of documentation and confusion about projects. The stated reason for not using vanilla Avro for CWL wasn’t that it was impossible (as @slnovak and @tetron noted, it is quite possible) but rather that ‘it was unmaintainable’. But solution for this was to define a second new general purpose description language. This is more doubles the scope of the work, because now examples, libraries, documentation and conformance tests now have to be developed for two different technical specifications. Designing a pipeline description schema like CWL is a very narrow, specific and well defined goal (according to the last schema, just 34 record types). But designing Avro-LD means defining a fully descriptive general purpose language, which is a much larger project then just defining CWL. The stress of this extra work is already show in the lack of documentation and implementation bugs that are impeding further development.

I’m hoping we can find a way to move forward. Perhaps, @tetron could continue his Avro-LD as a separate project, so that it has the chance to be documented, tested and made fully compliant with Avro. And do this work separately so that it doesn’t block the development of CWL.

@ghost
Copy link

ghost commented Aug 18, 2015

Now that draft-2 is stable, perhaps we can commit e.g. avdl files to the repository separately (to avoid potential bugs with --print-avro since it's only tested with avro's python implementation)?

@tetron tetron mentioned this issue Aug 18, 2015
@tetron
Copy link
Member

tetron commented Aug 18, 2015

I think @ntijanic and @kellrott have it right. Since @slnovak has already done the legwork to write the avidl file by hand, there's no reason that can't be merged as an alternate schema for the stable draft-2. The schema language as a separate project (see my note on #69) will be ready for draft-3, and will support conversion to avsc (and maybe even could export avidl) for use with Avro tooling, making it unnecessary to convert the schema by hand, making this a one-off.

@stefan-pdx
Copy link
Author

@tetron -- thanks for your feedback.

Do you think that there is room within the standard to support the notion of "schema implementations" along with tool implementations? That is, the community is able to develop their own schema implementation as long as it conforms to the standard. Do you know of any standards or data working groups that have done anything along these lines?

@tetron
Copy link
Member

tetron commented Aug 19, 2015

@slnovak just to clarify, by "schema implementations" I assume you mean an encoding of the CWL data model (such as using Avro IDL), rather than an implementation of the schema language? I think as long as we are able to keep them in sync and avoid multiple sources of truth (different schemas that are in conflict), then that is perfectly reasonable. This is in line with the broader idea that CWL is as much about defining the right abstract data and computational models as it is about a specific document encoding.

@mr-c
Copy link
Member

mr-c commented Oct 26, 2015

@slnovak Thank you for your detailed and thoughtful comment. Now that we are nearly in November what do you think about the state of the CWL?

@ghost
Copy link

ghost commented Dec 9, 2015

Related: #173

Also, I think it's totally doable to define additional encodings that are isomorphic with YAML version of CWL and the RDF version of CWL. Any RDF encoding should automatically qualify, though other encodings don't need to depend on RDF at all. Even one-way mappings should be fine (DSL?).

@ghost
Copy link

ghost commented Oct 22, 2018

@slnovak, @tetron is this ticket outdated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants