Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

packaging imports with ../ #6

Open
mlin opened this issue Jan 31, 2022 · 17 comments
Open

packaging imports with ../ #6

mlin opened this issue Jan 31, 2022 · 17 comments

Comments

@mlin
Copy link

mlin commented Jan 31, 2022

Hi @rhpvorderman @DavyCats

I noticed wdl-packager rejects top-level WDLs that import from outside their own directory e.g. import "../../../foo/bar/bas.wdl". I'm concerned that several of the important Broad WDL repos (warp, viral-pipelines, etc.) have settled on that approach (example), so we should try to support it even if we organize our own source code differently...

It seems possible in principle to embed in the ZIP file, the minimum directory subtree needed to make the relative paths work. The problem with that is the "main" WDL file gets buried in some subdirectory which isn't straightforward to find upon opening the ZIP file.

We could add a manifest file to the ZIP, with the equivalent of Main-Class: from a JAR manifest and that also could hold default inputs if desired. I could make miniwdl read that, but obviously no control over Cromwell. Come to think of it, I wonder how are repos with the above structure sent into Cromwell/Terra anyhow? (Probably by public URL instead of ZIP file?)

Have you thought about this, any other approaches?

(I spent this past weekend educating myself about the packaging problem by implementing an alternative approach that inlines all the WDL source in a YAML file along with a manifest, and relies on dedicated logic in miniwdl's source loader to resolve the paths. However, after finishing it, I think I should just go back to ZIP files.)

@rhpvorderman
Copy link
Contributor

I noticed wdl-packager rejects top-level WDLs that import from outside their own directory e.g. import "../../../foo/bar/bas.wdl". I'm concerned that several of the important Broad WDL repos (warp, viral-pipelines, etc.) have settled on that approach (example), so we should try to support it even if we organize our own source code differently...

I am all supporting a wide variety of use cases. But I am not for supporting use cases that are inheritantly unpackagable. There is no intuitive way to solve these problems, so the only solution is one where wdl-packager would behave unintuitively. Having unintuitive behavior causes a lot of friction for developers and users, so that is a big sacrifice to make in my opinion.

As for the broad case. Instead of importing ../../../mytask.wdl. They could also do import mytask.wdl. And then ln -s ../../../mytask.wdl. That fixes their WDL and WDL-packager will simply convert the link into a file. Problem solved.

Maybe we can provide this sort of information in a packaging guide? That would be more helpful than coding around all possible package problems that might occur.

@mlin
Copy link
Author

mlin commented Jan 31, 2022

Hmm, you're making me reconsider whether to merge my custom file format! 😅

@rhpvorderman
Copy link
Contributor

Hmm, maybe wdl-packager can do some rewriting of import statements? i.e. ../../../wdltask.wdl to wdltask.wdl if there are no collissions?
That would be useful code to for https:// imports. Those can also be written to a file as well, ensuring reproducibility.

@rhpvorderman
Copy link
Contributor

Would that be a solution? It sounds like something WDL-packager can support. Also I can add warnings.warn(WhatAreYouDoingWarning(".. imports in WDL,seriously?!?!?")) to the wdl-packager code. That sounds like fun 😉!

@mlin
Copy link
Author

mlin commented Jan 31, 2022

Although uncomfortable to change the source, I think it's slightly better than my approach that needs special logic in the source loader to "reinterpret" the import paths. Despite miniwdl not supporting full code generation, we could do a good enough job on the constrained problem of rewriting import statements based on miniwdl's DocImport object that records the line/column ranges in the source code.

@rhpvorderman
Copy link
Contributor

That would be doable indeed. It would just be an ad verbatim copy except the import statement.

That is if, we think wdl-packager got WDL packaging right the first try. We might want to think about more about that.
In principle I think the zip file approach is correct. That is also what Java uses for JAR files, so it can't be that bad. Python packages also, are simply archive files. I prefer zip here, because it can access files randomly from the archive (tar needs complete unpacking). We might want to add some sort of metadata format file which is mandatory, that contains paths to all sorts of other metadata in the package (LICENSE files etc.).

@mlin
Copy link
Author

mlin commented Jan 31, 2022

The only downside of ZIP is that it doesn't achieve the best possible compression, because it compresses each file separately without sharing a "dictionary." Usually the difference would be negligible, except that the reason I just got interested in packaging (after letting chanzuckerberg/miniwdl#292 languish for years!) was to send the package in an AWS Batch SubmitJob environment variable, which has a 30 KiB total size limit. It'd be nice not to have to arrange for some S3 bucket or public web server to host the package. Anyway, I admit this is a weird case, it's just factually what got me onto this topic 😄

@rhpvorderman
Copy link
Contributor

rhpvorderman commented Jan 31, 2022

wdl-packager now uses uncompressed zip files. Because they are still relatively small compared to checking out an entire repo including test files.

I suppose WDL packager could also use tar files. It's just packing an archive.

-rw-r--r--  1 rhpvorderman rhpvorderman  43K 31 jan 11:38 germline_v4.1.0.tar.xz
-rw-r--r--  1 rhpvorderman rhpvorderman  98K 31 jan 11:38 germline_v4.1.0.tar.gz
-rw-r--r--  1 rhpvorderman rhpvorderman 149K 31 jan 11:38 germline_v4.1.0.deflate9.zip
-rw-r--r--  1 rhpvorderman rhpvorderman 680K 31 jan 11:30 germline_v4.1.0.zip
-rw-r--r--  1 rhpvorderman rhpvorderman 720K 31 jan 11:38 germline_v4.1.0.tar

Sorted on size. Tar uncompressed is bigger than zip. zip deflated is already quite small. But tar is indeed 33% smaller with the same deflate compression (level 9). XZ -9 works very well. Less than half the size of gzip.

EDIT: I uses zip initially because that is what cromwell supports. But as soon as tar.xz gets into the spec wdl-packager can also support that. I prefer to support only formats that can be parsed by Python's stdlib. (gzip, bzip2, lzma). Bzip2 is not a suitable candidate in my opinion as decompression speeds are very slow. Lzma works much better in that regard. Also I see that XZ format is supported in the java world as well.

@mlin
Copy link
Author

mlin commented Feb 2, 2022

The thought just occurred to me, since my need for maximum compression is kind of a corner case, for that purpose I can simply use XZ on the uncompressed ZIP file. It's probably better to stick with ZIP for general WDL packaging, which will be less surprising to the community.

@rhpvorderman
Copy link
Contributor

If it helps I can add a compression level flag to wdl-packager. That should shrink the size somewhat.

The reason the zipfiles are uncompressed by default.

  1. No need for a DEFLATE implementation in the reader. The zip packages can be read plainly.
  2. Our team frequently processes gigabytes upon gigabytes of data. The savings of compressing the zip are insignificant.

@rhpvorderman
Copy link
Contributor

@mlin while writing the package specification I thought of another way to handle .. style imports.

Given a file called my.wdl, which lives in /home/jdoe/projects/wdlworkflow that imports ../wdltasks/mytask.wdl:

  • Go one directory back
  • Save my.wdl as wdlworkflow/my.wdl in the zip
  • Save ../wdltasks/mytask.wdl as wdltasks/mytask.wdl in the zip
  • Set mainWorkflowURL to wdlworkflow/my.wdl in MANIFEST.json.

Done, and no need to rewrite the source. It won't work for http style imports, but I hope this helps a bit with doing a minimum of import rewriting.

@mlin
Copy link
Author

mlin commented Mar 12, 2022

@rhpvorderman I had some similar thoughts early on in looking at this,

It seems possible in principle to embed in the ZIP file, the minimum directory subtree needed to make the relative paths work. The problem with that is the "main" WDL file gets buried in some subdirectory which isn't straightforward to find upon opening the ZIP file.

We could add a manifest file to the ZIP, with the equivalent of Main-Class: from a JAR manifest and that also could hold default inputs if desired. I could make miniwdl read that, but obviously no control over Cromwell. Come to think of it, I wonder how are repos with the above structure sent into Cromwell/Terra anyhow? (Probably by public URL instead of ZIP file?)

I'd still be concerned about producing a zip that wouldn't readily work with Cromwell.

Another thing I was unsure about was where to put the default input JSON (and "additional files" as wdl-packager supports). Do they go in the root or alongside the main WDL? Depending on that, what kind of relative paths does the input JSON use to refer to the additional files? By putting the main WDL in the root, I at least avoided having to make those decisions. 😅

Thanks for the great PR on miniwdl btw, I'm a bit tied up the next few days but will send comments asarp.

@rhpvorderman
Copy link
Contributor

rhpvorderman commented Mar 14, 2022

I'd still be concerned about producing a zip that wouldn't readily work with Cromwell.

Ah, yes that is a very good reason. I can't believe we overlooked that.

Thanks for the great PR on miniwdl btw, I'm a bit tied up the next few days but will send comments asarp.

Take your time. I wrote a spec for WDL packaging too and put it up for discussion on the OpenWDL repo. Packaging is mainly a BioWDL issue, since we have all these modular repositories, so we had a discussion with the team and decided to push for this ourselves.
With my scala and python skills I can implement any necessary features for supporting the packaging myself. So my PR on miniwdl is part of that effort.
We also plan on hosting the initiial packages.openwdl.org repository that will be a result of this in the long run. Since we will be the primary benefactors.

@vsmalladi
Copy link

@rhpvorderman @mlin is this still open for discussion? I would like to make this a standard we can put in OpenWDL

@rhpvorderman
Copy link
Contributor

@vsmalladi Sure! The reason this has not had much activity is that currently this would only benefit the LUMC and we also have other things we do besides working on WDL pipelines. So these not very pressing concerns tend to get stuck on the backlog. It would be very nice if more people get involved.

@mlin
Copy link
Author

mlin commented Apr 12, 2023

@vsmalladi Also FYI- this repo & thread were the original inspiration for miniwdl zip which @rhpvorderman also subsequently improved. It does rewrite import statements as needed to match the layout of what it stores in the zip file (it prints a warning if it has to do this)

@rhpvorderman
Copy link
Contributor

@vsmalladi I already started on a discussion about packaging: openwdl/wdl#499

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants