Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRS should be able to serve binary files #235

Open
coverbeck opened this issue Jan 3, 2023 · 5 comments
Open

TRS should be able to serve binary files #235

coverbeck opened this issue Jan 3, 2023 · 5 comments

Comments

@coverbeck
Copy link
Contributor

coverbeck commented Jan 3, 2023

A workflow could have binary files, such as compiled code, images, etc.

TRS does not have a way to serve up binaries; the FileWrapper object returns a JSON object with a String content field -- if the data were binary, it would need to be escaped.

Some brainstorming:

  • For JSON responses, base64 encoding the binary data seems like the best solution. But clients would then need to know that the content is base64-encoded, so that would entail adding another field to FileWrapper that indicates if the content field is encoded or not, so the client will know whether to decode it.
  • What to do if one of the PLAIN types, e.g., PLAIN_WDL is requested? An error response, or return the data with the content-type set accordingly?
  • A different idea is that the only way to fetch binary data is by fetching the workflow as a bundle. In practice, I think most clients should be using the zip endpoint, but it does seem incomplete to only be able to fetch and know about binaries this way.

┆Issue is synchronized with this Jira Story
┆Project Name: Zzz-ARCHIVE GA4GH tool-registry-service
┆Issue Number: TRS-64

@uniqueg
Copy link
Collaborator

uniqueg commented Jan 4, 2023

Good point. But I would like to point out that the current specs are at least consistent in that binary files appear to not be supported on purpose:

So with this in mind, let me address your points:

What to do if one of the PLAIN types, e.g., PLAIN_WDL is requested? An error response, or return the data with the content-type set accordingly?

I may be wrong, but I know of no workflow engine supporting the execution of workflows defined in one of the currently enumerated 'DescriptorTypes that allows executing workflows described in binary format. And even if, that would probably constitute highly engine-specific behavior. So if we were able to pull such binaries, how could we be sure it's in a format that a given engine supports. Shouldn't TRS then specify the binarization mechanism (e.g., a compression type) to be useful? Of course the case may be different for other workflow types that are currently not explicitly supported by TRS (i.e., not enumerated in the 'DescriptorType schema).

A different idea is that the only way to fetch binary data is by fetching the workflow as a bundle. In practice, I think most clients should be using the zip endpoint, but it does seem incomplete to only be able to fetch and know about binaries this way.

While I generally agree that there is an inconsistency here, the inability to fetch individual files is not restricted to binary files only, as described above. No file of ToolFile.file_type OTHER can currently be retrieved individually, outside of the GET /tools/{id}/versions/{version_id}/{type}/files endpoint with content type application/zip.

So, personally, I think we should not worry about supporting binary files unless we have a real-world use case that demonstrates a strong need for binary descriptor, test or container recipe files. Otherwise the route of fetching a ZIP archive of all files seems a perfectly acceptable way of dealing with files of any type, including binary, to me. Perhaps we could make it clearer though, both in the documentation and the specs, that clients are expected to make use of this mechanism if they want to fetch a workflow for actual execution, and that the other endpoints are more designed for allowing quick introspection of workflow descriptors, container recipes etc.

@denis-yuen
Copy link
Member

I may be wrong, but I know of no workflow engine supporting the execution of workflows defined in one of the currently enumerated 'DescriptorTypes that allows executing workflows described in binary format. And even if, that would probably constitute highly engine-specific behavior.

We're pondering Nextflow which has the option of binary files https://www.nextflow.io/docs/latest/faq.html#how-do-i-invoke-custom-scripts-and-tools

Otherwise the route of fetching a ZIP archive of all files seems a perfectly acceptable way of dealing with files of any type, including binary, to me. Perhaps we could make it clearer though, both in the documentation and the specs, that clients are expected to make use of this mechanism if they want to fetch a workflow for actual execution, and that the other endpoints are more designed for allowing quick introspection of workflow descriptors, container recipes etc.

That said, the zip archive/improving documentation route makes sense to me

@uniqueg
Copy link
Collaborator

uniqueg commented Jan 4, 2023

But this is not about descriptor files @denis-yuen. Surely workflows in the wild rely on scripts or executables being shipped with the workflows in the way described in the Nextflow documentation. A similar mechanism exists for Snakemake as well, and probably other workflow engines, including support for executing these executables in the indicated containers. So by all means, if you fetch a workflow and expect to be able to run it, you will need to make sure to fetch files of type OTHER as well.

By the way, if we go the doc improvement route, we might use the opportunity to more clearly specify that files of type TEST_FILE can be JSON only (unless we want to extend this to include YAML as well). Otherwise people may think that, e.g., input files for tests (arguably also test files), that may very well be in binary format, should be TEST_FILEs, too.

@coverbeck
Copy link
Contributor Author

Thanks for the thoughtful response @uniqueg .

There is no endpoint to allow fetching individual files of ToolFile.file_type OTHER

Ah, good point, I hadn't paid attention and wasn't aware of that. That does simplify things.

So then, I'm good with fetching via archive only and improving the doc.

A digression that this discussion about archives triggers for me... I suspect file permissions may be important in some cases for running the workflow and we may want to make tar an additional format option to handle that.

@denis-yuen
Copy link
Member

A digression that this discussion about archives triggers for me... I suspect file permissions may be important in some cases for running the workflow and we may want to make tar an additional format option to handle that.

🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants