Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose spec changes for multipart upload. #31

Merged
merged 4 commits into from
Jul 13, 2018
Merged

Conversation

tetron
Copy link
Contributor

@tetron tetron commented Jun 5, 2018

Add text describing multipart upload.

Remove workflow_descriptor because it is superceded by multipart
upload.

Follow up from Toronto meeting.

Corresponding proof of concept implementation here common-workflow-language/workflow-service#22

paging @geoffjentry @mckinsel @jaeddy @briandoconnor @dglazer

Remove workflow_descriptor because it is superceded by multipart
upload.
endpoint must understand and be able to access URLs supplied
in the input. This is implementation specific.

This endpoint can be used two different ways, selected by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not how I understood our agreement.

I was picturing the submission request to be a multipart with:

  • The JSON bundle of most things
  • 0 or more workflow descriptor files

@mckinsel Since you & I have the action item to design this, I'm curious to get your read on this. Was I the outlier here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize for jumping in on your action item. I was doing some prototyping so I wanted to write up my findings, and thought a PR would be the most productive way to kickstart the discussion.

We could specify that it only accepts multipart, but in any case one of the "parts" needs to be the json request, and the other "parts" are the files needed to run the workflow. The part containing the json payload can either be the first item, or an item with a particular name. I went with a particular name ("body") but it is fine either way.

Selecting on content-type makes life simpler for the submitter when they want to provide a workflow_url and not upload files. But that case can also be accommodated with multipart with a single item ("body").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah no I was happy you did :) It was also good because it made clear much faster that there was a difference of opinion on what was decided.

The second paragraph is exactly how I had been picturing it.

The third paragraph works for me insofar as the JSON blob has a well specified form field when multipart

My preference would be for the second paragraph, but I tend to dislike overloading things in general so I recognize that this quite possibly just my bias showing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be completely clear, I mean choosing behavior based on Content-Type of the primary headers as to whether the body of the request is just the request json or whether it is a multipart stream, and not talking about Content-Type fields on the individual parts.

But in the interests of keeping the interface minimal, I can take out that part, we specify that:

  • it always has to be multipart
  • the request json is the part named "body"
  • even if the submitter doesn't need to attach files, they still have to use multipart encoding on the body part

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should, at a minimum, wait for Marcus to comment prior to changing the PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tentatively on board with whatever the current WES implementors think is best/easiest. If you guys make a decision, I'll check back and make sure I can actually figure out what to do (as a client) from the description in the spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tetron "but also foresee the next person looking at the spec wondering why" FWIW when I have shown the spec to folks locally they're always wondering why the JSON blob approach is being used and it's not a multipart form altogether. I say this not as pushing for what @mckinsel suggested but rather to note that people are going to be coming to this with a variety of biases and will by nature be expecting to see what they're used to seeing.

@mckinsel Yes, there is (was?) a limitation in Swagger where you can't have arbitrarily large set of fields with the same name. That WorkflowParams{1..N} is purely a swagger thing, it's not necessary in e.g. curl, but was the only way we could get it to work at all in swagger.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess some of the non-workflow_descriptor fields are going to be JSON blobs either way: workflow_params and workflow_engine_parameters. So clients and implementers are doing some nested encoding/decoding in either scenario.

The swagger limitation appears to be fixed in openapi 3. You can do something like this (I think):

content:
  multipart/form-data:
    schema:
      WorkflowRequest:
        type: object
        properties:
          workflow_descriptor:
            type: array
            items:
              type: string
              format: binary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a preferred encoding in OpenAPI 3 then we should probably adopt that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mckinsel On your first point I don't think that's right. IIRC those were both specified to be actual JSON objects (instead of the current "whatever the heck one wants, but we know it'll be JSON" which means they can be sub-objects in the main JSON blob. The same would be true of form fields (as we see in the Cromwell API). So I don't think either side should require escaping?

@tetron
Copy link
Contributor Author

tetron commented Jun 20, 2018

@mckinsel @geoffjentry can we get to a consensus here? We don't want to supportmultiple content types, it always has to be multipart/form-data, but I don't know whether we want fields such as workflow_params to be separate form parts or packed into a single json encoded form part?

@geoffjentry
Copy link
Contributor

@tetron I have a slight personal preference towards "just make the whole thing a multi-part" but I'm biased as that's the world I come from. I'm OK w/ that not being the case.

I have a strong preference towards it always being some sort of multi-part, and if it's not all multipart the one which makes the most sense to me is workflow-descriptor-X (or whatever) and everything-else (or whatever)

If the final answer is one of those two then I would leave it totally up to you and @mckinsel (as the two other folks who seem to have opinions). If it's something else, I reserve the right to chime in again :)

Related: We're coming up on the quarter boundary which means I will be able to get time (and/or someone else's time) to start poking at this soon

this message with a workflow_type_version offered in ServiceInfo, one can initialize
CWL, WDL, or a base64 encoded gzip of the required workflow descriptors. When files must be
created in this way, the `workflow_url` should be set to the path of the main
workflow descriptor.
workflow_params:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the same thing for workflow_params? @achave11

@jaeddy
Copy link
Member

jaeddy commented Jun 21, 2018

After staring at the OpenAPI 3.0 spec and docs for a while (and following @mckinsel's example), I think something like this might work... I could also be wrong. Regardless of specification, I'm fine with the "just make the whole thing a multi-part" approach, as long as there's a sensible way to discriminate between allowable types for workflow_descriptor.

paths:
  /workflows:
    post:
      summary: Submit a workflow to run.
    requestBody:
      content:
        multipart/form-data:
          schema:
            $ref: '#/components/schemas/WorkflowRequest'

components:
  schemas:
    SimpleDescriptor:
      type: string
    ComplexDescriptor:
      type: array
      items:
        type: string
        format: binary
    WorkflowRequest:
      type: object
      properties:
        workflow_descriptor:
          oneOf:
          - $ref: '#/components/schemas/SimpleDescriptor'
          - $ref: '#/components/schemas/ComplexDescriptor'
        workflow_url:
          type: string
        workflow_params:
          $ref: '#/components/schemas/WesObject'
        workflow_type:
          type: string
        workflow_type_version:
          type: string
        tags:
          $ref: '#/components/schemas/WesObject'
        workflow_engine_parameters:
          $ref: '#/components/schemas/WesObject'

@tetron
Copy link
Contributor Author

tetron commented Jun 25, 2018

A few things:

You can describe multipart file upload in swagger 2.0: https://swagger.io/docs/specification/2-0/file-upload/

It also describes a workaround for describing arbitrary file uploads as an array of "binary" format strings, this is formalized in the OpenAPI 3.0 spec (the way of describing it is almost exactly the same).

Another limitation of Swagger 2.0 is that you can't have form parts which are "object" type. I think we can have them as strings and specify they are application/json though.

There's an online converter from Swagger 2.0 to OpenAPI 3.0:
https://openapi-converter.herokuapp.com/

Here's the problem: the tooling we're using (at least in Python, connexion and bravado) only understands Swagger 2.0. So if we switch to OpenAPI 3.0, what we gain in expressiveness, we lose in convenience.

My personal preference is to try and make it work with Swagger 2.0, at least for the time being.

@denis-yuen
Copy link
Member

We have the same problem with tooling in Java / Typescript for the near future.

@geoffjentry
Copy link
Contributor

@tetron Agree that for now it's probably worth sticking w/ 2.0 if possible for a host of practical reasons - beyond what you cited there's also the "who is volunteering to port this?" issue :)

I'd forgotten about that workaround. The case in the Cromwell API that led us to discover that functionality hole was such that the workaround wasn't useful as we explicitly wanted the Swagger API to provide the upload links, but for this purpose I think it's fine as long as the spec swagger is clear about what's happening.

Just so I'm clear, are we now agreed on the it's all multipart angle? The quarter starts next week so I'm hoping I can direct some of my group's attention this way ASAP once the switchover happens

@tetron
Copy link
Contributor Author

tetron commented Jun 25, 2018

Updated to specify that it has to consume: multipart/form-data and parameters are in: formData, renamed workflow_descriptor to workflow_attachment and updated text. I think this reflects the current consensus so please take a look. The corresponding branch is updated as well: common-workflow-language/workflow-service#22

@briandoconnor
Copy link
Contributor

briandoconnor commented Jul 2, 2018

@jaeddy is going to resolve conflicts.

Next week we call for a vote leading up to the next Monday call?

Voters will be: David (not no), Brian (not no), Marcus, @jaeddy, @geoffjentry

@geoffjentry
Copy link
Contributor

Tagging @mckinsel since Brian wasn't able to get the user ID live during the call


The `workflow_url` is either an absolute URL to a workflow
file that is accessible by the WES endpoint, or a relative URL
corresponding to one of the files attached using
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tetron Can you add an example for the latter case? I had to think about what that'd look like for a moment, and this seems like a case where a little ambiguity can lead to differences in behavior

Correct indentation and naming bugs that are causing server
setup to fail during build tests.
name: workflow_type
type: string

- in: formData
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idle thought - what is the appropriate response when the workflow type & version don't match reality? Is there an expectation that this be checked, or is it just "run it like this and hope for the best"?

We could also consider asking implementations to auto-detect these (e.g. it was pretty trivial to add this to Cromwell, we no longer require the equivalent fields in Cromwell's submit endpoint as we just autodetect wdl vs cwl and which version it is)

@briandoconnor
Copy link
Contributor

@mckinsel needs to vote
this may need to be modified after James' merge... @tetron will take point on the updates

REQUIRED
The workflow CWL or WDL document.
When workflow attachments files are provided, the `workflow_url` may be a relative path
corresponding to one of the attachments.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be clearer to just say that workflow_url should match one of the filename parameters in the Content-Disposition headers?

Copy link

@mckinsel mckinsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. My one suggestion is to consider replacing the "relative URL" concept with a reference to the form filename parameters.

name: workflow_type_version
type: string

- in: formData
Copy link
Member

@david4096 david4096 Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it possible to send in a request body without multipart upload items? We might leave the WorkflowRunRequest untouched and offer the file upload as an option, as opposed to requiring sending strings to multipart file uploads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand you correctly we already discussed that as an option and voted on this path

@geoffjentry
Copy link
Contributor

In particular directed at @tetron @jaeddy @mckinsel but also in general:

Am I right in thinking that this PR has an impact on WorkflowRunRequest in the WorkflowLog section? If so, should we tackle that here or after merging this? Considering how long it's taken to get consensus on this (I think it is mergeable now?)

I think I'd prefer to clean up the WorkflowLog stuff separately - I have a feeling that there are other lurking ripple effects from this change and we can handle them all at once. For instance I came across this when updating the Cromwell WES code and noticing something didn't map cleanly anymore.

@tetron
Copy link
Contributor Author

tetron commented Jul 13, 2018

@geoffjentry the impact is that it ends up being defined separately (redundantly) in the API doc.

@tetron
Copy link
Contributor Author

tetron commented Jul 13, 2018

I'm going to merge this, we can always continue to refine it in another PR.

@tetron tetron merged commit a53bf0c into develop Jul 13, 2018
@geoffjentry
Copy link
Contributor

@tetron Definitely, wasn't suggesting it was tricky just that there are now other inaccuracies in the spec. Since you merged it already it looks like we're in agreement that it can be sorted out later :)

@kislyuk
Copy link

kislyuk commented Aug 10, 2018

Sorry I'm late to this discussion. Just wanted to throw the following out there: I think all modern APIs use JSON as the data interchange format, and the tooling for JSON-based I/O is continuously improving and being invested in. I think making the WES API spec require the use of multipart form data will negatively impact the adoption of the API.

(For reference, to see the I/O preferences we made in the HCA DSS API, here is its Swagger spec: https://github.com/HumanCellAtlas/data-store/blob/master/dss-api.yml)

@tetron
Copy link
Contributor Author

tetron commented Aug 10, 2018

@kislyuk Is the HCA DSS API handling actual file content? It looks to me like it is JSON objects containing URLs to files uploaded separately. WES provides for uploading file content directly. Also I noticed in that document you linked, it uses multipart framing for streaming notifications, not JSON.

@kislyuk
Copy link

kislyuk commented Aug 10, 2018

@tetron yes, it handles actual file content, but only by way of redirects. For webhook notifications, the preferred payload format is JSON, but one of our contributors decided to add multipart form POST support for legacy applications.

@kislyuk
Copy link

kislyuk commented Aug 10, 2018

In general, handling file content in-band in the API might be a design issue in and of itself. If a text file needs to be included though, it can always be included as a JSON string value.

@jaeddy jaeddy deleted the multipart-upload branch October 4, 2018 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants