Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MIME multipart parsing #2967

Closed
wants to merge 2 commits into from
Closed

Conversation

RX14
Copy link
Contributor

@RX14 RX14 commented Jul 7, 2016

This pull request adds classes for parsing and generating MIME multipart/* messages, based on RFC2046 Section 5.1.

This pr depends on #2972, #2973.

@kostya
Copy link
Contributor

kostya commented Jul 7, 2016

interesting how this performance compare to https://github.com/kostya/http_parser.cr/blob/master/benchmark/bench_multipart.cr

@RX14
Copy link
Contributor Author

RX14 commented Jul 7, 2016

From preliminary benchmarks it looks about half as fast:

https://gist.github.com/RX14/e74d8bb3e812485259274591d5f45d67

Not sure exactly what to improve, i'm sure the somewhat increased user-friendliness of this implementation has something to do with it.

@ysbaddaden
Copy link
Contributor

What about having a MIME::Multipart class to handle generic MIME multipart (for form data or email parts) along with MIME::Multipart::Parser, then having a HTTP::FormData class to build a multipart/form-data body along with HTTP::FormData::Parser to parse an incoming request?

Maybe have a simpler API to build a multipart/form-data, with headers being generated automatically, or even skipped to directly write the single Content-Disposition (for FormData). Also body_part looks like a getter, what about add_part or just add or append (to follow the DOM FormData interface)?

data = HTTP::FormData.new
data.add("username", "john")

File.open("avatar.jpg", "r") do |file|
  data.add("file", file, HTTP::Headers{"content-type" => "image/jpeg"})
end

We miss a mecanism in HTTP::Client to directly write the body to the IO, but we could allow body to accept a HTTP::FormData in addition to String|Nil?

HTTP::Client.put("https://api.example.com/user/avatar", body: data)

@RX14
Copy link
Contributor Author

RX14 commented Jul 8, 2016

I'm ok with renaming HTTP::Multipart to MIME::Multipart, but I wish to keep Parser and Generator.

I was thinking of having a simpler API for multipart/form-data, but I wasn't quite sure what it would be. Being able to provide the form data as a hash is something I wish to support however. As for parsing, I intentionally wish to make the default parsing method able to handle large files without resorting to the ugly hack that rack (and many other libraries) use: writing all large files out to disk and providing uploads as files. This means callbacks with an IO given, at least for all files.

The kind of interface I am currently thinking of for parsing multipart requests would be callback-based like this:

HTTP::FormData.parse(request) do |p|
  p.field("username") do |username|
    # handle username field
  end
  p.file("upload") do |io|
    # use IO
  end
end

However this approach is a little unwieldy for parsing non-file fields so maybe this would be better, if a little less consistent:

mp = HTTP::FormData::Parser.new(request)
mp.handle_file("upload") do |io, filename(, headers)|
  # handle file somehow
end
mp.handle_files do |field, io, filename(, headers)|
  # callback for every file upload
  # cannot be used with `#handle_file`
end
mp.parse # => {"username" => {value: "rx14", headers: HTTP::Headers{...}}, ...}

My preference for generating FormData would be with a builder, or providing a Hash(String, String | IO | {value: String | IO, headers: HTTP::Headers}).

@ysbaddaden
Copy link
Contributor

What about yielding each part as it comes, where field would be a struct with some info (eg: headers, name, filename, content type, ...) and value either being a String or an IO when the part is a file?

For example (yes, it uses a tempfile, but this is an usage example):

record UploadedFile, filename : String, file : Tempfile
params = Hash(String, String|UploadedFile).new

HTTP::FormData.parse(request) do |field, value|
  case value
  when IO
    tmp = Tempfile.new("multipart")
    IO.copy(value, tmp)
    params[field.name] = UploadedFile.new(field.filename, tmp)
  else
    params[field.name] = value
  end
end

@RX14
Copy link
Contributor Author

RX14 commented Jul 8, 2016

That kind of boilerplate was what I wanted to avoid. With that method you're essentially going to end up with people switch-casing in their block based on field names. Having callbacks based on field-name seems like the most user-friendly way to return io objects, and returning a hash seems to be the most user-friendly way to return strings/non-files. The question is whether to use callbacks for both for consistency, or do both which will likely be easier to use.

Maybe if the FormData callbacks were basically transformers from io to another storable value.

For example:

mp = HTTP::FormData::Parser.new(request)
mp.transform_file("upload") do |io, filename, headers|
  tempfile = Tempfile.new(filename)
  IO.copy(io, tempfile)
  tempfile
end
mp.parse # => {"username" => {value: "rx14", headers: HTTP::Headers{...}}, "upload" => {value: Tempfile, headers: HTTP::Headers{...}}}

would be the way to implement saving files to disk for later usage. This is flexible, because the transformer could handle the io right there, and map to nil. Not quite sure how to implement it with regards to typing though.

@RX14
Copy link
Contributor Author

RX14 commented Jul 8, 2016

Ok, After discussing on IRC I think the correct interface is like so:

username = nil
password = nil
email = nil
avatar_url = nil
HTTP::FormData.parse(request) do |p|
  p.field("username") { |u| username = u }
  p.field("password") { |p| password = p }
  p.field("email") { |e| email = e }
  p.file("avatar") do |io|
    # write avatar
    # generate url
    avatar_url = url
  end
end

It's got a little bit of boilerplate for collecting fields, but I think it's consistent. If anyone can work out a way of getting rid of the boilerplate cleanly, it would be appreciated.

@RX14
Copy link
Contributor Author

RX14 commented Jul 9, 2016

I can't rename HTTP::Multipart to MIME::Multipart because I depend on HTTP::Headers and the http header parsing stuff. Returning HTTP::Headers, and requiring http, from the MIME namespace seems weird. Neither do I want to duplicate HTTP::Headers etc. to the MIME namespace.

The most technically correct option would be to note that rfc5322 is the basis for MIME and HTTP headers, although i'm not quite sure HTTP headers and MIME headers have exactly the same format and semantics.

@@ -89,7 +89,7 @@ _canonicalize_file_path() {
local dir file
dir=$(dirname -- "$1")
file=$(basename -- "$1")
(cd "$dir" 2>/dev/null && printf '%s/%s\n' "$(pwd -P)" "$file")
(cd "$dir" 2>/dev/null >/dev/null && printf '%s/%s\n' "$(pwd -P)" "$file")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? What's a way to reproduce something that, without this change, breaks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set CDPATH on my system, which causes cd to output which directory it's cding to, because it can be ambiguous.

@asterite
Copy link
Member

asterite commented Jul 9, 2016

Overall looks good!

Maybe you could split this PR into several PRs so it's easy to discuss some parts of it? For example IO::Sized is never used outside its specs, so probably doesn't belong here. Then another PR for just IO::Delimited would be good. And once we accept those, we can discuss this PR, which should be much smaller.

What do you think?

@asterite
Copy link
Member

asterite commented Jul 9, 2016

Oh, and a separate PR for bin/crystal, it's not clear to me why it's needed, so it should be discussed separately.

@RX14
Copy link
Contributor Author

RX14 commented Jul 9, 2016

Ok, PR has been split into 4.

@RX14 RX14 force-pushed the feature/multipart branch 3 times, most recently from b3bce73 to ab82e2a Compare July 9, 2016 21:14
@RX14 RX14 force-pushed the feature/multipart branch 3 times, most recently from b0fb06c to 3a1367c Compare July 14, 2016 20:58
@RX14
Copy link
Contributor Author

RX14 commented Jul 15, 2016

This also needs HTTP::Request's body to be an IO to actually be useful. What issues/problems are we going to face doing that? Seems to be a bit of a pain to do from the way HTTP::Request works at the moment.

@sdogruyol
Copy link
Member

Up! What's the status on this please 🙏

@RX14
Copy link
Contributor Author

RX14 commented Jul 30, 2016

Just have to document and finish the specs. Nearly there!

@RX14 RX14 force-pushed the feature/multipart branch 3 times, most recently from 8df2e45 to eaa153f Compare July 30, 2016 23:32
@asterite
Copy link
Member

asterite commented Aug 4, 2016

I wouldn't expose the IO, I would make the parser an IO that's always positioned at some multipart part. So in the beginning it would probably start in the epilogue (or maybe this should be automatically skipped, I don't know), then you read until you don't have any more data. Then you call next_part and the IO now points to the next part, and so on. So you can't really keep a reference to some part, you are always working with the entire multipart body positioned at some point.

@RX14
Copy link
Contributor Author

RX14 commented Aug 4, 2016

@asterite the Multipart::Parser doesn't do callbacks, it just passes the io into a block to try to show the lifetime of the IO::Delimited clearly. FormData::Parser does do callbacks because I just couldn't find a better way to do it while keeping it clean and not buffering the whole formdata in memory.

I also think that having the parser be an IO and a parser is a bit messy, and tries to combine two seperate concerns into one object. I believe the current method is cleaner and more intuitive.

@asterite
Copy link
Member

asterite commented Aug 4, 2016

@RX14 Oh, now I see only the form data has callbacks... I'm still not sure about that, I'll probably need to review this with @waj at some point (and I'll make a big effort to do this before 0.19.0)

@jhass jhass added this to the 0.19.0 milestone Aug 4, 2016
@sdogruyol
Copy link
Member

Up up up 👍

@sdogruyol
Copy link
Member

@asterite aren't you gonna release this with 0.19.0 😢

@RX14
Copy link
Contributor Author

RX14 commented Sep 1, 2016

@sdogruyol why wouldn't it? It's still on the 0.19.0 milestone.

@asterite
Copy link
Member

asterite commented Sep 1, 2016

I'm removing the 0.19.0 milestone. I won't merge this as long as it uses callbacks. Programming against callbacks requires a different way of thinking and organizing code than in regular sequential code (or pull parsers).

@asterite asterite removed this from the 0.19.0 milestone Sep 1, 2016
@asterite
Copy link
Member

asterite commented Sep 1, 2016

@sdogruyol It's not a big deal, one can create a shard with this code and use it until we have this functionality in the std

@sdogruyol
Copy link
Member

@asterite actually it's not that easy to create an efficient and correct multipart parser since it's kinda hard to deal with(i tried at least 3 times and failed).

I was actually waiting on this to release the next version of Kemal 😞

@RX14
Copy link
Contributor Author

RX14 commented Sep 1, 2016

I'm not quite sure how the callbacks in this PR are semantically different to the server callbacks in HTTP::Server etc.

For example, see these two examples:

HTTP::Server.listen(8080) do |context|
  context.response.content_type = "text/plain"
  context.response.print "Hello world, got #{context.request.path}!"
end
HTTP::FormData.parse(request) do |parser|
  parser.field("field1") do |data|
    data # => "field data"
  end
end

Both use blocks (callbacks). Both of which are executed in response to an event (HTTP request/FormData section parse). Both blocks are executed before the call returns. Both accept a raw .new form which requires an extra call to start listening, or parsing. As far as I can see the only difference is that the multipart parser allows multiple callbacks (which Benchmark.bm has too, in the same style as this PR even). Where is the line between a do block and a callback anyway? When it's stored for later execution? Both examples do that.

Even if you could draw a line between this use of callbacks and all the others in the standard library (which is every single do block if you take the wikipedia definition), then I still wouldn't reject this PR just because it has a different style of programming. You use the correct tools for the correct jobs, and I have spent a lot of time thinking about this PR, and I think callbacks are the correct tool in this case.

@jwoertink
Copy link
Contributor

Just out of curiosity on this subject, couldn't this work similar to how File.open stuff works? (at least in ruby since I haven't done it in crystal yet). Maybe there could be advantages/disadvantages to each style like callback style could be executed immediately, and object style could be lazy evaluated (similar to Rails Model.where()).

# As an object
f = File.open('filename.txt', 'w+')
# ... do stuff

# As a callback
File.open('filename.txt', 'w+') do |f|
  # ... do stuff
end

@ozra
Copy link
Contributor

ozra commented Sep 2, 2016

Seems like "iterating" over the fields would be best? They do come in sequentially, and so it should be possible to stream the data as it arrives? (I was hit by this when customers started uploading GB's of video to their presentations pages in a system we wrote in node.js. After changing to continually parsing and handling the multipart there was no problem uploading bluerays through it.)

@RX14
Copy link
Contributor Author

RX14 commented Sep 3, 2016

@asterite Would it be possible to get this merged in 0.19.1? Can you reply to my comment?

@asterite
Copy link
Member

asterite commented Sep 3, 2016

@RX14 No, sorry. This PR is huge for me to digest and I wouldn't have used callbacks for this.

I'd suggest to move this to a shard and use that. If with time it turns out to work well, and I or someone else didn't make an alternative implementation in the standard library, we can include it.

@asterite
Copy link
Member

asterite commented Sep 3, 2016

@RX14 The main issue I have with callback-based parsers is that you can't easily make a pull-parser out of them. But you can make it the other way around. Also, callbacks will probably involve closures, so heap memory. So pull parsers are faster and more flexible.

@RX14
Copy link
Contributor Author

RX14 commented Sep 3, 2016

@asterite Currently the Multipart part of the PR is a pull parser, I think we should merge that at least. The FormData part is a thin wrapper around the Multipart pull parser. If I separate a pull parser out of the FormData parser (which I think is kind of ugly), would you merge the PR if it still retained the callback parser?

@RX14
Copy link
Contributor Author

RX14 commented Sep 3, 2016

@asterite My problem with a pull parser for FormData is that you often have a large number of fields to your form and you end up having a massive switch statement, where the string fields are kind of ugly using gets_to_end. So I would advocate including both types of parsers for practical reasons.

@RX14 RX14 force-pushed the feature/multipart branch 2 times, most recently from 3f717be to ec774e5 Compare September 3, 2016 16:35
@RX14
Copy link
Contributor Author

RX14 commented Sep 3, 2016

This PR is now solely for the raw multipart/mixed part of the PR. Seperate PRs for a pull-based multipart/form-data and a block-based multipart/form-data parser are incoming.

@RX14
Copy link
Contributor Author

RX14 commented Oct 1, 2016

This PR and #3243 are now in a shard: https://github.com/RX14/multipart.cr

@RX14 RX14 closed this Dec 18, 2016
@RX14
Copy link
Contributor Author

RX14 commented Dec 18, 2016

@asterite the shard has been in production on a site I develop, handing uploads with no complaints for a month. I'm pretty sure it's stable. Is it worth attempting to get this merged again?

@asterite
Copy link
Member

@RX14 Sure! Please send another PR, we can probably merge it. We can always improve things just before 1.0, if we think it's worth it (your shard implementation is probably good enough ^_^)

@sdogruyol
Copy link
Member

@asterite i second @RX14. It's been performing great at production 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants