Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(request): Decouple form and URL param parsing #493

Closed
wants to merge 2 commits into from
Closed

fix(request): Decouple form and URL param parsing #493

wants to merge 2 commits into from

Conversation

necaris
Copy link

@necaris necaris commented Apr 14, 2015

Separate form and URL parameter parsing into request.param and
request.form_param. Introduces new APIs for accessing the data.

Also, do not consume the POST stream unless the data is requested,
and make the raw POST data available on the request.

Closes #418

@lwcolton
Copy link
Contributor

This PR doubles memory usage in the case where the form data is parsed. This happens due to storing the raw body and also the parsed form parameters, where previously the raw body was discarded when form parameters were parsed. The behavior to store the raw body when parsing should be optional.

@necaris
Copy link
Author

necaris commented Apr 14, 2015

@lwcolton Good point, I'll put that into a RequestOption

@necaris
Copy link
Author

necaris commented Apr 14, 2015

Also, it's worth noting that falcon-bench on my machine is about 5% slower on this branch :-(

@necaris
Copy link
Author

necaris commented Apr 18, 2015

Option added as above and rebased onto upstream master

@necaris
Copy link
Author

necaris commented Apr 18, 2015

@falconry/owners Are multiple commits in a PR OK, or should I squash this into one commit?

@lwcolton
Copy link
Contributor

They will ask you to squash please and thank you

On Sat, Apr 18, 2015, 3:57 PM Rami Chowdhury notifications@github.com
wrote:

@falconry/owners https://github.com/orgs/falconry/teams/owners Are
multiple commits in a PR OK, or should I squash this into one commit?


Reply to this email directly or view it on GitHub
#493 (comment).

@lwcolton
Copy link
Contributor

Also make sure to follow contribution guidelines when putting together your
commit message it needs to have a specific format. There is a
contributing.MD file in the root of the repo with that info

On Sat, Apr 18, 2015, 4:23 PM Colton Leekley-Winslow lwcolton@gmail.com
wrote:

They will ask you to squash please and thank you

On Sat, Apr 18, 2015, 3:57 PM Rami Chowdhury notifications@github.com
wrote:

@falconry/owners https://github.com/orgs/falconry/teams/owners Are
multiple commits in a PR OK, or should I squash this into one commit?


Reply to this email directly or view it on GitHub
#493 (comment).

@necaris
Copy link
Author

necaris commented Apr 19, 2015

Thanks @lwcolton -- I think I've complied with most of the requirements, although I'm not sure about where to put the docstrings for the get_param* methods on the Request. Since the logic for these has been moved to request_helpers.ParamProxy it makes sense to keep the docstring close to the implementation, but since Request.get_param* forms part of the public API perhaps the docstring should be left on Request?

@lwcolton
Copy link
Contributor

Because it is part of the public API I personally would leave the docstring
on Request

On Sat, Apr 18, 2015 at 10:28 PM, Rami Chowdhury notifications@github.com
wrote:

Thanks @lwcolton https://github.com/lwcolton -- I think I've complied
with most of the requirements, although I'm not sure about where to put the
docstrings for the get_param* methods on the Request. Since the logic for
these has been moved to request_helpers.ParamProxy it makes sense to keep
the docstring close to the implementation, but since Request.get_param*
forms part of the public API perhaps the docstring should be left on
Request?


Reply to this email directly or view it on GitHub
#493 (comment).

Sincerely,

Colton Leekley-Winslow
651-242-9559
lwcolton@gmail.com

values. Behaves exactly as the ``params`` attribute.

raw_body (bytes): The raw POST data, if any. Note that Falcon will not
save the raw body unless the `save_raw_body` option is set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to explain why it defaults to not buffering/saving the body.

@necaris
Copy link
Author

necaris commented Sep 30, 2015

@kgriffs Thanks for the comments! I've made some changes to address your concerns but noticed that the tests are failing after I rebased onto the most recent master -- I'll address these and add a get_as_date method to update the PR to the current API.

return self._param

@property
def raw_body(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the most surprising thing with Falcon's form parsing has always been that it consumes req.stream and you can't go back to read it again. To mitigate this problem, we could say "hey, go and use this raw_body attribute over here in that case; otherwise maybe use req.stream as usual?" But I'm concerned about that leading to more confusion about when to use what, and generally making apps harder to reason about.

I understand that adding a more general raw_body feature may be useful in some cases, yet I still hesitate to do this without better understanding the use cases from the community (and how common they are). I want to be sure we don't encourage anti-patterns, and that we can provide something more elegant than using a middleware component or hook (which is how you would otherwise accomplish this.) All things considered, I think it would be better to split the general case out into a separate issue/PR from the one currently under discussion.

For now, as an alternative, what do you think about replacing req.stream with a file-like object that has been seeked back to the beginning, and simply wraps an instance of bytes that was read in the course of parsing the form body? This would only happen the first time a form param was accessed (thus triggering the consumption and parsing of the original req.stream). Since url-encoded form bodies are generally of reasonable size, this shouldn't cause a significant degradation in memory efficiency. We would have to make sure that the mock stream behaves similarly to the ones provided by, e.g., mod_wsgi, Gunicorn and uWSGI.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting alternative -- I think it could work pretty well. I agree that it's less than optimal to add more stuff to the API surface if it doesn't serve a very clear use case; I wonder if it could be as simple as using a BytesIO?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might do it. I just tested on gunicorn under 2.7 and 3.4 and stream.read() returns bytes, and b'' when the stream is at EOF. wsgiref is similar. You may want to check mod_wsgi and uWSGI as well.

@kgriffs
Copy link
Member

kgriffs commented Oct 16, 2015

@necaris Will you have some time to work on this during the next week or two?

necaris and others added 2 commits October 17, 2015 09:28
Separate form and URL parameter parsing into `request.param` and
`request.form_param`. Also, do not consume the POST stream unless
the data is requested, and make the raw POST data available on the
request if the `store_raw_body` option is set.

Closes #418
if a query parameter is assigned a comma-separated list of
values (e.g., 'foo=a,b,c'), only one of those values will be
returned, and it is undefined which one. Use
`req.get_param_as_list()` to retrieve all the values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

req.params.get_as_list() now :)

@yohanboniface
Copy link
Contributor

As we are using lazy proxy objects, here is a suggestion:

  • use req.query for query string parameters
  • use req.form for form like parameters
  • keep req.params as a merge of both: this will not break the current API plus keep the convenient shortcut.

And let's keep in mind that we'll need req.files at some point.

body,
keep_blank_qs_values=self.options.keep_blank_qs_values,
)

self._params.update(extra_params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since self._params no longer contains form params, we will need to modify req.params to avoid a breaking change.

@philiptzou
Copy link
Contributor

Perhaps we should allow customization and extension of ParamProxy class. For example I may want to use voluptuous to validate the input body. Also the content-type can be others for example "multipart/form-data" or "application/json" but not only the "application/x-www-form-urlencoded".

class Request(object):

    param_proxy_class = helpers.URLEncodedParamProxy
    # so we can easily custom the param proxy by inheriting from AbstractParamProxy class
    # which is neutral to any serialization formats and any validators

    ...

    @property
    def form(self):
        if self._form is None:
            # let param_proxy to handle the unserialization of raw_body but not in Request
            self._form = self.param_proxy_class(self, self.raw_body)
        return self._form

    def set_param_proxy_class(self, klass):
        # can also change param_proxy class by instance
        self.param_proxy_class = klass

    ...

So why not we decouple _parse_form_urlencoded from Request too?

@philiptzou
Copy link
Contributor

@necaris Would you still like to follow up the comments of this pull-request? Or I can take over of it since I am interested in and if you don't mind :)

@necaris
Copy link
Author

necaris commented Nov 24, 2015

@philiptzou I don't have a clear plan to implement all the API changes requested recently so I'd be just fine with you taking over this branch 👍

@philiptzou
Copy link
Contributor

@necaris I now understand why you don't have a clear plan... wow

I'm now consider to implement ParamProxy as a general lazy proxy but not only support for dictionary API. Since I want to allow future potential that JSON or XML can be parsed too.

@kgriffs
Copy link
Member

kgriffs commented Dec 11, 2015

@philiptzou This may not not be the best place to create an extension method for supporting arbitrary body media types. It isn't really a "form" at that point in the sense of the HTML-centric intent of this interface. I'd like to see how this would compare and contrast with parsing body documents via middleware methods or hooks.

For now I like @yohanboniface's suggestion of adding req.query and req.form alongside req.params. req.form and req.params would only consume the request body stream the first time either is called. Once consumed, the body would be buffered into a BytesIO instance which would then be returned by req.stream in lieu of the original req.env['wsgi.input'].

@philiptzou Were you still interested in taking this on? In preparation for the imminent Falcon 1.0, I'm considering putting up a PR to just add the BytesIO buffer, since that could be considered a subtle breaking change. Meanwhile, you could work on adding the additional, non-breaking req.query and req.form attributes in a 1.1 or something (assuming semantic versioning). Finally, in Falcon 2.0 we could consider removing req.params.

@philiptzou
Copy link
Contributor

@kgriffs: Actually I have an unfinished BytesIO reversion at now. I think I'm just too caution to minimize resources usage and performance impact. I mean if you want to use plain BytesIO it will always double the memory usage as @lwcolton worried (#493 (comment)). And we'll also lose the advantages of stream reading (had to wait all request body arrived).

Alternatively, We can implement a compatible interface of BytesIO which proxying the original file-like object from WSGI. The advantages is we can still do stream reading, but the proxy will affect performance and there are a lot code to be written since we are going to implement an interface.

I have not decided yet which one should be implemented. And I don't think I have a clear idea. That's the reason why I need some more discussion but I'm still interested in taking this on.

@kgriffs
Copy link
Member

kgriffs commented Dec 21, 2015

I can understand the trepidation, and I don't see a problem with making the buffering optional via RequestOptions. However, I suspect that most application/x-www-form-urlencoded strings are not very long, and so would not incur a significant memory overhead when buffered. Let's suppose that the lengths of form bodies in a given app average 256 bytes. Even if one of my web heads is serving an extremely large number of simultaneous requests/sec, say 10,000, the app would only use about 2 mb of RAM.

Likewise, with such small bodies, I don't think streaming is going to be much of a concern. Since the body will usually fit in one or two TCP packets, all the data will already have arrived by the time the app attempts its first read.

Regardless, I've been thinking that we need to step back and re-examine our assumptions. I'm still not sure that I completely understand why people have been wanting to re-parse form bodies. Here are a few options:

  • To deal with collisions between field names in the query string and in the form. But this doesn't make a lot of sense, since form params are parsed after the query string, meaning the app would need to re-parse the raw query string to get the clobbered value, not the request body.
  • To work around some problem with the way Falcon parses application/x-www-form-urlencoded. Ideally this should be addressed by fixing the problem directly.

Whatever the reason, if the request body were only parsed on-demand (i.e., when the hypothetical req.form is first called), then at least the application would have some control over whether or not (or when) the body is consumed.

With that in mind, it may be better to just go ahead with the plan to disentangle the interfaces for form and query string parsing. Parse params just-in-time (the first time requested), and don't replace req.stream with an instance of BytesIO or similar. If the community feedback still indicates a need for buffering the body, we can cross that bridge when we come to it.

@kgriffs
Copy link
Member

kgriffs commented Feb 24, 2016

Another thought on this... perhaps we can provide robust HTML form handling via a standalone add-on (e.g., "falcon-forms"), vs. being part of the core framework. We could add an option to disable parsing form params as currently implemented, in the case that a developer would like to handle it in a different way.

@kgriffs
Copy link
Member

kgriffs commented May 10, 2016

Update: I plan on discussing this during the PyCon sprints. I'd like to settle on a design direction and get this finished up.

@kgriffs
Copy link
Member

kgriffs commented May 10, 2016

@lwcolton
Copy link
Contributor

@kgriffs You previously commented that you liked the idea of lazy-access request attributes for form data, so that on first access the request stream is consumed, used to populate the form data and files request attributes, and not saved or wrapped. This is what I have seen in other frameworks and allows a developer to consume the stream themselves if they wish. Would that be acceptable here? Would love to help this finally get merged

@kgriffs
Copy link
Member

kgriffs commented Nov 23, 2016

@lwcolton Cool, thanks for following up on this. I've also been thinking about this and finally sat down today to write up a proposal. I'd love to get your thoughts on what I just posted to #418. I'm going to close this PR since it is so old. If this is something you'd like to work on, let me know. Otherwise we'll see if anyone else has some bandwidth to take and run with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decouple parsing of form fields and URL params
5 participants