Skip to content

w4py3 can't properly process binary post data #14

@zarquon5

Description

@zarquon5

To whom it may concern,

If you post a request to w4py3 with a binary content type of application/octet-stream or, say, application/x-7z-compressed, w4py3 will always try to decode the binary payload, usually resulting in a traceback like this:

Traceback (most recent call last):
  File "c:\ProgramData\conda\envs\myenv\lib\site-packages\webware\HTTPRequest.py", line 41, in __init__
    self._fields = FieldStorage.FieldStorage(
  File "c:\ProgramData\conda\envs\myenv\lib\site-packages\webware\WebUtils\FieldStorage.py", line 212, in __init__
    self.read_single()
  File "c:\ProgramData\conda\envs\myenv\lib\site-packages\webware\WebUtils\FieldStorage.py", line 409, in read_single
    self.read_binary()
  File "c:\ProgramData\conda\envs\myenv\lib\site-packages\webware\WebUtils\FieldStorage.py", line 434, in read_binary
    self.file.write(data.decode())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 2: invalid start byte

This is because the FieldStorage code in w4py3, lifted from the deprecated cgi.py module, does an internal test on "whether or not the storage should be a binary file, viz:

            if self._binary_file:
                self.file.write(data)
            else:  # fix for issue 27777
                self.file.write(data.decode())     # <----- the problem line

However, unexpectedly, this self._binary_file flag is not set based on the Content type, but rather if and only if there is a Content-disposition header, and then only if a filename attribute in the options dictionary for that header (even though, as far as I can tell, that filename is otherwise ignored, viz:

        cdisp, pdict = '', {}
        if 'content-disposition' in self.headers:
            cdisp, pdict = parse_header(self.headers['content-disposition'])
            fslog.write(f'cdisp and pdict found in headers: {cdisp}, {pdict}\n')
        self.disposition = cdisp
        self.disposition_options = pdict
        self.name = None
        if 'name' in pdict:
            self.name = pdict['name']
        self.filename = None
        if 'filename' in pdict:
            self.filename = pdict['filename']
        self._binary_file = self.filename is not None

(I do not understand the reasoning behind this logic, as this header is normally associated with responses, rather than requests. But I digress...)

And, unfortunately, the HTTPRequest class does not pass headers into the FieldStorage constructor and the FieldStorage constructor does not parse this header out of the passed environment to create its own internal headers dictionary, viz:

        if headers is None:
            headers = {}
            if method == 'POST':
                headers['content-type'] = 'application/x-www-form-urlencoded'
            if 'CONTENT_TYPE' in environ:
                headers['content-type'] = environ['CONTENT_TYPE']
            if 'QUERY_STRING' in environ:
                self.qs_on_post = environ['QUERY_STRING']
            if 'CONTENT_LENGTH' in environ:
                headers['content-length'] = environ['CONTENT_LENGTH']

        else:
            if not (isinstance(headers, (Mapping, Message))):
                raise TypeError(
                    'headers must be mapping or an instance of'
                    ' email.message.Message')
        self.headers = headers

Because of the above logic, the _binary_file flag is therefore never set to true, and the upshot of this is that you can't post binary data to a w4py3 servlet like you could in w4py2, because read_binary() now always try to run data.decode() on it, which results in the above traceback for non-unicode POST payloads.

I'm not sure what the right fix here is, but my gut tells me that "binary" test should be based on the content-type, rather than the presence of the filename attribute in the content-disposition header; am I missing something?

One other thing: the exception handler for the FieldStorage constructor still assumes that the cgi package is importable:

            # Protect the loading of fields with an exception handler,
            # because bad headers sometimes can break the field storage
            # (see also https://bugs.python.org/issue27777).
            try:
                self._fields = FieldStorage.FieldStorage(
                    self._input, environ=self._environ,
                    keep_blank_values=True, strict_parsing=False)
            except Exception:
                self._fields = cgi.FieldStorage(keep_blank_values=True)
                traceback.print_exc(file=sys.stderr)

But, given that original cgi module is already deprecated and scheduled to go away in python 3.13, is there a plan to replace all of this the FieldStorage logic with the recommended (alas, non-core) "multipart" package?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions