Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly encode filenames in Content-Disposition headers. #2330

Merged
merged 7 commits into from Sep 11, 2017

Conversation

Projects
None yet
3 participants
@manthey
Copy link
Member

commented Sep 6, 2017

Prior to this, names that included quotes or unicode codepoints might produce broken Content-Dispositions.

Properly encode filenames in Content-Disposition headers.
Prior to this, names that included quotes or unicode codepoints might produce broken Content-Dispositions.
@brianhelba
Copy link
Member

left a comment

See some comments inline.

Also, there is a usage of setResponseHeader('Content-Disposition' in hashsum_download which should be updated to use this new function.

Finally, what do you think adding a special case to the top of setResponseHeader, whereby if the first parameter is 'Content-Disposition', then it will short-circuit to the new setContentDisposition function? This way, downstream and new code will automatically gain the benefits of this fix without having to explicitly remember to call it.

if not isinstance(quotedFilename, six.binary_type):
quotedFilename = quotedFilename.encode('ascii', 'ignore')
value += b'; filename*=UTF-8\'\'' + quotedFilename
if not isinstance(value, six.text_type):

This comment has been minimized.

Copy link
@brianhelba

brianhelba Sep 7, 2017

Member

Why would value not be a six.text_type at this point?

This comment has been minimized.

Copy link
@manthey

manthey Sep 7, 2017

Author Member

I'll remove the check.

Content-Disposition header, but do not set it.
:returns: the content-disposition header value.
"""
if disposition is None:

This comment has been minimized.

Copy link
@brianhelba

brianhelba Sep 7, 2017

Member

I think this is unnecessary complexity. The default can simply be 'attachment', and None should be an error (triggered implicitly when .encode) is called on it. The only time we need to use a default parameter value of None is when the value would be a mutable object.

def setContentDisposition(filename, disposition='attachment', setHeader=True):
"""
Set the content disposition header to either inline or attachment, and
specify a filename that is properly escaped.

This comment has been minimized.

Copy link
@brianhelba

brianhelba Sep 7, 2017

Member

Can you include a reference to the RFC that this should confirm to?

Also, did you review the relevant RFC, and how well does this implementation conform to it?

This comment has been minimized.

Copy link
@manthey

manthey Sep 7, 2017

Author Member

As often the MDN page has a good summary of the issue and links to the relevant RFCs (see https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition), and see https://tools.ietf.org/html/rfc2183 and https://tools.ietf.org/html/rfc6266.

I'll add a comment.

safeFilename = filename.encode('ascii', 'ignore').replace(b'"', b'')
encodedFilename = filename.encode('utf8', 'ignore')
value = disposition + b'; filename="' + safeFilename + b'"'
if safeFilename != encodedFilename:

This comment has been minimized.

Copy link
@brianhelba

brianhelba Sep 7, 2017

Member

Flask's implementation of this seems a lot cleaner on the surface. Do you think it would be worthwhile to crib more of what they have (potentially utilizing the unicodedata library)?

This comment has been minimized.

Copy link
@manthey

manthey Sep 7, 2017

Author Member

Flask defers a bunch of the work to werkzeug. In werkzeug's Header() class sending a list of values is supported, which is then encoded by werkzeug, handling unicode and string/bytes appropriately. Multiple parameters are added in an arbitrary order (so in Flask, the filename* parameter might be before the filename parameter), which doesn't seem to affect any browser (see http://greenbytes.de/tech/tc2231/).

Flask uses the unicodedata library to decompose the unicode before making the fallback filename value, which is actually a clever way to de-accent characters where possible while converting to ISO-8859-1.

This comment has been minimized.

Copy link
@manthey

manthey Sep 7, 2017

Author Member

We'd have to modify setResponseHeader to accept a werkzeug-like dictionary of parameters to make our implementation look as 'clean' as Flask's.

@manthey

This comment has been minimized.

Copy link
Member Author

commented Sep 7, 2017

Adding a special case to setResponseHeader adds complexity unless we add a werkzeug-like ability to add a dictionary of values to the header. What would setResponseHeader do if sent 'Content-Disposition' and 'inline; filename=""? Would it try to extract out the improperly formed filename and reencode it? Something else? If we accepted setResponseHeader('Content-Disposition', 'inline', {'filename': }), then reencoding filenameand possibly addingfilename*` could make sense.

manthey added some commits Sep 7, 2017

Refactor based on feedback.
Add comments on RFC.  Use ISO8859-1 as default encoding.  Better escape quotes and slashes.  Better fallback values.  Use setContentDisposition in the hashsum_download plugin.

@manthey manthey force-pushed the content-disposition branch from 206d619 to 9129af2 Sep 7, 2017

manthey added some commits Sep 7, 2017

@brianhelba
Copy link
Member

left a comment

LGTM.

See one inline comment for an opportunity for future hardening of Girder tests. Let's report this as an issue if you agree.

I'll leave it up to you whether to implement my other comment about hardcoding the baseline string. If you decide to change that, I'll quickly reapprove the update.

@@ -781,6 +784,18 @@ def testFilesystemAssetstore(self):
copyTestFile = self._testUploadFile('helloWorld1.txt')
self._testCopyFile(copyTestFile)

# Test unicode filenames for content disposition. The test name has
# quotes, a Linear-B codepoint, Cyrllic, Arabic, Chinese, and an emoji.
filename = u'Unicode "sample" \U00010088 ' + \

This comment has been minimized.

Copy link
@brianhelba

brianhelba Sep 8, 2017

Member

I think it'd be a good idea to put this test string somewhere general, and start using it as test data for any place that should be expected to handle general strings (which includes Unicode, quotes, etc.) gracefully.

There is some risk of "overfitting" to this one test string, but if we think of new corner cases that challenge string handling, we can just add them to the master string.

This comment has been minimized.

Copy link
@manthey

manthey Sep 8, 2017

Author Member

We probably should have a set of test strings and one master complex string like this one. I could see a pure right-to-left string giving different results than a mixed direction string, for instance.

u'\u0639\u064a\u0646\u0629 \u6a23\u54c1 \U0001f603'
file = self._testUploadFile(filename)
file = self.model('file').load(file['_id'], force=True)
testval = 'filename="%s"; filename*=UTF-8\'\'%s' % (

This comment has been minimized.

Copy link
@brianhelba

brianhelba Sep 8, 2017

Member

Since this probably most useful as a regression test, maybe it's actually better to hardcode the result that we're expecting to be in filename*, so we implicitly test the full stack that setContentDisposition relies on. If you still want the more pure unit test, then you can still continue to dynamically compute the result, but compare it to the hardcoded string.

@manthey What do you think?

This comment has been minimized.

Copy link
@manthey

manthey Sep 8, 2017

Author Member

I'll change this to a hardcoded value (not today, though).

This comment has been minimized.

Copy link
@brianhelba

brianhelba Sep 8, 2017

Member

My own trial (with just print(repr(...))) indicates that the values may be respectively:
'Unicode \\"sample\\" '
'Unicode%20%22sample%22%20%F0%90%82%88%20%D0%BE%D0%B1%D1%80%D0%B0%D0%B7%D0%B5%D1%86%20%D8%B9%D9%8A%D9%86%D8%A9%20%E6%A8%A3%E5%93%81%20%F0%9F%98%83'
but I'll trust however you want to encode them in this file.

"""
if (not disposition or (disposition not in ('inline', 'attachment') and
not disposition.startswith('form-data'))):
raise RestException('Error: Content-Disposition is not a recognized value.')

This comment has been minimized.

Copy link
@zachmullen

zachmullen Sep 11, 2017

Member

It's always nice in error messages like this to include the offending value in the message. :)

This comment has been minimized.

Copy link
@manthey

manthey Sep 11, 2017

Author Member

Done.

if not filename:
raise RestException('Error: Content-Disposition filename is empty.')
if not isinstance(disposition, six.binary_type):
disposition = disposition.encode('iso8859-1', 'ignore')

This comment has been minimized.

Copy link
@zachmullen

zachmullen Sep 11, 2017

Member

Why ISO 8859-1? I couldn't find mention of it in the standard mentioned in the docstring, is it mentioned somewhere further up the RFC chain?

This comment has been minimized.

Copy link
@zachmullen

zachmullen Sep 11, 2017

Member

Nevermind, I found it. For posterity

This comment has been minimized.

Copy link
@manthey

manthey Sep 11, 2017

Author Member

Added RFC 5987 link into the comments.

Show erroneous value in exception.
Reference one more RFC in comments.
@@ -271,7 +271,8 @@ def setContentDisposition(filename, disposition='attachment', setHeader=True):
"""
if (not disposition or (disposition not in ('inline', 'attachment') and
not disposition.startswith('form-data'))):
raise RestException('Error: Content-Disposition is not a recognized value.')
raise RestException(
'Error: Content-Disposition (%r) is not a recognized value.' % disposition)

This comment has been minimized.

Copy link
@brianhelba

brianhelba Sep 11, 2017

Member

The use of %r is definitely more appropriate here. We should remember to use this in place of %s when we display erroneous values throughout the code.

@manthey manthey merged commit e8eba6a into master Sep 11, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 90.16%)
Details
codecov/project Absolute coverage decreased by -0.02% but relative coverage increased by +9.83% compared to 0b0478f
Details

@manthey manthey deleted the content-disposition branch Sep 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.