Rework Content-Disposition parsing totally #461

Gowee · 2018-08-12T15:47:45Z

This PR is going to solve several bugs that exist in content_disposition.rs and improve the ergonomics of extracting data out of Content-Disposition header with actix-web.

The previous implementation:

Cannot parse Content-Disposition if it contains special characters like " (double quote) and ; (semicolon) due to its implementation limits;
Cannot parse Content-Disposition if it contains parameter values in UTF-8 as quoted-string (e.g. form-data; name=upload; filename="文件.webp"). UTF-8 in quoted-string is not compliant with old RFCs, however:
1. RFC 7578 states clearly that:
Some commonly deployed systems use multipart/form-data with file names directly encoded including octets outside the US-ASCII range. The encoding used for the file names is typically UTF-8, although HTML forms will use the charset associated with the form.
1. Mainstream browsers like Chrome and Firefox (Gecko) follow this practice. So the previous implementation cannot parse Content-Disposition at all from HTML forms submitted using Chrome and Firefox if non-ASCII characters are present in file name.
2. Generate Content-Disposition header without escaping some characters, resulting in unexpected file name truncation. For example, if a file called quote"here.zip is served with fs::StaticFiles, then browsers will see it as quote with all characters after the double quote stripped.
3. Other non-significant incompliance and problems.

Besides solving these problems listed, this commit also:

Tries to be compatible, even though not strictly compliant, with RFC6266 (Content-Disposition as response header used to indicate how to open files and file names to be saved as) and [RFC7578] (Content-Disposition generated for submitting HTML forms of multipart/form-data).
Should be able to parse all Content-Disposition generated by implementations compliant with RFC6266 and RFC7578 under most circumstances. (But, it could also possibly accept some malformed headers.)
Adds plenty of test cases to ensure it works as expected.
Tweaks related struct/enum to improve ergonomics and ensure conformance (e.g. the previous implementation treat filename and filename* equally, and users cannot distinguish between them, which is not wanted in some cases).
Adds a lot of useful methods to existing struct/enum to allow easier data extracting. For example:
https://github.com/Gowee/actix-web/blob/c4e29eb1dc0bb3339207f0ca22cbcca48bb46012/src/header/common/content_disposition.rs#L240

fafhrd91 · 2018-08-12T18:11:26Z

This is great!

@DoumanAsh could you review this pr.

DoumanAsh · 2018-08-12T18:50:42Z

src/header/common/content_disposition.rs

+            let (param_name, new_left) = split_once_and_trim(left, '=');
+            if param_name.len() == 0 {
+                return Err(::error::ParseError::Header);
+            }


Should we check for new_left != 0?

Yes but not necessary. Line 294 header::parse_extended_value will error out if ext_value is empty, which includes cases where new_left.len() == 0. The nested if block starting at L305 falls into the branch starting at L334, if new_left is empty, which accepts empty values. But check in L342 will ensure it error out.

But if new_left.len() == 0 it wouldn't fall into the line 289 branch.
I just feel like it would be better to leave earlier in bad case instead of checking it out.
Though I'm not sure should we just ignore such parameter or error....

Why wouldn't? Whether falling into L289 only dependent on the presence of trailing asterisk of param_name.

Fine, I will add that check in a new commit.

I am also not sure and fail to find related statements in those standards or elsewhere. But I do not think it is really a problem in practice because that is generated by browsers and does not rely on end-users input. Besides, what the lenient parsing implementation wants to do is just ensuring whatever directly submitted by end-users (typically, file names) are parsed correctly as long as browsers produce headers correctly.

If you think that the current version is ok as it is unlikely to happen then leave it as it is 😄

DoumanAsh · 2018-08-12T18:51:25Z

src/header/common/content_disposition.rs

+            left = new_left;
+            if param_name.ends_with('*') {
+                // extended parameters
+                let (param_name, _) = param_name.split_at(param_name.len() - 1); // trim asterisk


why split_at instead of slicing?

Because I just forget to use...

DoumanAsh · 2018-08-12T18:55:12Z

src/header/common/content_disposition.rs

+                            escaping = false;
+                            quoted_string.push(c);
+                        } else {
+                            if c == 0x5c // backslash


maybe it would be better to use char literals here instead of raw byte?

See the following comments.

DoumanAsh · 2018-08-12T18:57:08Z

src/header/common/content_disposition.rs

+                    let mut quoted_string = vec![];
+                    let mut end = None;
+                    // search for closing quote
+                    for (i, &c) in left.as_bytes().iter().skip(1).enumerate() {


why use reference for &c?

To dereference &u8 into u8? I think as_bytes().iter() produces &u8 instead of u8, am I right?
I am not sure how to write that idiomatically in Rust.

Ah yeah, I see.
If I'm not mistaken though you can omit & here

After change &c to c in line 311:

error[E0277]: can't compare `&u8` with `{integer}` --> src/header/common/content_disposition.rs:316:34 | 316 | if c == 0x5c // backslash | ^^ no implementation for `&u8 == {integer}` | = help: the trait `std::cmp::PartialEq<{integer}>` is not implemented for `&u8` error[E0277]: can't compare `&u8` with `{integer}` --> src/header/common/content_disposition.rs:319:41 | 319 | } else if c == 0x22 // double quote | ^^ no implementation for `&u8 == {integer}` | = help: the trait `std::cmp::PartialEq<{integer}>` is not implemented for `&u8` error[E0308]: mismatched types --> src/header/common/content_disposition.rs:331:59 | 331 | let quoted_string = String::from_utf8(quoted_string) | ^^^^^^^^^^^^^ expected u8, found &u8 | = note: expected type `std::vec::Vec<u8>` found type `std::vec::Vec<&u8>`

Ah sorry, I was sure Rust would copy it as it is, but it seems you went for right syntax originally

Aha, I am often messed up by those rules/implications too.

DoumanAsh · 2018-08-12T18:59:29Z

src/header/common/content_disposition.rs

+                    let mut quoted_string = vec![];
+                    let mut end = None;
+                    // search for closing quote
+                    for (i, &c) in left.as_bytes().iter().skip(1).enumerate() {


Should we maybe use .chars() here?
This way you could just collect it into string, instead of going through parsing

That is what I attempt at first, and then I find that the following split_at panics. Then I get to know Char, which is produced by the .chars() iterator, is different from u8 in Rust. But split_at takes byte offset instead of Char offset. Seems that this is also not idiomatic... Any suggestions to improve?

Ah I see... Yes, in this case chars are nasty as they are themself four bytes and doesn't represent actual character's byte len

DoumanAsh · 2018-08-12T18:59:46Z

src/header/common/content_disposition.rs

+                    }
+                    left = left.split_at(end.ok_or(::error::ParseError::Header)? + 1).1;
+                    left = split_once(left, ';').1.trim_left();
+                    // In fact, tt should not be Err if the above code is correct.


tt -> it

DoumanAsh

This is quite a work you did! Thank you.

I'm feeling that it would be better to use full fledged parser for Content-Disposition but since our parsing is quite lenient I suppose it is not really necessary.

One thing I'd like to ask:
How should we handle cases when param's value is missing?
I'd like to add such tests to be sure that we would react correctly

Gowee · 2018-08-13T00:08:24Z

@DoumanAsh
I agree. It would be more robust if a rigorous parser is used. But it will introduce new dependencies (if not writing a parser from the scratch). I am not sure what is your opinions on that. Besides, the new implementation should handle most cases and run a little faster.

Thanks for your patient review!

I assume cases where param's value is missing will only appear in token (e.g. filename=sample.png; L335) other than quoted-string (filename="sample.pdf") and ext-value (filename*=UTF-8''%c2%a3%20and%20%e2%82%ac%20rates''). Because the latter both contains boundary characters and certainly will/should not be accepted if value is totally missing.
In token, for param's value missing (e.g. attachment; filename= or form-data; name=; filename=foobar), it now just error out (see my first comment). I look through related statements in RFC7578 (-> RFC 2183 Section 2 -> RFC2045 Section 5.1) and RFC6266 (-> RFC2616 Section 3.2). They both require at least one character is present. So It should be syntax error if the value is empty as meets the current behavior. Negative test cases are welcome.

DoumanAsh · 2018-08-13T04:32:09Z

@Gowee
Yeah the implementation is already good as it is, 😄

I see that code covers the case, but I didn't notice unit tests for that.
For example could you add test to verify handling of attachment; filename=?

Gowee · 2018-08-13T05:53:33Z

Several test cases are added and tweaks have been made accordingly. : )

DoumanAsh

There are only two minor comments, but overall it is good :)

DoumanAsh · 2018-08-13T06:00:19Z

src/header/common/content_disposition.rs

+
+impl<'a> From<&'a str> for DispositionType {
+    fn from(origin: &'a str) -> DispositionType {
+        match origin.as_bytes().to_ascii_lowercase().as_slice() {


I know it makes it more elegant, but I think it would be better to use non-allocating comparison using eq_ignore_ascii_case

DoumanAsh · 2018-08-13T06:03:23Z

src/header/common/content_disposition.rs

-                };
+                }
+                cd.parameters
+                    .push(match param_name.to_ascii_lowercase().as_str() {


Can you please use eq_ignore_ascii_case? Know it is not so elegant but even more effectiveness 😄

yes. but I am not sure what is the idiomatic way to write that. is a long if block fine?

Yes, that's the only way sadly, you may wish to move if block outside of cd.parameters.push for a better read-ability

DoumanAsh · 2018-08-13T06:05:25Z

src/header/common/content_disposition.rs

+
+impl fmt::Display for DispositionParam {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        // TODO: More special charaters in filename should be escaped.


Btw we already use regex for example, so if you think we'd need some more characters you could add it here too using regex crate I guess.

DoumanAsh · 2018-08-13T06:07:59Z

@Gowee Could you please add changes with mention of your more robust parser for Content-Disposition

DoumanAsh · 2018-08-13T06:09:22Z

@fafhrd91 The commit makes breaking changes to DispositionParam, what do you think about it? Is it too early for such change?

codecov · 2018-08-13T07:10:09Z

Codecov Report

Merging #461 into master will increase coverage by 0.29%.
The diff coverage is 85.26%.

@@            Coverage Diff             @@
##           master     #461      +/-   ##
==========================================
+ Coverage   74.05%   74.34%   +0.29%     
==========================================
  Files         107      107              
  Lines       16790    17050     +260     
==========================================
+ Hits        12433    12675     +242     
- Misses       4357     4375      +18

Impacted Files	Coverage Δ
src/header/mod.rs	`88.59% <ø> (-2.02%)`	⬇️
src/multipart.rs	`62.36% <100%> (-0.11%)`	⬇️
src/fs.rs	`85.94% <100%> (-0.1%)`	⬇️
src/header/common/content_disposition.rs	`83.46% <85.14%> (+3.29%)`	⬆️
src/server/h1.rs	`78.87% <0%> (+1.23%)`	⬆️
src/server/h1writer.rs	`80.55% <0%> (+1.66%)`	⬆️
src/pipeline.rs	`64.2% <0%> (+3.57%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf7779a...59ba1c8. Read the comment docs.

…Dispostion display in content_disposition.rs

DoumanAsh

Just minor comment. We also need to consider breaking changes, if @fafhrd91 says it is a bit earlier, we might need to make DispositionParam without breaking changes.

DoumanAsh · 2018-08-13T13:19:31Z

src/header/common/content_disposition.rs

@@ -467,17 +475,23 @@ impl fmt::Display for DispositionType {

 impl fmt::Display for DispositionParam {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
-        // TODO: More special charaters in filename should be escaped.
+        // All ASCII control charaters (0-30, 127) excepting horizental tab, double quote, and


horizental -> horizontal

Gowee · 2018-08-13T13:33:27Z

f036c27
59ba1c8

Use regex to more rigorously escape characters in values like filename as suggested by @DoumanAsh.
While there is still a problem inherited from the previous code as well as the http crate: Some unprintable ASCII characters are allowed to be used directly in Content-Disposition as long as they are enclosed inside the quote pair and escaped. But all HeaderValue::from_... methods, including the one actix-web use to create CD header, do not accept characters in ASCII 0x1-0x31 and 0x127, and hence serving files whose name contains those special characters using actix-web with Content-Disposition set will fail. I have no idea how to solve this problem appropriately in view of the fact that Content-Disposition is so different than other headers. Anyway, it is not a significant problem because I suppose people do not often serve files with unprintable characters.
All code using to_ascii_lowercase should have been modified to use eq_ignore_ascii_case if I did not miss some.
Typo fixed.

fafhrd91 · 2018-08-13T13:43:40Z

I think breaking change is fine for this

Gowee · 2018-08-13T14:02:59Z

If necessary, could you please help me write CHANGES.md instead? I might fall into a sleep soon 😴 and be a little busy tomorrow.

DoumanAsh · 2018-08-13T14:33:32Z

@Gowee ahah, ok, I'll add changes myself. Thank you for your work

Rework content_disposition.rs

c4e29eb

fafhrd91 requested a review from DoumanAsh August 12, 2018 18:10

DoumanAsh reviewed Aug 12, 2018

View reviewed changes

Tweak content_disposition.rs

6650ec0

Tweak content_disposition.rs again and add more negative test cases.

1d86e9c

DoumanAsh approved these changes Aug 13, 2018

View reviewed changes

Use eq_ignore_ascii_case to avoid mem allocation and improve Content-…

f036c27

…Dispostion display in content_disposition.rs

DoumanAsh reviewed Aug 13, 2018

View reviewed changes

Fix a typo in content_disposition.rs

59ba1c8

DoumanAsh merged commit d9c7cd9 into actix:master Aug 13, 2018

Gowee mentioned this pull request Nov 1, 2019

Fix escaping/encoding problems in Content-Disposition header #1151

Merged

Rework Content-Disposition parsing totally #461

Rework Content-Disposition parsing totally #461

Conversation

Gowee commented Aug 12, 2018

fafhrd91 commented Aug 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DoumanAsh left a comment

Choose a reason for hiding this comment

Gowee commented Aug 13, 2018 • edited

DoumanAsh commented Aug 13, 2018

Gowee commented Aug 13, 2018 • edited

DoumanAsh left a comment

Choose a reason for hiding this comment

DoumanAsh Aug 13, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gowee Aug 13, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DoumanAsh commented Aug 13, 2018

DoumanAsh commented Aug 13, 2018

codecov bot commented Aug 13, 2018 • edited

Codecov Report

DoumanAsh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gowee commented Aug 13, 2018 • edited

fafhrd91 commented Aug 13, 2018

Gowee commented Aug 13, 2018

DoumanAsh commented Aug 13, 2018

Gowee commented Aug 13, 2018 •

edited

Gowee commented Aug 13, 2018 •

edited

DoumanAsh Aug 13, 2018 •

edited

Gowee Aug 13, 2018 •

edited

codecov bot commented Aug 13, 2018 •

edited

Gowee commented Aug 13, 2018 •

edited