Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework Content-Disposition parsing totally #461

Merged
merged 5 commits into from Aug 13, 2018
Merged

Conversation

Gowee
Copy link
Contributor

@Gowee Gowee commented Aug 12, 2018

This PR is going to solve several bugs that exist in content_disposition.rs and improve the ergonomics of extracting data out of Content-Disposition header with actix-web.

The previous implementation:

  • Cannot parse Content-Disposition if it contains special characters like " (double quote) and ; (semicolon) due to its implementation limits;
  • Cannot parse Content-Disposition if it contains parameter values in UTF-8 as quoted-string (e.g. form-data; name=upload; filename="文件.webp"). UTF-8 in quoted-string is not compliant with old RFCs, however:
    1. RFC 7578 states clearly that:

    Some commonly deployed systems use multipart/form-data with file names directly encoded including octets outside the US-ASCII range. The encoding used for the file names is typically UTF-8, although HTML forms will use the charset associated with the form.

    1. Mainstream browsers like Chrome and Firefox (Gecko) follow this practice. So the previous implementation cannot parse Content-Disposition at all from HTML forms submitted using Chrome and Firefox if non-ASCII characters are present in file name.
    2. Generate Content-Disposition header without escaping some characters, resulting in unexpected file name truncation. For example, if a file called quote"here.zip is served with fs::StaticFiles, then browsers will see it as quote with all characters after the double quote stripped.
    3. Other non-significant incompliance and problems.

Besides solving these problems listed, this commit also:

  • Tries to be compatible, even though not strictly compliant, with RFC6266 (Content-Disposition as response header used to indicate how to open files and file names to be saved as) and [RFC7578] (Content-Disposition generated for submitting HTML forms of multipart/form-data).
  • Should be able to parse all Content-Disposition generated by implementations compliant with RFC6266 and RFC7578 under most circumstances. (But, it could also possibly accept some malformed headers.)
  • Adds plenty of test cases to ensure it works as expected.
  • Tweaks related struct/enum to improve ergonomics and ensure conformance (e.g. the previous implementation treat filename and filename* equally, and users cannot distinguish between them, which is not wanted in some cases).
  • Adds a lot of useful methods to existing struct/enum to allow easier data extracting. For example:
    https://github.com/Gowee/actix-web/blob/c4e29eb1dc0bb3339207f0ca22cbcca48bb46012/src/header/common/content_disposition.rs#L240

@fafhrd91
Copy link
Member

This is great!

@DoumanAsh could you review this pr.

let (param_name, new_left) = split_once_and_trim(left, '=');
if param_name.len() == 0 {
return Err(::error::ParseError::Header);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check for new_left != 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but not necessary. Line 294 header::parse_extended_value will error out if ext_value is empty, which includes cases where new_left.len() == 0. The nested if block starting at L305 falls into the branch starting at L334, if new_left is empty, which accepts empty values. But check in L342 will ensure it error out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if new_left.len() == 0 it wouldn't fall into the line 289 branch.
I just feel like it would be better to leave earlier in bad case instead of checking it out.
Though I'm not sure should we just ignore such parameter or error....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't? Whether falling into L289 only dependent on the presence of trailing asterisk of param_name.

Fine, I will add that check in a new commit.

I am also not sure and fail to find related statements in those standards or elsewhere. But I do not think it is really a problem in practice because that is generated by browsers and does not rely on end-users input. Besides, what the lenient parsing implementation wants to do is just ensuring whatever directly submitted by end-users (typically, file names) are parsed correctly as long as browsers produce headers correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think that the current version is ok as it is unlikely to happen then leave it as it is 😄

left = new_left;
if param_name.ends_with('*') {
// extended parameters
let (param_name, _) = param_name.split_at(param_name.len() - 1); // trim asterisk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why split_at instead of slicing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I just forget to use...

escaping = false;
quoted_string.push(c);
} else {
if c == 0x5c // backslash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it would be better to use char literals here instead of raw byte?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the following comments.

let mut quoted_string = vec![];
let mut end = None;
// search for closing quote
for (i, &c) in left.as_bytes().iter().skip(1).enumerate() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use reference for &c?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To dereference &u8 into u8? I think as_bytes().iter() produces &u8 instead of u8, am I right?
I am not sure how to write that idiomatically in Rust.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, I see.
If I'm not mistaken though you can omit & here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After change &c to c in line 311:

error[E0277]: can't compare `&u8` with `{integer}`
   --> src/header/common/content_disposition.rs:316:34
    |
316 |                             if c == 0x5c // backslash
    |                                  ^^ no implementation for `&u8 == {integer}`
    |
    = help: the trait `std::cmp::PartialEq<{integer}>` is not implemented for `&u8`

error[E0277]: can't compare `&u8` with `{integer}`
   --> src/header/common/content_disposition.rs:319:41
    |
319 |                             } else if c == 0x22 // double quote
    |                                         ^^ no implementation for `&u8 == {integer}`
    |
    = help: the trait `std::cmp::PartialEq<{integer}>` is not implemented for `&u8`

error[E0308]: mismatched types
   --> src/header/common/content_disposition.rs:331:59
    |
331 |                     let quoted_string = String::from_utf8(quoted_string)
    |                                                           ^^^^^^^^^^^^^ expected u8, found &u8
    |
    = note: expected type `std::vec::Vec<u8>`
               found type `std::vec::Vec<&u8>`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I was sure Rust would copy it as it is, but it seems you went for right syntax originally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I am often messed up by those rules/implications too.

let mut quoted_string = vec![];
let mut end = None;
// search for closing quote
for (i, &c) in left.as_bytes().iter().skip(1).enumerate() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe use .chars() here?
This way you could just collect it into string, instead of going through parsing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I attempt at first, and then I find that the following split_at panics. Then I get to know Char, which is produced by the .chars() iterator, is different from u8 in Rust. But split_at takes byte offset instead of Char offset. Seems that this is also not idiomatic... Any suggestions to improve?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see... Yes, in this case chars are nasty as they are themself four bytes and doesn't represent actual character's byte len

}
left = left.split_at(end.ok_or(::error::ParseError::Header)? + 1).1;
left = split_once(left, ';').1.trim_left();
// In fact, tt should not be Err if the above code is correct.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tt -> it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix it

Copy link
Contributor

@DoumanAsh DoumanAsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a work you did! Thank you.

I'm feeling that it would be better to use full fledged parser for Content-Disposition but since our parsing is quite lenient I suppose it is not really necessary.

One thing I'd like to ask:
How should we handle cases when param's value is missing?
I'd like to add such tests to be sure that we would react correctly

@Gowee
Copy link
Contributor Author

Gowee commented Aug 13, 2018

@DoumanAsh
I agree. It would be more robust if a rigorous parser is used. But it will introduce new dependencies (if not writing a parser from the scratch). I am not sure what is your opinions on that. Besides, the new implementation should handle most cases and run a little faster.

Thanks for your patient review!

I assume cases where param's value is missing will only appear in token (e.g. filename=sample.png; L335) other than quoted-string (filename="sample.pdf") and ext-value (filename*=UTF-8''%c2%a3%20and%20%e2%82%ac%20rates''). Because the latter both contains boundary characters and certainly will/should not be accepted if value is totally missing.
In token, for param's value missing (e.g. attachment; filename= or form-data; name=; filename=foobar), it now just error out (see my first comment). I look through related statements in RFC7578 (-> RFC 2183 Section 2 -> RFC2045 Section 5.1) and RFC6266 (-> RFC2616 Section 3.2). They both require at least one character is present. So It should be syntax error if the value is empty as meets the current behavior. Negative test cases are welcome.

@DoumanAsh
Copy link
Contributor

@Gowee
Yeah the implementation is already good as it is, 😄

I see that code covers the case, but I didn't notice unit tests for that.
For example could you add test to verify handling of attachment; filename=?

@Gowee
Copy link
Contributor Author

Gowee commented Aug 13, 2018

Several test cases are added and tweaks have been made accordingly. : )

Copy link
Contributor

@DoumanAsh DoumanAsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only two minor comments, but overall it is good :)


impl<'a> From<&'a str> for DispositionType {
fn from(origin: &'a str) -> DispositionType {
match origin.as_bytes().to_ascii_lowercase().as_slice() {
Copy link
Contributor

@DoumanAsh DoumanAsh Aug 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it makes it more elegant, but I think it would be better to use non-allocating comparison using eq_ignore_ascii_case

};
}
cd.parameters
.push(match param_name.to_ascii_lowercase().as_str() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please use eq_ignore_ascii_case? Know it is not so elegant but even more effectiveness 😄

Copy link
Contributor Author

@Gowee Gowee Aug 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. but I am not sure what is the idiomatic way to write that. is a long if block fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the only way sadly, you may wish to move if block outside of cd.parameters.push for a better read-ability


impl fmt::Display for DispositionParam {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
// TODO: More special charaters in filename should be escaped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw we already use regex for example, so if you think we'd need some more characters you could add it here too using regex crate I guess.

@DoumanAsh
Copy link
Contributor

@Gowee Could you please add changes with mention of your more robust parser for Content-Disposition

@DoumanAsh
Copy link
Contributor

@fafhrd91 The commit makes breaking changes to DispositionParam, what do you think about it? Is it too early for such change?

@codecov
Copy link

codecov bot commented Aug 13, 2018

Codecov Report

Merging #461 into master will increase coverage by 0.29%.
The diff coverage is 85.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #461      +/-   ##
==========================================
+ Coverage   74.05%   74.34%   +0.29%     
==========================================
  Files         107      107              
  Lines       16790    17050     +260     
==========================================
+ Hits        12433    12675     +242     
- Misses       4357     4375      +18
Impacted Files Coverage Δ
src/header/mod.rs 88.59% <ø> (-2.02%) ⬇️
src/multipart.rs 62.36% <100%> (-0.11%) ⬇️
src/fs.rs 85.94% <100%> (-0.1%) ⬇️
src/header/common/content_disposition.rs 83.46% <85.14%> (+3.29%) ⬆️
src/server/h1.rs 78.87% <0%> (+1.23%) ⬆️
src/server/h1writer.rs 80.55% <0%> (+1.66%) ⬆️
src/pipeline.rs 64.2% <0%> (+3.57%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf7779a...59ba1c8. Read the comment docs.

…Dispostion display in content_disposition.rs
Copy link
Contributor

@DoumanAsh DoumanAsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor comment. We also need to consider breaking changes, if @fafhrd91 says it is a bit earlier, we might need to make DispositionParam without breaking changes.

@@ -467,17 +475,23 @@ impl fmt::Display for DispositionType {

impl fmt::Display for DispositionParam {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
// TODO: More special charaters in filename should be escaped.
// All ASCII control charaters (0-30, 127) excepting horizental tab, double quote, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

horizental -> horizontal

@Gowee
Copy link
Contributor Author

Gowee commented Aug 13, 2018

f036c27
59ba1c8

  • Use regex to more rigorously escape characters in values like filename as suggested by @DoumanAsh.
    While there is still a problem inherited from the previous code as well as the http crate: Some unprintable ASCII characters are allowed to be used directly in Content-Disposition as long as they are enclosed inside the quote pair and escaped. But all HeaderValue::from_... methods, including the one actix-web use to create CD header, do not accept characters in ASCII 0x1-0x31 and 0x127, and hence serving files whose name contains those special characters using actix-web with Content-Disposition set will fail. I have no idea how to solve this problem appropriately in view of the fact that Content-Disposition is so different than other headers. Anyway, it is not a significant problem because I suppose people do not often serve files with unprintable characters.
  • All code using to_ascii_lowercase should have been modified to use eq_ignore_ascii_case if I did not miss some.
  • Typo fixed.

@fafhrd91
Copy link
Member

I think breaking change is fine for this

@Gowee
Copy link
Contributor Author

Gowee commented Aug 13, 2018

If necessary, could you please help me write CHANGES.md instead? I might fall into a sleep soon 😴 and be a little busy tomorrow.

@DoumanAsh
Copy link
Contributor

@Gowee ahah, ok, I'll add changes myself. Thank you for your work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants