Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support overwriting response's mime and charset #184

Merged
merged 7 commits into from
Nov 8, 2021
Merged

Conversation

ducaale
Copy link
Owner

@ducaale ducaale commented Oct 24, 2021

Ticks another item from #4

@ducaale ducaale changed the title Support overriding response's content-type and charset Support overwriting response's mime and charset Oct 24, 2021
@ducaale ducaale mentioned this pull request Oct 24, 2021
28 tasks
Copy link
Collaborator

@blyxxyz blyxxyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • HTTPie crashes (quickly) if it can't parse the encoding.
  • HTTPie recognizes some encoding names encoding_rs doesn't, notably utf16. (It has to be spelled utf-16.)

First one shouldn't be too hard, just change Args to response_charset: Option<Encoding> and do the parsing earlier.

The second one is tricky:

  • HTTPie defers to Python (this table), we defer to encoding-rs (this table).
  • encoding-rs doesn't support all the encodings Python does. That's too bad if you find a server that serves up UTF-7 but there's not much we can do about it.
  • Python normalizes the names. It seems to ignore case as well as most non-alphanumeric ASCII characters (like - and * and \x7F, but not . for some reason, and not unicode characters). So --response-charset=~~~~UtF////16@@ just works.
  • Python supports some aliases we don't (and vice versa, but that's less important). For example u8/u16 for UTF-8/UTF-16.

I think we can get quite far by first normalizing it by removing all ASCII non-alphanumeric characters except for _, - and : (lowercasing isn't necessary, encoding-rs does that) and then:

  1. Try it as-is.
  2. Check if it's one of the aliases u8/u16/u32/utf.
  3. Try it with all - and _ removed.
  4. Try it with all _ turned into -.
  5. Try it with all - turned into _.
  6. Try it with all - and _ removed and a - inserted before the first digit.

If all that fails, we can put a link to the table in the error message.

src/printer.rs Outdated Show resolved Hide resolved

match normalized_encoding.as_str() {
"u8" | "utf" => return Ok(encoding_rs::UTF_8),
"u16" => return Ok(encoding_rs::UTF_16LE),
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encoding_rs associates the label utf-16 with UTF_16LE but I am not sure if it is the same in Python.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think that encoding_rs supports utf-32

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my phone (ARM) it's little-endian:

>>> 'a'.encode('utf_16_be').decode('utf16')
'愀'
>>> 'a'.encode('utf_16_le').decode('utf16')
'a'

I wouldn't be surprised if it depended on the machine's architecture. x86 is also little-endian so we'd agree on most machines.
In any case, encoding-rs is made for the web, so if it disagrees with Python then it's probably Python which is wrong.

src/cli.rs Outdated
Comment on lines 1349 to 1361
fn parse_encoding_label() {
assert_eq!(
parse_encoding("~~~~UtF////16@@").unwrap(),
encoding_rs::UTF_16LE
);
assert_eq!(parse_encoding("utf_8").unwrap(), encoding_rs::UTF_8);
assert_eq!(parse_encoding("utf8").unwrap(), encoding_rs::UTF_8);
assert_eq!(parse_encoding("utf-8").unwrap(), encoding_rs::UTF_8);
assert_eq!(
parse_encoding("iso8859_6").unwrap(),
encoding_rs::ISO_8859_6
);
}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any test cases you think are worth adding here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also try u8 (for the alias), utf16, iso_8859-2:1987 (for the colon), l1, elot-928 (the original has a required underscore), utf_16_be, utf16be, utf-16-be.

And maybe "notreal" and "" to round it off?

I guess that's a lot of cases. You could put them in a const &[(&str, &Encoding)] and test in a loop.

@ducaale ducaale requested a review from blyxxyz October 25, 2021 14:55
src/cli.rs Outdated
Comment on lines 1011 to 1018
Err(Error::with_description(
&format!(
"{} is not a supported encoding, please refer to https://encoding.spec.whatwg.org/#names-and-labels\
for supported encodings",
encoding
),
ErrorKind::InvalidValue,
))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing a space after the link, and you get the bold red "error:" twice because structopt wraps another clap error around this one:

error: Invalid value for '--response-charset ': error: utf32 is not a supported encoding, please refer to https://encoding.spec.whatwg.org/#names-and-labelsfor supported encodings

I think an error type of &'static str would work.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are flags other than --response-charset that are currently returning a clap error instead of a &str. I fix those as well.

src/cli.rs Outdated
Comment on lines 1349 to 1361
fn parse_encoding_label() {
assert_eq!(
parse_encoding("~~~~UtF////16@@").unwrap(),
encoding_rs::UTF_16LE
);
assert_eq!(parse_encoding("utf_8").unwrap(), encoding_rs::UTF_8);
assert_eq!(parse_encoding("utf8").unwrap(), encoding_rs::UTF_8);
assert_eq!(parse_encoding("utf-8").unwrap(), encoding_rs::UTF_8);
assert_eq!(
parse_encoding("iso8859_6").unwrap(),
encoding_rs::ISO_8859_6
);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also try u8 (for the alias), utf16, iso_8859-2:1987 (for the colon), l1, elot-928 (the original has a required underscore), utf_16_be, utf16be, utf-16-be.

And maybe "notreal" and "" to round it off?

I guess that's a lot of cases. You could put them in a const &[(&str, &Encoding)] and test in a loop.

@ducaale ducaale merged commit 1f0d775 into develop Nov 8, 2021
@ducaale ducaale deleted the custom-encoding branch November 8, 2021 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants