Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add browser-style text encoding detection #203

Merged
merged 1 commit into from
Dec 2, 2021

Conversation

blyxxyz
Copy link
Collaborator

@blyxxyz blyxxyz commented Nov 29, 2021

  • If there's no explicit encoding it's detected using chardetng, which was made for Firefox.
  • HTML is never assumed to be UTF-8, that has to be declared explicitly. This is a bit iffy, it'll definitely break some real world output. But it's consistent with how browsers handle it. See https://hsivonen.fi/utf-8-detection/.
    One argument against this is that fetch and XHR can assume UTF-8 even if loading as a normal webpage can't. What do you think?
  • BOM sniffing is no longer used when the encoding is explicit.
  • Streaming and non-streaming decodes now behave the same when it comes to encoding detection and BOM handling.

I had to do a bunch of research for this one, I hope it's not too hard to follow.

@ducaale ducaale self-requested a review November 29, 2021 18:44
Copy link
Owner

@ducaale ducaale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTML is never assumed to be UTF-8, that has to be declared explicitly. This is a bit iffy, it'll definitely break some real world output. But it's consistent with how browsers handle it. See https://hsivonen.fi/utf-8-detection/.
One argument against this is that fetch and XHR can assume UTF-8 even if loading as a normal webpage can't. What do you think?

Being consistent with how browsers handle it seems reasonable to me. If has_charset_utf8() and chardetng both fail, users will always be able to set the charset with --response-charset.

tests/cli.rs Outdated Show resolved Hide resolved
src/printer.rs Outdated Show resolved Hide resolved
src/printer.rs Outdated Show resolved Hide resolved
src/printer.rs Outdated Show resolved Hide resolved
src/printer.rs Show resolved Hide resolved
src/printer.rs Outdated Show resolved Hide resolved
@blyxxyz
Copy link
Collaborator Author

blyxxyz commented Dec 1, 2021

I think I changed my mind about HTML and UTF-8. HTTPie specializes in APIs, and you'd usually talk to those using some method that can assume UTF-8.

@ducaale
Copy link
Owner

ducaale commented Dec 2, 2021

Can you update the commit description, especially the bit about HTML and UTF-8?

- If there's no explicit encoding it's detected using BOM sniffing or
  using chardetng, which was made for Firefox.

- BOM sniffing is no longer used when the encoding is explicit.

- Streaming and non-streaming decodes now behave the same.
Copy link
Owner

@ducaale ducaale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved 🎉

Can you also update the Pull request's description? thanks

@ducaale ducaale merged commit 75232aa into ducaale:develop Dec 2, 2021
@blyxxyz blyxxyz deleted the strict-encoding branch December 2, 2021 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants