Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add charset_normalizer detection. #1791

Merged
merged 6 commits into from
Aug 13, 2021
Merged

Conversation

tomchristie
Copy link
Member

Use charset_normalizer to auto-detect character encodings in cases where no Content-Type: text/...; charset=... is included.

See #1657 and https://github.com/tomchristie/top-1000 for some evidence-led rationale behind this change.

Return the encoding, as detemined by `charset_normalizer`.
"""
content = getattr(self, "_content", b"")
if len(content) < 32:
Copy link
Contributor

@Ousret Ousret Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are cases where the detection works just fine with small content. I would suggest silent the warning instead.

(target: x.encode("utf_8"))
* Using the following `Qu'est ce que une étoile?`
chardet detect ISO-8859-1
cchardet detect IBM852
charset-normalizer detect utf-8

  * Using the following `Qu’est ce que une étoile?`
chardet detect utf-8
cchardet detect UTF-8
charset-normalizer detect utf-8

  * Using the following `<?xml ?><c>Financiën</c>`
chardet detect ISO-8859-1
cchardet detect ISO-8859-13
charset-normalizer detect utf-8

  * Using the following `(° ͜ʖ °), creepy face, smiley 😀`
chardet detect Windows-1254
cchardet detect UTF-8
charset-normalizer detect utf-8

  * Using the following `["Financiën", "La France"]`
chardet detect utf-8
cchardet detect ISO-8859-13
charset-normalizer detect utf-8

WDYT?

Copy link
Member Author

@tomchristie tomchristie Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with that originally, and a couple of the tests with small amounts of content returned results I wasn't expecting. If apparent_encoding is None, then we'll end up decoding it with 'utf-8', errors='replace', whichI figure is a pretty reasonable default for the corner case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alrighty, that is reasonable. 👍
In another matter, you may run the detection anyway and check if the result has a SIG/BOM, that could be reasonable too. And discard it if len(content) < 32 and best_guess.bom is False.

results = from_bytes(content)
best_guess = results.best()

if best_guess.bom:
    ...

@tomchristie tomchristie merged commit acb5e6a into master Aug 13, 2021
@tomchristie tomchristie deleted the charset-normalizer-detection branch August 13, 2021 10:38
tomchristie added a commit that referenced this pull request Aug 31, 2021
* 📝 Docs patch following PR #1791 section compatibility.encoding

Reintroducing charset detection

* 📝 Amend sentence in 3080a9d

Co-authored-by: Tom Christie <tom@tomchristie.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants