-
-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add charset_normalizer
detection.
#1791
Conversation
Return the encoding, as detemined by `charset_normalizer`. | ||
""" | ||
content = getattr(self, "_content", b"") | ||
if len(content) < 32: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are cases where the detection works just fine with small content. I would suggest silent the warning instead.
(target: x.encode("utf_8"))
* Using the following `Qu'est ce que une étoile?`
chardet detect ISO-8859-1
cchardet detect IBM852
charset-normalizer detect utf-8
* Using the following `Qu’est ce que une étoile?`
chardet detect utf-8
cchardet detect UTF-8
charset-normalizer detect utf-8
* Using the following `<?xml ?><c>Financiën</c>`
chardet detect ISO-8859-1
cchardet detect ISO-8859-13
charset-normalizer detect utf-8
* Using the following `(° ͜ʖ °), creepy face, smiley 😀`
chardet detect Windows-1254
cchardet detect UTF-8
charset-normalizer detect utf-8
* Using the following `["Financiën", "La France"]`
chardet detect utf-8
cchardet detect ISO-8859-13
charset-normalizer detect utf-8
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with that originally, and a couple of the tests with small amounts of content returned results I wasn't expecting. If apparent_encoding
is None
, then we'll end up decoding it with 'utf-8', errors='replace'
, whichI figure is a pretty reasonable default for the corner case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alrighty, that is reasonable. 👍
In another matter, you may run the detection anyway and check if the result has a SIG/BOM, that could be reasonable too. And discard it if len(content) < 32 and best_guess.bom is False.
results = from_bytes(content)
best_guess = results.best()
if best_guess.bom:
...
Use
charset_normalizer
to auto-detect character encodings in cases where noContent-Type: text/...; charset=...
is included.See #1657 and https://github.com/tomchristie/top-1000 for some evidence-led rationale behind this change.