Add `charset_normalizer` detection. #1791

tomchristie · 2021-08-10T12:37:38Z

Use charset_normalizer to auto-detect character encodings in cases where no Content-Type: text/...; charset=... is included.

See #1657 and https://github.com/tomchristie/top-1000 for some evidence-led rationale behind this change.

Ousret · 2021-08-11T09:12:07Z

httpx/_models.py

+        Return the encoding, as detemined by `charset_normalizer`.
+        """
+        content = getattr(self, "_content", b"")
+        if len(content) < 32:


There are cases where the detection works just fine with small content. I would suggest silent the warning instead.

(target: x.encode("utf_8")) * Using the following `Qu'est ce que une étoile?` chardet detect ISO-8859-1 cchardet detect IBM852 charset-normalizer detect utf-8 * Using the following `Qu’est ce que une étoile?` chardet detect utf-8 cchardet detect UTF-8 charset-normalizer detect utf-8 * Using the following `<?xml ?><c>Financiën</c>` chardet detect ISO-8859-1 cchardet detect ISO-8859-13 charset-normalizer detect utf-8 * Using the following `(° ͜ʖ °), creepy face, smiley 😀` chardet detect Windows-1254 cchardet detect UTF-8 charset-normalizer detect utf-8 * Using the following `["Financiën", "La France"]` chardet detect utf-8 cchardet detect ISO-8859-13 charset-normalizer detect utf-8

WDYT?

I went with that originally, and a couple of the tests with small amounts of content returned results I wasn't expecting. If apparent_encoding is None, then we'll end up decoding it with 'utf-8', errors='replace', whichI figure is a pretty reasonable default for the corner case.

Alrighty, that is reasonable. 👍
In another matter, you may run the detection anyway and check if the result has a SIG/BOM, that could be reasonable too. And discard it if len(content) < 32 and best_guess.bom is False.

results = from_bytes(content) best_guess = results.best() if best_guess.bom: ...

* 📝 Docs patch following PR #1791 section compatibility.encoding Reintroducing charset detection * 📝 Amend sentence in 3080a9d Co-authored-by: Tom Christie <tom@tomchristie.com>

tomchristie added 4 commits August 10, 2021 12:43

Add charset_normalizer detection

b2b46c9

Tweak JSON tests for slightly different charset decoding behaviour

dceb9da

Add charset-normalizer to docs

f0fc2d5

Merge branch 'master' into charset-normalizer-detection

b3c7eef

Ousret reviewed Aug 11, 2021

View reviewed changes

tomchristie added 2 commits August 13, 2021 11:31

Merge branch 'master' into charset-normalizer-detection

297808e

Merge branch 'master' into charset-normalizer-detection

bfadce7

tomchristie merged commit acb5e6a into master Aug 13, 2021

tomchristie deleted the charset-normalizer-detection branch August 13, 2021 10:38

benoit74 mentioned this pull request Jun 14, 2024

Automated encoding detection is still not working properly openzim/warc2zim#312

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `charset_normalizer` detection. #1791

Add `charset_normalizer` detection. #1791

tomchristie commented Aug 10, 2021

Ousret Aug 11, 2021 •

edited

Loading

tomchristie Aug 11, 2021 •

edited

Loading

Ousret Aug 11, 2021

Add charset_normalizer detection. #1791

Add charset_normalizer detection. #1791

Conversation

tomchristie commented Aug 10, 2021

Ousret Aug 11, 2021 • edited Loading

Choose a reason for hiding this comment

tomchristie Aug 11, 2021 • edited Loading

Choose a reason for hiding this comment

Ousret Aug 11, 2021

Choose a reason for hiding this comment

Add `charset_normalizer` detection. #1791

Add `charset_normalizer` detection. #1791

Ousret Aug 11, 2021 •

edited

Loading

tomchristie Aug 11, 2021 •

edited

Loading