You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is especially a problem with HTML which lets you define the encoding on the HTML layer and does not rely on the Content-Type.
As we want to get the bytes anyways, there's no need to fiddle with the response as string, as we can simply access bytes via response.content.
Error message
No error message, but content is wrongly encoded
Expected behavior
Encoding is correct
Additional context
Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.
To Reproduce
As an example where this is an issue, take https://www.mckinsey.com/about-us/new-at-mckinsey-blog/equal-at-mckinsey : Encoding is defined in html correctly as utf-8, but Content-Type is text/html only.
Browsers and httpx library handle this site correctly as utf-8, LinkContentFetcher does not.
tstadel
changed the title
LinkContentFetcher uses wrong encoding for responses where encoding is not explicitly defined in Content-Type header.LinkContentFetcher uses wrong encoding for html responses where encoding is not explicitly defined in Content-Type header.
Jul 4, 2024
Describe the bug
LinkContentFetcher
usesrequests
library under the hood. If requests library receivesContent-Type
headers without explict endoding, it infers encodingISO-8859-1
, which is in line with https://datatracker.ietf.org/doc/html/rfc2616#section-3.7.1 but is most of the time wrong asutf-8
has become a defacto standard (See https://stackoverflow.com/a/52615216 and https://stackoverflow.com/a/44203633).This is especially a problem with HTML which lets you define the encoding on the HTML layer and does not rely on the
Content-Type
.As we want to get the bytes anyways, there's no need to fiddle with the response as string, as we can simply access bytes via
response.content
.Error message
No error message, but content is wrongly encoded
Expected behavior
Encoding is correct
Additional context
Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.
To Reproduce
As an example where this is an issue, take https://www.mckinsey.com/about-us/new-at-mckinsey-blog/equal-at-mckinsey : Encoding is defined in html correctly as utf-8, but
Content-Type
istext/html
only.Browsers and httpx library handle this site correctly as
utf-8
,LinkContentFetcher
does not.FAQ Check
System:
The text was updated successfully, but these errors were encountered: