Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add html5 character encoding mappings #8

Open
mlissner opened this issue Nov 10, 2012 · 4 comments
Open

Add html5 character encoding mappings #8

mlissner opened this issue Nov 10, 2012 · 4 comments

Comments

@mlissner
Copy link

I had a frustrating issue recently when trying to use chardet to work with a web page: http://stackoverflow.com/questions/11588458/how-to-handle-encodings-using-python-requests-library

My solution was to write a bit of custom code that says, "Whenever chardet reports ISO-8859-1, instead use cp1252."

Basically, browsers don't use a number of character encodings, and instead map to other ones instead. This was done unofficially for a while by browsers, but it's now enshrined in the HTML5 spec:

http://dev.w3.org/html5/spec/single-page.html#character-encodings-0

Since most of the data that chardet is used on will be coming from the web, it makes sense for it to return the character encodings that are used by browsers. This might make sense as an option rather than default functionality....not sure, but I'd love to see this be added.

If this is a feature that'd be accepted, I'd be happy to put it together in a pull request, but I need guidance as to the design that'd be accepted.

@dan-blanchard
Copy link

@sigmavirus24 Is this something you fixed? I'm trying to migrate old issues over to our new repo.

@sigmavirus24
Copy link

@dan-blanchard no, the issue in that commit refers to sv24-archive#8

@dan-blanchard
Copy link

@mlissner I don't actually see this mapping you speak of in the latest version of the HTML5 spec:
http://www.w3.org/html/wg/drafts/html/master/single-page.html#character-encodings

That said, I do think always returning CP1252 instead of Latin-1 is advisable because it gets you the quotes and things that Latin-1 doesn't, and one almost never encounters documents that purposefully have the replaced control characters in them.

@mlissner
Copy link
Author

Well... My internet connection is so slow here I can't download the whole
spec, but it might be in chapter eight? I can get the TOC, but downloading
past chapter four is killing me.

Anyway, yeah, I filed this bug after encountering a page that was reporting
Latin-1 when it had cp1252 characters, and I realized there's almost no
reason to ever return Latin-1 anymore.
On Dec 21, 2013 3:16 AM, "Dan Blanchard" notifications@github.com wrote:

@mlissner https://github.com/mlissner I don't actually see this mapping
you speak of in the latest version of the HTML5 spec:

http://www.w3.org/html/wg/drafts/html/master/single-page.html#character-encodings

That said, I do think always returning CP1252 instead of Latin-1 is
advisable because it gets you the quotes and things that Latin-1 doesn't,
and one almost never encounters documents that purposefully have the
replaced control characters in them.


Reply to this email directly or view it on GitHubhttps://github.com//issues/8#issuecomment-31011860
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants