Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How deal with encoding charset? #96

Closed
ebuildy opened this issue Dec 31, 2013 · 10 comments
Closed

How deal with encoding charset? #96

ebuildy opened this issue Dec 31, 2013 · 10 comments

Comments

@ebuildy
Copy link

ebuildy commented Dec 31, 2013

I try to parse the RSS http://fr.canoe.ca/rss/feed/nouvelles/aujourdhui.xml encoded in ISO-8859-1 charset.

I use iconv piped to encode the XML to UTF8 but how can I get the RSS encoding before apply the parsing?

Here the code I am using: https://gist.github.com/ebuildy/0b5023e2d97ebc2e4a57

Thanks

@szwacz
Copy link

szwacz commented Dec 31, 2013

@danmactough
Copy link
Owner

I'll take that as a feature request: expose method to return detected
character encoding. Great idea.

@ebuildy
Copy link
Author

ebuildy commented Dec 31, 2013

@szwacz solution is perfect, but definitively this should be added into your library

@ebuildy
Copy link
Author

ebuildy commented Dec 31, 2013

Also need to set encoding : null in request option.

@danmactough
Copy link
Owner

Also need to set encoding : null in request option.

You need to specify that? You can't just omit the option? (I guess if omitted, it's assumes a default utf-8?)

@danmactough
Copy link
Owner

Also, regarding that feed. I may be crazy, but I believe that the REAL problem you're having with it is that it declares itself as iso-8859-1, but it actually contains utf-8 multibyte characters. You should never need to "convert" from iso-8859-1 to utf-8.

For example, when you examine the character "ë", it is character code 65533 -- iso-8859-1 only goes up to 255 (right? -- not double-checking).

> str
'Cano�'
> str.length
5
> str.charCodeAt(3)
111
> str.charCodeAt(4)
65533

@ebuildy
Copy link
Author

ebuildy commented Dec 31, 2013

Every thing is fine with the encode solution, i am not sure you can browse the string like this, utf and ISO have différent charzcters length ...

@danmactough
Copy link
Owner

@ebuildy If you're satisfied with @szwacz's encode solution, that works for me. Just to close the loop on the encoding with that feed, your character length is not quite correct. You can refer to MDN, but the tl;dr is that the last character in the string Cano� (from that feed) is a single-byte character, but it's outside the range of ISO-8859-1 -- it's a UTF-8 single-byte character. In other words, the feed declares itself as ISO-8859-1, but it is not ISO-8859-1.

@adantoscano
Copy link

@danmactough can you put @szwacz 's solution? I got the same problem but his page is not available.

@szwacz
Copy link

szwacz commented Mar 2, 2017

@adantl File in my repo still exists, just have been moved: https://github.com/szwacz/sputnik/blob/master/app/core/helpers/feed_parser.js#L42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants