How deal with encoding charset? #96

ebuildy · 2013-12-31T11:49:58Z

I try to parse the RSS http://fr.canoe.ca/rss/feed/nouvelles/aujourdhui.xml encoded in ISO-8859-1 charset.

I use iconv piped to encode the XML to UTF8 but how can I get the RSS encoding before apply the parsing?

Here the code I am using: https://gist.github.com/ebuildy/0b5023e2d97ebc2e4a57

Thanks

szwacz · 2013-12-31T11:57:49Z

I can show you my solution: https://github.com/szwacz/sputnik/blob/master/app/helpers/feedParser.js

danmactough · 2013-12-31T12:50:41Z

I'll take that as a feature request: expose method to return detected
character encoding. Great idea.

ebuildy · 2013-12-31T13:03:53Z

@szwacz solution is perfect, but definitively this should be added into your library

ebuildy · 2013-12-31T13:08:33Z

Also need to set encoding : null in request option.

danmactough · 2013-12-31T14:50:23Z

Also need to set encoding : null in request option.

You need to specify that? You can't just omit the option? (I guess if omitted, it's assumes a default utf-8?)

danmactough · 2013-12-31T15:19:11Z

Also, regarding that feed. I may be crazy, but I believe that the REAL problem you're having with it is that it declares itself as iso-8859-1, but it actually contains utf-8 multibyte characters. You should never need to "convert" from iso-8859-1 to utf-8.

For example, when you examine the character "ë", it is character code 65533 -- iso-8859-1 only goes up to 255 (right? -- not double-checking).

> str
'Cano�'
> str.length
5
> str.charCodeAt(3)
111
> str.charCodeAt(4)
65533

ebuildy · 2013-12-31T15:35:56Z

Every thing is fine with the encode solution, i am not sure you can browse the string like this, utf and ISO have différent charzcters length ...

danmactough · 2014-01-24T06:06:51Z

@ebuildy If you're satisfied with @szwacz's encode solution, that works for me. Just to close the loop on the encoding with that feed, your character length is not quite correct. You can refer to MDN, but the tl;dr is that the last character in the string Cano� (from that feed) is a single-byte character, but it's outside the range of ISO-8859-1 -- it's a UTF-8 single-byte character. In other words, the feed declares itself as ISO-8859-1, but it is not ISO-8859-1.

adantoscano · 2017-03-02T15:55:16Z

@danmactough can you put @szwacz 's solution? I got the same problem but his page is not available.

szwacz · 2017-03-02T16:00:39Z

@adantl File in my repo still exists, just have been moved: https://github.com/szwacz/sputnik/blob/master/app/core/helpers/feed_parser.js#L42

danmactough closed this as completed Jan 24, 2014

alabeduarte mentioned this issue Mar 2, 2017

encoding problems alabeduarte/feedparser-promised#9

Closed

pedrohh mentioned this issue Nov 6, 2018

Support to charset encoding filipedeschamps/rss-feed-emitter#172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How deal with encoding charset? #96

How deal with encoding charset? #96

ebuildy commented Dec 31, 2013

szwacz commented Dec 31, 2013

danmactough commented Dec 31, 2013

ebuildy commented Dec 31, 2013

ebuildy commented Dec 31, 2013

danmactough commented Dec 31, 2013

danmactough commented Dec 31, 2013

ebuildy commented Dec 31, 2013

danmactough commented Jan 24, 2014

adantoscano commented Mar 2, 2017

szwacz commented Mar 2, 2017

How deal with encoding charset? #96

How deal with encoding charset? #96

Comments

ebuildy commented Dec 31, 2013

szwacz commented Dec 31, 2013

danmactough commented Dec 31, 2013

ebuildy commented Dec 31, 2013

ebuildy commented Dec 31, 2013

danmactough commented Dec 31, 2013

danmactough commented Dec 31, 2013

ebuildy commented Dec 31, 2013

danmactough commented Jan 24, 2014

adantoscano commented Mar 2, 2017

szwacz commented Mar 2, 2017