Non-latin characters get HTML-encoded with decodeEntities=true #866

dsavenko · 2016-05-17T09:34:13Z

Hi,

Consider this code:

var cheerio = require('cheerio')
var ch = cheerio.load('<div>абв</div>', { decodeEntities: true })
console.log(ch('div').html())

It prints абв. If I set decodeEntities to false, the output will be the expected абв.

Two issues here:

This is not how decodeEntities is supposed to work. I tested with htmlparser2 directly, and it works as expected both ways (the code is below).
Htmlparser recommends always set decodeEntities to true for security reasons.

Test code for htmlparser:

var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
    ontext: function(text){
        console.log("-->", text);
    }
}, {decodeEntities: true});
parser.write("<div>абв</div>");
parser.end();

Output: --> абв (as expected)

P.S. Cheerio version: 0.20.0

The text was updated successfully, but these errors were encountered:

shelldweller · 2016-06-01T00:47:50Z

This was reported in #565 and for some reason closed. I agree, .html() should not escape non-ASCII characters to entities, but should return Unicode string, like .text() does.

davidbayo10 · 2016-06-03T18:03:10Z

This code is working for me. The problem was html response which was not encoding correct

const request = require('request');
const cheerio = require('cheerio');
const iconv = require('iconv-lite');

function getDOM(url) {
  return new Promise(function (resolve, reject) {
    request({ uri: url, encoding: null }, function (err, res, html) {
      if (err) {
        reject(err);
      } else {
        html = iconv.decode(html, 'ISO-8859-1');
        resolve(cheerio.load(html));
      }
    });
  });
}

Cactusbone · 2016-11-30T10:26:28Z

looks like it's manually done that way in dom-serializer.

 if (opts.decodeEntities && !(elem.parent && elem.parent.name in unencodedElements)) {
    data = entities.encodeXML(data);
  }

issue is linked to cheeriojs/dom-serializer#26

Cactusbone · 2016-11-30T10:44:23Z

found a way around the problem :

var cheerio = require('cheerio')
var ch = cheerio.load('<div>абв</div>', { decodeEntities: true })
console.log(ch('div').html({ decodeEntities: false }))

rlidwka · 2017-01-27T14:03:41Z

@Cactusbone :

I stumbled across this issue, so I took my time to test this workaround:

> var cheerio = require('cheerio')
undefined
> var ch = cheerio.load('<div>&lt;script&gt;alert(1)&lt;/script&gt;</div>', { decodeEntities: true })
undefined
> console.log(ch.html({ decodeEntities: false }))
<div><script>alert(1)</script></div>

It's working all right. The problem is: you're disabling all entity encoding, even entities that really need to be escaped. Which opens up an obvious XSS attack.

Cactusbone · 2017-01-27T14:33:18Z

Thanks for the testing ! Indeed that's bad :( I'll add this to the test. Did you find a way around this ?

rlidwka · 2017-01-27T15:55:43Z

Did you find a way around this ?

An obvious solution is, undo all the escaping work that dom-serializer does:

var cheerio = require('cheerio')

var cheerio_html = cheerio.prototype.html

cheerio.prototype.html = function wrapped_html() {
  var result = cheerio_html.apply(this, arguments)
 
  if (typeof result === 'string') {
    result = result.replace(/&#x([0-9a-f]{1,6});/ig, function (entity, code) {
      code = parseInt(code, 16)

      // don't unescape ascii characters, assuming that all ascii characters
      // are encoded for a good reason
      if (code < 0x80) return entity

      return String.fromCodePoint(code)
    })
  }

  return result
}

console.log(cheerio.load('<div>абв"&quot;&lt;&gt;</div>').root().html())
// <div>абв&quot;&quot;&lt;&gt;</div>

It modifies cheerio prototype (which might not be desirable in your case), and it could slow down parsing if you're calling html() a lot. So I'd be happy to know if there's a better solution out there.

Cactusbone · 2017-01-30T11:13:41Z

I think cheeriojs/dom-serializer#33 & fb55/entities#28 might do the trick. I'll try that

ihteandr · 2017-08-04T22:42:58Z

some problem with Russian on version 1.0.0-rc.2

konstantinblaesi · 2017-10-18T10:15:45Z

@fb55 @matthewmueller @jugglinmike What's your opinion on this? It confuses me that setting { decodeEntities: true } causes encoding of non-latin characters. It would be nice if cheerio was behaving like jQuery in the case of .text() and .html(). Currently I have to set { decodeEntities: false } to prevent encoding of non-latin characters, but I have to use { decodeEntities: true } to decode html entities such as &.

7iomka · 2018-03-25T11:36:50Z

Any ideas?

7iomka · 2018-03-25T11:53:31Z

@dsavenko Привет. Нашёл какое-то обходное решение на текущий момент?

ihteandr · 2018-03-25T12:19:01Z

I real world I such problem with encoding while use request module
I resolved it this way
put request options encoding=null;
encoded response manually with
var iconv = require('iconv-lite');
iconv.decode(html, 'win1251')
I hope this help

7iomka · 2018-03-25T17:28:29Z

@ihteandr, this not help for me. I have normal html with utf-8 encoding, but only cyrillic symbols after cheerio.load(html) looks like 'о&#x431'
And I do not need it to decode the html entities already encoded in my html! this is exactly what he does if you disable option decodeEntities: false
:(

nitwhiz · 2018-04-21T18:07:04Z

For me decodeEntities doesn't change a single thing. They get encoded always.

The solution of @rlidwka works, but I need the   and & and so on in my html code. On the same hand, I don't want some &#xx; for every non-ASCII character.

I don't see a problem and/or the need in not touching such things?
How about just leaving the "text" as it is, at least optional?

7iomka · 2018-04-22T09:24:14Z

@daintycode.
I resolved this problem (hard solution, but, it works finally!):

const cheerio = require('cheerio');
const sanitizeHtml = require('sanitize-html');
const $ = cheerio.load(html, {
            decodeEntities: true
        });
...
...
// Write sanitized html to the files
        fs.writeFile(output, sanitizeHtml($.html(), {
            allowedTags: false,
            allowedAttributes: false,
            // Lots of these won't come up by default because we don't allow them
            selfClosing: ['img', 'br', 'hr', 'area', 'base', 'basefont', 'input', 'link', 'meta'],
            // URL schemes we permit
            allowedSchemes: ['http', 'https', 'ftp', 'mailto'],
            // allowedSchemes: false,
            allowedSchemesByTag: {},
            allowedSchemesAppliedToAttributes: ['href', 'src', 'cite'],
            // allowedSchemesAppliedToAttributes: false,
            allowProtocolRelative: true,
            // allowedIframeHostnames: ['www.youtube.com', 'player.vimeo.com']
            allowedIframeHostnames: false,
            parser: {
                // THIS LINE OF CONFIG RESOLVE cheerio problem!
                decodeEntities: true
            }
        }), (err) => {
            if (err) {
                console.log(`Error rendering ${err.message}`);
            } 
        });

But, I don't understand why Russian users (and all others whose language is different from English) should experience similar problems using a plugin from the box? 👎

- See cheerio issue: cheeriojs/cheerio#866

GHolk · 2018-11-25T02:52:52Z

i fix this issue by a series of patch on entities, dom-serializer, and cheerio it self.
on pull request #1249 . however entities and dom-serializer not accept my pull request yet.

claviska · 2019-04-12T21:36:32Z

Expanding on @rlidwka's solution, I wrote a wrapper module that monkey patches both HTML methods so you can use $.html() and $(selector).html() with consistent results.

Monkey patching isn't my preferred way to solve a problem, but when you're using multiple instances of Cheerio it makes things less painful. It also makes it easier to revert to the original lib once it's been fixed.

Tested in 0.22.0, 1.0.0-rc.1, and 1.0.0-rc.2.

const cheerio = require('cheerio');
const load = cheerio.load;

function decode(string) {
  return string.replace(/&#x([0-9a-f]{1,6});/ig, (entity, code) => {
    code = parseInt(code, 16);

    // Don't unescape ASCII characters, assuming they're encoded for a good reason
    if (code < 0x80) return entity;

    return String.fromCodePoint(code);
  });
}

function wrapHtml(fn) {
  return function() {
    const result = fn.apply(this, arguments);
    return typeof result === 'string' ? decode(result) : result;
  };
}

cheerio.load = function() {
  const instance = load.apply(this, arguments);

  instance.html = wrapHtml(instance.html);
  instance.prototype.html = wrapHtml(instance.prototype.html);

  return instance;
};

module.exports = cheerio;

Example:

const $ = cheerio.load('<p>Here’s a “quote” for ‘you’</p>');

console.log(
  $.html(),
  $.root().html(),
  $('p').html()
);

/*
Output without patch:

<html><head></head><body><p>Here&#x2019;s a &#x201C;quote&#x201D; for &#x2018;you&#x2019;</p></body></html>
<html><head></head><body><p>Here&#x2019;s a &#x201C;quote&#x201D; for &#x2018;you&#x2019;</p></body></html>
Here&#x2019;s a &#x201C;quote&#x201D; for &#x2018;you&#x2019;

Output with patch:

<html><head></head><body><p>Here’s a “quote” for ‘you’</p></body></html>
<html><head></head><body><p>Here’s a “quote” for ‘you’</p></body></html>
Here’s a “quote” for ‘you’
*/

matthewmueller · 2019-04-15T04:33:32Z

Hey @fb55, do you know a way forward on this issue?

kouhin · 2019-04-15T05:16:57Z

@fb55 A simple test case is here

import cheerio from 'cheerio';

function cheerioLoad(str) {
  const $ = cheerio.load(
    str.indexOf('<body>') === -1 ? `<body>${str}</body>` : str,
  );
  $.originalHTML = $.html;
  $.html = () => $('head').html() + $('body').html();
  return $;
}

it('test', () => {
      const content = '<div><p>&lt;a&gt;あああああ&lt;img&gt;</p></div>';
      const $ = cheerioLoad(content);
      expect($.html()).to.be.equal(
        '<div><p>&lt;a&gt;あああああ&lt;img&gt;</p></div>',
      );
});

The result will be

      + expected - actual

      -<div><p>&lt;a&gt;&#x3042;&#x3042;&#x3042;&#x3042;&#x3042;&lt;img&gt;</p></div>
      +<div><p>&lt;a&gt;あああああ&lt;img&gt;</p></div>

dzcpy · 2019-07-13T20:25:34Z

Any updates? I just confirmed that there's still an issue with non-latin characters.

cheeriojs/cheerio#866 (comment)

matthewmueller · 2019-08-26T02:31:56Z

Please keep comments constructive and useful.

See cheeriojs/cheerio#866

peterbe · 2020-01-21T01:36:31Z

I'd like to endorse @claviska 's solution. It worked for me.
Can this please become part of cheerio core? Who knows how to make a PR that passes and gets accepted?

ref cheeriojs/cheerio#866

fb55 · 2020-12-22T12:31:17Z

This should be resolved with the latest release. We are now using a new serializer for HTML, which no longer encodes non-ASCII characters.

oppilate · 2020-12-22T13:27:39Z

This should be resolved with the latest release. We are now using a new serializer for HTML, which no longer encodes non-ASCII characters.

Can you explain more? htmlparser2 seems to be the default only for XML parsing. For HTML it still uses parser5, unless you

// Usage as of htmlparser2 version 3:
const htmlparser2 = require('htmlparser2');
const dom = htmlparser2.parseDOM(document, options);

const $ = cheerio.load(dom);

Am I right?

fb55 · 2020-12-22T16:00:49Z

Yes — we are now using parse5's serializer, which doesn't encode non-latin characters anymore.

oppilate · 2020-12-22T16:06:44Z

Yes — we are now using parse5's serializer, which doesn't encode non-latin characters anymore.

Oh I see. Thanks!

Cactusbone added a commit to F4-Group/cheerio that referenced this issue Nov 30, 2016

unwanted characters encoded test for cheeriojs#866

8186bce

Cactusbone mentioned this issue Nov 30, 2016

Decode entities workaround test #953

Closed

chawyehsu mentioned this issue Dec 17, 2016

let cheerio do not decode entities bubkoo/hexo-toc#15

Merged

d4n1elchen mentioned this issue Dec 27, 2017

should .html() decode entities? #1124

Closed

peremenov mentioned this issue Apr 5, 2018

Plugin replaces all Cyrillic characters with html entities axe312ger/metalsmith-adaptive-images#42

Closed

sbmaxx mentioned this issue May 13, 2018

get rid off cheerio turboext/css#16

Merged

jamsinclair added a commit to jamsinclair/budou-node that referenced this issue Aug 10, 2018

Replace cheerio with JSDOM, known problem with encoding non latin chars

c7fdb1d

- See cheerio issue: cheeriojs/cheerio#866

crimx mentioned this issue Apr 12, 2019

插件将所有文本全部转换成了html实体编码。plugin transfers all text to HTML entity coding. crimx/hexo-filter-github-emojis#17

Closed

dzcpy mentioned this issue Jul 13, 2019

Non-latin text should not be HTML encoded automatically with @html selector matthewmueller/x-ray#350

Merged

4 tasks

Ritsuka314 mentioned this issue Aug 7, 2019

cheerio导致文本被转换成html实体编码问题 Ritsuka314/hexo-pandoc-tippy#4

Closed

claviska mentioned this issue Aug 7, 2019

decodeEntities false code tag #1198

Closed

crimx mentioned this issue Aug 14, 2019

Fixed code blocks rendering as html crimx/hexo-filter-github-emojis#22

Closed

zemlanin added a commit to zemlanin/scroll that referenced this issue Aug 21, 2019

monkeypatch cheerio to keep good unicode

963a0e5

cheeriojs/cheerio#866 (comment)

cheeriojs deleted a comment from hassanila Aug 26, 2019

claviska mentioned this issue Sep 22, 2019

Meta russian language Postleaf/postleaf#110

Open

hftf added a commit to hftf/oligodendrocytes that referenced this issue Oct 16, 2019

PG: Output Unicode, not entities

e689186

See cheeriojs/cheerio#866

Prinzhorn mentioned this issue Dec 5, 2019

cheerio convert dom to html is not expect #1006

Closed

mrvautin mentioned this issue Apr 4, 2020

Problems with utf-8 mrvautin/metaget#10

Closed

oppilate mentioned this issue May 11, 2020

Fix parsing of non-latin characters jbrayton/mercury-parser#15

Open

ryzzn added a commit to ryzzn/hexo-renderer-org that referenced this issue Aug 10, 2020

fix wrong text for toc of hexo articles

cdf09a2

ref cheeriojs/cheerio#866

ryzzn mentioned this issue Aug 10, 2020

修复文件过长导致emacs无法启动问题，并解决TOC乱码问题 coldnew/hexo-renderer-org#78

Open

fb55 closed this as completed Dec 22, 2020

peterbe mentioned this issue Dec 22, 2020

No more need for monkeypatched-cheerio.js ? mdn/yari#2270

Closed

This was referenced Dec 22, 2020

chore: optimize performance (replace he with entities) DIYgod/RSSHub#6497

Merged

feat: use htmlparser2 to improve performance DIYgod/RSSHub#6521

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-latin characters get HTML-encoded with decodeEntities=true #866

Non-latin characters get HTML-encoded with decodeEntities=true #866

dsavenko commented May 17, 2016 •

edited

shelldweller commented Jun 1, 2016

davidbayo10 commented Jun 3, 2016

Cactusbone commented Nov 30, 2016 •

edited

Cactusbone commented Nov 30, 2016

rlidwka commented Jan 27, 2017

Cactusbone commented Jan 27, 2017

rlidwka commented Jan 27, 2017

Cactusbone commented Jan 30, 2017

ihteandr commented Aug 4, 2017

konstantinblaesi commented Oct 18, 2017 •

edited

7iomka commented Mar 25, 2018

7iomka commented Mar 25, 2018

ihteandr commented Mar 25, 2018

7iomka commented Mar 25, 2018 •

edited

nitwhiz commented Apr 21, 2018 •

edited

7iomka commented Apr 22, 2018 •

edited

GHolk commented Nov 25, 2018

claviska commented Apr 12, 2019

matthewmueller commented Apr 15, 2019

kouhin commented Apr 15, 2019 •

edited

dzcpy commented Jul 13, 2019 •

edited

matthewmueller commented Aug 26, 2019

peterbe commented Jan 21, 2020

fb55 commented Dec 22, 2020

oppilate commented Dec 22, 2020

fb55 commented Dec 22, 2020

oppilate commented Dec 22, 2020

Non-latin characters get HTML-encoded with decodeEntities=true #866

Non-latin characters get HTML-encoded with decodeEntities=true #866

Comments

dsavenko commented May 17, 2016 • edited

shelldweller commented Jun 1, 2016

davidbayo10 commented Jun 3, 2016

Cactusbone commented Nov 30, 2016 • edited

Cactusbone commented Nov 30, 2016

rlidwka commented Jan 27, 2017

Cactusbone commented Jan 27, 2017

rlidwka commented Jan 27, 2017

Cactusbone commented Jan 30, 2017

ihteandr commented Aug 4, 2017

konstantinblaesi commented Oct 18, 2017 • edited

7iomka commented Mar 25, 2018

7iomka commented Mar 25, 2018

ihteandr commented Mar 25, 2018

7iomka commented Mar 25, 2018 • edited

nitwhiz commented Apr 21, 2018 • edited

7iomka commented Apr 22, 2018 • edited

GHolk commented Nov 25, 2018

claviska commented Apr 12, 2019

matthewmueller commented Apr 15, 2019

kouhin commented Apr 15, 2019 • edited

dzcpy commented Jul 13, 2019 • edited

matthewmueller commented Aug 26, 2019

peterbe commented Jan 21, 2020

fb55 commented Dec 22, 2020

oppilate commented Dec 22, 2020

fb55 commented Dec 22, 2020

oppilate commented Dec 22, 2020

dsavenko commented May 17, 2016 •

edited

Cactusbone commented Nov 30, 2016 •

edited

konstantinblaesi commented Oct 18, 2017 •

edited

7iomka commented Mar 25, 2018 •

edited

nitwhiz commented Apr 21, 2018 •

edited

7iomka commented Apr 22, 2018 •

edited

kouhin commented Apr 15, 2019 •

edited

dzcpy commented Jul 13, 2019 •

edited