Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-latin characters get HTML-encoded with decodeEntities=true #866

Closed
dsavenko opened this issue May 17, 2016 · 27 comments
Closed

Non-latin characters get HTML-encoded with decodeEntities=true #866

dsavenko opened this issue May 17, 2016 · 27 comments

Comments

@dsavenko
Copy link

dsavenko commented May 17, 2016

Hi,

Consider this code:

var cheerio = require('cheerio')
var ch = cheerio.load('<div>абв</div>', { decodeEntities: true })
console.log(ch('div').html())

It prints &#x430;&#x431;&#x432;. If I set decodeEntities to false, the output will be the expected абв.

Two issues here:

  1. This is not how decodeEntities is supposed to work. I tested with htmlparser2 directly, and it works as expected both ways (the code is below).
  2. Htmlparser recommends always set decodeEntities to true for security reasons.

Test code for htmlparser:

var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
    ontext: function(text){
        console.log("-->", text);
    }
}, {decodeEntities: true});
parser.write("<div>абв</div>");
parser.end();

Output: --> абв (as expected)

P.S. Cheerio version: 0.20.0

@shelldweller
Copy link

This was reported in #565 and for some reason closed. I agree, .html() should not escape non-ASCII characters to entities, but should return Unicode string, like .text() does.

@davidbayo10
Copy link

This code is working for me. The problem was html response which was not encoding correct

const request = require('request');
const cheerio = require('cheerio');
const iconv = require('iconv-lite');

function getDOM(url) {
  return new Promise(function (resolve, reject) {
    request({ uri: url, encoding: null }, function (err, res, html) {
      if (err) {
        reject(err);
      } else {
        html = iconv.decode(html, 'ISO-8859-1');
        resolve(cheerio.load(html));
      }
    });
  });
}

@Cactusbone
Copy link

Cactusbone commented Nov 30, 2016

looks like it's manually done that way in dom-serializer.

 if (opts.decodeEntities && !(elem.parent && elem.parent.name in unencodedElements)) {
    data = entities.encodeXML(data);
  }

issue is linked to cheeriojs/dom-serializer#26

@Cactusbone
Copy link

found a way around the problem :

var cheerio = require('cheerio')
var ch = cheerio.load('<div>абв</div>', { decodeEntities: true })
console.log(ch('div').html({ decodeEntities: false }))

@rlidwka
Copy link

rlidwka commented Jan 27, 2017

@Cactusbone :

I stumbled across this issue, so I took my time to test this workaround:

> var cheerio = require('cheerio')
undefined
> var ch = cheerio.load('<div>&lt;script&gt;alert(1)&lt;/script&gt;</div>', { decodeEntities: true })
undefined
> console.log(ch.html({ decodeEntities: false }))
<div><script>alert(1)</script></div>

It's working all right. The problem is: you're disabling all entity encoding, even entities that really need to be escaped. Which opens up an obvious XSS attack.

@Cactusbone
Copy link

Thanks for the testing ! Indeed that's bad :( I'll add this to the test. Did you find a way around this ?

@rlidwka
Copy link

rlidwka commented Jan 27, 2017

Did you find a way around this ?

An obvious solution is, undo all the escaping work that dom-serializer does:

var cheerio = require('cheerio')

var cheerio_html = cheerio.prototype.html

cheerio.prototype.html = function wrapped_html() {
  var result = cheerio_html.apply(this, arguments)
 
  if (typeof result === 'string') {
    result = result.replace(/&#x([0-9a-f]{1,6});/ig, function (entity, code) {
      code = parseInt(code, 16)

      // don't unescape ascii characters, assuming that all ascii characters
      // are encoded for a good reason
      if (code < 0x80) return entity

      return String.fromCodePoint(code)
    })
  }

  return result
}

console.log(cheerio.load('<div>абв"&quot;&lt;&gt;</div>').root().html())
// <div>абв&quot;&quot;&lt;&gt;</div>

It modifies cheerio prototype (which might not be desirable in your case), and it could slow down parsing if you're calling html() a lot. So I'd be happy to know if there's a better solution out there.

@Cactusbone
Copy link

I think cheeriojs/dom-serializer#33 & fb55/entities#28 might do the trick. I'll try that

@ihteandr
Copy link

ihteandr commented Aug 4, 2017

some problem with Russian on version 1.0.0-rc.2

@konstantinblaesi
Copy link

konstantinblaesi commented Oct 18, 2017

@fb55 @matthewmueller @jugglinmike What's your opinion on this? It confuses me that setting { decodeEntities: true } causes encoding of non-latin characters. It would be nice if cheerio was behaving like jQuery in the case of .text() and .html(). Currently I have to set { decodeEntities: false } to prevent encoding of non-latin characters, but I have to use { decodeEntities: true } to decode html entities such as &amp;.

@7iomka
Copy link

7iomka commented Mar 25, 2018

Any ideas?

@7iomka
Copy link

7iomka commented Mar 25, 2018

@dsavenko Привет. Нашёл какое-то обходное решение на текущий момент?

@ihteandr
Copy link

I real world I such problem with encoding while use request module
I resolved it this way
put request options encoding=null;
encoded response manually with
var iconv = require('iconv-lite');
iconv.decode(html, 'win1251')
I hope this help

@7iomka
Copy link

7iomka commented Mar 25, 2018

@ihteandr, this not help for me. I have normal html with utf-8 encoding, but only cyrillic symbols after cheerio.load(html) looks like 'о&#x431'
And I do not need it to decode the html entities already encoded in my html! this is exactly what he does if you disable option decodeEntities: false
:(

@nitwhiz
Copy link

nitwhiz commented Apr 21, 2018

For me decodeEntities doesn't change a single thing. They get encoded always.

The solution of @rlidwka works, but I need the &nbsp; and &amp; and so on in my html code. On the same hand, I don't want some &#xx; for every non-ASCII character.

I don't see a problem and/or the need in not touching such things?
How about just leaving the "text" as it is, at least optional?

@7iomka
Copy link

7iomka commented Apr 22, 2018

@daintycode.
I resolved this problem (hard solution, but, it works finally!):

const cheerio = require('cheerio');
const sanitizeHtml = require('sanitize-html');
const $ = cheerio.load(html, {
            decodeEntities: true
        });
...
...
// Write sanitized html to the files
        fs.writeFile(output, sanitizeHtml($.html(), {
            allowedTags: false,
            allowedAttributes: false,
            // Lots of these won't come up by default because we don't allow them
            selfClosing: ['img', 'br', 'hr', 'area', 'base', 'basefont', 'input', 'link', 'meta'],
            // URL schemes we permit
            allowedSchemes: ['http', 'https', 'ftp', 'mailto'],
            // allowedSchemes: false,
            allowedSchemesByTag: {},
            allowedSchemesAppliedToAttributes: ['href', 'src', 'cite'],
            // allowedSchemesAppliedToAttributes: false,
            allowProtocolRelative: true,
            // allowedIframeHostnames: ['www.youtube.com', 'player.vimeo.com']
            allowedIframeHostnames: false,
            parser: {
                // THIS LINE OF CONFIG RESOLVE cheerio problem!
                decodeEntities: true
            }
        }), (err) => {
            if (err) {
                console.log(`Error rendering ${err.message}`);
            } 
        });

But, I don't understand why Russian users (and all others whose language is different from English) should experience similar problems using a plugin from the box? 👎

jamsinclair added a commit to jamsinclair/budou-node that referenced this issue Aug 10, 2018
@GHolk
Copy link

GHolk commented Nov 25, 2018

i fix this issue by a series of patch on entities, dom-serializer, and cheerio it self.
on pull request #1249 . however entities and dom-serializer not accept my pull request yet.

@claviska
Copy link

Expanding on @rlidwka's solution, I wrote a wrapper module that monkey patches both HTML methods so you can use $.html() and $(selector).html() with consistent results.

Monkey patching isn't my preferred way to solve a problem, but when you're using multiple instances of Cheerio it makes things less painful. It also makes it easier to revert to the original lib once it's been fixed.

Tested in 0.22.0, 1.0.0-rc.1, and 1.0.0-rc.2.

const cheerio = require('cheerio');
const load = cheerio.load;

function decode(string) {
  return string.replace(/&#x([0-9a-f]{1,6});/ig, (entity, code) => {
    code = parseInt(code, 16);

    // Don't unescape ASCII characters, assuming they're encoded for a good reason
    if (code < 0x80) return entity;

    return String.fromCodePoint(code);
  });
}

function wrapHtml(fn) {
  return function() {
    const result = fn.apply(this, arguments);
    return typeof result === 'string' ? decode(result) : result;
  };
}

cheerio.load = function() {
  const instance = load.apply(this, arguments);

  instance.html = wrapHtml(instance.html);
  instance.prototype.html = wrapHtml(instance.prototype.html);

  return instance;
};

module.exports = cheerio;

Example:

const $ = cheerio.load('<p>Here’s a “quote” for ‘you’</p>');

console.log(
  $.html(),
  $.root().html(),
  $('p').html()
);

/*
Output without patch:

<html><head></head><body><p>Here&#x2019;s a &#x201C;quote&#x201D; for &#x2018;you&#x2019;</p></body></html>
<html><head></head><body><p>Here&#x2019;s a &#x201C;quote&#x201D; for &#x2018;you&#x2019;</p></body></html>
Here&#x2019;s a &#x201C;quote&#x201D; for &#x2018;you&#x2019;

Output with patch:

<html><head></head><body><p>Here’s a “quote” for ‘you’</p></body></html>
<html><head></head><body><p>Here’s a “quote” for ‘you’</p></body></html>
Here’s a “quote” for ‘you’
*/

@matthewmueller
Copy link
Member

Hey @fb55, do you know a way forward on this issue?

@kouhin
Copy link

kouhin commented Apr 15, 2019

@fb55 A simple test case is here

import cheerio from 'cheerio';

function cheerioLoad(str) {
  const $ = cheerio.load(
    str.indexOf('<body>') === -1 ? `<body>${str}</body>` : str,
  );
  $.originalHTML = $.html;
  $.html = () => $('head').html() + $('body').html();
  return $;
}

it('test', () => {
      const content = '<div><p>&lt;a&gt;あああああ&lt;img&gt;</p></div>';
      const $ = cheerioLoad(content);
      expect($.html()).to.be.equal(
        '<div><p>&lt;a&gt;あああああ&lt;img&gt;</p></div>',
      );
});

The result will be

      + expected - actual

      -<div><p>&lt;a&gt;&#x3042;&#x3042;&#x3042;&#x3042;&#x3042;&lt;img&gt;</p></div>
      +<div><p>&lt;a&gt;あああああ&lt;img&gt;</p></div>

@dzcpy
Copy link
Contributor

dzcpy commented Jul 13, 2019

Any updates? I just confirmed that there's still an issue with non-latin characters.

@matthewmueller
Copy link
Member

Please keep comments constructive and useful.

@peterbe
Copy link

peterbe commented Jan 21, 2020

I'd like to endorse @claviska 's solution. It worked for me.
Can this please become part of cheerio core? Who knows how to make a PR that passes and gets accepted?

@fb55
Copy link
Member

fb55 commented Dec 22, 2020

This should be resolved with the latest release. We are now using a new serializer for HTML, which no longer encodes non-ASCII characters.

@fb55 fb55 closed this as completed Dec 22, 2020
@oppilate
Copy link

This should be resolved with the latest release. We are now using a new serializer for HTML, which no longer encodes non-ASCII characters.

Can you explain more? htmlparser2 seems to be the default only for XML parsing. For HTML it still uses parser5, unless you

// Usage as of htmlparser2 version 3:
const htmlparser2 = require('htmlparser2');
const dom = htmlparser2.parseDOM(document, options);

const $ = cheerio.load(dom);

Am I right?

@fb55
Copy link
Member

fb55 commented Dec 22, 2020

Yes — we are now using parse5's serializer, which doesn't encode non-latin characters anymore.

@oppilate
Copy link

Yes — we are now using parse5's serializer, which doesn't encode non-latin characters anymore.

Oh I see. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests