New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-latin characters get HTML-encoded with decodeEntities=true #866
Comments
This was reported in #565 and for some reason closed. I agree, |
This code is working for me. The problem was html response which was not encoding correct const request = require('request');
const cheerio = require('cheerio');
const iconv = require('iconv-lite');
function getDOM(url) {
return new Promise(function (resolve, reject) {
request({ uri: url, encoding: null }, function (err, res, html) {
if (err) {
reject(err);
} else {
html = iconv.decode(html, 'ISO-8859-1');
resolve(cheerio.load(html));
}
});
});
} |
looks like it's manually done that way in
issue is linked to cheeriojs/dom-serializer#26 |
found a way around the problem : var cheerio = require('cheerio')
var ch = cheerio.load('<div>абв</div>', { decodeEntities: true })
console.log(ch('div').html({ decodeEntities: false })) |
I stumbled across this issue, so I took my time to test this workaround: > var cheerio = require('cheerio')
undefined
> var ch = cheerio.load('<div><script>alert(1)</script></div>', { decodeEntities: true })
undefined
> console.log(ch.html({ decodeEntities: false }))
<div><script>alert(1)</script></div> It's working all right. The problem is: you're disabling all entity encoding, even entities that really need to be escaped. Which opens up an obvious XSS attack. |
Thanks for the testing ! Indeed that's bad :( I'll add this to the test. Did you find a way around this ? |
An obvious solution is, undo all the escaping work that dom-serializer does: var cheerio = require('cheerio')
var cheerio_html = cheerio.prototype.html
cheerio.prototype.html = function wrapped_html() {
var result = cheerio_html.apply(this, arguments)
if (typeof result === 'string') {
result = result.replace(/&#x([0-9a-f]{1,6});/ig, function (entity, code) {
code = parseInt(code, 16)
// don't unescape ascii characters, assuming that all ascii characters
// are encoded for a good reason
if (code < 0x80) return entity
return String.fromCodePoint(code)
})
}
return result
}
console.log(cheerio.load('<div>абв""<></div>').root().html())
// <div>абв""<></div> It modifies cheerio prototype (which might not be desirable in your case), and it could slow down parsing if you're calling html() a lot. So I'd be happy to know if there's a better solution out there. |
I think cheeriojs/dom-serializer#33 & fb55/entities#28 might do the trick. I'll try that |
some problem with Russian on version 1.0.0-rc.2 |
@fb55 @matthewmueller @jugglinmike What's your opinion on this? It confuses me that setting |
Any ideas? |
@dsavenko Привет. Нашёл какое-то обходное решение на текущий момент? |
I real world I such problem with encoding while use request module |
@ihteandr, this not help for me. I have normal html with utf-8 encoding, but only cyrillic symbols after cheerio.load(html) looks like 'об' |
For me The solution of @rlidwka works, but I need the I don't see a problem and/or the need in not touching such things? |
@daintycode.
But, I don't understand why Russian users (and all others whose language is different from English) should experience similar problems using a plugin from the box? 👎 |
i fix this issue by a series of patch on entities, dom-serializer, and cheerio it self. |
Expanding on @rlidwka's solution, I wrote a wrapper module that monkey patches both HTML methods so you can use Monkey patching isn't my preferred way to solve a problem, but when you're using multiple instances of Cheerio it makes things less painful. It also makes it easier to revert to the original lib once it's been fixed. Tested in 0.22.0, 1.0.0-rc.1, and 1.0.0-rc.2. const cheerio = require('cheerio');
const load = cheerio.load;
function decode(string) {
return string.replace(/&#x([0-9a-f]{1,6});/ig, (entity, code) => {
code = parseInt(code, 16);
// Don't unescape ASCII characters, assuming they're encoded for a good reason
if (code < 0x80) return entity;
return String.fromCodePoint(code);
});
}
function wrapHtml(fn) {
return function() {
const result = fn.apply(this, arguments);
return typeof result === 'string' ? decode(result) : result;
};
}
cheerio.load = function() {
const instance = load.apply(this, arguments);
instance.html = wrapHtml(instance.html);
instance.prototype.html = wrapHtml(instance.prototype.html);
return instance;
};
module.exports = cheerio; Example: const $ = cheerio.load('<p>Here’s a “quote” for ‘you’</p>');
console.log(
$.html(),
$.root().html(),
$('p').html()
);
/*
Output without patch:
<html><head></head><body><p>Here’s a “quote” for ‘you’</p></body></html>
<html><head></head><body><p>Here’s a “quote” for ‘you’</p></body></html>
Here’s a “quote” for ‘you’
Output with patch:
<html><head></head><body><p>Here’s a “quote” for ‘you’</p></body></html>
<html><head></head><body><p>Here’s a “quote” for ‘you’</p></body></html>
Here’s a “quote” for ‘you’
*/ |
Hey @fb55, do you know a way forward on this issue? |
@fb55 A simple test case is here
The result will be
|
Any updates? I just confirmed that there's still an issue with non-latin characters. |
Please keep comments constructive and useful. |
I'd like to endorse @claviska 's solution. It worked for me. |
This should be resolved with the latest release. We are now using a new serializer for HTML, which no longer encodes non-ASCII characters. |
Can you explain more?
Am I right? |
Yes — we are now using parse5's serializer, which doesn't encode non-latin characters anymore. |
Oh I see. Thanks! |
Hi,
Consider this code:
It prints
абв
. If I setdecodeEntities
tofalse
, the output will be the expectedабв
.Two issues here:
decodeEntities
is supposed to work. I tested withhtmlparser2
directly, and it works as expected both ways (the code is below).decodeEntities
totrue
for security reasons.Test code for htmlparser:
Output:
--> абв
(as expected)P.S. Cheerio version: 0.20.0
The text was updated successfully, but these errors were encountered: