Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script tag contents are HTML escaped #460

Closed
kpdecker opened this issue Apr 21, 2014 · 14 comments
Closed

Script tag contents are HTML escaped #460

kpdecker opened this issue Apr 21, 2014 · 14 comments

Comments

@kpdecker
Copy link
Contributor

As of the latest master the contents of script tags are HTML escaped.

  + expected - actual

  +"<!doctype html>\n<html>\n  <body>foo\n<script>var $serverCache = {\"http://localhost:51205/\": {\"data\":\"get!\"}};</script></body>\n</html>\n"
  -"<!doctype html>\n<html>\n  <body>foo\n<script>var $serverCache = {&quot;http://localhost:51205/&quot;: {&quot;data&quot;:&quot;get!&quot;}};</script></body>\n</html>\n"
@fb55
Copy link
Member

fb55 commented Apr 22, 2014

All entities are now decoded while parsing. That's generally an improvement (especially as this fixed several issues with entities in attribute values), but requires encoding characters once the document is rendered.

There are several characters that don't really have to be escaped, " in text nodes being one of them. <, & and non-ASCII characters should actually be the only ones that require escaping.

The fix in #461 is too narrow. In the end, we should maintain a list of elements that contain plain text, raw text or script data (ie. script, style, xmp, iframe, noembed, noframes, noscript, plaintext) or alternatively add this information to the DOM.

@kpdecker
Copy link
Contributor Author

For #461 this was a bit of a rushed fix as the merge with canonical master
was about to break production. (We're running bleeding edge right now :) )

I think in the long run reducing the number of escaped characters will have
a positive impact as this results in a smaller file size and depending on
the implementation should result in faster rendering times due to reduced
escaping overhead.

That being said, I don't think the list there is accurate, from my reading
of the spec. It seems like script, style, and foreign elements such as
embedded SVG have different escaping standards, but otherwise it's more or
less the same as all other elements (with content) would fall into the
normal elements category.
http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#elements-0

On Tue, Apr 22, 2014 at 9:04 AM, Felix Böhm notifications@github.comwrote:

All entities are now decoded while parsing. That's generally an
improvement (especially as this fixed several issues with entities in
attribute values), but requires encoding characters once the document is
rendered.

There are several characters that don't really have to be escaped, " in
text nodes being one of them. <, & and non-ASCII characters should
actually be the only ones that require escaping.

The fix in #461 #461 is too
narrow. In the end, we should maintain a list of elements that contain
plain text, raw text or script data (ie. script, style, xmp, iframe,
noembed, noframes, noscript, plaintext) or alternatively add this
information to the DOM.


Reply to this email directly or view it on GitHubhttps://github.com//issues/460#issuecomment-41043093
.

@fb55
Copy link
Member

fb55 commented Apr 22, 2014

@kpdecker Have a look at the serialization part of the spec:

If current node is a Text node
If the parent of current node is a style, script, xmp, iframe, noembed, noframes, or plaintext element, or if the parent of current node is a noscript element and scripting is enabled for the node, then append the value of current node's data IDL attribute literally.

Otherwise, append the value of current node's data IDL attribute, escaped as described below.

cheerio should just assume that the "scripting flag" is set, even though scripts aren't executed.

@fb55
Copy link
Member

fb55 commented Apr 22, 2014

@kpdecker Do you want to give it a try? The dom-serializer module could need some love :)

(IMHO, the easiest solution would be to inline renderText…)

@kpdecker
Copy link
Contributor Author

Sure I could take a look. Like I said the original patch was a knee jerk fix to get around our immediate production issues :( Will try to make some time to look at the whole thing.

kpdecker added a commit to walmartlabs/lumbar that referenced this issue May 12, 2014
@giltayar
Copy link

Just stumbled on the bug when upgrading cheerio.

I also had to pin it to 0.15, as this is a breaking change for us too - we use cheerio to read and transform html, and if the transformed HTML includes a script tag, that script tag will not parse in browsers. (well, I tried Chrome, but I'm assuming all the others...)

@kennu
Copy link

kennu commented May 16, 2014

We also had to pin to Cheerio 0.15, because the output from .html() now has invalid escaped inline stylesheets etc.

@alexindigo
Copy link
Contributor

Similar problem with tag attributes, it mangles following <div data-fetch_summary="{&quot;collection&quot;:5}"> into <div data-fetch_summary="{&#x26;quot;collection&#x26;quot;:5}"> which is pain in the neck.

And I can't see the use case for escaping attribute values, since they weren't decoded on the first place.

@robogeek
Copy link

robogeek commented Jun 1, 2014

I've had to pin to Cheerio 0.15 as well, for the reasons noted above (script tags having their content mangled).

@fb55
Copy link
Member

fb55 commented Jun 1, 2014

Will be fixed by #458.

@alexindigo
Copy link
Contributor

I've found the solution for this problem and other ones, just need to fix the tests.
Going to submit soon.

@alexindigo
Copy link
Contributor

Submitted PR #499

@bitinn
Copy link

bitinn commented Jun 5, 2014

What we are doing might be an edge case, but we use cheerio to parse and build Swig template for rendering on server. And v0.16 starts to escape this line:

<title>{{__("i18n text")}}</title>

into

<title>{{__(&quot;i18n text&quot;)}}</title>

which cause Swig to choke. I am wondering if this is a supported use-case in future? Or can we turn off such escaping with a setting (apologize if i overlook something trivial)

@fb55
Copy link
Member

fb55 commented Jun 5, 2014

The next release will allow you to pass decodeEntities: false, resulting
in the desired behavior.

Felix

fb55 added a commit that referenced this issue Jun 9, 2014
Respect options on the element level

fixes #496, #460
@fb55 fb55 closed this as completed Jun 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants