Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintain tag attribute quote characters #720

Closed
desandro opened this issue Jun 16, 2015 · 14 comments
Closed

Maintain tag attribute quote characters #720

desandro opened this issue Jun 16, 2015 · 14 comments

Comments

@desandro
Copy link

Cheerio changes attributes with single quotes into double quotes.

var cheerio = require('cheerio');
// uses 'single quotes'
var $ = cheerio.load('<div attr=\'value\'></div>')
$.html();
// => <div attr="value"></div>
// has "double quotes"

This is useful for me, as I use JSON in HTML attributes for widget settings.

<div data-settings='{ "option": true }'></div>

Which is encoded with HTML entities (possibly breaking JSON) as

<div data-options="{ &quot;option&quot;: true }"></div>

Setting decodeEntities: false is encoded and breaks HTML

<div data-options="{ "option": true }"></div>

Ideally, cheerio would preserve which quote character is used. I understand this is an edge case, so I'm reporting it in case others run into it. Similar to #460

@fb55
Copy link
Member

fb55 commented Jun 16, 2015

This definitely won't break JSON within browsers. IMHO this won't be fixed.

@OverFlow636
Copy link

@fb55 is there any chance of this being fixed?

The html my app has to consume is very horribly written and I do not have control over fixing it. Not only does it contain raw json inside of a div tag, but most of the href attributes for anchors only work because of single quotes href='javascript: dostuff("asdf");' which like the issue breaks the tag when double quotes are replaced and the decodeEntities: false is used.

@rajsite
Copy link

rajsite commented Jan 9, 2016

+1, I also store JSON in HTML attributes and would like to second keeping single quoted escapes because it makes the html with embedded json much easier to manipulate and read

@achtan
Copy link

achtan commented Jul 15, 2016

+1

@rostacik
Copy link

rostacik commented Aug 9, 2016

gentlemen any update on this one? do you intend to fix this or not at all? its more that 1 year later and task is still open.

@EugeneButusov
Copy link

@fb55, any update for this? Problem is really important, can't store json in meta tags there.

@fb55
Copy link
Member

fb55 commented Oct 18, 2016

Switching to parse5 could fix this (it's another open issue). As I said
before, not sure if this will be fixed with the current architecture as it
doesn't break anything.

– Felix

@EugeneButusov
Copy link

ok @fb55, thank you very much for the feedback.

@gaecom
Copy link

gaecom commented Dec 26, 2016

I fix this.
steps:

  1. find the file node_modules/dom-serializer/index.js
  2. location the line at 68,change
else {
output += key + "='" + (opts.decodeEntities ? entities.encodeXML(value) : value) + "'";
}

To

else {
  if(/[^\\]\"/.test(value)){
        output += key + "='" + (opts.decodeEntities ? entities.encodeXML(value) : value) + "'";
      }else {
        output += key + '="' + (opts.decodeEntities ? entities.encodeXML(value) : value) + '"';

      }
}

@andreysm
Copy link

andreysm commented Jun 22, 2017

Thanks, @gaecom.
I've slightly modified your solution.

    if (opts.plainQuotes && /[^\\]\"/.test(value)) {
        if (opts.decodeEntities) {
          value = entities.encodeXML(value);
          if (opts.plainQuotes) { value = value.replace(/&quot;/g, '"'); }
        }
        output += key + "='" + value + "'";
      } else {
        if (opts.decodeEntities) {
          value = entities.encodeXML(value);
          if (opts.plainQuotes) { value = value.replace(/&apos;/g, "'"); }
        }
        output += key + '="' + value + '"';
      }

This variant works with decodeEntities: true.
It makes some double job with encoding and replacing &quot; back, but it is acceptable for better readability.

@nicolasbadia
Copy link

I also have this issue and the proposed solutions doesn't works for my use case. The HTML parsed by cheerio (via Inky) will also by parsed later by Twig and the quote needs to remain the same. Here is a HTML sample which fail:

{{ "The content may have simple quote like « l' » or « d' », as it is the case for <a href='http://my.link'>tags</a>."|trans }}

@alexflorisca
Copy link

We are having this issue with implementing the new AMP pages by google. One of the parameters requires JSON inside a data attribute like so:

<amp-ad rtc-config='{"urls": ["url1",...]}'>

The single quotes get converted to doubles which invalidates the HTML. JSON can't contain single quotes so swapping double and single quotes doesn't work.

Please cheerio, we need you.

@stephengardner
Copy link

stephengardner commented Jul 10, 2018

SOS. Does anyone have a fix for this without modifying external node_modules?

@ederchrono
Copy link

I just had this problem and fixed it by adding decodeEntities: false to the props when loading the html:

this.$ = cheerio.load(myHtml, {decodeEntities: false})

@fb55 fb55 closed this as completed Nov 1, 2018
kevinansfield added a commit to TryGhost/Ghost that referenced this issue Mar 6, 2024
…ds (#19727)

closes ENG-627

We were using `cheerio` to parse+modify+serialize our rendered HTML to modify links for member attribution. Cheerio's serializer has a [long-standing issue](cheeriojs/cheerio#720) (that we've [had to deal with before](TryGhost/SDK#124)) where it replaces single-quote attributes with double-quote attributes. That was resulting in broken rendering when content used single-quotes such as in HTML cards that have JSON data inside a `data-` attribute or otherwise used single-quotes to avoid escaping double-quotes in an attribute value.

- swapped the implementation that uses `cheerio` for one that uses `html5parser` to tokenize the html string, from there we can loop over the tokens and replace the href attribute values in the original string without touching any other part of the content. Avoids a full parse+serialize process which is both more costly and can result unexpected content changes due to serializer opinions.
  - fixes the quote change bug
  - uses tokenization directly to avoid cost of building a full AST
- updated Content API Posts snapshot
  - one of our fixtures has a missing closing tag which we're no longer "fixing" with a full parse+serialize step in the link replacer (keeps modified src closer to original and better matches behaviour elsewhere in the app / without member-attribution applied)
  - the link replacer no longer converts `attr=""` to `attr` (these are equivalent in the HTML spec so no change in behaviour other than preserving the original source html)
- added a benchmark test file comparing the two implementations because the link replacer runs on render so it's used in a hot path
  - new implementation has a 3x performance improvement
  - the separate files with the old/new implementations have been cleaned up but I've left the benchmark test file in place for future reference

Benchmark results comparing implementations:

```
❯ node test/benchmark.js

LinkReplacer
├─ cheerio: 5.03K /s ±2.20%
├─ html5parser: 16.5K /s ±0.43%

Completed benchmark in 0.9976526670455933s
┌─────────────┬─────────┬────────────┬─────────┬───────┐
│   (index)   │ percent │ iterations │ current │  max  │
├─────────────┼─────────┼────────────┼─────────┼───────┤
│   cheerio   │   ''    │ '5.03K/s'  │  5037   │ 5037  │
│ html5parser │   ''    │ '16.5K/s'  │  16534  │ 16534 │
└─────────────┴─────────┴────────────┴─────────┴───────┘
```
royalfig pushed a commit to TryGhost/Ghost that referenced this issue Mar 25, 2024
…ds (#19727)

closes ENG-627

We were using `cheerio` to parse+modify+serialize our rendered HTML to modify links for member attribution. Cheerio's serializer has a [long-standing issue](cheeriojs/cheerio#720) (that we've [had to deal with before](TryGhost/SDK#124)) where it replaces single-quote attributes with double-quote attributes. That was resulting in broken rendering when content used single-quotes such as in HTML cards that have JSON data inside a `data-` attribute or otherwise used single-quotes to avoid escaping double-quotes in an attribute value.

- swapped the implementation that uses `cheerio` for one that uses `html5parser` to tokenize the html string, from there we can loop over the tokens and replace the href attribute values in the original string without touching any other part of the content. Avoids a full parse+serialize process which is both more costly and can result unexpected content changes due to serializer opinions.
  - fixes the quote change bug
  - uses tokenization directly to avoid cost of building a full AST
- updated Content API Posts snapshot
  - one of our fixtures has a missing closing tag which we're no longer "fixing" with a full parse+serialize step in the link replacer (keeps modified src closer to original and better matches behaviour elsewhere in the app / without member-attribution applied)
  - the link replacer no longer converts `attr=""` to `attr` (these are equivalent in the HTML spec so no change in behaviour other than preserving the original source html)
- added a benchmark test file comparing the two implementations because the link replacer runs on render so it's used in a hot path
  - new implementation has a 3x performance improvement
  - the separate files with the old/new implementations have been cleaned up but I've left the benchmark test file in place for future reference

Benchmark results comparing implementations:

```
❯ node test/benchmark.js

LinkReplacer
├─ cheerio: 5.03K /s ±2.20%
├─ html5parser: 16.5K /s ±0.43%

Completed benchmark in 0.9976526670455933s
┌─────────────┬─────────┬────────────┬─────────┬───────┐
│   (index)   │ percent │ iterations │ current │  max  │
├─────────────┼─────────┼────────────┼─────────┼───────┤
│   cheerio   │   ''    │ '5.03K/s'  │  5037   │ 5037  │
│ html5parser │   ''    │ '16.5K/s'  │  16534  │ 16534 │
└─────────────┴─────────┴────────────┴─────────┴───────┘
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests