Encoding ampersands on HTML entities when parser.decodeEntities = false #249

WillGibson · 2018-08-29T14:59:58Z

E.g. Code like...

const text = 'This &amp; that &reg';
const sanitizeHtmlOptions = {
    parser: {
        decodeEntities: false
    }
};
demand(sanitizeHtml(text, sanitizeHtmlOptions)).equal(text);

...results in...

AssertionError: "This &amp;amp; that &amp;reg" must equal "This &amp; that &reg"
+ expected - actual

-This &amp;amp; that &amp;reg
+This &amp; that &reg

I'm guessing that this behaviour is not intended?

The text was updated successfully, but these errors were encountered:

boutell · 2018-08-29T15:02:30Z

Not all parser options can be safely combined with sanitize-html, which has to set some in a predictable way in order to work.

…

On Wed, Aug 29, 2018 at 11:01 AM, WillGibson ***@***.***> wrote: E.g. Code like... const text = 'This & that &reg'; const sanitizeHtmlOptions = { parser: { decodeEntities: false } }; demand(sanitizeHtml(text, sanitizeHtmlOptions)).equal(text); ...results in... AssertionError: "This &amp; that &reg" must equal "This & that &reg" + expected - actual -This &amp; that &reg +This & that &reg I'm guessing that this behaviour is not intended? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#249>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAB9fak9B_pK04UZo8A63dmMYjDCazABks5uVqzVgaJpZM4WRvGx> .

-- *THOMAS BOUTELL, CHIEF SOFTWARE ARCHITECT* P'UNK AVENUE | (215) 755-1330 | punkave.com

WillGibson · 2018-08-29T15:10:11Z

I would expect it not to do the extra encode on the ampersands.

I could have a look at whether I can make a PR that makes it behave like that if you set parser.decodeEntities = false in the morning if you think that would be desirable?

boutell · 2018-08-29T15:15:17Z

It would have to pass all of the unit tests and introduce no XSS vulnerabilities.

…

On Wed, Aug 29, 2018 at 11:12 AM, WillGibson ***@***.***> wrote: I would expect it not to do the extra encode on the ampersands. I could have a look at whether I can make a PR that makes it behave like that if you set parser.decodeEntities = false in the morning if you think that would be desirable? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#249 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAB9fSBgtRiJN3t_P2SjXLwKKIqDcBLAks5uVq9xgaJpZM4WRvGx> .

-- *THOMAS BOUTELL, CHIEF SOFTWARE ARCHITECT* P'UNK AVENUE | (215) 755-1330 | punkave.com

WillGibson · 2018-08-30T10:00:43Z

On it

timotm · 2018-09-26T13:30:42Z

I think this change caused some regression.

sh = require('sanitize-html')
sh('<img src="<0&0;0.2&" />', {allowedTags: ['img']})

produces
<img src="<0&0;0.2&" />
instead of the expected
<img src="<0&0;0.2&" />

boutell · 2018-09-26T14:25:17Z

@WillGibson did you miss a /g somewhere?

boutell · 2018-09-26T14:25:54Z

Or is it the presence of the 0;? I can see that it may not be as simple as we thought to detect the valid cases for leaving the & alone.

WillGibson · 2018-09-28T12:01:40Z

I'm on holiday now with no laptop. If no-one has addressed it before I get back I'll add this example to the tests and code away until they all go green again.

WillGibson · 2018-09-28T12:04:34Z

The regex will be picking up &0; as if it's an ampersand that's already part of an HTML entity and therefore not encoding it.

boutell · 2018-09-28T14:03:37Z

A fix for the regression with the module's default parser settings is under internal code review. I also added a commented-out test demonstrating Timo's issue but bear in mind, it must be a rigorous validation of actually valid entities only to really meet the requirement. So for now decodeEntities: false is not recommended. (It has never been the default or a suggested configuration though.)

…

On Fri, Sep 28, 2018 at 8:04 AM WillGibson ***@***.***> wrote: The regex will be picking up &0; as if it's an ampersand that's already part of an HTML entity and therefore not encoding it. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#249 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAB9fdnpTPfuEW7inUU3w8PgoWBvrZ1Eks5ufhBTgaJpZM4WRvGx> .

-- *Thomas Boutell, Chief Software Architect* P'unk Avenue | (215) 755-1330 | punkave.com

boutell · 2018-09-28T15:49:34Z

The fix for the default case has been published to npm. decodeEntities: false is still broken as described.

WillGibson · 2018-10-11T15:04:10Z

@boutell Re: "I can see that it may not be as simple as we thought to detect the valid cases for leaving the & alone.".

I agree. It's a good challenge though, I'm going to have a go at it!

WillGibson · 2018-10-11T17:41:42Z

Hmmm, encoding ampersands without double encoding them on HTML entities is indeed hard given that the range of strings that can make up a valid HTML entity is so wide. There are too many edge cases.

In our use case, we want to strip all the stuff that isn't text, but keep HTML entities encoded.

After all this, I'm wondering if I should not have left sanitise-html alone and just used something else to encode the HTML entities after the text comes out clean from sanitise-html.

alexandruluca · 2018-11-14T09:04:26Z

Any follow up on this?

WillGibson · 2018-11-17T07:27:05Z

https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_entity_references_in_HTML

Trying to make a regex to pick up &htmlentity;, and not &sometext; is a thankless task.

Short of maintaining a full list of HTML entities, or using another package which already does that, I'm not sure how to deal with it.

I'm open to suggestions.

firefoxNX · 2019-05-13T22:16:06Z

any update on this?

    var sanitizeHtmlOptions = {
      parser: {
        decodeEntities: false
      }
    };
    assert.equal(sanitizeHtml('simple & test', sanitizeHtmlOptions), 'simple & test');
  });

slidenerd · 2019-12-03T10:00:38Z

ahan now I see, the concern the library author is having is that if decodeEntities is false < and > are allowed as is which means a vulnerability, isnt there some way to say opt out for some tags but not for the others, & can probably be exempted but <> are encoded

stale · 2020-07-07T14:29:14Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

WillGibson mentioned this issue Aug 30, 2018

Stop double encoding ampersands on HTML entities #250

Merged

boutell mentioned this issue Jan 10, 2019

Characters escaping #246

Closed

stale bot added the stale label Jul 7, 2020

stale bot closed this as completed Jul 21, 2020

t1m0thyj mentioned this issue Jan 28, 2022

Fix angle brackets escaped in web help code blocks zowe/imperative#731

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding ampersands on HTML entities when parser.decodeEntities = false #249

Encoding ampersands on HTML entities when parser.decodeEntities = false #249

WillGibson commented Aug 29, 2018

boutell commented Aug 29, 2018 via email

WillGibson commented Aug 29, 2018

boutell commented Aug 29, 2018 via email

WillGibson commented Aug 30, 2018

timotm commented Sep 26, 2018

boutell commented Sep 26, 2018

boutell commented Sep 26, 2018

WillGibson commented Sep 28, 2018

WillGibson commented Sep 28, 2018

boutell commented Sep 28, 2018 via email

boutell commented Sep 28, 2018

WillGibson commented Oct 11, 2018

WillGibson commented Oct 11, 2018

alexandruluca commented Nov 14, 2018

WillGibson commented Nov 17, 2018

firefoxNX commented May 13, 2019 •

edited

slidenerd commented Dec 3, 2019

stale bot commented Jul 7, 2020

Encoding ampersands on HTML entities when parser.decodeEntities = false #249

Encoding ampersands on HTML entities when parser.decodeEntities = false #249

Comments

WillGibson commented Aug 29, 2018

boutell commented Aug 29, 2018 via email

WillGibson commented Aug 29, 2018

boutell commented Aug 29, 2018 via email

WillGibson commented Aug 30, 2018

timotm commented Sep 26, 2018

boutell commented Sep 26, 2018

boutell commented Sep 26, 2018

WillGibson commented Sep 28, 2018

WillGibson commented Sep 28, 2018

boutell commented Sep 28, 2018 via email

boutell commented Sep 28, 2018

WillGibson commented Oct 11, 2018

WillGibson commented Oct 11, 2018

alexandruluca commented Nov 14, 2018

WillGibson commented Nov 17, 2018

firefoxNX commented May 13, 2019 • edited

slidenerd commented Dec 3, 2019

stale bot commented Jul 7, 2020

firefoxNX commented May 13, 2019 •

edited