Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding ampersands on HTML entities when parser.decodeEntities = false #249

Closed
WillGibson opened this issue Aug 29, 2018 · 18 comments
Closed
Labels

Comments

@WillGibson
Copy link

E.g. Code like...

const text = 'This & that &reg';
const sanitizeHtmlOptions = {
    parser: {
        decodeEntities: false
    }
};
demand(sanitizeHtml(text, sanitizeHtmlOptions)).equal(text);

...results in...

AssertionError: "This & that &reg" must equal "This & that &reg"
+ expected - actual

-This & that &reg
+This & that &reg

I'm guessing that this behaviour is not intended?

@boutell
Copy link
Member

boutell commented Aug 29, 2018 via email

@WillGibson
Copy link
Author

I would expect it not to do the extra encode on the ampersands.

I could have a look at whether I can make a PR that makes it behave like that if you set parser.decodeEntities = false in the morning if you think that would be desirable?

@boutell
Copy link
Member

boutell commented Aug 29, 2018 via email

@WillGibson
Copy link
Author

On it

@timotm
Copy link

timotm commented Sep 26, 2018

I think this change caused some regression.

sh = require('sanitize-html')
sh('<img src="<0&0;0.2&" />', {allowedTags: ['img']})

produces
<img src="&lt;0&0;0.2&amp;" />
instead of the expected
<img src="&lt;0&amp;0;0.2&amp;" />

@boutell
Copy link
Member

boutell commented Sep 26, 2018

@WillGibson did you miss a /g somewhere?

@boutell
Copy link
Member

boutell commented Sep 26, 2018

Or is it the presence of the 0;? I can see that it may not be as simple as we thought to detect the valid cases for leaving the & alone.

@WillGibson
Copy link
Author

I'm on holiday now with no laptop. If no-one has addressed it before I get back I'll add this example to the tests and code away until they all go green again.

@WillGibson
Copy link
Author

The regex will be picking up &0; as if it's an ampersand that's already part of an HTML entity and therefore not encoding it.

@boutell
Copy link
Member

boutell commented Sep 28, 2018 via email

@boutell
Copy link
Member

boutell commented Sep 28, 2018

The fix for the default case has been published to npm. decodeEntities: false is still broken as described.

@WillGibson
Copy link
Author

@boutell Re: "I can see that it may not be as simple as we thought to detect the valid cases for leaving the & alone.".

I agree. It's a good challenge though, I'm going to have a go at it!

@WillGibson
Copy link
Author

Hmmm, encoding ampersands without double encoding them on HTML entities is indeed hard given that the range of strings that can make up a valid HTML entity is so wide. There are too many edge cases.

In our use case, we want to strip all the stuff that isn't text, but keep HTML entities encoded.

After all this, I'm wondering if I should not have left sanitise-html alone and just used something else to encode the HTML entities after the text comes out clean from sanitise-html.

@alexandruluca
Copy link

Any follow up on this?

@WillGibson
Copy link
Author

https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_entity_references_in_HTML

Trying to make a regex to pick up &htmlentity;, and not &sometext; is a thankless task.

Short of maintaining a full list of HTML entities, or using another package which already does that, I'm not sure how to deal with it.

I'm open to suggestions.

@firefoxNX
Copy link

firefoxNX commented May 13, 2019

any update on this?

    var sanitizeHtmlOptions = {
      parser: {
        decodeEntities: false
      }
    };
    assert.equal(sanitizeHtml('simple & test', sanitizeHtmlOptions), 'simple & test');
  });

@slidenerd
Copy link

ahan now I see, the concern the library author is having is that if decodeEntities is false < and > are allowed as is which means a vulnerability, isnt there some way to say opt out for some tags but not for the others, & can probably be exempted but <> are encoded

@stale
Copy link

stale bot commented Jul 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants