Leave text content? #15

codylindley · 2014-03-18T19:43:13Z

Can I remove an element node and not remove the text node inside of it?

cure53 · 2014-03-19T07:19:12Z

Currently, the entire element is being removed. Do you have an example at hands where your request would make sense? I'd have a look then.

cure53 · 2014-03-25T08:17:46Z

@codylindley Do you have a specific example on where and how you would need this feature? If not I would be tempted to close this ticket - cannot see a use-case right now.

codylindley · 2014-03-25T16:20:46Z

I'm actually not aware of a XXS tool that does not permit the configuration of leaving the text node while removing the element node. For example, I might not want to allow text wrapped with a span. I would think since DOMPurify can remove a span I should have the option to leave the text node inside of it, while removing the span. Make sense?

cure53 · 2014-03-26T11:13:00Z

@codylindley Kinda does ;) Now I am wondering: how would a config API look like? How would you expect to configure it from a developer's perceptive? if you give me a specific example, i can start thinking about an implementation - so far it's still a bit too vague.

codylindley · 2014-03-26T15:27:49Z

Have you seen js-xss?

Here is a look at how they filter out element nodes, but leave text nodes. And it's optional.
https://www.dropbox.com/s/haepbg2w8277dij/Screenshot%202014-03-26%2009.22.47.png

I would think this would be as simple as something like...

var clean = DOMPurify.sanitize(dirty, {
ALLOWED_TAGS: ['b', 'q'], 
ALLOWED_ATTR: ['style'],
KEEP_TEXT: true,
REMOVE_TEXT: ['script']
});

cure53 · 2014-03-26T16:02:57Z

Ah, that's a good example, Thx! And I didn't know that library yet. Your API suggestion is neat, sufficiently clean although I am not a big fan of the REMOVE_TEXT blacklist.

I'll have a look.

cure53 · 2014-03-27T19:53:50Z

@codylindley I had a look and I am tempted to reject this feature request.

So far I see only one way to do it safely, and that would involve iterating over the nodes twice instead of once. We'd first have to flag elements for deletion and then go over the tree again and remove them while having their harmless children and text nodes survive. And even then I believe we'd have to go from the deepest node upwards to the "thicker branches" which might end up in a performance killer. I am however open for alternative ways to solve it. If you have thoughts on that: appreciated!

However your request helped spotting a bug; DOMPurify was too intolerant about content starting with TextNode or consisting entirely of TextNode - that is being fixed, thx!

UltCombo · 2014-03-29T03:49:34Z

@cure53 I've been presented with this problem in a SO question where the user wanted to remove anchor elements, but keeping their text content. You can see my answer there with a couple ways to implement that, though I'm not sure if those are applicable to how DOM Purify works.

About performance: this option could be disabled by default, so there should be nearly no performance penalty, right?

I was actually going to recommend DOM Purify for the task in the question, then I got quite stumped that DOM Purify does not have an option to keep text nodes from removed elements yet.

Also, HTML Purifier keeps text nodes when removing not allowed tag names. Github and Stack Overflow's markdown editors also keep text nodes when removing unsafe tags.

cure53 · 2014-03-29T07:21:46Z

To get the described effect, the user can allow a but remove href. What remains are deactivated links, text preserved. If the user really needs to remove the entire link, they could use a regex to do so.

This feature would be easy to implement for a flat DOM - e.g. a link with no child elements. For a working implementation one would have to iterate recursively over all child elements and check their type. The given example on SO does not do that sufficiently. It also juggles with innerHTML and thereby opens a can of worms for mXSS attacks. Think for example <a href="">Foo<a><svg><style>*{font-family:'<iframe/>'}</style></svg><a>bar</a>.

I believe in a scenario where a feature like that would not be under attack and confronted with a simple and flat DOM it's not a problem. I our case however I think it will enable several attacks and is not feasible. Thus I need to reject it.

Just for the record, knowing it's not a security tool; the function you linked is vulnerable against XSS too:

var content = "<a href=\"1\">I</a> was going <img src=x onerror=alert(1)>here and then <a href=\"that\">that</a> happened.";

var container = document.createElement('div');
container.innerHTML = content;

var anchors = container.getElementsByTagName('a'),
    anchor;

while (anchor = anchors[0]) {
    var anchorParent = anchor.parentNode;

    while (anchor.firstChild) {
        anchorParent.insertBefore(anchor.firstChild, anchor);
    }
    anchorParent.removeChild(anchor);
}

var clean = container.innerHTML;
console.log(clean);

cure53 · 2014-03-29T09:10:33Z

I created #18 for discussion. Maybe an API would help here rather than adding more features to the core library.

UltCombo · 2014-03-29T18:38:27Z

@cure53 I'm aware that throwing arbitrary markup into innerHTML is vulnerable to XSS. If you check the comments on that answer, I mentioned running a XSS sanitizer in case of arbitrary markup before my sample code example.

As I said, that code is far from optimal for this library, it was just an example use case which would be much better covered by a tested library than some hack'ish algorithm/regex for those that need to do it securely. =]

codylindley · 2014-04-05T06:00:51Z

Can you clarify? Should the KEEP_CONTENT: true ... keep the text node?

Does not work for me if that is what it's suppose to do: http://jsfiddle.net/codylindley/9SxCb/

cure53 · 2014-04-05T06:41:04Z

Haha yes, I can ;) Right now we classify text nodes as actual elements. Which might not be the most intuitive. This would work just fine (note the extra #text):

var foo = DOMPurify.sanitize('123<h2>test</h2><p>test</p>',
    {
        ALLOWED_TAGS: ['#text','strong','a','table','td','tr','tbody','thead','tfoot','th','li','ul','ol','br','p'], 
        ALLOWED_ATTR: ['href,title,style,rows,rowspan,width,height,rel,align,cite,cols,colspan,target'],
        KEEP_CONTENT: true
});
console.log(foo);

I'll deploy a fix later today, it's fine when people don't pass in their own tags but fairly counter-intuitive in your and probably many other scenarios. Nice spot!

cure53 · 2014-04-05T09:30:56Z

@codylindley Done!

Fixed a problem with overzealous text-node removal

codylindley · 2014-04-05T14:42:14Z

Thanks. Yeah. After I sent the fiddle notice that in the source and sure enough #text was the key. Nice work all around here!

cure53 · 2014-04-05T15:48:32Z

Thx :)

Implemented basic assert.contains() Fixed the tests All tests green on Chrome 33 and FF28

cure53 closed this as completed Mar 29, 2014

cure53 mentioned this issue Mar 29, 2014

Provide an API to allow hooking into _sanitizeElements and _sanitizeAttributes #18

Closed

cure53 pushed a commit that referenced this issue Apr 5, 2014

see #15

93c2331

Fixed a problem with overzealous text-node removal

cure53 mentioned this issue Apr 5, 2014

Regression test Suite #5

Closed

cure53 pushed a commit that referenced this issue Apr 11, 2014

see #15

85c03c8

Implemented basic assert.contains() Fixed the tests All tests green on Chrome 33 and FF28

cure53 mentioned this issue Oct 8, 2020

Semantics of allow vs block vs defaults WICG/sanitizer-api#33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leave text content? #15

Leave text content? #15

codylindley commented Mar 18, 2014

cure53 commented Mar 19, 2014

cure53 commented Mar 25, 2014

codylindley commented Mar 25, 2014

cure53 commented Mar 26, 2014

codylindley commented Mar 26, 2014

cure53 commented Mar 26, 2014

cure53 commented Mar 27, 2014

UltCombo commented Mar 29, 2014

cure53 commented Mar 29, 2014

cure53 commented Mar 29, 2014

UltCombo commented Mar 29, 2014

codylindley commented Apr 5, 2014

cure53 commented Apr 5, 2014

cure53 commented Apr 5, 2014

codylindley commented Apr 5, 2014

cure53 commented Apr 5, 2014

Leave text content? #15

Leave text content? #15

Comments

codylindley commented Mar 18, 2014

cure53 commented Mar 19, 2014

cure53 commented Mar 25, 2014

codylindley commented Mar 25, 2014

cure53 commented Mar 26, 2014

codylindley commented Mar 26, 2014

cure53 commented Mar 26, 2014

cure53 commented Mar 27, 2014

UltCombo commented Mar 29, 2014

cure53 commented Mar 29, 2014

cure53 commented Mar 29, 2014

UltCombo commented Mar 29, 2014

codylindley commented Apr 5, 2014

cure53 commented Apr 5, 2014

cure53 commented Apr 5, 2014

codylindley commented Apr 5, 2014

cure53 commented Apr 5, 2014