Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leave text content? #15

Closed
codylindley opened this issue Mar 18, 2014 · 16 comments
Closed

Leave text content? #15

codylindley opened this issue Mar 18, 2014 · 16 comments

Comments

@codylindley
Copy link

Can I remove an element node and not remove the text node inside of it?

@cure53
Copy link
Owner

cure53 commented Mar 19, 2014

Currently, the entire element is being removed. Do you have an example at hands where your request would make sense? I'd have a look then.

@cure53
Copy link
Owner

cure53 commented Mar 25, 2014

@codylindley Do you have a specific example on where and how you would need this feature? If not I would be tempted to close this ticket - cannot see a use-case right now.

@codylindley
Copy link
Author

I'm actually not aware of a XXS tool that does not permit the configuration of leaving the text node while removing the element node. For example, I might not want to allow text wrapped with a span. I would think since DOMPurify can remove a span I should have the option to leave the text node inside of it, while removing the span. Make sense?

@cure53
Copy link
Owner

cure53 commented Mar 26, 2014

@codylindley Kinda does ;) Now I am wondering: how would a config API look like? How would you expect to configure it from a developer's perceptive? if you give me a specific example, i can start thinking about an implementation - so far it's still a bit too vague.

@codylindley
Copy link
Author

Have you seen js-xss?

Here is a look at how they filter out element nodes, but leave text nodes. And it's optional.
https://www.dropbox.com/s/haepbg2w8277dij/Screenshot%202014-03-26%2009.22.47.png

I would think this would be as simple as something like...

var clean = DOMPurify.sanitize(dirty, {
ALLOWED_TAGS: ['b', 'q'], 
ALLOWED_ATTR: ['style'],
KEEP_TEXT: true,
REMOVE_TEXT: ['script']
});

@cure53
Copy link
Owner

cure53 commented Mar 26, 2014

Ah, that's a good example, Thx! And I didn't know that library yet. Your API suggestion is neat, sufficiently clean although I am not a big fan of the REMOVE_TEXT blacklist.

I'll have a look.

@cure53
Copy link
Owner

cure53 commented Mar 27, 2014

@codylindley I had a look and I am tempted to reject this feature request.

So far I see only one way to do it safely, and that would involve iterating over the nodes twice instead of once. We'd first have to flag elements for deletion and then go over the tree again and remove them while having their harmless children and text nodes survive. And even then I believe we'd have to go from the deepest node upwards to the "thicker branches" which might end up in a performance killer. I am however open for alternative ways to solve it. If you have thoughts on that: appreciated!

However your request helped spotting a bug; DOMPurify was too intolerant about content starting with TextNode or consisting entirely of TextNode - that is being fixed, thx!

@UltCombo
Copy link

@cure53 I've been presented with this problem in a SO question where the user wanted to remove anchor elements, but keeping their text content. You can see my answer there with a couple ways to implement that, though I'm not sure if those are applicable to how DOM Purify works.

About performance: this option could be disabled by default, so there should be nearly no performance penalty, right?

I was actually going to recommend DOM Purify for the task in the question, then I got quite stumped that DOM Purify does not have an option to keep text nodes from removed elements yet.

Also, HTML Purifier keeps text nodes when removing not allowed tag names. Github and Stack Overflow's markdown editors also keep text nodes when removing unsafe tags.

@cure53
Copy link
Owner

cure53 commented Mar 29, 2014

To get the described effect, the user can allow a but remove href. What remains are deactivated links, text preserved. If the user really needs to remove the entire link, they could use a regex to do so.

This feature would be easy to implement for a flat DOM - e.g. a link with no child elements. For a working implementation one would have to iterate recursively over all child elements and check their type. The given example on SO does not do that sufficiently. It also juggles with innerHTML and thereby opens a can of worms for mXSS attacks. Think for example <a href="">Foo<a><svg><style>*{font-family:'<iframe/>'}</style></svg><a>bar</a>.

I believe in a scenario where a feature like that would not be under attack and confronted with a simple and flat DOM it's not a problem. I our case however I think it will enable several attacks and is not feasible. Thus I need to reject it.

Just for the record, knowing it's not a security tool; the function you linked is vulnerable against XSS too:

var content = "<a href=\"1\">I</a> was going <img src=x onerror=alert(1)>here and then <a href=\"that\">that</a> happened.";

var container = document.createElement('div');
container.innerHTML = content;

var anchors = container.getElementsByTagName('a'),
    anchor;

while (anchor = anchors[0]) {
    var anchorParent = anchor.parentNode;

    while (anchor.firstChild) {
        anchorParent.insertBefore(anchor.firstChild, anchor);
    }
    anchorParent.removeChild(anchor);
}

var clean = container.innerHTML;
console.log(clean);

@cure53
Copy link
Owner

cure53 commented Mar 29, 2014

I created #18 for discussion. Maybe an API would help here rather than adding more features to the core library.

@UltCombo
Copy link

@cure53 I'm aware that throwing arbitrary markup into innerHTML is vulnerable to XSS. If you check the comments on that answer, I mentioned running a XSS sanitizer in case of arbitrary markup before my sample code example.

As I said, that code is far from optimal for this library, it was just an example use case which would be much better covered by a tested library than some hack'ish algorithm/regex for those that need to do it securely. =]

@codylindley
Copy link
Author

Can you clarify? Should the KEEP_CONTENT: true ... keep the text node?

Does not work for me if that is what it's suppose to do: http://jsfiddle.net/codylindley/9SxCb/

@cure53
Copy link
Owner

cure53 commented Apr 5, 2014

Haha yes, I can ;) Right now we classify text nodes as actual elements. Which might not be the most intuitive. This would work just fine (note the extra #text):

var foo = DOMPurify.sanitize('123<h2>test</h2><p>test</p>',
    {
        ALLOWED_TAGS: ['#text','strong','a','table','td','tr','tbody','thead','tfoot','th','li','ul','ol','br','p'], 
        ALLOWED_ATTR: ['href,title,style,rows,rowspan,width,height,rel,align,cite,cols,colspan,target'],
        KEEP_CONTENT: true
});
console.log(foo);

I'll deploy a fix later today, it's fine when people don't pass in their own tags but fairly counter-intuitive in your and probably many other scenarios. Nice spot!

@cure53
Copy link
Owner

cure53 commented Apr 5, 2014

@codylindley Done!

cure53 pushed a commit that referenced this issue Apr 5, 2014
Fixed a problem with overzealous text-node removal
@codylindley
Copy link
Author

Thanks. Yeah. After I sent the fiddle notice that in the source and sure enough #text was the key. Nice work all around here!

@cure53
Copy link
Owner

cure53 commented Apr 5, 2014

Thx :)

cure53 pushed a commit that referenced this issue Apr 11, 2014
Implemented basic assert.contains()
Fixed the tests
All tests green on Chrome 33 and FF28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants