Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XSS hazards need to be addressed #61

Closed
palant opened this issue Sep 4, 2014 · 13 comments
Closed

XSS hazards need to be addressed #61

palant opened this issue Sep 4, 2014 · 13 comments

Comments

@palant
Copy link

palant commented Sep 4, 2014

At the moment, the spec doesn't seem to mention XSS issues with a single word. The reference implementation is a huge footgun as it will happily render things like:

[Click me](javascript:alert("XSS"))
<javascript:alert("XSS")>
<div onclick="alert('XSS')">click me</div>
<script>alert("XSS")</script>

There should be at least a huge warning that the reference implementation shouldn't be used without some kind of additional XSS filter - or when the content rendered is trusted. Ideally however, a reference implementation should allow specifying a whitelist (URL schemes, tags, attributes) and even provide a whitelist that is appropriate in most cases. Otherwise the sanitization of the generated code will become yet another distinguishing point of every implementation.

@craigbarnes
Copy link
Contributor

The only sane way to do this is by filtering the HTML output, like GitHub does with html-pipeline.

@palant
Copy link
Author

palant commented Sep 4, 2014

From the look of it, it would make sense to filter the AST generated by the Markdown parser before it is rendered into HTML - this would remove the necessity to parse the same content twice (something that is pretty error-prone).

@mstum
Copy link

mstum commented Sep 4, 2014

Just in case someone wants to completely remove HTML Support:

var parser = new stmd.DocParser();
parser.inlineParser.parseHtmlTag = function() { return 0; }

That said, it would definitely make sense to call it out in the Spec.

@buro9
Copy link

buro9 commented Sep 5, 2014

This isn't a bug of Standard Markdown or the reference implementations.

Yes you should process any user generated content for XSS risks, using something like the OWASP Java HTML Sanitizer or an equivalent.

But... it's really important that such sanitization is the very last step in any processing of user generated content. This is to prevent bugs in libraries such as Standard Markdown, or any other processing library, from introducing attack vectors that are exploited after the sanitization.

So yes... sanitize against XSS. But don't force libraries which already have a clear purpose "parse Markdown" to take on far more responsibility than they have to. Especially when Markdown explicitly permits raw HTML, and that would force Standard Markdown to become a full HTML XSS sanitizer.

What's worse is that what is permitted by a sanitizer should be configurable depending on a whitelist... instead of just saying "Parse this Markdown", you'd have to get into the habit of saying "Parse this Markdown according to this policy of what is permitted for my use-case".

For context: I've been in security roles and have implemented the OWASP sanitizer in different organisations, as well as being the author of a Go based sanitizer. There is nearly always post-processing of HTML, and sanitization must be a separate and last process to be run.

All that we should demand from Standard Markdown is that for the Markdown input, that valid HTML is produced. For raw HTML input, it should be passed through as-is.

In the given example in the first comment... nearly all of that would produce valid HTML. It should be left as-is.

@craigbarnes
Copy link
Contributor

@buro9 Very much agreed. It's not possible to sanitize HTML correctly without a full HTML5 parser. Anything short of that is just an illusion of security.

@JordanMilne
Copy link
Contributor

@buro9

Agreed. The level of sanitization necessary is out of scope for a markdown library. You should be using an external library to sanitize and unambiguously reserialize the output even with HTML blocks disabled.

@rlidwka
Copy link
Contributor

rlidwka commented Sep 6, 2014

All that we should demand from Standard Markdown is that for the Markdown input, that valid HTML is produced.

Any thoughts about user-specified html tags inside markdown document? I mean, people can easily forget to close certain html tags, in which case markdown parser would technically output invalid html.

@JordanMilne
Copy link
Contributor

@rlidwka

The parser would have to know the doctype to do that. <br> is valid HTML 5, but not valid XHTML. Since only document fragments are generated, that's probably out of scope as well.

@buro9
Copy link

buro9 commented Sep 6, 2014

@rlidwka

A Markdown parsers' job is to parse Markdown. It isn't to parse HTML and validate it and to create great HTML from bad input. Whilst there should be no scenario in which Markdown syntax produces invalid HTML, so long as the parser consistently and predictably parses Markdown that's all we should ask of it.

HTML should be passed through untouched and, depending on context, the Markdown within the raw HTML should/or should not be processed.

Gruber was vague about this, specifying that the parser should produce valid HTML, but there isn't an implementation (including his) that then fixes invalid HTML, a few balance the tree but none fix a balanced but incorrect fragment of HTML. GIGO (garbage in, garbage out) rules when it comes to raw/inline HTML.

The spec would have to massively expand it's scope into a full HTML validator and fixer to be able to do this. Namely, it isn't enough to say "balance the tree", if an extra tag is found there are opinions about how to handle this (remove it? balance it? if balancing it produces an invalid hierarchy do we rewrite the tree?).

Opinions like that require a substantial set of options and deep understanding of HTML. The job of any Markdown parser should only be "parse Markdown". I'd argue strongly that the clarification Gruber should have included is that "valid HTML should be produced for all Markdown syntax, raw/inline HTML will be passed through unmodified".

A Markdown parser is not the "one tool to replace all tools", it is just a Markdown parser.

Just as a specialised tool should be used to fix XSS issues, so is the case with fixing invalid HTML.

Luckily in many cases the XSS tool will also handle fixing bad HTML too. As once you do have a full HTML parser to handle XSS attacks, fixing the HTML and doing things like adding rel="nofollow" to external anchors is effectively trivial.

With projects like the OWASP Java HTML Sanitizer in existence, and equivalents for other languages/environments existing, we are able to focus all of the work on stmd to the one task that stmd was created for: "parse Markdown syntax into predictable/consistent valid HTML".

@palant
Copy link
Author

palant commented Sep 6, 2014

As I said:

There should be at least a huge warning that the reference implementation shouldn't be used without some kind of additional XSS filter - or when the content rendered is trusted.

@coding-horror
Copy link
Contributor

This is discussed on talk and should be closed here

http://talk.commonmark.org/t/cross-site-scripting-issue-in-standard-markdown-example-at-try-standardmarkdown-com/55

Please keep all discussions there, so GH issues can be clean, actionable, and clear per

http://talk.commonmark.org/t/what-feedback-should-be-here-vs-on-github/25

@jgm
Copy link
Member

jgm commented Sep 10, 2014

I'd like to keep this here, because I think the takeaway is a simple
actionable thing, which I just haven't gotten around to doing:
adding a clear warning to the documentation that the processor
does not sanitize HTML, and that the user should take care to do
that.

+++ Jeff Atwood [Sep 09 14 17:03 ]:

This is discussed on talk and should be closed here

http://talk.commonmark.org/t/cross-site-scripting-issue-in-standard-markdown-example-at-try-standardmarkdown-com/55

Please keep all discussions there, so GH issues can be clean, actionable, and clear per

http://talk.commonmark.org/t/what-feedback-should-be-here-vs-on-github/25


Reply to this email directly or view it on GitHub:
#61 (comment)

@tunnckoCore
Copy link

@buro9 +1 agreed.

The level of sanitization necessary is out of scope for a markdown library.

Agreed.
The scope of markdown library must only be to output a valid HTML. If incoming HTML is not valid or it is unsecure/xss-ed -- "not my problem". "Markdown parser/library" not mean "the new freakin HTML, YEAH!" or "the new awesome HTML validator and sanitizer, fuk yea". Out there we have special projects and librarys for that. Why to mess parsers' responsibility with sanitizers'/validators' responsibilty?
For security and against XSS, the markdown MUST go through HTML sanitizers and validators BEFORE it outputs the final result., because THE PARSER will output valid HTML - this must be the state.
Okey, maybe pluggable before outputping (this is great idea), but this might be parsers responsibility and architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants