New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XSS hazards need to be addressed #61
Comments
The only sane way to do this is by filtering the HTML output, like GitHub does with html-pipeline. |
From the look of it, it would make sense to filter the AST generated by the Markdown parser before it is rendered into HTML - this would remove the necessity to parse the same content twice (something that is pretty error-prone). |
Just in case someone wants to completely remove HTML Support:
That said, it would definitely make sense to call it out in the Spec. |
This isn't a bug of Standard Markdown or the reference implementations. Yes you should process any user generated content for XSS risks, using something like the OWASP Java HTML Sanitizer or an equivalent. But... it's really important that such sanitization is the very last step in any processing of user generated content. This is to prevent bugs in libraries such as Standard Markdown, or any other processing library, from introducing attack vectors that are exploited after the sanitization. So yes... sanitize against XSS. But don't force libraries which already have a clear purpose "parse Markdown" to take on far more responsibility than they have to. Especially when Markdown explicitly permits raw HTML, and that would force Standard Markdown to become a full HTML XSS sanitizer. What's worse is that what is permitted by a sanitizer should be configurable depending on a whitelist... instead of just saying "Parse this Markdown", you'd have to get into the habit of saying "Parse this Markdown according to this policy of what is permitted for my use-case". For context: I've been in security roles and have implemented the OWASP sanitizer in different organisations, as well as being the author of a Go based sanitizer. There is nearly always post-processing of HTML, and sanitization must be a separate and last process to be run. All that we should demand from Standard Markdown is that for the Markdown input, that valid HTML is produced. For raw HTML input, it should be passed through as-is. In the given example in the first comment... nearly all of that would produce valid HTML. It should be left as-is. |
@buro9 Very much agreed. It's not possible to sanitize HTML correctly without a full HTML5 parser. Anything short of that is just an illusion of security. |
Agreed. The level of sanitization necessary is out of scope for a markdown library. You should be using an external library to sanitize and unambiguously reserialize the output even with HTML blocks disabled. |
Any thoughts about user-specified html tags inside markdown document? I mean, people can easily forget to close certain html tags, in which case markdown parser would technically output invalid html. |
The parser would have to know the doctype to do that. |
A Markdown parsers' job is to parse Markdown. It isn't to parse HTML and validate it and to create great HTML from bad input. Whilst there should be no scenario in which Markdown syntax produces invalid HTML, so long as the parser consistently and predictably parses Markdown that's all we should ask of it. HTML should be passed through untouched and, depending on context, the Markdown within the raw HTML should/or should not be processed. Gruber was vague about this, specifying that the parser should produce valid HTML, but there isn't an implementation (including his) that then fixes invalid HTML, a few balance the tree but none fix a balanced but incorrect fragment of HTML. GIGO (garbage in, garbage out) rules when it comes to raw/inline HTML. The spec would have to massively expand it's scope into a full HTML validator and fixer to be able to do this. Namely, it isn't enough to say "balance the tree", if an extra tag is found there are opinions about how to handle this (remove it? balance it? if balancing it produces an invalid hierarchy do we rewrite the tree?). Opinions like that require a substantial set of options and deep understanding of HTML. The job of any Markdown parser should only be "parse Markdown". I'd argue strongly that the clarification Gruber should have included is that "valid HTML should be produced for all Markdown syntax, raw/inline HTML will be passed through unmodified". A Markdown parser is not the "one tool to replace all tools", it is just a Markdown parser. Just as a specialised tool should be used to fix XSS issues, so is the case with fixing invalid HTML. Luckily in many cases the XSS tool will also handle fixing bad HTML too. As once you do have a full HTML parser to handle XSS attacks, fixing the HTML and doing things like adding With projects like the OWASP Java HTML Sanitizer in existence, and equivalents for other languages/environments existing, we are able to focus all of the work on stmd to the one task that stmd was created for: "parse Markdown syntax into predictable/consistent valid HTML". |
As I said:
|
This is discussed on talk and should be closed here Please keep all discussions there, so GH issues can be clean, actionable, and clear per http://talk.commonmark.org/t/what-feedback-should-be-here-vs-on-github/25 |
I'd like to keep this here, because I think the takeaway is a simple +++ Jeff Atwood [Sep 09 14 17:03 ]:
|
@buro9 +1 agreed.
Agreed. |
At the moment, the spec doesn't seem to mention XSS issues with a single word. The reference implementation is a huge footgun as it will happily render things like:
There should be at least a huge warning that the reference implementation shouldn't be used without some kind of additional XSS filter - or when the content rendered is trusted. Ideally however, a reference implementation should allow specifying a whitelist (URL schemes, tags, attributes) and even provide a whitelist that is appropriate in most cases. Otherwise the sanitization of the generated code will become yet another distinguishing point of every implementation.
The text was updated successfully, but these errors were encountered: