Skip to content
This repository has been archived by the owner on Feb 28, 2022. It is now read-only.

fix(html pipe): Sanitize generated markdown to avoid XSS attacks #263

Merged
merged 17 commits into from
May 29, 2019

Conversation

ramboz
Copy link
Contributor

@ramboz ramboz commented Apr 15, 2019

Properly escape the generated HTML using DOMPurify to avoid potential XSS attacks via JS injection, and also avoid DOM clobbering.

  • Add an .before step right before the pre function in the html pipeline to sanitize the content
  • Adjust the default DOMPurify sanitization config to support the helix use case (esi: tags)
  • Add unit tests to validate escaping works on links and images
  • Add unit tests to validate custom elements and attributes in the MD are kept
  • Add unit tests to validate sanitization only applies to author-generated content and not the developer content (i.e. HTL template)
  • Add ADR to document the decision

fix #253

@codecov
Copy link

codecov bot commented Apr 15, 2019

Codecov Report

Merging #263 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master   #263   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files          46     47    +1     
  Lines        1022   1041   +19     
=====================================
+ Hits         1022   1041   +19
Impacted Files Coverage Δ
src/utils/heading-handler.js 100% <100%> (ø) ⬆️
src/html/sanitize.js 100% <100%> (ø)
src/defaults/html.pipe.js 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 756229a...2c8ea4a. Read the comment docs.

@ramboz ramboz requested review from trieloff, koraa and kptdobe and removed request for trieloff and koraa April 15, 2019 22:56
const HELIX_SCHEMA_CUSTOMIZATIONS = {
tagNames: ['esi:include'],
attributes: {
'esi:include': ['src'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

esi:include ? this is never part of the markdown - or at least, it shouldn't be.
so best to remove it completely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This schema controls the HTML output, i.e. it lists all valid HTML tags. As we generate esi:include tags, they are valid.

test/testMdastToVDOM.js Outdated Show resolved Hide resolved
test/testMdastToVDOM.js Outdated Show resolved Hide resolved
test/testMdastToVDOM.js Outdated Show resolved Hide resolved
Copy link
Contributor

@trieloff trieloff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tentative approval: take out the non-XSS related changes to the tests, and ping me again – then I'll merge. As for custom elements, it makes most sense to open a separate issue.

const HELIX_SCHEMA_CUSTOMIZATIONS = {
tagNames: ['esi:include'],
attributes: {
'esi:include': ['src'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This schema controls the HTML output, i.e. it lists all valid HTML tags. As we generate esi:include tags, they are valid.


// Define helix specific customizations on top of the default GitHub schema
const HELIX_SCHEMA_CUSTOMIZATIONS = {
tagNames: ['esi:include'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a strict white-list of allowed tag names, we might run into issues when users want to customize the VDOMTransformer to create HTML custom elements. Maybe we can add all tag names with a dash in them to the white list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. But then again, white-listing is the safest XSS approach.
The best security is always "deny-all" and open up what you need. I'd rather let the customer extend the schema with their own as needed, than just "allow-all".

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is still an issue with your new implementation. For normal HTML elements, let's use a whitelist of known good elements. For custom elements, let's assume that they are all good, because they would have to be inserted by the developer of the site.

As long as we don't have a good way of configuring the XSS filter (and we probably should not introduce a configuration anyway), we should have defaults that work, even when you are using extensions.

Copy link
Contributor Author

@ramboz ramboz Apr 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still would be an issue at the moment. DOMPurify will remove any custom elements by default. We'd have to follow their proposed solution to cure53/DOMPurify#184 in:
https://github.com/cure53/DOMPurify/blob/master/test/test-suite.js#L294-L312

we probably should not introduce a configuration anyway

I'm not sure about that one. We can have a catch-all solution for <\w+-\w+ tags, but you'd still end up with edge cases that authors may need to have.

  • What about <customtag/>?
  • Should we support custom properties as well? <custom-tag custom-prop/>
  • What about <esi:custom/>?

Assuming all custom elements are "good" is the same as saying all authors know what they are doing and their sessions cannot be compromised. While I'm ok with that argument for the DOM Clobbering topic as DOMPurifies is applied after that anyway and secures the response, I'm really not convinced here. I've seen way to many authors just copy/paste examples from elsewhere without understanding what they do.

My proposal would be to "white-list" all the esi: tags we need, but lock down the rest and let it be configured for each project by extending the default DOMPurify config (and abstracting it obviously as it's an implementation detail). Make helix "super-secured" by default, but let project teams open up some doors if they feel safe about them.

Just a random thought, but I would be happy as a front-end dev to define something like:
.helix-config.json

{
  "template-rules": {
    "allow-custom-html-tags": false, // allow none (the default, except esi: white-listed ones)
    "allow-custom-html-tags": true, // allow all
    "allow-custom-html-tags": ["my-tag", "my-other-tag"], // allow some
    "allow-custom-html-attributes": false, // allow none (the default)
    "allow-custom-html-attributes": true, // allow all
    "allow-custom-html-attributes": ["my-attribute", "my-other-attribute"], // allow some
  }
}

i.e. the same way I would define my linting configs for instance.

@lkrapf, @asanso, @rofe: any thoughts on that topic?

Copy link
Contributor

@trieloff trieloff Apr 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about <customtag/>?

not a Custom Element, according to spec: forbidden.

Should we support custom properties as well? <custom-tag custom-prop/>

within a custom element, yes.

What about <esi:custom/>?

Neither a proper Custom Element, nor part of the ESI language (https://www.w3.org/TR/esi-lang): forbidden

For ESI, we need a while list:

  • esi:include
  • esi:remove

For Custom Elements we don't know what's coming in advance, so we can assume that the developer who is building the VDOM Matcher to produce my-foo-element is not trying to attack the other developer, working on the same code base, who is implementing class MyFooElement extends HTMLParagraphElement… on the client-side.

Copy link
Contributor Author

@ramboz ramboz Apr 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree. But I'd really like to see some potential attack vectors described and tests for them being written. At the moment, I can't come up with any, but maybe @lkrapf or @asanso can?

Also for ESI, do we pre-whitelist all accepted tags from the spec, or only the ones we actually use at the moment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd only allow the ESI tags that are actually supported by Fastly (complete list above)

test/testMdastToVDOM.js Outdated Show resolved Hide resolved
Copy link
Contributor

@koraa koraa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick reminder that this is not really an XSS issue; we are processing md generated by the authors of the website; not user generated md.

I won't block because of the issues I took with the sanatization library because this may be a point where my opinion finds disagreement…but you could try to convince me why this is a better way ;)

Thanks for approaching this issue!

@@ -194,7 +195,9 @@ class VDOMTransformer {
* @returns {string} the corresponding HTML
*/
static toHTML(mdast, handlers) {
return hast2html(mdast2hast(mdast, { handlers }));
const rawHast = mdast2hast(mdast, { handlers });
const sanitizedHast = sanitize(rawHast);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a ton of well maintained, well reviewed (just by sheer amount) html sanitizers out there. Just fom 5s of research I get:

https://github.com/cure53/DOMPurify

I would rather trust one of those – if just by sheer number of users – than a
library that is used by few and touched every couple of months…
This is a security issue after all…

I don't see an advantage of doing this in htast over dom – but I am tooting the same horn here as in the other PR. If we model this as a dom transforming pipeline step we give users the ability to opt out (maybe they want to have js in ther markdown) and we are Independent from htast/remark.

Copy link
Contributor Author

@ramboz ramboz Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koraa

than a library that is used by few and touched every couple of months…

I'd tend to agree here. Although this was updated in the last couple of months and does have some good tests for it, so it didn't feel too "amateurish" either.

Plus it is from the same org as a lot of our other modules (hast-to-hyperscript, hast-util-to-html, mdast-util-to-hast, mdast-util-to-string, unist-util-map, unist-util-select, …) which basically face the same issues, so my assumption was these thresholds (last commit, number of maintainers, tests, etc.) were the accepted norm for helix.

In previous projects, I'd filter with these criteria:

  • last commit in the last 3 months
  • last closed bug in the last month
  • no stale bugs (>6 months old)
  • proper test coverage (80%+)
  • active community (10+ forks and 10+ stars)
  • proper JSDoc (or equivalent) in the source code
  • 3+ contributors

I'd be happy to start a discussion about how to properly chose dependencies for helix. Since the project is still in its infancy, it's a good time to define best practices and choose dependencies carefully for the long run and not only for the short-time gain.
Would you agree to handle that in a separate issue and close the security hole for now with this?

I'd also like to get some more use cases from @asanso to write more tests seeing his background in security. Ultimately, I also want a library that offers the best protection against XSS, and at the moment, the couple tests I have in this PR do not feel that comprehensive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but those other libraries are not as security-relevant I would say…

It does have good tests, that's true. I personally do not have specific thresholds for anything; it's more about comparing the options and finding the best choice.
How about you take a look at a couple of html sanatization libs and compare them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tend to take a look at the source code of libs and tbh I am more impressed with DOMPurify than hast-util-sanitize (the latter uses a bit of a weird style and the former seems to be much better documented and more complete)

Btw I didn't mean to imply that hast-util-sanitize was a bad or amateurish choice; if it was the only choice we had, I'd go for it. But it isn't the only choice and I think we may be able to do better.

Copy link
Contributor Author

@ramboz ramboz Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. That is also why I hope to get some more input from @asanso. I don't want to make a pure judgment call on the lib itself for now, and want to make sure we properly cover the XSS use case with it.

I'm also not partial with hast-util-sanitize. I like the kind of discussion this PR triggers.
I'll play around with other libs this week and see what I can come up with

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++1

@ramboz
Copy link
Contributor Author

ramboz commented Apr 16, 2019

@koraa I see your point. I agree at the moment that the risk coming from the author is quite limited.

Now if I try to imagine a worst case scenario, I'd be looking at the risk of having someone gaining access to the author account, or being added to the project team, and then maliciously injecting random JS in the markdown (bitcoin miner, keylogger, etc.). Also, just a non-savvy author could randomly copy-paste from stackoverflow and end up opening some loopholes.

I could also imagine that at some point we would also integrate git comments, etc. in the pipeline and this would open up additional attack vectors. So I think sanitization and XSS escaping is definitely needed.

@koraa
Copy link
Contributor

koraa commented Apr 16, 2019

I could also imagine that at some point we would also integrate git comments, etc. in the pipeline and this would open up additional attack vectors. So I think sanitization and XSS escaping is definitely needed.

Agreed!

Copy link
Contributor

@tripodsan tripodsan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like @rofe mentioned in #253, the problem should be tackled in the output template, and not in the input. by escaping the mdast, you also remove possibility to easily parse the links.

if it is not possible to do it in HTL, then I agree with @koraa and would prefer that the XSS protection this done on the dom. since markdown might not be the only source for content. e.g. what if helix-embed introduces XSS? or the image-optimizer?

I'll look into why HTL doesn't escape correctly.

@rofe
Copy link
Contributor

rofe commented Apr 17, 2019

I'll look into why HTL doesn't escape correctly.

@lkrapf mentioned that the HTL regexp in Sling evolved quite a bit since you ported it to node.

@tripodsan
Copy link
Contributor

@lkrapf mentioned that the HTL regexp in Sling evolved quite a bit since you ported it to node.

why regexp?

@lkrapf
Copy link

lkrapf commented Apr 17, 2019

@tripodsan wrote:

why regexp?

This is the regexp Sling uses to validate href's:
https://github.com/apache/sling-org-apache-sling-xss/blob/5071a142cdfeaf128a0db1ed5222b4e3753303c9/src/main/resources/SLING-INF/content/config.xml#L76

built from this code:
https://github.com/apache/sling-org-apache-sling-xss/blob/5071a142cdfeaf128a0db1ed5222b4e3753303c9/src/main/java/org/apache/sling/xss/impl/XSSFilterImpl.java#L84

However, a good alternative for Helix might be to use https://github.com/cure53/DOMPurify, it is very well tested, and you could use it for both client- and serverside.

@adobe adobe deleted a comment Apr 19, 2019
@adobe adobe deleted a comment Apr 19, 2019
@@ -55,7 +55,7 @@ class HeadingHandler {
// Inject the id after transformation
const n = Object.assign({}, node);
const el = fallback(h, n);
el.properties.id = headingIdentifier;
el.properties.id = el.properties.id || `user-content-${headingIdentifier}`;
Copy link
Contributor Author

@ramboz ramboz Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an id was provided by a previous step, keep it as is, otherwise make sure to escape it to avoid DOM clobbering.
If provided id does clobber the DOM, DOMPurify will remove it during the sanitization step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user-content- prefix you are inserting here seems odd and is leading to heading IDs that don't match GitHub's.

Copy link
Contributor Author

@ramboz ramboz Apr 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github does the same. You can check the helix home readme for instance and you'll notice that

## Background

becomes

<h2>
    <a id="user-content-background" class="anchor" aria-hidden="true" href="#background"></a>
    Background
</h2>

So we are behaving exactly like github here.

You can read more on the subject here for instance: https://opensoul.org/2014/09/05/dom-clobbering/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know that the actual IDs in GitHub are different and that GitHub is using JavaScript to restore the scroll functionality they've broken through this clobbering.

I think we should not handle it the same way. Use proper id attribute values, and maybe add a blacklist of known DOM properties that are forbidden. You have access to JSDOM, which could help.

Copy link
Contributor Author

@ramboz ramboz Apr 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tricky one to handle, as white/black listing is not necessarily enough, you also have to account for all character encoding possibilities. Any value put in an id or name attribute can potentially clobber the DOM, and browsers are supposed to prevent that (Chrome does, Firefox has a long history of not doing it, …)

If you only blacklist, then you fail to escape new properties that are introduced by the browser engine. The accepted norm here is to prefix using user-content- like github does it, so that you are sure the prefix alone prevents any collision with the DOM properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The markup producer is the author and trusted. I'd assume that most DOM clobbering issues that arise would be accidents rather than attacks.

Copy link
Contributor Author

@ramboz ramboz Apr 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so assuming an "accident":

## Name

The process will look like this:

  • generate a heading with identifier name
  • DOM gets sanitized by DOMPurify which detects a collision on document.name, and removes it
  • final output is:
<h2>Name</h2>

with no anchor.

If customers want the prefixed version, they can do that in their own matching function for the heading

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. This is assuming that

## Name 1

would get an id, and DOMPurify isn't simply removing all id attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,

## Name 1

would be kept and turned into:

<h2 id="name-1">Name 1</h2>

In the end, I only see this being a real issue on API documentation websites where the author would probably use the API name for the heading:

# My Custom Object
## Properties
### Name
### OnClick
### OnChange

And for these edge cases, I would either:

  • Add a prefix to all headings: Property: Name, Event: OnClick in the markdown
  • Use a custom matcher function that injects a prefix in the identifier in the HTML
  • Use a custom JS script that does it clientside

@trieloff, @kptdobe, @rofe:
Since this would be a known edge case, where should we document this?
Is helix using ADRs (https://adr.github.io/) or do we have customer-facing documentation that would need an entry for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use ADRs, but it looks like a good template for this, so just put the first ADR into docs. Removing the ID for properties, name, etc. only seems like a good solution and there are enough documented workarounds that customers can implement if needed.

@@ -115,6 +115,6 @@ https://www.youtube.com/watch?v=KOxbO0EI4MA
assert.equal(result.response.body, `<p>Hello World
Here comes an embed.</p>
<esi:include src="https://example-embed-service.com/https://www.youtube.com/watch?v=KOxbO0EI4MA"></esi:include>
<p><img src="easy.png" alt="Easy!" srcset="easy.png?width=480&amp;auto=webp 480w,easy.png?width=1384&amp;auto=webp 1384w,easy.png?width=2288&amp;auto=webp 2288w,easy.png?width=3192&amp;auto=webp 3192w,easy.png?width=4096&amp;auto=webp 4096w" sizes="100vw"></p>`);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular changes here. DOMPurify just reorders the properties in the output

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a domEquivalence testes in helix.shared/src/dom.js that is precisely designed to work in such situations.

@ramboz
Copy link
Contributor Author

ramboz commented Apr 19, 2019

@tripodsan, @rofe: not sure about the HTL escape. It didn't seem to work here.
For now I updated the PR to use DOMPurify instead and also sanitize the output instead of the MDAST

@ramboz ramboz requested a review from tripodsan April 19, 2019 23:19
@rofe
Copy link
Contributor

rofe commented Apr 20, 2019

htlengine is supposed to output safe HTML in the first place. On the other hand, it's probably a good idea to run the response.body through a proven sanitizer before sending it to the browser.

@ramboz
Copy link
Contributor Author

ramboz commented May 28, 2019

@trieloff I've been rebasing a couple times to resolve the conflict. I'll do it again!

ramboz and others added 14 commits May 28, 2019 09:51
Properly escape the initial markdown and custom matchers to avoid potential XSS attacks via JS
injection, and also avoid DOM clubbering

fix #253
Properly escape the initial markdown and custom matchers to avoid potential XSS attacks via JS
injection, and also avoid DOM clubbering

fix #253
Removing duplicate import

fix “253
Refactor to sanitize the output using DOMPurify in the after step of the pipeline instead of the
previous alternative in the mdast to vdom step

fix #253
Remove the hast-util-sanitize based sanitization logic as this is now replaced by the DOMPurify
pipeline after step

fix #253
Removing unnecessary modification to existing tests as the use case is covered by a separate
dedicated test now

fix #253
… execution

Move the sanitization step right before the pre function execution so that author-generated content
is properly sanitized, but sanitizing the developer-generated content is the responsibility of the
template engine

fix #253
The HTML pipeline should sanitize the author-generated content, but not the developer code as the
templating engine will mostly do that.

fix #253
Let the sanitization step accept custom elements and attributes added to the markdown

fix #253
Adjust tests after removal of the DOM Clobbering escape logic

fix #253
Adding an architecture decision record for the sanitization of heading identifiers that would cause
DOM clobbering

fix #253
Fix a merge conflict from a broken rebase operation

fix #253
Update the sanitization logic to leverage the newly introduced middleware pattern

fix #253
Fix XSS tests so the match the new simplified pipeline step executor logic

fix #253
@adobe adobe deleted a comment from lgtm-com bot May 28, 2019
@ramboz
Copy link
Contributor Author

ramboz commented May 28, 2019

@trieloff Fixed the merge conflict and updated the tests according to the latest changes. Please take a moment to re-review, thanks :)

Context
-------

> Describe the detailed context and problem being addressed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line break missing

Suggested change
> Describe the detailed context and problem being addressed
> Describe the detailed context and problem being addressed

@@ -63,6 +64,7 @@ const htmlpipe = (cont, payload, action) => {
.before(selecttest)
.before(html).expose('html')
.before(responsive)
.before(sanitize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once we modified the entire pipeline (including the HTL engine) to be based on JSDOM (see #337), should be move this after the once() step?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because this would make it impossible to have unsanitized output, which would severely reduce the power and expressiveness of the templates. As we generally assume developers know what they want, we shouldn't do this.

@trieloff trieloff merged commit 0513756 into master May 29, 2019
@trieloff
Copy link
Contributor

Thanks for your contribution and your patience with us @ramboz

@adobe-bot
Copy link

🎉 This PR is included in version 2.2.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@rofe rofe deleted the issue/253 branch May 29, 2019 11:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[XSS] Sanitize generated markdown to avoid XSS attacks
7 participants