Skip to content
This repository has been archived by the owner on Feb 28, 2022. It is now read-only.

Data Pipelines with Markdown Templates? #593

Closed
trieloff opened this issue Feb 20, 2020 · 20 comments
Closed

Data Pipelines with Markdown Templates? #593

trieloff opened this issue Feb 20, 2020 · 20 comments
Labels

Comments

@trieloff
Copy link
Contributor

trieloff commented Feb 20, 2020

As a continuation of the discussion we had around query-backed pipelines (adobe/helix-home#90) I'm trying to write up a mini-spec around an offhand remark @davidnuescheler made along the lines of "I think query-backed pipelines should be data pipelines – and they should use Markdown templates". This is an obvious riff on @dylandepass's original headless CMS POC, so I'd love to hear his feedback.

Use case

Some content is just not hypertext-shaped, but effectively spreadsheet-shaped, for instance:

  • the latest blog posts for a blog's homepage (coming from a search query)
  • the navigation categories on a site, managed in a Google Spreadsheet
  • dealer locations for a company homepage, managed through Airtable

Current Solution

In the current solution (see theblog for an example), we would fetch the JSON representation of the data on the client-side, then navigate to "the big array", iterate over it and apply HTML templates (or some other templating language) to update the DOM.

This has the drawback of creating additional engineering complexity and runtime complexity, because you now have to deal with two templating systems (HTL on the server-side and whatever you have on the client side)

Proposed solution

Instead of using a separate pipeline which would come with the problem of selecting the correct extension or selector, we supercharge the normal Markdown to HTML pipeline, so that it can detect some special directives in the Markdown front matter:

---
# this can be just the URL of a known data source, the pipeline will know how to fetch the JSON representation
hlx_data: https://docs.google.com/spreadsheets/d/1IX0g5P74QnHPR3GW1AMCdTk_-m954A-FKZRT2uOZY7k/edit#gid=0
# TDB, we might want to use a different templating language
hlx_template: handlebars
# a JSON pointer expression that points to the root node to start iterating
hlx_root: /sheets[1]
# a sift https://www.npmjs.com/package/sift expression to filter from the root
hlx_filter: 
  maker:
    $eq: Tesla
---

# A list of cars
  {{#each row}}
## {{this.maker}} {{this.model}}

{{this.model}} was first released in {{this.year}}
  {{/each}}

Upon detecting a hlx_data tag in the front matter, the pipeline will:

  1. fetch the associated JSON
  2. (maybe post-process it – possible pre.js extension point)
  3. use the JSON Pointer from hlx_root to navigate to an array node
  4. use sift and hlx_filter to remove elements from the array that don't match
  5. apply the template language specified in hlx_template (I'm picking handlebars here, @dylandepass suggested Squirrely) to generate a markdown document
  6. parse the markdown document and continue the pipeline

Intended Limitations

In the spirit of "how simple can we make things and still get away with it", there are some limitations:

  • you can have only one JSON source per document. This means no bindings like in @dylandepass's original design. If you want to combine multiple sources, use ESI or a cgi-bin script
  • we should pick one templating language instead of making it configurable to make things a bit easier
@trieloff trieloff added discussion enhancement New feature or request labels Feb 20, 2020
@trieloff
Copy link
Contributor Author

Thinking about this a bit more over the weekend, we might want to go with an even simpler solution:

# A list of cars

---
# this can be just the URL of a known data source, the pipeline will know how to fetch the JSON representation
hlx_data: https://docs.google.com/spreadsheets/d/1IX0g5P74QnHPR3GW1AMCdTk_-m954A-FKZRT2uOZY7k/edit#gid=0
# a JSON pointer expression that points to the root node to start iterating
hlx_root: /sheets[1]
# a sift https://www.npmjs.com/package/sift expression to filter from the root
hlx_filter: 
  maker:
    $eq: Tesla
---

## {{maker}} {{model}}

{{model}} was first released in {{year}}

In this example, the iteration would be implicit (no {{each}} tags needed) and per-section. We would not need a full template engine and limit the depth of iteration.

@tripodsan
Copy link
Contributor

I would stick to a proper, well documented template engine, like handlebars and not come up with our own simple solution.

the the example above, how would you mix repetition with fixed content?

@trieloff
Copy link
Contributor Author

In the example above,

# A list of cars

would be fixed content (because it's outside of the section) and

## {{maker}} {{model}}

{{model}} was first released in {{year}}

would be repeated for each matching element in https://docs.google.com/spreadsheets/d/1IX0g5P74QnHPR3GW1AMCdTk_-m954A-FKZRT2uOZY7k/edit#gid=0

@tripodsan
Copy link
Contributor

don't you think, this is too much magic?

  • how would you specify multiple sources?
  • how would you define how many sections to loop

I can just imagine, that creating a very simple solution soon is not enough, and developers will ask for a more comprehensive template language.

@trieloff
Copy link
Contributor Author

I'm not sure about this yet. My proposal:

  • doesn't feel like magic to me, as you don't have to use weird spells like {{#each}}
  • doesn't feel like the kind of magic you can invoke accidentally

how would you specify multiple sources?

You don't. If you need to combine multiple sources, write a cgi-bin action that does so, or you use includes (this does not allow mixing, though).

how would you define how many sections to loop

You loop only through the current section. You may argue with me that there can be multiple sections, each with a data source of its own to be looped.

nobody designs a list view

I don't think I'm violating the letter of the law (no mockups here) or the spirit. I'd even argue that my proposal does not support the creation of looped Markdown tables, so I'm feeling safe.

don't let markdown get weird

The YAML front matter definitely has potential for weirdness, and so do the {{variable}} tags in my proposal. However, when it comes to using a "proper, well documented", and "more comprehensive" template language (reasonable asks, I might add) the weirdness level is definitely increased.

At the moment I see the most viable choices to be Handlebars and MDX (Markdown + JSX). Handlebars looks like the original example:

---
# this can be just the URL of a known data source, the pipeline will know how to fetch the JSON representation
hlx_data: https://docs.google.com/spreadsheets/d/1IX0g5P74QnHPR3GW1AMCdTk_-m954A-FKZRT2uOZY7k/edit#gid=0
# TDB, we might want to use a different templating language
hlx_template: handlebars
# a JSON pointer expression that points to the root node to start iterating
hlx_root: /sheets[1]
# a sift https://www.npmjs.com/package/sift expression to filter from the root
hlx_filter: 
  maker:
    $eq: Tesla
---

# A list of cars
  {{#each row}}
## {{this.maker}} {{this.model}}

{{this.model}} was first released in {{this.year}}
  {{/each}}

MDX would look like this:

import { Each, Text } from "helix-mdx";

# A list of cars

<Each data=" https://docs.google.com/spreadsheets/d/1IX0g5P74QnHPR3GW1AMCdTk_-m954A-FKZRT2uOZY7k/edit#gid=0" root="/sheets[1]">
  <Text src="{{model}}"/> was first released in <Text src="{{year}}"> 
</Each>

Using MDX would technically not introduce another templating language (as Helix already supports JSX), but I just can't see this being a good idea.

Handlebars may be a viable choice, but I think we might also add this functionality later when we need it.

@trieloff
Copy link
Contributor Author

trieloff commented Feb 25, 2020

Even simpler (suggested by @davidnuescheler)

# A list of cars

https://docs.google.com/spreadsheets/d/1IX0g5P74QnHPR3GW1AMCdTk_-m954A-FKZRT2uOZY7k/edit?root=sheets[1]&filter=maker:Tesla

## {{maker}} {{model}}

{{model}} was first released in {{year}}

---

That was a cool list, wasn't it?

we would detect the https://docs.google.com/spreadsheets/ embed to be a data-embed and turn the rest of the document into an iterator.

@trieloff
Copy link
Contributor Author

Another question from @davidnuescheler:

Is there a way to make this work more like regular embeds, i.e. just paste the URL to the spreadsheet and get the templated markdown/html as a result.

This would have the advantage of having markdown that is decidedly non-weird (no templates at all), but poses the question of how to specify the template (we cannot assume to guessing the correct formatting correctly) and how to run it, especially in the light of Helix Pages' no server side imperative code paradigm.

trieloff added a commit that referenced this issue Mar 4, 2020
trieloff added a commit that referenced this issue Mar 4, 2020
@trieloff trieloff mentioned this issue Mar 5, 2020
2 tasks
trieloff added a commit that referenced this issue Mar 11, 2020
trieloff added a commit that referenced this issue Mar 11, 2020
trieloff added a commit that referenced this issue Mar 11, 2020
injects values from data embeds, but does not support sections or deep object access yet

fix #593
@trieloff
Copy link
Contributor Author

@davidnuescheler I've implemented your penultimate idea (couldn't come up with a way to implement the last suggestion) in #611 – please take a look at the documentation here: https://github.com/adobe/helix-pipeline/blob/data-embeds/docs/markdown.md

We could (quite easily) also implement the frontmatter syntax that I suggested, but I'd leave that for a separate PR.

adobe-bot pushed a commit that referenced this issue Mar 18, 2020
# [6.8.0](v6.7.5...v6.8.0) (2020-03-18)

### Bug Fixes

* **data-sections:** ensure that unist map can handle async callbacks ([4f80666](4f80666))
* **data-sections:** ensure that unist map can handle async callbacks ([13dca79](13dca79))
* **embeds:** add proper error handling (logging) for failed data embed downloads ([a221714](a221714)), closes [#593](#593)
* **embeds:** add proper error handling (logging) for failed data embed downloads ([79d194e](79d194e)), closes [#593](#593)
* **embeds:** data embeds update the surrogate key based on the source URL ([638415b](638415b))
* **embeds:** remove dataEmbed nodes from mdast after detection ([ae6445c](ae6445c))
* **embeds:** remove dataEmbed nodes from mdast after detection ([0ab4dba](0ab4dba))
* **embeds:** use a proper logger when fetching data embeds ([aca4798](aca4798))
* **embeds:** use a proper logger when fetching data embeds ([2f959a0](2f959a0))

### Features

* **embeds:** add data section extraction step ([e7bc55a](e7bc55a))
* **embeds:** add data section extraction step ([bd721a5](bd721a5))
* **embeds:** detect data embeds ([155df67](155df67)), closes [/github.com//issues/593#issuecomment-590956631](https://github.com//github.com/adobe/helix-pipeline/issues/593/issues/issuecomment-590956631)
* **embeds:** detect data embeds ([fec11d4](fec11d4)), closes [/github.com//issues/593#issuecomment-590956631](https://github.com//github.com/adobe/helix-pipeline/issues/593/issues/issuecomment-590956631)
* **embeds:** implement data embeds for sections ([de54ccb](de54ccb)), closes [#593](#593)
* **embeds:** implement data embeds for sections ([366a7e9](366a7e9)), closes [#593](#593)
* **embeds:** provide basic data injection ([e733e24](e733e24)), closes [#593](#593)
* **embeds:** provide basic data injection ([355b2ba](355b2ba)), closes [#593](#593)
* **embeds:** support dot notation `{{foo.bar}}` in data embed templates ([283972a](283972a)), closes [#593](#593)
* **embeds:** support dot notation `{{foo.bar}}` in data embed templates ([ace965e](ace965e)), closes [#593](#593)
* **utils:** add cache-utils for merging cache-control headers ([3568d39](3568d39))
@adobe-bot
Copy link

🎉 This issue has been resolved in version 6.8.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@rofe
Copy link
Contributor

rofe commented Mar 18, 2020

I'd like to discuss two aspects:

  • output sanitization: we are dealing with arbitrary user content, large spreadsheets we can't assume authors to have full control over anymore. to be on the safe side, all content must be sanitized before rendering it in a browser. will this happen in the pipeline's sanitize step, or should we let authors control it through the syntax (e.g. handlebars' {{clean}} vs {{{dirty}}})?
  • pagination: for large amounts of data, it should probably somehow be possible to limit and navigate or use smart loading. uncontrollably large HTML files could also become dangerous (DOS) if we just render whatever data.

@rofe rofe reopened this Mar 18, 2020
@tripodsan
Copy link
Contributor

to be on the safe side, all content must be sanitized before rendering it in a browser.

since we are injecting markdown, the document gets sanitized the same way as it is normal markdown content.

@tripodsan
Copy link
Contributor

alternative to handlebars, would be using the HTL expressions, eg: ${car.model}.

But I think, handlebars is more versatile, since it would allow for more control flow in the future.

@trieloff
Copy link
Contributor Author

since we are injecting markdown, the document gets sanitized the same way as it is normal markdown content.

That's my assumption, too.

@rofe
Copy link
Contributor

rofe commented Mar 19, 2020

That's my assumption, too.

We should add a test to confirm it :)

trieloff added a commit that referenced this issue Mar 20, 2020
@trieloff
Copy link
Contributor Author

Since #356 we don't have the XSS sanitizer enabled.

@rofe
Copy link
Contributor

rofe commented Mar 20, 2020

That's fine for markup from a markdown file, but bad for markup from a large spreadsheet.

@trieloff
Copy link
Contributor Author

trieloff commented Mar 23, 2020

I'd think so, too. How would we fix this? with {{}} and {{{}}} for secure and insecure or should we try to filter everything, regardless?

trieloff added a commit that referenced this issue Mar 23, 2020
@trieloff
Copy link
Contributor Author

fb116cd adds another test, this time trying to brute-force inject <script> tags. This attempt does get thwarted.

@rofe
Copy link
Contributor

rofe commented Mar 23, 2020

I'd go with filtering everything regardless, and wait for customers to complain.

@rofe rofe closed this as completed Sep 23, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants