Data Pipelines with Markdown Templates? #593

trieloff · 2020-02-20T17:40:22Z

As a continuation of the discussion we had around query-backed pipelines (adobe/helix-home#90) I'm trying to write up a mini-spec around an offhand remark @davidnuescheler made along the lines of "I think query-backed pipelines should be data pipelines – and they should use Markdown templates". This is an obvious riff on @dylandepass's original headless CMS POC, so I'd love to hear his feedback.

Use case

Some content is just not hypertext-shaped, but effectively spreadsheet-shaped, for instance:

the latest blog posts for a blog's homepage (coming from a search query)
the navigation categories on a site, managed in a Google Spreadsheet
dealer locations for a company homepage, managed through Airtable

Current Solution

In the current solution (see theblog for an example), we would fetch the JSON representation of the data on the client-side, then navigate to "the big array", iterate over it and apply HTML templates (or some other templating language) to update the DOM.

This has the drawback of creating additional engineering complexity and runtime complexity, because you now have to deal with two templating systems (HTL on the server-side and whatever you have on the client side)

Proposed solution

Instead of using a separate pipeline which would come with the problem of selecting the correct extension or selector, we supercharge the normal Markdown to HTML pipeline, so that it can detect some special directives in the Markdown front matter:

---
# this can be just the URL of a known data source, the pipeline will know how to fetch the JSON representation
hlx_data: https://docs.google.com/spreadsheets/d/1IX0g5P74QnHPR3GW1AMCdTk_-m954A-FKZRT2uOZY7k/edit#gid=0
# TDB, we might want to use a different templating language
hlx_template: handlebars
# a JSON pointer expression that points to the root node to start iterating
hlx_root: /sheets[1]
# a sift https://www.npmjs.com/package/sift expression to filter from the root
hlx_filter: 
  maker:
    $eq: Tesla
---

# A list of cars
  {{#each row}}
## {{this.maker}} {{this.model}}

{{this.model}} was first released in {{this.year}}
  {{/each}}

Upon detecting a hlx_data tag in the front matter, the pipeline will:

fetch the associated JSON
(maybe post-process it – possible pre.js extension point)
use the JSON Pointer from hlx_root to navigate to an array node
use sift and hlx_filter to remove elements from the array that don't match
apply the template language specified in hlx_template (I'm picking handlebars here, @dylandepass suggested Squirrely) to generate a markdown document
parse the markdown document and continue the pipeline

Intended Limitations

In the spirit of "how simple can we make things and still get away with it", there are some limitations:

you can have only one JSON source per document. This means no bindings like in @dylandepass's original design. If you want to combine multiple sources, use ESI or a cgi-bin script
we should pick one templating language instead of making it configurable to make things a bit easier

The text was updated successfully, but these errors were encountered:

trieloff · 2020-02-24T11:03:55Z

Thinking about this a bit more over the weekend, we might want to go with an even simpler solution:

# A list of cars

---
# this can be just the URL of a known data source, the pipeline will know how to fetch the JSON representation
hlx_data: https://docs.google.com/spreadsheets/d/1IX0g5P74QnHPR3GW1AMCdTk_-m954A-FKZRT2uOZY7k/edit#gid=0
# a JSON pointer expression that points to the root node to start iterating
hlx_root: /sheets[1]
# a sift https://www.npmjs.com/package/sift expression to filter from the root
hlx_filter: 
  maker:
    $eq: Tesla
---

## {{maker}} {{model}}

{{model}} was first released in {{year}}

In this example, the iteration would be implicit (no {{each}} tags needed) and per-section. We would not need a full template engine and limit the depth of iteration.

tripodsan · 2020-02-25T00:35:23Z

I would stick to a proper, well documented template engine, like handlebars and not come up with our own simple solution.

the the example above, how would you mix repetition with fixed content?

trieloff · 2020-02-25T09:13:26Z

In the example above,

# A list of cars

would be fixed content (because it's outside of the section) and

## {{maker}} {{model}}

{{model}} was first released in {{year}}

would be repeated for each matching element in https://docs.google.com/spreadsheets/d/1IX0g5P74QnHPR3GW1AMCdTk_-m954A-FKZRT2uOZY7k/edit#gid=0

tripodsan · 2020-02-25T10:12:06Z

don't you think, this is too much magic?

how would you specify multiple sources?
how would you define how many sections to loop

I can just imagine, that creating a very simple solution soon is not enough, and developers will ask for a more comprehensive template language.

stefan-guggisberg · 2020-02-25T12:47:54Z

BTW, just a reminder ;-)

trieloff · 2020-02-25T14:02:46Z

I'm not sure about this yet. My proposal:

doesn't feel like magic to me, as you don't have to use weird spells like {{#each}}
doesn't feel like the kind of magic you can invoke accidentally

how would you specify multiple sources?

You don't. If you need to combine multiple sources, write a cgi-bin action that does so, or you use includes (this does not allow mixing, though).

how would you define how many sections to loop

You loop only through the current section. You may argue with me that there can be multiple sections, each with a data source of its own to be looped.

nobody designs a list view

I don't think I'm violating the letter of the law (no mockups here) or the spirit. I'd even argue that my proposal does not support the creation of looped Markdown tables, so I'm feeling safe.

don't let markdown get weird

The YAML front matter definitely has potential for weirdness, and so do the {{variable}} tags in my proposal. However, when it comes to using a "proper, well documented", and "more comprehensive" template language (reasonable asks, I might add) the weirdness level is definitely increased.

At the moment I see the most viable choices to be Handlebars and MDX (Markdown + JSX). Handlebars looks like the original example:

---
# this can be just the URL of a known data source, the pipeline will know how to fetch the JSON representation
hlx_data: https://docs.google.com/spreadsheets/d/1IX0g5P74QnHPR3GW1AMCdTk_-m954A-FKZRT2uOZY7k/edit#gid=0
# TDB, we might want to use a different templating language
hlx_template: handlebars
# a JSON pointer expression that points to the root node to start iterating
hlx_root: /sheets[1]
# a sift https://www.npmjs.com/package/sift expression to filter from the root
hlx_filter: 
  maker:
    $eq: Tesla
---

# A list of cars
  {{#each row}}
## {{this.maker}} {{this.model}}

{{this.model}} was first released in {{this.year}}
  {{/each}}

MDX would look like this:

import { Each, Text } from "helix-mdx";

# A list of cars

<Each data=" https://docs.google.com/spreadsheets/d/1IX0g5P74QnHPR3GW1AMCdTk_-m954A-FKZRT2uOZY7k/edit#gid=0" root="/sheets[1]">
  <Text src="{{model}}"/> was first released in <Text src="{{year}}"> 
</Each>

Using MDX would technically not introduce another templating language (as Helix already supports JSX), but I just can't see this being a good idea.

Handlebars may be a viable choice, but I think we might also add this functionality later when we need it.

trieloff · 2020-02-25T16:48:12Z

Even simpler (suggested by @davidnuescheler)

# A list of cars

https://docs.google.com/spreadsheets/d/1IX0g5P74QnHPR3GW1AMCdTk_-m954A-FKZRT2uOZY7k/edit?root=sheets[1]&filter=maker:Tesla

## {{maker}} {{model}}

{{model}} was first released in {{year}}

---

That was a cool list, wasn't it?

we would detect the https://docs.google.com/spreadsheets/ embed to be a data-embed and turn the rest of the document into an iterator.

trieloff · 2020-02-25T17:04:49Z

Another question from @davidnuescheler:

Is there a way to make this work more like regular embeds, i.e. just paste the URL to the spreadsheet and get the templated markdown/html as a result.

This would have the advantage of having markdown that is decidedly non-weird (no templates at all), but poses the question of how to specify the template (we cannot assume to guessing the correct formatting correctly) and how to run it, especially in the light of Helix Pages' no server side imperative code paradigm.

see #593 (comment)

see #593

…d downloads see #593

see #593 (comment)

see #593

injects values from data embeds, but does not support sections or deep object access yet fix #593

see #593

…d downloads see #593

see #593 #56 #267 #42

trieloff · 2020-03-11T14:31:03Z

@davidnuescheler I've implemented your penultimate idea (couldn't come up with a way to implement the last suggestion) in #611 – please take a look at the documentation here: https://github.com/adobe/helix-pipeline/blob/data-embeds/docs/markdown.md

We could (quite easily) also implement the frontmatter syntax that I suggested, but I'd leave that for a separate PR.

# [6.8.0](v6.7.5...v6.8.0) (2020-03-18) ### Bug Fixes * **data-sections:** ensure that unist map can handle async callbacks ([4f80666](4f80666)) * **data-sections:** ensure that unist map can handle async callbacks ([13dca79](13dca79)) * **embeds:** add proper error handling (logging) for failed data embed downloads ([a221714](a221714)), closes [#593](#593) * **embeds:** add proper error handling (logging) for failed data embed downloads ([79d194e](79d194e)), closes [#593](#593) * **embeds:** data embeds update the surrogate key based on the source URL ([638415b](638415b)) * **embeds:** remove dataEmbed nodes from mdast after detection ([ae6445c](ae6445c)) * **embeds:** remove dataEmbed nodes from mdast after detection ([0ab4dba](0ab4dba)) * **embeds:** use a proper logger when fetching data embeds ([aca4798](aca4798)) * **embeds:** use a proper logger when fetching data embeds ([2f959a0](2f959a0)) ### Features * **embeds:** add data section extraction step ([e7bc55a](e7bc55a)) * **embeds:** add data section extraction step ([bd721a5](bd721a5)) * **embeds:** detect data embeds ([155df67](155df67)), closes [/github.com//issues/593#issuecomment-590956631](https://github.com//github.com/adobe/helix-pipeline/issues/593/issues/issuecomment-590956631) * **embeds:** detect data embeds ([fec11d4](fec11d4)), closes [/github.com//issues/593#issuecomment-590956631](https://github.com//github.com/adobe/helix-pipeline/issues/593/issues/issuecomment-590956631) * **embeds:** implement data embeds for sections ([de54ccb](de54ccb)), closes [#593](#593) * **embeds:** implement data embeds for sections ([366a7e9](366a7e9)), closes [#593](#593) * **embeds:** provide basic data injection ([e733e24](e733e24)), closes [#593](#593) * **embeds:** provide basic data injection ([355b2ba](355b2ba)), closes [#593](#593) * **embeds:** support dot notation `{{foo.bar}}` in data embed templates ([283972a](283972a)), closes [#593](#593) * **embeds:** support dot notation `{{foo.bar}}` in data embed templates ([ace965e](ace965e)), closes [#593](#593) * **utils:** add cache-utils for merging cache-control headers ([3568d39](3568d39))

adobe-bot · 2020-03-18T10:01:05Z

🎉 This issue has been resolved in version 6.8.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

rofe · 2020-03-18T20:05:07Z

I'd like to discuss two aspects:

output sanitization: we are dealing with arbitrary user content, large spreadsheets we can't assume authors to have full control over anymore. to be on the safe side, all content must be sanitized before rendering it in a browser. will this happen in the pipeline's sanitize step, or should we let authors control it through the syntax (e.g. handlebars' {{clean}} vs {{{dirty}}})?
pagination: for large amounts of data, it should probably somehow be possible to limit and navigate or use smart loading. uncontrollably large HTML files could also become dangerous (DOS) if we just render whatever data.

tripodsan · 2020-03-19T02:31:08Z

to be on the safe side, all content must be sanitized before rendering it in a browser.

since we are injecting markdown, the document gets sanitized the same way as it is normal markdown content.

tripodsan · 2020-03-19T02:35:46Z

alternative to handlebars, would be using the HTL expressions, eg: ${car.model}.

But I think, handlebars is more versatile, since it would allow for more control flow in the future.

trieloff · 2020-03-19T09:48:23Z

since we are injecting markdown, the document gets sanitized the same way as it is normal markdown content.

That's my assumption, too.

rofe · 2020-03-19T21:31:07Z

That's my assumption, too.

We should add a test to confirm it :)

spoiler alert: yes, they are #593 (comment)

trieloff · 2020-03-20T12:17:46Z

Since #356 we don't have the XSS sanitizer enabled.

rofe · 2020-03-20T20:13:21Z

That's fine for markup from a markdown file, but bad for markup from a large spreadsheet.

trieloff · 2020-03-23T08:13:53Z

I'd think so, too. How would we fix this? with {{}} and {{{}}} for secure and insecure or should we try to filter everything, regardless?

see #593 (comment)

trieloff · 2020-03-23T08:20:30Z

fb116cd adds another test, this time trying to brute-force inject <script> tags. This attempt does get thwarted.

rofe · 2020-03-23T12:58:46Z

I'd go with filtering everything regardless, and wait for customers to complain.

trieloff added discussion enhancement New feature or request labels Feb 20, 2020

trieloff added a commit that referenced this issue Mar 4, 2020

feat(embeds): detect data embeds

fec11d4

see #593 (comment)

trieloff added a commit that referenced this issue Mar 4, 2020

test(embed): add tests for data embeds

b9b6217

see #593

trieloff mentioned this issue Mar 5, 2020

Data Embeds #611

Merged

2 tasks

trieloff added a commit that referenced this issue Mar 10, 2020

feat(embeds): implement data embeds for sections

de54ccb

see #593

trieloff added a commit that referenced this issue Mar 10, 2020

feat(embeds): support dot notation {{foo.bar}} in data embed templates

ace965e

see #593

trieloff added a commit that referenced this issue Mar 11, 2020

fix(embeds): add proper error handling (logging) for failed data embe…

79d194e

…d downloads see #593

trieloff added a commit that referenced this issue Mar 11, 2020

feat(embeds): detect data embeds

155df67

see #593 (comment)

trieloff added a commit that referenced this issue Mar 11, 2020

test(embed): add tests for data embeds

d821119

see #593

trieloff added a commit that referenced this issue Mar 11, 2020

feat(embeds): provide basic data injection

e733e24

injects values from data embeds, but does not support sections or deep object access yet fix #593

trieloff added a commit that referenced this issue Mar 11, 2020

feat(embeds): implement data embeds for sections

366a7e9

see #593

trieloff added a commit that referenced this issue Mar 11, 2020

feat(embeds): support dot notation {{foo.bar}} in data embed templates

283972a

see #593

trieloff added a commit that referenced this issue Mar 11, 2020

fix(embeds): add proper error handling (logging) for failed data embe…

a221714

…d downloads see #593

trieloff added a commit that referenced this issue Mar 11, 2020

docs(markdown): document markdown syntax extensions

8193edb

see #593 #56 #267 #42

trieloff closed this as completed in 355b2ba Mar 18, 2020

adobe-bot added the released label Mar 18, 2020

rofe mentioned this issue Mar 18, 2020

Revert "Data Embeds" #628

Closed

rofe reopened this Mar 18, 2020

This was referenced Mar 19, 2020

Pagination for Data Embeds #630

Open

Create helix-data-embed service adobe/helix-home#112

Closed

trieloff added a commit that referenced this issue Mar 20, 2020

test(embeds): test if data embeds are suceptible to XSS injections

5b8743a

spoiler alert: yes, they are #593 (comment)

trieloff added a commit that referenced this issue Mar 23, 2020

test(embeds): more XSS tests

fb116cd

see #593 (comment)

rofe closed this as completed Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Pipelines with Markdown Templates? #593

Data Pipelines with Markdown Templates? #593

trieloff commented Feb 20, 2020 •

edited

Loading

trieloff commented Feb 24, 2020

tripodsan commented Feb 25, 2020

trieloff commented Feb 25, 2020

tripodsan commented Feb 25, 2020

stefan-guggisberg commented Feb 25, 2020

trieloff commented Feb 25, 2020

trieloff commented Feb 25, 2020 •

edited

Loading

trieloff commented Feb 25, 2020

trieloff commented Mar 11, 2020

adobe-bot commented Mar 18, 2020

rofe commented Mar 18, 2020

tripodsan commented Mar 19, 2020

tripodsan commented Mar 19, 2020

trieloff commented Mar 19, 2020

rofe commented Mar 19, 2020

trieloff commented Mar 20, 2020

rofe commented Mar 20, 2020

trieloff commented Mar 23, 2020 •

edited

Loading

trieloff commented Mar 23, 2020

rofe commented Mar 23, 2020

Data Pipelines with Markdown Templates? #593

Data Pipelines with Markdown Templates? #593

Comments

trieloff commented Feb 20, 2020 • edited Loading

Use case

Current Solution

Proposed solution

Intended Limitations

trieloff commented Feb 24, 2020

tripodsan commented Feb 25, 2020

trieloff commented Feb 25, 2020

tripodsan commented Feb 25, 2020

stefan-guggisberg commented Feb 25, 2020

trieloff commented Feb 25, 2020

trieloff commented Feb 25, 2020 • edited Loading

trieloff commented Feb 25, 2020

trieloff commented Mar 11, 2020

adobe-bot commented Mar 18, 2020

rofe commented Mar 18, 2020

tripodsan commented Mar 19, 2020

tripodsan commented Mar 19, 2020

trieloff commented Mar 19, 2020

rofe commented Mar 19, 2020

trieloff commented Mar 20, 2020

rofe commented Mar 20, 2020

trieloff commented Mar 23, 2020 • edited Loading

trieloff commented Mar 23, 2020

rofe commented Mar 23, 2020

trieloff commented Feb 20, 2020 •

edited

Loading

trieloff commented Feb 25, 2020 •

edited

Loading

trieloff commented Mar 23, 2020 •

edited

Loading