Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with keeping some HTML tags in Turndown's markdown output. #241

Closed
seth-brown opened this issue Jun 26, 2018 · 2 comments
Closed

Comments

@seth-brown
Copy link

seth-brown commented Jun 26, 2018

Hello,

Thanks for this great project. I am interested in converting some HTML to Markdown where I wish to preserve img and h1 tags as HTML in the resulting Markdown. Here's a minimal working example of what I'm trying to do:

var html = `<h1>Hello</h1>
		    Here's some cool text.
		    Here is <a href="www.foo.com">a link</a>. 
		    Here is an image <img src="image.jpg">.`

var TurndownService = require('turndown')
var md = new TurndownService()
       .keep(['h1', 'img'])
       .turndown(html)

console.log(md)

The above example outputs:

Hello
===

Here's some cool text. Here is [a link](www.foo.com). Here is an image ![](image.jpg).

From reading the docs, my understanding is that keep will ignore specified tags and NOT convert them to markdown. How can I get Turndown to output the following..

<h1>Hello</h1>Here's some cool text. Here is [a link](www.foo.com). Here is an image <img src="image.jpg">.

p.s. Turndown is a GREAT name for the project!

@domchristie
Copy link
Collaborator

Hi @drBunsen

To keep elements that already have markdown/commonmark equivalents, you will need to use addRule. To achieve your example output, you might want to try something like:

var turndownService = new TurndownService()
turndownService.addRule('keep', {
  filter: ['h1', 'img'],
  replacement: function (content, node) {
    return node.outerHTML
  }
)
md = turndownService.turndown(html)

or to have the the h1s separated by blank lines:

turndownService.addRule('keep', {
  filter: ['h1', 'img'],
  replacement: function (content, node) {
    return node.isBlock ? '\n\n' + node.outerHTML + '\n\n' : node.outerHTML
  }
)

The keep and remove features were mainly added as an easy way to handle non-markdown elements, for example: style and script.


To go into some more detail as to why this is the case: there are essentially three arrays of rules:

  1. CommonMark rules (along with any added rules)
  2. Keep rules
  3. Remove rules

Given that there might be conflicts between these rules, there had to be a defined precedence, so it was decided that the above would be the order. keep and remove were likely to be lesser-used features, so to avoid unnecessary checks, they took a lower precedence. What's more, if you wanted to change the behaviour of the CommonMark rules, you could always use addRule.

This order was also influenced by how the code architecture developed over time. I think it might be possible now to change this order to something like:

  1. added rules
  2. keep
  3. remove
  4. CommonMark rules

But I might wait for some more feedback before doing so

P.S. Glad to hear you like the name! Funnily enough I have another project called bunsn!

@seth-brown
Copy link
Author

Thank you. I appreciate the helpful example and also the explanation. Keep up the good work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants