New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty lines in HTML attributes introduce new Markdown paragraph #490

Closed
Yogarine opened this Issue Aug 17, 2017 · 7 comments

Comments

Projects
None yet
4 participants
@Yogarine

Yogarine commented Aug 17, 2017

This:

<div id="foo" class="bar

  baz">
</div>

Will be converted to:

<div id="foo" class="bar
<p>baz&quot;&gt;</p>
</div>

See: http://spec.commonmark.org/dingus/?text=%3Cdiv%20id%3D%22foo%22%20class%3D%22bar%0A%20%20baz%22%3E%0A%3C%2Fdiv%3E%0A

However one would expect CommonMark to ignore anything in the context of the HTML attribute, since newlines in attributes are perfectly valid use of HTML.

This bug is annoying when trying to combine MarkDown with tools that use big query strings like Gravizo.

@jgm

This comment has been minimized.

Show comment
Hide comment
@jgm

jgm Aug 17, 2017

Member
Member

jgm commented Aug 17, 2017

@Yogarine

This comment has been minimized.

Show comment
Hide comment
@Yogarine

Yogarine Aug 17, 2017

Even though I understand the reasons, I don't believe this behaviour follows The Principle of Least Astonishment

Also, the second reason is kinda moot, since the fact that you're supporting certain HTML tags in te first place already implies that you need to parse HTML. Having additional weird rules applied to the parsing of this HTML actually complicates things. Besides, HTML parsers are a dime in a dozen.

As a matter of fact, having written several HTML parsers myself, one of my first thoughts was what awkward constructions you would have to apply to the tokeniser to make this work the way it does currently. It makes much more sense, when encountering a string literal token inside the HTML tag, to just collect everything up until the matching quote. I haven't looked at the CommonMark code, but I suppose it either:

  • collects up till the matching quote or an empty line, or
  • it just collects the whole Markdown paragraph in a higher level pass beforehand, or
  • it just ignores anything inside the HTML tag and reproduces it verbatim, as long as it opens and closes appropriately.

Maybe it's just because I'm so used to writing tokenisers and parsers, but it the second arguments just sounds like a bad excuse. And the first case just doesn't really seem like the expected behaviour.

Yogarine commented Aug 17, 2017

Even though I understand the reasons, I don't believe this behaviour follows The Principle of Least Astonishment

Also, the second reason is kinda moot, since the fact that you're supporting certain HTML tags in te first place already implies that you need to parse HTML. Having additional weird rules applied to the parsing of this HTML actually complicates things. Besides, HTML parsers are a dime in a dozen.

As a matter of fact, having written several HTML parsers myself, one of my first thoughts was what awkward constructions you would have to apply to the tokeniser to make this work the way it does currently. It makes much more sense, when encountering a string literal token inside the HTML tag, to just collect everything up until the matching quote. I haven't looked at the CommonMark code, but I suppose it either:

  • collects up till the matching quote or an empty line, or
  • it just collects the whole Markdown paragraph in a higher level pass beforehand, or
  • it just ignores anything inside the HTML tag and reproduces it verbatim, as long as it opens and closes appropriately.

Maybe it's just because I'm so used to writing tokenisers and parsers, but it the second arguments just sounds like a bad excuse. And the first case just doesn't really seem like the expected behaviour.

@aidantwoods

This comment has been minimized.

Show comment
Hide comment
@aidantwoods

aidantwoods Aug 17, 2017

Contributor

If your application uses big query strings, I suggest URL-escaping the whitespace in the queries (which is good practice anyway). Then the problem will disappear.

Instead of URL encoding, you should (perhaps aggressively) HTML encode the offending whitespace to produce character references (e.g. using the &#xx; format), then the browser should treat the whitespace as if it were literal as per https://www.w3.org/TR/html51/syntax.html#attribute-value

Contributor

aidantwoods commented Aug 17, 2017

If your application uses big query strings, I suggest URL-escaping the whitespace in the queries (which is good practice anyway). Then the problem will disappear.

Instead of URL encoding, you should (perhaps aggressively) HTML encode the offending whitespace to produce character references (e.g. using the &#xx; format), then the browser should treat the whitespace as if it were literal as per https://www.w3.org/TR/html51/syntax.html#attribute-value

@jgm

This comment has been minimized.

Show comment
Hide comment
@jgm

jgm Aug 17, 2017

Member
Member

jgm commented Aug 17, 2017

@Yogarine

This comment has been minimized.

Show comment
Hide comment
@Yogarine

Yogarine Aug 17, 2017

@aidantwoods

Instead of URL encoding, you should (perhaps aggressively) HTML encode the offending whitespace to produce character references

That would make for very ugly code when using something like Gravizo (which is where my main gripe comes from at the moment).

@jgm

If you want to suggest a concrete alternative, feel free.
But keep in mind the costs.

I'm probably oversimplifying, but what I'd do is scan towards the position of to the next unescaped matching quotes ("/'). If you don't find it, just disregard it and treat it the way you currently do. If you do find the matching quotes, do another scan for the matching unescaped greater-then (>). If you find that, just continue tokenising or parsing from that character position forward and copy the HTML verbatim. No need to backtrack and the overhead of these scans is minimal.

This doesn't cover the case where you're only missing the closing '>' of a tag, but you've got that accounted for in the spec anyway.

Its a little bit of very specific additional logic, but it covers all the cases that I can imagine (or care about anyway... ;-) ).

Yogarine commented Aug 17, 2017

@aidantwoods

Instead of URL encoding, you should (perhaps aggressively) HTML encode the offending whitespace to produce character references

That would make for very ugly code when using something like Gravizo (which is where my main gripe comes from at the moment).

@jgm

If you want to suggest a concrete alternative, feel free.
But keep in mind the costs.

I'm probably oversimplifying, but what I'd do is scan towards the position of to the next unescaped matching quotes ("/'). If you don't find it, just disregard it and treat it the way you currently do. If you do find the matching quotes, do another scan for the matching unescaped greater-then (>). If you find that, just continue tokenising or parsing from that character position forward and copy the HTML verbatim. No need to backtrack and the overhead of these scans is minimal.

This doesn't cover the case where you're only missing the closing '>' of a tag, but you've got that accounted for in the spec anyway.

Its a little bit of very specific additional logic, but it covers all the cases that I can imagine (or care about anyway... ;-) ).

@valich

This comment has been minimized.

Show comment
Hide comment
@valich

valich Aug 18, 2017

I personally think that the fact you have to escape some characters (# with %23) suggests that this link-using design is broken in a way. If one wants to write some code, code spans/fences should be used for that. The good example (to my taste) is how mermaid is being used for drawing diagrams in atom.

@jgm Should spec tell us anything from the post-processing point of view?

valich commented Aug 18, 2017

I personally think that the fact you have to escape some characters (# with %23) suggests that this link-using design is broken in a way. If one wants to write some code, code spans/fences should be used for that. The good example (to my taste) is how mermaid is being used for drawing diagrams in atom.

@jgm Should spec tell us anything from the post-processing point of view?

@jgm

This comment has been minimized.

Show comment
Hide comment
@jgm

jgm Aug 18, 2017

Member

The next version of pandoc will include a syntax for generic inclusion of text that will be passed verbatim to the output format. So, to include some HTML, you do

```{=html}
<div title="You

can have whatever you want
here/>
```

This content will be passed through verbatim if the output format is HTML, and otherwise ignored. I like this explicit method, and I think it would have been a better choice for Markdown, but of course with CommonMark we're trying to support the classic Markdown features.

With postprocessing, you can always use a regular code block, something like:

``` .svg-diagram
<xml goes here>
```

For a cmark wrapper that allows you to easily write AST filters like this, see my lcmark.

Member

jgm commented Aug 18, 2017

The next version of pandoc will include a syntax for generic inclusion of text that will be passed verbatim to the output format. So, to include some HTML, you do

```{=html}
<div title="You

can have whatever you want
here/>
```

This content will be passed through verbatim if the output format is HTML, and otherwise ignored. I like this explicit method, and I think it would have been a better choice for Markdown, but of course with CommonMark we're trying to support the classic Markdown features.

With postprocessing, you can always use a regular code block, something like:

``` .svg-diagram
<xml goes here>
```

For a cmark wrapper that allows you to easily write AST filters like this, see my lcmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment