Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upEmpty lines in HTML attributes introduce new Markdown paragraph #490
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jgm
Aug 17, 2017
Member
|
You probably know this, but this is in accord with the spec,
Sec. 4.6.
Start condition 6, end condition is blank line.
So when we hit "baz", we're out of the HTML block.
To be sure, this spec gives unexpected results on inputs
such as yours. But there are reasons to do the spec this
way:
* We want to allow authors the possibility of putting
Markdown content inside HTML blocks; the blank line
end conditions are one of the main mechanisms for this.
* We don't want to require building all the complexity of
parsing HTML into any conforming CommonMark parser, so
we use simple heuristics.
The idea is this: users can include raw HTML, but they have
to be mindful of the rules governing how this is interpreted
in CommonMark, or they may end up with garbage.
If your application uses big query strings, I suggest
URL-escaping the whitespace in the queries (which is
good practice anyway). Then the problem will disappear.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Yogarine
Aug 17, 2017
Even though I understand the reasons, I don't believe this behaviour follows The Principle of Least Astonishment
Also, the second reason is kinda moot, since the fact that you're supporting certain HTML tags in te first place already implies that you need to parse HTML. Having additional weird rules applied to the parsing of this HTML actually complicates things. Besides, HTML parsers are a dime in a dozen.
As a matter of fact, having written several HTML parsers myself, one of my first thoughts was what awkward constructions you would have to apply to the tokeniser to make this work the way it does currently. It makes much more sense, when encountering a string literal token inside the HTML tag, to just collect everything up until the matching quote. I haven't looked at the CommonMark code, but I suppose it either:
- collects up till the matching quote or an empty line, or
- it just collects the whole Markdown paragraph in a higher level pass beforehand, or
- it just ignores anything inside the HTML tag and reproduces it verbatim, as long as it opens and closes appropriately.
Maybe it's just because I'm so used to writing tokenisers and parsers, but it the second arguments just sounds like a bad excuse. And the first case just doesn't really seem like the expected behaviour.
Yogarine
commented
Aug 17, 2017
•
|
Even though I understand the reasons, I don't believe this behaviour follows The Principle of Least Astonishment Also, the second reason is kinda moot, since the fact that you're supporting certain HTML tags in te first place already implies that you need to parse HTML. Having additional weird rules applied to the parsing of this HTML actually complicates things. Besides, HTML parsers are a dime in a dozen. As a matter of fact, having written several HTML parsers myself, one of my first thoughts was what awkward constructions you would have to apply to the tokeniser to make this work the way it does currently. It makes much more sense, when encountering a string literal token inside the HTML tag, to just collect everything up until the matching quote. I haven't looked at the CommonMark code, but I suppose it either:
Maybe it's just because I'm so used to writing tokenisers and parsers, but it the second arguments just sounds like a bad excuse. And the first case just doesn't really seem like the expected behaviour. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
aidantwoods
Aug 17, 2017
Contributor
If your application uses big query strings, I suggest URL-escaping the whitespace in the queries (which is good practice anyway). Then the problem will disappear.
Instead of URL encoding, you should (perhaps aggressively) HTML encode the offending whitespace to produce character references (e.g. using the &#xx; format), then the browser should treat the whitespace as if it were literal as per https://www.w3.org/TR/html51/syntax.html#attribute-value
Instead of URL encoding, you should (perhaps aggressively) HTML encode the offending whitespace to produce character references (e.g. using the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jgm
Aug 17, 2017
Member
|
One thing we've tried to do is to make block structure
discernible on a line-by-line basis, so that the parser
need never backtrack.
E.g. we don't want to have to read 500 lines that start like
<div id="ooo
but don't form a proper HTML tag, and then have to backtrack
and reparse the whole thing as regular CommonMark text.
This behavior helps avoid pathological behavior and hence DOS
attacks.
As you can see from the spec, we really don't parse HTML at
all, as things stand. We see `<div ` and then go to the
next blank line. That's cheap and dirty, but it does the
job, which is to allow users to include raw HTML when they
need to, and to allow them to include CommonMark content
betwene HTML contents when they want to. (Of course they
can't just paste in ANY raw HTML, but it is always possible
to include semantically identical raw HTML.)
If you want to suggest a concrete alternative, feel free.
But keep in mind the costs. Parsing HTML is not exactly
simple. You can write your own simple parser, but if you
want to get it right (e.g. accord with the HTML5 spec),
it gets pretty complicated. So, realistically, if we wanted
to parse HTML we'd need to rely on a third-party library.
That's a significant cost, too. Currently the reference
library cmark is a self-contained C library with no external
dependencies. This makes it easy to provide bindings to many
other languages.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Yogarine
Aug 17, 2017
Instead of URL encoding, you should (perhaps aggressively) HTML encode the offending whitespace to produce character references
That would make for very ugly code when using something like Gravizo (which is where my main gripe comes from at the moment).
If you want to suggest a concrete alternative, feel free.
But keep in mind the costs.
I'm probably oversimplifying, but what I'd do is scan towards the position of to the next unescaped matching quotes ("/'). If you don't find it, just disregard it and treat it the way you currently do. If you do find the matching quotes, do another scan for the matching unescaped greater-then (>). If you find that, just continue tokenising or parsing from that character position forward and copy the HTML verbatim. No need to backtrack and the overhead of these scans is minimal.
This doesn't cover the case where you're only missing the closing '>' of a tag, but you've got that accounted for in the spec anyway.
Its a little bit of very specific additional logic, but it covers all the cases that I can imagine (or care about anyway... ;-) ).
Yogarine
commented
Aug 17, 2017
•
That would make for very ugly code when using something like Gravizo (which is where my main gripe comes from at the moment).
I'm probably oversimplifying, but what I'd do is scan towards the position of to the next unescaped matching quotes ("/'). If you don't find it, just disregard it and treat it the way you currently do. If you do find the matching quotes, do another scan for the matching unescaped greater-then (>). If you find that, just continue tokenising or parsing from that character position forward and copy the HTML verbatim. No need to backtrack and the overhead of these scans is minimal. This doesn't cover the case where you're only missing the closing '>' of a tag, but you've got that accounted for in the spec anyway. Its a little bit of very specific additional logic, but it covers all the cases that I can imagine (or care about anyway... ;-) ). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
valich
Aug 18, 2017
I personally think that the fact you have to escape some characters (# with %23) suggests that this link-using design is broken in a way. If one wants to write some code, code spans/fences should be used for that. The good example (to my taste) is how mermaid is being used for drawing diagrams in atom.
@jgm Should spec tell us anything from the post-processing point of view?
valich
commented
Aug 18, 2017
|
I personally think that the fact you have to escape some characters ( @jgm Should spec tell us anything from the post-processing point of view? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jgm
Aug 18, 2017
Member
The next version of pandoc will include a syntax for generic inclusion of text that will be passed verbatim to the output format. So, to include some HTML, you do
```{=html}
<div title="You
can have whatever you want
here/>
```
This content will be passed through verbatim if the output format is HTML, and otherwise ignored. I like this explicit method, and I think it would have been a better choice for Markdown, but of course with CommonMark we're trying to support the classic Markdown features.
With postprocessing, you can always use a regular code block, something like:
``` .svg-diagram
<xml goes here>
```
For a cmark wrapper that allows you to easily write AST filters like this, see my lcmark.
|
The next version of pandoc will include a syntax for generic inclusion of text that will be passed verbatim to the output format. So, to include some HTML, you do
This content will be passed through verbatim if the output format is HTML, and otherwise ignored. I like this explicit method, and I think it would have been a better choice for Markdown, but of course with CommonMark we're trying to support the classic Markdown features. With postprocessing, you can always use a regular code block, something like:
For a cmark wrapper that allows you to easily write AST filters like this, see my lcmark. |
Yogarine commentedAug 17, 2017
•
edited
This:
Will be converted to:
See: http://spec.commonmark.org/dingus/?text=%3Cdiv%20id%3D%22foo%22%20class%3D%22bar%0A%20%20baz%22%3E%0A%3C%2Fdiv%3E%0A
However one would expect CommonMark to ignore anything in the context of the HTML attribute, since newlines in attributes are perfectly valid use of HTML.
This bug is annoying when trying to combine MarkDown with tools that use big query strings like Gravizo.