Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add clarifying tests about HTML and block quotes #738

Merged
merged 2 commits into from
Sep 13, 2024

Conversation

notriddle
Copy link
Contributor

This result makes sense, since inline HTML being inline implies that it is parsed after block quotes, while block HTML being copy-and-pasteable implies that it should eat Markdown syntax like block quotes. However, pulldown-cmark got this wrong, and apparently so do MD4C, markdown-it, and parsedown, according to Babelmark 3.

@wooorm
Copy link
Contributor

wooorm commented Jun 23, 2023

Hmm, this doesn’t have anything to do with block quotes, from what I understand?

What’s going on in these two cases is that the start condition of HTML kind 6 matches for <div. But no block HTML start condition matches for <di or <d or <a or whatever else (but with a >, so <a>, it would match HTML kind 7).

https://spec.commonmark.org/0.30/#html-blocks.

Some examples:

<div
> a

<di>
> b

<di
> c

The same happens for “containers” other than block quotes:

<div
* d

Or say headings:

<div
# e?

@notriddle
Copy link
Contributor Author

It's true that the <a tag isn't the start of an HTML block, but it is valid inline HTML. Try this example:

<a
ping>

@wooorm
Copy link
Contributor

wooorm commented Jun 23, 2023

Right, but that’s something else. The inline rules/algos have nothing to do with how (block quotes and) HTML interact.

@notriddle
Copy link
Contributor Author

notriddle commented Jun 23, 2023

Which section of the spec do you think these test cases belong in, then?

@wooorm
Copy link
Contributor

wooorm commented Jun 23, 2023

From a quick glance, this seems close to the cases of https://spec.commonmark.org/0.30/#example-156 and 157? It has “emphasis” currently, and could also have say an ATX heading and a block quote?

@notriddle
Copy link
Contributor Author

@wooorm Okay, I sorted the examples under the HTML Block and Raw HTML sections.

Copy link
Contributor

@wooorm wooorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty much in favor of this, I think it’s good to have!

spec.txt Outdated
<p><a
b></p>
````````````````````````````````

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this one is needed, it can never be a block quote. But not against it. Wonder what other folks think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Example 614 already covers this case.

<p>&lt;a</p>
<blockquote>
</blockquote>
````````````````````````````````
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice here to not have an empty block quote, but to actually have a word there? I don’t think many folks will want to write empty block quotes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Let's do that.

Comment on lines +9200 to +9201
A block quote can prevent a line from being parsed as inline HTML,
even though line breaks are allowed in tags:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something along the lines of “Block quotes, and other blocks such as headings or lists, precede over inline things:”? My intent here is to show that this isn’t about block quotes per se. Do you think something along those lines makes sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trouble with bringing up other block structures is that their syntax isn't part of a valid HTML tag.

An attribute name consists of an ASCII letter, _, or :, followed by zero or more ASCII letters, digits, _, ., :, or -.

...

An open tag consists of a < character, a tag name, zero or more attributes, optional spaces, tabs, and up to one line ending, an optional / character, and a > character.

The weird thing about the spec is that the open tag syntax sounds like it allows <a\n>, but there's an entirely separate part of the specification (the rules around paragraph interruption) that means you can't actually write that. For an analogous, apparent ambiguity to show up, you'd have to be able to take the below template:

<a
SOMETHING>

... and find a string to fill in for SOMETHING that is both a valid HTML attribute, but also makes the line a valid instance of some block structure.

Since HTML attributes can't start with #, *, -, `, ~, or ASCII digits, that means it can't be ATX headings, bulleted lists, numbered lists, or fenced code blocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a very good point, that in the case of block HTML, as condition 7, line endings are not supported, even though the productions of open tag and closing tag could have them. 🤔

Using the dingus:

<xxx
yyy>

…turns into:

<p><xxx
yyy></p>

Importantly, the text for start condition 7 starts with “line begins with” and ends with “followed by the end of the line”, so I think the intent there is to say that it is all on one line. But it is probably good to add that all of this has to be on one line?

Maybe something like this?:

Start condition: line begins with a complete open tag (with any tag name other than pre, script, style, or textarea) or a complete closing tag, followed by zero or more spaces and tabs, **all on a single line, **followed by the end of the line.

I wonder what @jgm or others think!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't object to adding "all on a single line" if you think it resolves a doubt or ambiguity.

Maybe something along the lines of “Block quotes, and other blocks such as headings or lists, precede over inline things:”? My intent here is to show that this isn’t about block quotes per se. Do you think something along those lines makes sense?

See the beginning 3.1 which makes this general assertion about the priority of block-level parsing over inline.

.
<div
>
````````````````````````````````
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe fill this “block quote” with something? I have the same comment on the 3rd new example too

This result makes sense, since inline HTML being inline implies
that it is parsed after block quotes, while block HTML being
copy-and-pasteable implies that it should eat Markdown syntax
like block quotes. However, pulldown-cmark got this wrong, and
apparently so do MD4C, markdown-it, and parsedown, according
to [Babelmark 3].

[Babelmark 3]: https://babelmark.github.io/?text=%3Ca%0A%3E%0A%0A%3Cdiv%0A%3E
@notriddle
Copy link
Contributor Author

Are there any additional blocking concerns for this patch?

@jgm
Copy link
Member

jgm commented Sep 13, 2024

Looks good to me!

@jgm jgm merged commit 800e199 into commonmark:master Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants