Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken HTML parsing #597

Open
fabiospampinato opened this issue Jul 26, 2019 · 7 comments

Comments

@fabiospampinato
Copy link

commented Jul 26, 2019

I have the following 2 snippets:

<script>
    Foo
 
    Bar
</script>
<div>
  <script>
    Foo
 
    Bar
  </script>
</div>

Which are rendered respectively as:

<script>
    Foo
 
    Bar
</script>
<div>
  <script>
    Foo
<pre><code>Bar
</code></pre>
  </script>
</div>

On how to parse these the spec says, here:

Start condition: line begins with the string <script, <pre, or <style (case-insensitive), followed by whitespace, the string >, or the end of the line.
End condition: line contains an end tag </script>, , or </style> (case-insensitive; it need not match the start tag).

This makes sense to me, pre, style and script tags often contain empty lines and should be parsed correctly.

On how to parse the second snippet though the spec mentions that since the snippet started with <div> then the exit condition for the entire thing will be an empty line.

How does this make any sense?

Why shouldn't script's rule takes precedence for the lines wrapped between <script> and </script>?

@jgm

This comment has been minimized.

Copy link
Member

commented Jul 26, 2019

This tracker is for bug reports. Discussion and questions like this should go to the forum at talk.commonmark.org. Feel free to open a topic there, after searching for existing relevant discussions. (But I think if you just read the whole section on HTML blocks in the spec, you'll find your question answered.)

@jgm jgm closed this Jul 26, 2019

@jgm

This comment has been minimized.

Copy link
Member

commented Jul 26, 2019

Oh, I see you're not asking why in general a blank line ends an HTML block, but why the inner script tag's rule doesn't take precedence. Well, if you like you can bring this up in the forum. The rule is fairly simple; it won't automatically do what a human would expect in every case. We could talk about specific ways to change it (but please, nothing that requires unlimited backtracking).

@fabiospampinato

This comment has been minimized.

Copy link
Author

commented Jul 26, 2019

@jgm How is this not a bug report since script tags aren't parsed correctly?

The rule is fairly simple; it won't automatically do what a human would expect in every case.

I agree that it's pretty simple, but since no human would parse HTML in their heads that way I would consider it broken. Plus there's a specific rule for parsing script, pre and style tags containing empty lines, but it breaks down pretty quickly.

I'm not too familiar with parsers to make a detailed proposal about this, but roughly I'd say the rule for script, pre and style tags should just take the precedence inside their blocks.

Can we please reopen this?

@aidantwoods

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2019

@fabiospampinato As you noted, the HTML block started by <div> requires a blank line to end it, and HTML blocks are leaf blocks (cannot contain other blocks) so the line <script> never starts a HTML block.

@jgm & @fabiospampinato one idea for making this parse more "intuitively" might be to reassess the start condition on lines that don't meet the end condition and adjust the HTML block type accordingly in some order of precedence.
E.g. if a subsequent line of a type 6 or 7 HTML block (which don't have, let's say "intuitive", end conditions in some cases) could start a type 1 through 5 HTML block then the HTML block will change to the respective type. Another way to phrase this might be that HTML blocks of type 1 through 5 may interrupt a type 6 or 7 HTML block.

@fabiospampinato if you open something over on the forums, we can discuss pros and cons of, and ideas for an adjusted HTML block definition there that bears this example in mind?

@fabiospampinato

This comment has been minimized.

Copy link
Author

commented Jul 26, 2019

so the line <script> never starts a HTML block.

That's just a technicality of the spec, the rendered output contains a script tag, and that script tag is being outputted precisely because I wrote <script> in the Markdown I'm parsing, so I think it's reasonable to say that that <script> text is in fact the beginning of a script tag.

@fabiospampinato if you open something over on the forums, we can discuss pros and cons of, and ideas for an adjusted HTML block definition there that bears this example in mind?

Here you go: https://talk.commonmark.org/t/improve-detection-of-pre-style-and-script-tags-containing-an-empty-line/3216

@jgm

This comment has been minimized.

Copy link
Member

commented Jul 27, 2019

Well, I'll open this again here. Given that we're not doing a full HTML parse, there are always going to be cases where the spec doesn't "do the right thing" as a human would judge it. But this one might be relatively simple to handle in the way @aidantwoods describes. It would make the spec more complex, and it's an issue you can work around quite easily (just put a blank line between the <div> and the <script>, so that the latter opens up a new HTML block).

@jgm jgm reopened this Jul 27, 2019

@fabiospampinato

This comment has been minimized.

Copy link
Author

commented Jul 27, 2019

just put a blank line between the

and the <script>, so that the latter opens up a new HTML block

The "just" here is quite a big "just", in order to know about this workaround one would need to have a pretty deep level of understanding of the spec, a level of understanding that I can't ask of my users. Plus it would make the written code uglier.

There are more transparent workarounds for this, like for instance right now I'm inserting a non-breakable space between each 2 consecutive newline characters before piping the Markdown code through the compiler, and then removing them after that, but these kinds of workarounds shouldn't be necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.