Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly parse HTML5 void elements #141

Open
nidico opened this issue Mar 13, 2023 · 1 comment
Open

Correctly parse HTML5 void elements #141

nidico opened this issue Mar 13, 2023 · 1 comment

Comments

@nidico
Copy link

nidico commented Mar 13, 2023

(I'm unsure whether parsing "normal HTML" - and not only "custom HTML tags" - is in scope of the library. As a user I expected to be able to parse "normal HTML" as it is, so I opened this issue; if it's not in scope, feel free to close!)

HTML5 elements are not necessarily valid XML, e.g.

  • HTML void elements such as <br> must not have a closing tag (here: </br>).
  • Above MDN document even states that - in general, "Self-closing tags (<tag />) do not exist in HTML." - but they can be added to void elements in order to be XHTML compliant. Something like <p /> isn't valid HTML5 though!

In contrast, HtmlParser requires all "HTML" to be valid XML.

In practice this means that:

  • Void elements currently require a closing tag (e.g. <br><br />) or a self-closing tag (<br /), just (<br>) doesn't work.
  • Non-void elements are currently allowed to be self-closing (e.g. <p />) despite this being non-valid.

My pragmatic suggestion would be to

  • fix the former, i.e. assume void elements always self-close -> treat <br> as <br />
  • and tolerate the latter, as <p /> is still valid XHTML and nobody would write this "by accident" anyway -> leave it as it is.
@dillonkearns
Copy link
Owner

(I'm unsure whether parsing "normal HTML" - and not only "custom HTML tags" - is in scope of the library. As a user I expected to be able to parse "normal HTML" as it is, so I opened this issue; if it's not in scope, feel free to close!)

This question is discussed in this thread:

#99

It's not a clear cut answer, but I laid out a few possible approaches. Still need to gather ideas and decide on a final design decision there, would be glad to hear your thoughts there as well!

My pragmatic suggestion would be to

  • fix the former, i.e. assume void elements always self-close -> treat
    as
  • and tolerate the latter, as

    is still valid XHTML and nobody would write this "by accident" anyway -> leave it as it is.

I agree with your reasoning here. With a self-closing tag, the intention is very clear so I don't see any harm in supporting that in any context regardless of whether it is valid XHTML or HTML.

And void elements should work as expected in HTML, so since <br> works in the Browser (and I believe might be arguably the preferred syntax according to some standards), it makes sense to have that be valid. I'm not sure the scope of the void elements with regard to parsers, but I guess that would mean always treating void tags as self-closing, and always ignoring/throwing away a closing tag for a void element during parsing (such as </br>).

I don't have bandwidth to work on this, but I'd be happy to help review and merge a PR for either of these changes, they both sound like great improvements.

Thank you for the thoughtful writeup and context on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants