This repository has been archived by the owner. It is now read-only.

Both start and end tags of html / head / body elements can't be omitted #98

Closed
abotalov opened this Issue Apr 27, 2015 · 4 comments

Comments

3 participants
@abotalov
Contributor

abotalov commented Apr 27, 2015

The HTML5, HTML5.1, WHATWG HTML specs say:

An html element's start tag may be omitted if the first thing inside the html element is not a comment.
An html element's end tag may be omitted if the html element is not immediately followed by a comment.

A head element's start tag may be omitted if the element is empty, or if the first thing inside the head element is an element.
A head element's end tag may be omitted if the head element is not immediately followed by a space character or a comment.

A body element's start tag may be omitted if the element is empty, or if the first thing inside the body element is not a space character or a comment, except if the first thing inside the body element is a meta, link, script, style, or template element.
A body element's end tag may be omitted if the body element is not immediately followed by a comment.

However, that "feature" isn't supported. html / head / body elements don't seem to inserted to DOM by oga if both start and end tags were omitted.

@abotalov abotalov changed the title from Both start and end tags of html / head / body / colgroup / tbody elements can't be omitted to Both start and end tags of html / head / body elements can't be omitted Apr 27, 2015

@YorickPeterse

This comment has been minimized.

Show comment
Hide comment
@YorickPeterse

YorickPeterse Apr 27, 2015

Owner

This is correct. In an initial revision of the HTML handling system the lexer did automatically insert html/head/body tags whenever needed. After thinking about this for a while I decided to remove this as ultimately it leads to unexpected behaviour. To explain this, when parsing XML/HTML there are two kinds of inputs:

  • Full blown documents (e.g. entire web pages)
  • Fragments of data (e.g. just a <form> tag)

Nokogiri supports this distinction in the form of Nokogiri.HTML() and Nokogiri::HTML.fragment(). When using Nokogiri.HTML() any missing html/body/head tags as well as doctypes are inserted automatically, when using the fragment method this is not the case.

The problem of this is that it complicates using the library. One has to think "am I parsing a document or a fragment?" every time they want to do something with HTML/XML. This distinction also complicates the lexing phase as the lexer now has to include extra support based on some sort of flag (e.g. :document => true or :fragment => false).

If one were to not be aware (or simply not expect) the above distinction this would lead to unexpected behaviour. For example, say somebody is parsing the following snippet and wants to remove the class attribute:

document = Oga.parse_html('<p class="example">Hello</p>')
p = document.children[0]
p.unset('class')

They then serialize the document back to XML and lo and behold they get this:

<html>
    <body>
        <p>Hello</p>
    </body>
</html>

This is very different compared to just receiving <p>Hello</p> as output.

One of the goals I have is that Oga does not return unexpected output. For example, Oga does not automatically add doctypes (unlike Nokogiri) or XML declarations. For that exact same reason I opted to not automatically add html/body/head tags even if the HTML5 specification says otherwise.

I intend to document this choice, but it seems you beat me to it before I could write it down :)

Owner

YorickPeterse commented Apr 27, 2015

This is correct. In an initial revision of the HTML handling system the lexer did automatically insert html/head/body tags whenever needed. After thinking about this for a while I decided to remove this as ultimately it leads to unexpected behaviour. To explain this, when parsing XML/HTML there are two kinds of inputs:

  • Full blown documents (e.g. entire web pages)
  • Fragments of data (e.g. just a <form> tag)

Nokogiri supports this distinction in the form of Nokogiri.HTML() and Nokogiri::HTML.fragment(). When using Nokogiri.HTML() any missing html/body/head tags as well as doctypes are inserted automatically, when using the fragment method this is not the case.

The problem of this is that it complicates using the library. One has to think "am I parsing a document or a fragment?" every time they want to do something with HTML/XML. This distinction also complicates the lexing phase as the lexer now has to include extra support based on some sort of flag (e.g. :document => true or :fragment => false).

If one were to not be aware (or simply not expect) the above distinction this would lead to unexpected behaviour. For example, say somebody is parsing the following snippet and wants to remove the class attribute:

document = Oga.parse_html('<p class="example">Hello</p>')
p = document.children[0]
p.unset('class')

They then serialize the document back to XML and lo and behold they get this:

<html>
    <body>
        <p>Hello</p>
    </body>
</html>

This is very different compared to just receiving <p>Hello</p> as output.

One of the goals I have is that Oga does not return unexpected output. For example, Oga does not automatically add doctypes (unlike Nokogiri) or XML declarations. For that exact same reason I opted to not automatically add html/body/head tags even if the HTML5 specification says otherwise.

I intend to document this choice, but it seems you beat me to it before I could write it down :)

@abotalov

This comment has been minimized.

Show comment
Hide comment
@abotalov

abotalov May 12, 2015

Contributor

Do you think it makes sense to insert start tags if end tags are present? (in situations where it should be done according to HTML spec)

Oga.parse_html('</html>')
Contributor

abotalov commented May 12, 2015

Do you think it makes sense to insert start tags if end tags are present? (in situations where it should be done according to HTML spec)

Oga.parse_html('</html>')
@YorickPeterse

This comment has been minimized.

Show comment
Hide comment
@YorickPeterse

YorickPeterse May 12, 2015

Owner

@abotalov This is currently not possible, and I don't think I'll be adding this any time soon. Oga only tracks the names of opening tags (

def on_element_name(name)
vs
def on_element_end
). Changing this will introduce a pretty hefty performance pentalty (due to extra string allocations) and I'd rather not do that any time soon.

Besides this I can't really think of any use cases where this would be useful.

Owner

YorickPeterse commented May 12, 2015

@abotalov This is currently not possible, and I don't think I'll be adding this any time soon. Oga only tracks the names of opening tags (

def on_element_name(name)
vs
def on_element_end
). Changing this will introduce a pretty hefty performance pentalty (due to extra string allocations) and I'd rather not do that any time soon.

Besides this I can't really think of any use cases where this would be useful.

@pcasaretto

This comment has been minimized.

Show comment
Hide comment
@pcasaretto

pcasaretto Jan 3, 2017

Nokogiri was driving me crazy assuming too much either adding tags when using full docs or removing them when using fragments.
Thanks for this! 🍻

Nokogiri was driving me crazy assuming too much either adding tags when using full docs or removing them when using fragments.
Thanks for this! 🍻

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.