Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Both start and end tags of html / head / body elements can't be omitted #98
However, that "feature" isn't supported. html / head / body elements don't seem to inserted to DOM by oga if both start and end tags were omitted.
This is correct. In an initial revision of the HTML handling system the lexer did automatically insert html/head/body tags whenever needed. After thinking about this for a while I decided to remove this as ultimately it leads to unexpected behaviour. To explain this, when parsing XML/HTML there are two kinds of inputs:
Nokogiri supports this distinction in the form of
The problem of this is that it complicates using the library. One has to think "am I parsing a document or a fragment?" every time they want to do something with HTML/XML. This distinction also complicates the lexing phase as the lexer now has to include extra support based on some sort of flag (e.g.
If one were to not be aware (or simply not expect) the above distinction this would lead to unexpected behaviour. For example, say somebody is parsing the following snippet and wants to remove the
They then serialize the document back to XML and lo and behold they get this:
This is very different compared to just receiving
One of the goals I have is that Oga does not return unexpected output. For example, Oga does not automatically add doctypes (unlike Nokogiri) or XML declarations. For that exact same reason I opted to not automatically add html/body/head tags even if the HTML5 specification says otherwise.
I intend to document this choice, but it seems you beat me to it before I could write it down :)
@abotalov This is currently not possible, and I don't think I'll be adding this any time soon. Oga only tracks the names of opening tags (
Besides this I can't really think of any use cases where this would be useful.