Change from "html.HTMLParser" to "etree.XMLParser" for validation #97

jedie · 2021-10-09T13:12:36Z

html.HTMLParser doesn't accept "unknown" html tags.
e.g.: It raised an error because of <nav> tag :(

The etree.XMLParser seems to be a better choice: It raise errors on really broken html documents,
but accepts all tags.

Another difference: HTMLParser accepts whitespaces in closing tags like: </ \r\n h1> and
XMLParser not.

Thsi PR also doesn't activate recover=False, because with this option, validate django admin
default templates doesn't work. But they seems that they are fine html codes...

Another thing:

etree.XMLSyntaxError will raise a very, very big traceback and not any context to the broken HTML
document.
This is a problem in combination with snapshot tests: Because it's totally unknown what part of the
HTML document contains the error.

Now we get really helpful messages, that points to the error and contains some context lines, e.g.:

StartTag: invalid element name, line 5, column 25
--------------------------------------------------------------------------------
02     <foo>
03         <bar>
04             <h1>Test</h1>
05             <p> >broken< </p>
----------------------------^
06             <p>the end</p>
07         <bar>
08     </foo>
--------------------------------------------------------------------------------

`html.HTMLParser` doesn't accept "unknown" html tags. e.g.: It raised an error because of `<nav>` tag :( The `etree.XMLParser` seems to be a better choice: It raise errors on really broken html documents, but accepts all tags. Another difference: `HTMLParser` accepts whitespaces in closing tags like: `</ \r\n h1>` and `XMLParser` not. Thsi PR also doesn't activate `recover=False`, because with this option, validate django admin default templates doesn't work. But they seems that they are fine html codes... Another thing: `etree.XMLSyntaxError` will raise a very, very big traceback and not any context to the broken HTML document. This is a problem in combination with snapshot tests: Because it's totally unknown what part of the HTML document contains the error. Now we get really helpful messages, that points to the error and contains some context lines, e.g.: ``` StartTag: invalid element name, line 5, column 25 -------------------------------------------------------------------------------- 02 <foo> 03 <bar> 04 <h1>Test</h1> 05 <p> >broken< </p> ----------------------------^ 06 <p>the end</p> 07 <bar> 08 </foo> -------------------------------------------------------------------------------- ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change from "html.HTMLParser" to "etree.XMLParser" for validation #97

Change from "html.HTMLParser" to "etree.XMLParser" for validation #97

jedie commented Oct 9, 2021 •

edited

Loading

Change from "html.HTMLParser" to "etree.XMLParser" for validation #97

Change from "html.HTMLParser" to "etree.XMLParser" for validation #97

Conversation

jedie commented Oct 9, 2021 • edited Loading

jedie commented Oct 9, 2021 •

edited

Loading