Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change from "html.HTMLParser" to "etree.XMLParser" for validation #97

Merged
merged 1 commit into from
Oct 9, 2021

Conversation

jedie
Copy link
Collaborator

@jedie jedie commented Oct 9, 2021

html.HTMLParser doesn't accept "unknown" html tags.
e.g.: It raised an error because of <nav> tag :(

The etree.XMLParser seems to be a better choice: It raise errors on really broken html documents,
but accepts all tags.

Another difference: HTMLParser accepts whitespaces in closing tags like: </ \r\n h1> and
XMLParser not.

Thsi PR also doesn't activate recover=False, because with this option, validate django admin
default templates doesn't work. But they seems that they are fine html codes...

Another thing:

etree.XMLSyntaxError will raise a very, very big traceback and not any context to the broken HTML
document.
This is a problem in combination with snapshot tests: Because it's totally unknown what part of the
HTML document contains the error.

Now we get really helpful messages, that points to the error and contains some context lines, e.g.:

StartTag: invalid element name, line 5, column 25
--------------------------------------------------------------------------------
02     <foo>
03         <bar>
04             <h1>Test</h1>
05             <p> >broken< </p>
----------------------------^
06             <p>the end</p>
07         <bar>
08     </foo>
--------------------------------------------------------------------------------

`html.HTMLParser` doesn't accept "unknown" html tags.
e.g.: It raised an error because of `<nav>` tag :(

The `etree.XMLParser` seems to be a better choice: It raise errors on really broken html documents,
but accepts all tags.

Another difference: `HTMLParser` accepts whitespaces in closing tags like: `</ \r\n h1>` and
`XMLParser` not.

Thsi PR also doesn't activate `recover=False`, because with this option, validate django admin
default templates doesn't work. But they seems that they are fine html codes...

Another thing:

`etree.XMLSyntaxError` will raise a very, very big traceback and not any context to the broken HTML
document.
This is a problem in combination with snapshot tests: Because it's totally unknown what part of the
HTML document contains the error.

Now we get really helpful messages, that points to the error and contains some context lines, e.g.:
```
StartTag: invalid element name, line 5, column 25
--------------------------------------------------------------------------------
02     <foo>
03         <bar>
04             <h1>Test</h1>
05             <p> >broken< </p>
----------------------------^
06             <p>the end</p>
07         <bar>
08     </foo>
--------------------------------------------------------------------------------
```
jedie added a commit to jedie/PyInventory that referenced this pull request Oct 9, 2021
jedie added a commit to jedie/PyInventory that referenced this pull request Oct 9, 2021
@phihag phihag merged commit 6d29c91 into master Oct 9, 2021
@phihag phihag deleted the enhance-validation branch October 9, 2021 13:58
jedie added a commit to jedie/PyInventory that referenced this pull request Oct 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants