-
-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XPath3.1: mimic handling of multiple root element nodes #2351
base: master
Are you sure you want to change the base?
Conversation
requirements.txt
Outdated
@@ -55,7 +55,7 @@ beautifulsoup4 | |||
lxml >=4.8.0,<6 | |||
|
|||
# XPath 2.0-3.1 support - 4.2.0 broke something? | |||
elementpath==4.1.5 | |||
elementpath==4.4.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is time to upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is time to upgrade?
Sure, if the tests pass it's OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change was required to fix this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR(#2351) uses fragment=True option, >=4.1.5 won't work. and 4.2.0 has another problem. So minimum is 4.2.1
]) | ||
def test_broken_DOM_01(html_content, xpath, answer): | ||
# In normal situation, DOM's root element node is only one. So when DOM violation happens, Exception occurs. | ||
with pytest.raises(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally add this test to reproduce the problem.
And, in the future, libxml2 may implement "html5"(https://gitlab.gnome.org/GNOME/libxml2/-/issues/211). As I posted the issue, this problem will be gone, and this test will fail. The day, please remove these tests.
@pytest.mark.parametrize("html_content", [DOM_violation_two_html_root_element]) | ||
@pytest.mark.parametrize("xpath, answer", [ | ||
("/html/body/p[1]", "First paragraph."), | ||
("/html/body/p[1]", "Browsers parse this part by fixing it but lxml doesn't and returns two root element node"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the critical point. why do I choose one element in the browser inspect window, but lxml returns two? Because there are two html tag elements and two body tag elements.
<p>First paragraph.</p> | ||
</body> | ||
</html> | ||
<html> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second html root element.
As an idea, what about having this enabled by default as a config option? |
Valid HTML DOM are all alike; every non-valid HTML DOM is not valid and unhappy in its own way. Since the suggestion will add another synthetic root element by default, the point is the initial context item. For that, three possible options exist.
One may say that I chose the wrong html parser and I should choose etree.HTML. But, in this case the "html" tag root element node can have html tag as a child. In this case "/html/html" To be honest, all sorts of solutions seem intuitively unfair even with mine. But libxml2 will develop html5. As I mentioned, selecting one of the html "root element nodes" as a context item is not an option. So, in the first method, the context item is one depth deeper(?) for all cases. (https://github.com/dgtlmoon/changedetection.io/actions/runs/9057420996/job/24881401637) |
do we need this one? it could help #2175 , thoughts? |
the minimum version of elementpath is 4.2.1 because of https://github.com/Constantin1489/changedetection.io/actions |
Looks OK to me, i guess it doubles the CPU usage for checking a watch right? |
Good point. Since the new function will inevitably increase the usage, I chose just another method to increase speed. In the 138 tests in CI, it was 0.39 sec(https://github.com/dgtlmoon/changedetection.io/actions/runs/9107046053/job/25035224418#step:9:3365). now it has become 0.31 sec(https://github.com/dgtlmoon/changedetection.io/actions/runs/9126805653/job/25095800223?pr=2351#step:9:3367) like this function doesn't exist.(https://github.com/dgtlmoon/changedetection.io/actions/runs/9124493905/job/25088778195#step:9:3292) |
BTW, I couldn't find any evidence that lxml parses the content again. |
BTW, the reason why I don't do lazy import is because it is Python. I'm not an expert of php but a SOF user said this point (https://stackoverflow.com/a/10084940/20307768) |
could you merge current master into this branch so we can test again? thanks! |
So this is nearly always caused by a missing |
@dgtlmoon As I posted to https://gitlab.gnome.org/GNOME/libxml2/-/issues/716,
OR
OR
In this case, libxml2, and lxml returns two html root element nodes.
or
|
I mean this is an awesome algorithm! How much wealth has been generated over the decades? |
please, could you update this with latest |
Obviously, some web server provides broken html.
The lxml and libxml2 fix it. It's good and indeed great!!! (We have been happy for decades!)
But, at the point, the error I want to solve occurs, the elementpath describes the DOM structure. it's because sometimes lxml or libxml2 returns multiple root element nodes when using html parser. (This could be a trace? of the browser wars. I don't remember the article but there were four kinds of html parser rules because of four major browsers.)
See also, https://gitlab.gnome.org/GNOME/libxml2/-/issues/716
So I mimicked it.
The test I included describes the point.
fixes #2318