XPath3.1: mimic handling of multiple root element nodes #2351

Constantin1489 · 2024-05-07T15:46:17Z

Obviously, some web server provides broken html.
The lxml and libxml2 fix it. It's good and indeed great!!! (We have been happy for decades!)

But, at the point, the error I want to solve occurs, the elementpath describes the DOM structure. it's because sometimes lxml or libxml2 returns multiple root element nodes when using html parser. (This could be a trace? of the browser wars. I don't remember the article but there were four kinds of html parser rules because of four major browsers.)

See also, https://gitlab.gnome.org/GNOME/libxml2/-/issues/716

So I mimicked it.
The test I included describes the point.

fixes #2318

…s for fragment

Constantin1489 · 2024-05-07T15:49:09Z

requirements.txt

@@ -55,7 +55,7 @@ beautifulsoup4
 lxml >=4.8.0,<6

 # XPath 2.0-3.1 support - 4.2.0 broke something?
-elementpath==4.1.5
+elementpath==4.4.0


Is time to upgrade?

Is time to upgrade?

Sure, if the tests pass it's OK

this change was required to fix this PR?

Since this PR(#2351) uses fragment=True option, >=4.1.5 won't work. and 4.2.0 has another problem. So minimum is 4.2.1

changedetectionio/html_tools.py

…t work like my repo

Constantin1489 · 2024-05-07T17:02:23Z

changedetectionio/tests/test_xpath_selector_unit.py

+                          ])
+def test_broken_DOM_01(html_content, xpath, answer):
+    # In normal situation, DOM's root element node is only one. So when DOM violation happens, Exception occurs.
+    with pytest.raises(Exception):


I intentionally add this test to reproduce the problem.
And, in the future, libxml2 may implement "html5"(https://gitlab.gnome.org/GNOME/libxml2/-/issues/211). As I posted the issue, this problem will be gone, and this test will fail. The day, please remove these tests.

Constantin1489 · 2024-05-07T17:12:40Z

changedetectionio/tests/test_xpath_selector_unit.py

+@pytest.mark.parametrize("html_content", [DOM_violation_two_html_root_element])
+@pytest.mark.parametrize("xpath, answer", [
+    ("/html/body/p[1]", "First paragraph."),
+    ("/html/body/p[1]", "Browsers parse this part by fixing it but lxml doesn't and returns two root element node"),


This is the critical point. why do I choose one element in the browser inspect window, but lxml returns two? Because there are two html tag elements and two body tag elements.

Constantin1489 · 2024-05-07T17:39:54Z

changedetectionio/tests/test_xpath_selector_unit.py

+    <p>First paragraph.</p>
+  </body>
+</html>
+<html>


The second html root element.

dgtlmoon · 2024-05-08T07:44:16Z

As an idea, what about having this enabled by default as a config option?

Constantin1489 · 2024-05-13T04:51:29Z

Valid HTML DOM are all alike; every non-valid HTML DOM is not valid and unhappy in its own way.

Since the suggestion will add another synthetic root element by default, the point is the initial context item.

For that, three possible options exist.

the first is that (fragment True and the context item is the new root element).
the second is that (fragment True and, the context item is dynamic). The latter means that if only one root element exists, then that is the context item. when multiple root elements exist, the context item will be the new_root element. So the second is meaningless. it would be the same with current PR.
the third one is that fragment True and don't care about the number of root element as a context item and select one of them. It will reduce the accessible scope of information for multiple root element cases. this is unacceptable.

One may say that I chose the wrong html parser and I should choose etree.HTML. But, in this case the "html" tag root element node can have html tag as a child. In this case "/html/html"

To be honest, all sorts of solutions seem intuitively unfair even with mine. But libxml2 will develop html5.

As I mentioned, selecting one of the html "root element nodes" as a context item is not an option.

So, in the first method, the context item is one depth deeper(?) for all cases. (https://github.com/dgtlmoon/changedetection.io/actions/runs/9057420996/job/24881401637)
e.g. manager[@name = 'Godot'] -> branches_to_visit/manager[@name = 'Godot']
So, making the default new root element node for all cases makes sense when the context item is the "new_root" element.

dgtlmoon · 2024-05-15T10:23:11Z

do we need this one? it could help #2175 , thoughts?

Constantin1489 · 2024-05-16T05:36:26Z

the minimum version of elementpath is 4.2.1 because of https://github.com/Constantin1489/changedetection.io/actions

dgtlmoon · 2024-05-17T07:08:05Z

Looks OK to me, i guess it doubles the CPU usage for checking a watch right?

Constantin1489 · 2024-05-17T10:43:45Z

Good point. Since the new function will inevitably increase the usage, I chose just another method to increase speed.
in my test, 507427 function calls (502074 primitive calls) in 0.374 seconds becomes 12909 function calls (12291 primitive calls) in 0.007 seconds

In the 138 tests in CI, it was 0.39 sec(https://github.com/dgtlmoon/changedetection.io/actions/runs/9107046053/job/25035224418#step:9:3365). now it has become 0.31 sec(https://github.com/dgtlmoon/changedetection.io/actions/runs/9126805653/job/25095800223?pr=2351#step:9:3367) like this function doesn't exist.(https://github.com/dgtlmoon/changedetection.io/actions/runs/9124493905/job/25088778195#step:9:3292)

Constantin1489 · 2024-05-20T12:45:53Z

BTW, I couldn't find any evidence that lxml parses the content again.
Could you provide me an example of double CPU usage?

Constantin1489 · 2024-05-20T13:13:26Z

BTW, the reason why I don't do lazy import is because it is Python. I'm not an expert of php but a SOF user said this point (https://stackoverflow.com/a/10084940/20307768)

dgtlmoon · 2024-05-22T08:20:31Z

could you merge current master into this branch so we can test again? thanks!

This reverts commit 66a7dae.

dgtlmoon · 2024-05-28T09:14:25Z

So this is nearly always caused by a missing <html open tag right?

Constantin1489 · 2024-05-28T11:52:12Z

@dgtlmoon As I posted to https://gitlab.gnome.org/GNOME/libxml2/-/issues/716,
minimal codes are

<!DOCTYPE HTML>
<html></html>
<link href="/example/uri" rel="stylesheet" type="text/css" />

OR

<!DOCTYPE HTML>
<html></html>
Some string

OR

<!DOCTYPE HTML>
<html></html>
<Some/>

In this case, libxml2, and lxml returns two html root element nodes.

cat <<EOF | xmllint --html - --output
<!DOCTYPE HTML>
<html></html>
Some string
EOF

or

cat <<EOF | xmllint --html - --output
<!DOCTYPE HTML>
<html></html>
<Some/>     
EOF

Constantin1489 · 2024-05-28T11:57:40Z

I mean this is an awesome algorithm! How much wealth has been generated over the decades?

dgtlmoon · 2024-06-25T11:32:30Z

please, could you update this with latest master ?

Constantin1489 added 12 commits May 2, 2024 20:41

html_tools/fix: Add forest_transplanting to handle invalid DOM

8e1f170

requirements/fix: Upgrade and pin elementpath to support fragment option

1f776ff

html_tools/fix:

bf5c2c7

html_tools/fix: Another option

9f0cb35

html_tools/fix:

879d0b2

tests/test_xpath_selector_unit/test: Add test.

ed2aaf4

html_tools/docs: Remove comments

dd8b4fe

tests/test_xpath_selector_unit/fix: Typo

fbd5512

tests/test_xpath_selector_unit/test: Fix test and add more small test…

20195e7

…s for fragment

tests/test_xpath_selector_unit/test: Check error occurs.

220f484

tests/test_xpath_selector_unit/test: Fix

e84b9f1

tests/test_xpath_selector_unit/test: Add more unintuitive tests

60777e4

Constantin1489 changed the title ~~mimic several root element nodes handling~~ mimic multiple root element nodes handling May 7, 2024

Constantin1489 changed the title ~~mimic multiple root element nodes handling~~ XPATH3.1: mimic multiple root element nodes handling May 7, 2024

Constantin1489 commented May 7, 2024

View reviewed changes

changedetectionio/html_tools.py Outdated Show resolved Hide resolved

Constantin1489 added 2 commits May 8, 2024 01:04

tests/test_xpath_selector_unit/test: Trigger test again

e325e02

tests/test_xpath_selector_unit/fix: Trigger test again. why it doesn'…

6a2e1cf

…t work like my repo

Constantin1489 marked this pull request as draft May 7, 2024 16:33

Constantin1489 added 2 commits May 8, 2024 01:36

tests/test_xpath_selector_unit/test: Oops fix test name

55b2c6c

tests/test_xpath_selector_unit/test: Failed successfully

93a9585

Constantin1489 changed the title ~~XPATH3.1: mimic multiple root element nodes handling~~ XPATH3.1: mimic handling of multiple root element nodes May 7, 2024

Constantin1489 marked this pull request as ready for review May 7, 2024 16:59

Constantin1489 commented May 7, 2024

View reviewed changes

Constantin1489 added 2 commits May 8, 2024 02:16

tests/test_xpath_selector_unit/test: Add count test

e6b13c9

tests/test_xpath_selector_unit/chore: Trigger CICD

2e3e781

Constantin1489 commented May 7, 2024

View reviewed changes

Constantin1489 added 2 commits May 8, 2024 02:50

tests/test_xpath_selector_unit/test: Add same behavior for xpath 1

c295c5e

tests/test_xpath_selector_unit/test: Fix misc

5acd31f

Constantin1489 added 2 commits May 8, 2024 02:54

tests/test_xpath_selector_unit/test: Fix answer

de7b66b

html_tools/docs: Fix old comment

66a7dae

Constantin1489 added 3 commits May 13, 2024 13:53

tests/test_xpath_selector_unit/feat: Do forest_transplanting by default

4d266ca

tests/test_xpath_selector_unit/test: Fix tests

ebf7fd4

tests/test_xpath_selector_unit/test: Add context node related tests

26e4a58

Constantin1489 mentioned this pull request May 13, 2024

'str' object has no attribute '__name__' error on some xpath filters #2318

Open

Constantin1489 requested a review from dgtlmoon May 14, 2024 12:07

Constantin1489 mentioned this pull request May 16, 2024

chore: Skip elementpath 4.2 but support since 4.1.5. #2175

Closed

requirements/chore: Change minimum version of elementpath

dbf4e87

html_tools/fix: Improve speed for function calls

7cd764f

Constantin1489 added 2 commits May 22, 2024 17:22

Merge branch 'dgtlmoon:master' into transplanting

48a5aa2

Revert "html_tools/docs: Fix old comment"

3619877

This reverts commit 66a7dae.

Constantin1489 changed the title ~~XPATH3.1: mimic handling of multiple root element nodes~~ mimic handling of multiple root element nodes May 26, 2024

Constantin1489 changed the title ~~mimic handling of multiple root element nodes~~ XPath3.1: mimic handling of multiple root element nodes May 26, 2024

Constantin1489 marked this pull request as draft June 13, 2024 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XPath3.1: mimic handling of multiple root element nodes #2351

XPath3.1: mimic handling of multiple root element nodes #2351

Constantin1489 commented May 7, 2024 •

edited

Loading

Constantin1489 May 7, 2024

dgtlmoon May 8, 2024

dgtlmoon May 15, 2024

Constantin1489 May 16, 2024 •

edited

Loading

Constantin1489 May 7, 2024 •

edited

Loading

Constantin1489 May 7, 2024

Constantin1489 May 7, 2024

dgtlmoon commented May 8, 2024

Constantin1489 commented May 13, 2024 •

edited

Loading

dgtlmoon commented May 15, 2024

Constantin1489 commented May 16, 2024

dgtlmoon commented May 17, 2024

Constantin1489 commented May 17, 2024 •

edited

Loading

Constantin1489 commented May 20, 2024 •

edited

Loading

Constantin1489 commented May 20, 2024 •

edited

Loading

dgtlmoon commented May 22, 2024

dgtlmoon commented May 28, 2024

Constantin1489 commented May 28, 2024 •

edited

Loading

Constantin1489 commented May 28, 2024 •

edited

Loading

dgtlmoon commented Jun 25, 2024

XPath3.1: mimic handling of multiple root element nodes #2351

Are you sure you want to change the base?

XPath3.1: mimic handling of multiple root element nodes #2351

Conversation

Constantin1489 commented May 7, 2024 • edited Loading

Constantin1489 May 7, 2024

Choose a reason for hiding this comment

dgtlmoon May 8, 2024

Choose a reason for hiding this comment

dgtlmoon May 15, 2024

Choose a reason for hiding this comment

Constantin1489 May 16, 2024 • edited Loading

Choose a reason for hiding this comment

Constantin1489 May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Constantin1489 May 7, 2024

Choose a reason for hiding this comment

Constantin1489 May 7, 2024

Choose a reason for hiding this comment

dgtlmoon commented May 8, 2024

Constantin1489 commented May 13, 2024 • edited Loading

dgtlmoon commented May 15, 2024

Constantin1489 commented May 16, 2024

dgtlmoon commented May 17, 2024

Constantin1489 commented May 17, 2024 • edited Loading

Constantin1489 commented May 20, 2024 • edited Loading

Constantin1489 commented May 20, 2024 • edited Loading

dgtlmoon commented May 22, 2024

dgtlmoon commented May 28, 2024

Constantin1489 commented May 28, 2024 • edited Loading

Constantin1489 commented May 28, 2024 • edited Loading

dgtlmoon commented Jun 25, 2024

Constantin1489 commented May 7, 2024 •

edited

Loading

Constantin1489 May 16, 2024 •

edited

Loading

Constantin1489 May 7, 2024 •

edited

Loading

Constantin1489 commented May 13, 2024 •

edited

Loading

Constantin1489 commented May 17, 2024 •

edited

Loading

Constantin1489 commented May 20, 2024 •

edited

Loading

Constantin1489 commented May 20, 2024 •

edited

Loading

Constantin1489 commented May 28, 2024 •

edited

Loading

Constantin1489 commented May 28, 2024 •

edited

Loading