Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support XPath2.0 to 3.1 #1774

Merged
merged 37 commits into from
Nov 13, 2023
Merged

Conversation

Constantin1489
Copy link
Contributor

@Constantin1489 Constantin1489 commented Sep 7, 2023

Is XPath 3.1 100% upwardly compatible with XPath 2.0?
What are the differences between versions of XPath (1.0, 2.0, 3.1)

This update allows syntax like this below.(matches and position)
//*[matches(@id,"container")]/section[1]/article[2]/div[2]/table/tbody/tr[position()>3]/td[2]/a[1]

I tested with mine, works well. However, exception catches and so on need to be added. Because I can't test every syntax or exception handling for this app. ex)
image(The issue in the image is fixed)

Also because of incompatibility between XPathes(1.0 vs 2.0), something will work differently.

Also, I didn't do tests in the directory, because the tests already failed before I made changes. BTW, I tested this patch with mine, works well.

Closes #1631
Closes #1674
#1773

[Is XPath 3.1 100% upwardly compatible with XPath 2.0?](https://stackoverflow.com/a/62438044/20307768)
[What are the differences between versions of XPath (1.0, 2.0, 3.1)](https://stackoverflow.com/a/51377660/20307768)
@dgtlmoon
Copy link
Owner

dgtlmoon commented Sep 7, 2023

very cool, running tests now

any chance you can add a test for something like //*[contains(@id,'post')]/div/div[2]/header/h2/a/text() | //*[contains(@id,'post')]/div/div[2]/header/h2/a/@href ?

just copy def test_check_with_prefix_include_filters to a new def test_xpath_20 in tests/test_xpath_selector.py

@dgtlmoon
Copy link
Owner

dgtlmoon commented Sep 7, 2023

, because the tests already failed before I made changes.

That is not correct because it passes here on github, so yeah, super important to add a tests or it cant be merged

@dgtlmoon
Copy link
Owner

dgtlmoon commented Sep 7, 2023

@Constantin1489 interesting

        res = client.post(
            url_for("edit_page", uuid="first"),
            data={"include_filters": "/something horrible", "url": test_url, "tags": "", "headers": "", 'fetch_backend': "html_requests"},
            follow_redirects=True
        )
>       assert b"is not a valid XPath expression" in res.data

So it should throw an exception if the xpath string is invalid right? I wonder

@Constantin1489
Copy link
Contributor Author

Okay, I do it. Please, give me few days

@Constantin1489
Copy link
Contributor Author

@dgtlmoon e295a17 fixes this. You don't have to take any action.

@Constantin1489
Copy link
Contributor Author

Constantin1489 commented Sep 7, 2023

Is there a method to run tests locally? Or is it just a port conflict?

@dgtlmoon
Copy link
Owner

dgtlmoon commented Sep 7, 2023

Is there a method to run tests locally? Or is it just a port conflict?

cd changedetectionio
pytest tests/test_foobar.py

@Constantin1489
Copy link
Contributor Author

Constantin1489 commented Sep 7, 2023

Yes, I did it before. I cannot configure it correctly.

FAILED tests/test_xpath_selector.py::test_setup - AttributeError: Can't pickle local object 'LiveServer.start.<locals>.worker'
FAILED tests/test_xpath_selector.py::test_check_xpath_text_function_utf8 - assert b'<div class="">Stock Alert (UK): RPi CM4' in b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">\n<title>500 Internal Server Error</title>\n<h1>Internal Server ...
FAILED tests/test_xpath_selector.py::test_check_with_prefix_include_filters - assert b'Some text thats the same' in b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">\n<title>500 Internal Server Error</title>\n<h1>Internal Server Error</h1>\n<p>...
FAILED tests/test_xpath_selector.py::test_xpath_20 - assert b'Some text thats the same' in b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">\n<title>500 Internal Server Error</title>\n<h1>Internal Server Error</h1>\n<p>...
==================================================================== 4 failed, 3 passed, 5 warnings in 35.93s =====================================================================
Shutting down datastore thread

I tried it on Rocky os 9(RHEL) it also failed.

=================================================================== 111 failed, 10 passed, 8 warnings in 48.47s ===================================================================

@dgtlmoon
Copy link
Owner

dgtlmoon commented Sep 7, 2023

I cant help there sorry, i never tried in mac or windows

@Constantin1489
Copy link
Contributor Author

@dgtlmoon I added a 09b6203 test for the syntax. (#1774 (comment)).

Explanations will be commented on there.


res = client.post(
url_for("edit_page", uuid="first"),
data={"include_filters": "//*[contains(@class, 'sametext')]|//*[contains(@class, 'changetext')]",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing with the union operator. The operator is supported since 1.0.

//*[contains(@class, 'sametext')]
//*[contains(@class, 'changetext')]

will parse like

<div class="sametext">Some text thats the same</div>
<div class="sametext">Some text thats the same</div>
<div class="changetext">Some new text</div>
<div class="changetext">Some new text</div>

But the operator//*[contains(@class, 'sametext')]|//*[contains(@class, 'changetext')] will parse like this

<div class="sametext">Some text thats the same</div>
<div class="changetext">Some new text</div>
<div class="sametext">Some text thats the same</div>
<div class="changetext">Some new text</div>

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that it affects existing xpaths that people have saved?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It won't. It is just an xpath syntax explanation. I didn't explain here enough. I'm sorry for slabs.

Let's say modified.html contains the content below.

$cat modified.html
<html>
<body>
<p>Which is across multiple lines</p>
<div class="sametext">Some text thats the same</div>
<div class="changetext">Some new text</div>
<div class="sametext">Some text thats the same</div>
<div class="changetext">Some new text</div>
</body>
</html>

basex is cli xpath parser.
If the user uses the XPath union syntax, the result is below.

$basex -i modified.html "//*[contains(@class, 'sametext')]|//*[matches(@class, 'changetext')]" 
<div class="sametext">Some text thats the same</div>
<div class="changetext">Some new text</div>
<div class="sametext">Some text thats the same</div>
<div class="changetext">Some new text</div>

In your program, without the union syntax,

//*[contains(@class, 'sametext')]
//*[contains(@class, 'changetext')]

will show

<div class="sametext">Some text thats the same</div>
<div class="sametext">Some text thats the same</div>
<div class="changetext">Some new text</div>
<div class="changetext">Some new text</div>

)

assert b"Some text thats the same" in res.data #in selector
assert b"Some text that will change" in res.data #in selector
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the include_filters contains //*[contains(@class, 'changetext')]

@Constantin1489
Copy link
Contributor Author

Constantin1489 commented Sep 9, 2023

@dgtlmoon, Do not merge this commit until further notice, please.
#1674

@Constantin1489 Constantin1489 force-pushed the element_lib branch 2 times, most recently from 4a75940 to 79cf984 Compare September 9, 2023 21:23
@Constantin1489
Copy link
Contributor Author

@dgtlmoon I fixed problems only XPath2.0-3.1 related. If tests pass, it is okay to publish.

@Constantin1489
Copy link
Contributor Author

@dgtlmoon please, review the new commit!

@dgtlmoon
Copy link
Owner

dgtlmoon commented Nov 6, 2023

@Constantin1489 I found a bug, when using xpath1: it does not validate the rule anymore

@dgtlmoon
Copy link
Owner

dgtlmoon commented Nov 6, 2023

wachete uses xpath2 also in their export #1922

@Constantin1489
Copy link
Contributor Author

Constantin1489 commented Nov 6, 2023

  1. Oh, thank you for the bug! That is easy!
    related with git show da923352371fe5e121bb321efb390501c4547871 changedetectionio/forms.py
    2. I'm currently digging my PR with git log -p --author=Constantin.
    We need except etree.XPathEvalError as e: should be back, too!
    It will be except (elementpath.ElementPathError, etree.XPathEvalError) as e
    https://stackoverflow.com/a/24338247/20307768

2-2.
[ ] fix exception handling except etree.XPathEvalError as e: in changedetectionio/forms.py

2-3.
Previous if line.strip()[0] == '/': in class ValidateCSSJSONXPATHInput(object): of changedetectionio/forms.py
will be changed to cover xpath1:, xpath: too in new my commit(I will add it soon). Is there a special reason you didn't do that?

  1. We need to revive the tests for xpath1 too! Is it an easy way, without copy and paste all test code??

@dgtlmoon
Copy link
Owner

dgtlmoon commented Nov 6, 2023

2-3. Previous if line.strip()[0] == '/': in class ValidateCSSJSONXPATHInput(object): of changedetectionio/forms.py will be changed to cover xpath1:, xpath: too in new my commit(I will add it soon). Is there a special reason you didn't do that?

Not sure I understand, xpath: should be now xpath2 and xpath1 should be xpath1

  1. We need to revive the tests for xpath1 too! Is it an easy way, without copy and paste all test code??

please just copy test_xpath_validation to test_xpath1_validation and I will clean up the code later

@Constantin1489
Copy link
Contributor Author

Constantin1489 commented Nov 6, 2023

for 2-3.
It will become
if line.strip()[0] == '/' or line.strip().startswith('xpath:'):
and new block is
if line.strip().statswith('xpath1:'):

to validate

for 3. Okay!

@Constantin1489
Copy link
Contributor Author

I added! Would you please check what I missed?

@dgtlmoon dgtlmoon changed the title feat: Support XPath2.0-3.1 feat: Support XPath2.0 to 3.1 Nov 6, 2023
@dgtlmoon
Copy link
Owner

dgtlmoon commented Nov 6, 2023

I just realised the VisualSelector needs checking if it still works with xpath1: style syntax also, might need a tweak

@Constantin1489
Copy link
Contributor Author

Sorry, I didn't understand. Would you explain it to me?

@Constantin1489
Copy link
Contributor Author

Constantin1489 commented Nov 8, 2023

I considered why you wanted to add xpath1 on the visual selector.

I think you are worried about if a user selects an element on the visual selector tab but the result won't show the same as xpath1. I don't think that can happen.

The visual selector represents an element as an element or an element with a number.

Let's test it in ycombinator.

<td class="title"><span class="titleline"><a href="https://ooni.org/" rel="nofollow noreferrer">Open Observatory of Network Interference</a><span class="sitebit comhead"> (<a href="from?site=ooni.org"><span class="sitestr">ooni.org</span></a>)</span></span></td>

For this let's test /html/body/center/table/tbody/tr[3]/td/table/tbody/tr[31]/td[3].
this is the result

Open Observatory of Network Interference ( ooni.org )

when we test two kind of xpath,

/html/body/center/table/tbody/tr[3]/td/table/tbody/tr[31]/td[3]
/html/body/center/table/tbody/tr[3]/td/table/tbody/tr[31]/td
xpath1:/html/body/center/table/tbody/tr[3]/td/table/tbody/tr[31]/td[3]
xpath1:/html/body/center/table/tbody/tr[3]/td/table/tbody/tr[31]/td
/html/body/center/table/tbody/tr[3]/td/table/tbody/tr[31]/td[3]
/html/body/center/table/tbody/tr[3]/td/table/tbody/tr[31]/td

result is

Open Observatory of Network Interference ( ooni.org ) 11. 
Open Observatory of Network Interference ( ooni.org ) Open Observatory of Network Interference ( ooni.org ) 11.
Open Observatory of Network Interference ( ooni.org ) Open Observatory of Network Interference ( ooni.org ) 11.
Open Observatory of Network Interference ( ooni.org )

So I don't believe there will be a different result for the visual selection because it is just a basic syntax.

Just for sure,
//*/text() can be different.(I don't know where the reason came from between xpath1 vs xpath2 or elementpath) but not that significant.
image

Also, this kind of syntax won't show a difference.

xpath1://tr[*]/td[3]/a[1]
//tr[*]/td[3]/a[1]

@Constantin1489
Copy link
Contributor Author

See also, https://www.w3.org/TR/xpath20/#id-backwards-compatibility
In this ref, no mention related to CD's visual selector.

@dgtlmoon
Copy link
Owner

dgtlmoon commented Nov 12, 2023

I think this is going to break functionality for atleast some people out there who have weird/tricky xPath1 selectors.. but CD absolutely needs xPath2/3 so we just have to soldier on :)

@dgtlmoon dgtlmoon merged commit 26931e0 into dgtlmoon:master Nov 13, 2023
3 checks passed
@dgtlmoon
Copy link
Owner

dgtlmoon commented Nov 20, 2023

@Constantin1489

image

I just got this :)

@Constantin1489
Copy link
Contributor Author

Constantin1489 commented Nov 20, 2023

@dgtlmoon that is a really great book. The author is one of the editors of XSLT2.0, 3.0 spec in W3C. He is also the editor of XPath2.0 reference of W3C!
He is an active user in StackOverFlow! If you ask a question related to XPath or XSLT, you will have a good chance to encounter him.
Yes, the book is written with authority.

@dgtlmoon
Copy link
Owner

dgtlmoon commented Dec 5, 2023

@Constantin1489 I think we can say that this went well in the end! thanks! havent got any large bug reports about problems caused by this

@Constantin1489
Copy link
Contributor Author

@dgtlmoon the update_14 was a great choice! Thank you for your patience, too!

@dgtlmoon
Copy link
Owner

dgtlmoon commented Feb 6, 2024

@Constantin1489 over at - #2164 something changed in elementpath, or one of the queries in your test was actually invalid but is not valid, or something... ( tests/test_xpath_selector_unit.py )

@Constantin1489
Copy link
Contributor Author

Thank you for the report. I will contact the repo.

@Constantin1489
Copy link
Contributor Author

Constantin1489 commented Feb 6, 2024

I've just tested if (count(//hotel/branch/staff) = 5) then true() else false() and if (count(/hotel/branch/staff) = 5) then true() else false(), using other tools(basex and xidel), they returned true. I think I will create some test cases and report them to the repo to discuss. Thank you @dgtlmoon !

@dgtlmoon
Copy link
Owner

dgtlmoon commented Feb 7, 2024

@Constantin1489 amazing thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature] xpath2.0 Formatting xpath results (adding spacing etc) Support XPath 2.0
2 participants