Skip to content

Rewrote LinkParseFUlter + added XPathFilter + tests for JSOUPFilters #953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 22, 2022

Conversation

jnioche
Copy link
Contributor

@jnioche jnioche commented Feb 18, 2022

The parsing bolts can become the botteneck on some crawls if parsefilters requiring xpath are in the configuration.
This is due to the fact that this triggers a conversion from the JSoup document to a DOM representation which is then used with Xpath. This can become costly.

We recently added JsoupFilters (#847) so that we can implement ParseFIlters straight onto the JSoup documents. This PR uses the Xsoup library so that we can keep similar XPath based extraction patterns without needing the costly conversion to DOM.

Please note that there are slight differences in the way the patterns should be written, see https://github.com/code4craft/xsoup#syntax-supported

For instance, if the text content of a node should be used as value, the pattern should end in /text().

We have noticed substantial improvements to the speed of the parsing without any losses in content.

This also adds some unit tests for JSoupFilters.

I will modify the configuration of the parse filters in the archetypes so that future users get them by default.

This PR is generously donated by @sam-ulrich1 and https://www.gagepiracy.com/.

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
@jnioche jnioche added this to the 2.3 milestone Feb 18, 2022
@jnioche jnioche self-assigned this Feb 18, 2022
Signed-off-by: Julien Nioche <julien@digitalpebble.com>

protected final Map<String, List<LabelledExpression>> expressions = new HashMap<>();

class LabelledExpression {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClassCanBeStatic: Inner class is non-static but does not reference enclosing class (details)
(at-me in a reply with help or ignore)

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
@jnioche jnioche merged commit 76cf0fa into master Feb 22, 2022
@jnioche jnioche deleted the xsoup branch February 22, 2022 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant