Rewrote LinkParseFUlter + added XPathFilter + tests for JSOUPFilters #953
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The parsing bolts can become the botteneck on some crawls if parsefilters requiring xpath are in the configuration.
This is due to the fact that this triggers a conversion from the JSoup document to a DOM representation which is then used with Xpath. This can become costly.
We recently added JsoupFilters (#847) so that we can implement ParseFIlters straight onto the JSoup documents. This PR uses the Xsoup library so that we can keep similar XPath based extraction patterns without needing the costly conversion to DOM.
Please note that there are slight differences in the way the patterns should be written, see https://github.com/code4craft/xsoup#syntax-supported
For instance, if the text content of a node should be used as value, the pattern should end in /text().
We have noticed substantial improvements to the speed of the parsing without any losses in content.
This also adds some unit tests for JSoupFilters.
I will modify the configuration of the parse filters in the archetypes so that future users get them by default.
This PR is generously donated by @sam-ulrich1 and https://www.gagepiracy.com/.