Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug with LXML parser #28

Closed
dmoklaf opened this issue Nov 10, 2020 · 2 comments
Closed

Bug with LXML parser #28

dmoklaf opened this issue Nov 10, 2020 · 2 comments

Comments

@dmoklaf
Copy link
Contributor

dmoklaf commented Nov 10, 2020

On some documents Trafilatura 0.6.0 fails with this error:

... File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/external.py", line 123, in sanitize_tree tree = prune_html(tree) File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/htmlprocessing.py", line 63, in prune_html element.drop_tree() AttributeError: 'lxml.etree._Element' object has no attribute 'drop_tree'

A git blame on this line reveals this is new code that has been made 21 days ago in this revision:
74444d2

Note: I am using the latest version of lxml (4.6.1)

@dmoklaf
Copy link
Contributor Author

dmoklaf commented Nov 10, 2020

Further investigation reveals there might be 2 type of elements within the lxml library:
lxml.etree._Element (generated here by Trafilatura but that doesn't have the drop_tree method)
lxml.html.HtmlElement (that is not used here but has the drop_tree method called by Trafilatura)

@adbar
Copy link
Owner

adbar commented Nov 11, 2020

Good catch, thank you!
As drop_tree results in a slight performance increase, I used a try-catch to make sure the error is taken into account: 7abe222

@adbar adbar closed this as completed Nov 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants