-
-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trafilatura is skipping ordered list elements (ol tag) #10
Comments
Thanks for reporting the issue, I'll look into it. Do you have an example for result precision? |
I meant that I'm not sure if you want such issues concerning the precision of the extracted content to be reported here. Anyway, I did some further manual tests. Here's the issues I found:
Thanks |
Thank you for your feedback! I could solve part of the problems you mentioned, some of them cannot be treated ad hoc without destabilizing the whole structure. The improvements will ship very soon with version Do you have other examples of extraction issues, especially pages where the main content is missing? |
The extraction has been significantly improved, also for the web pages you mentioned, I'm closing the issue for now. |
Trafilatura is not including the ol tag on the following page, e.g. where the first list element starts with "Der Arbeitgeber finanziert Ihre bAV allein":
https://www.finanztip.de/betriebliche-altersvorsorge/
It's also skipping the h3 titles on the aforementioned page.
PS: Not sure if you want issues with the result precision to be reported here?
The text was updated successfully, but these errors were encountered: