-
-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot extract heading correctly in a list #309
Comments
Hi @fortyfourforty, I'm not sure what you mean and I couldn't find an example. Could you please provide a concrete page from |
Hi @adbar, this page for example: |
Thanks, I get it but there is a certain level of nesting involved... Not sure it's the priority right now but let's keep track of the bug. |
Actually, many, many giant sites are doing this for SEO purposes - put h2 in a list to perform better on google search result pages for a rich snippet/ jump links. This could be on your priority list IMO. |
Yes, thankfully it was easy to fix. |
The problem:
If the heading tag is wrapped in a list item, then Trafilatura cannot extract the heading correctly in XML format. It can extract as a heading as
<head>
but cannot identify the h2 or h3 like<head rend="h2">
. Such sites likethespruce.com
and similar sites often wrap h2/h3 tags in an ol ordered list to show h2 tags as a list format.The text was updated successfully, but these errors were encountered: