Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot extract heading correctly in a list #309

Closed
fortyfourforty opened this issue Feb 24, 2023 · 5 comments
Closed

Cannot extract heading correctly in a list #309

fortyfourforty opened this issue Feb 24, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@fortyfourforty
Copy link

The problem:
If the heading tag is wrapped in a list item, then Trafilatura cannot extract the heading correctly in XML format. It can extract as a heading as <head> but cannot identify the h2 or h3 like <head rend="h2">. Such sites like thespruce.com and similar sites often wrap h2/h3 tags in an ol ordered list to show h2 tags as a list format.

@adbar
Copy link
Owner

adbar commented Feb 24, 2023

Hi @fortyfourforty, I'm not sure what you mean and I couldn't find an example.

Could you please provide a concrete page from thespruce.com?

@fortyfourforty
Copy link
Author

Hi @adbar, this page for example: https://www.thespruce.com/types-of-electrical-switches-in-the-home-1824672 . You can see all light switch types are wrapped in an ordered list. If you try to extract it into xml, you cannot get the exact h2, only as a h tag.

@adbar adbar added the bug Something isn't working label Feb 24, 2023
@adbar
Copy link
Owner

adbar commented Feb 24, 2023

Thanks, I get it but there is a certain level of nesting involved... Not sure it's the priority right now but let's keep track of the bug.

@fortyfourforty
Copy link
Author

fortyfourforty commented Feb 24, 2023

Actually, many, many giant sites are doing this for SEO purposes - put h2 in a list to perform better on google search result pages for a rich snippet/ jump links. This could be on your priority list IMO.

@adbar adbar closed this as completed in 34b6960 Feb 24, 2023
@adbar
Copy link
Owner

adbar commented Feb 24, 2023

Yes, thankfully it was easy to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants