Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doesn't detect bullet points within tables #335

Closed
alroythalus opened this issue Apr 25, 2023 · 5 comments
Closed

Doesn't detect bullet points within tables #335

alroythalus opened this issue Apr 25, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@alroythalus
Copy link

Site: https://stackoverflow.com/legal/privacy-policy#:~:text=We%20will%20only%20process%20your,be%20shared%20with%20other%20parties.

Screenshot (549)
For one of the tables, it has a list within a cell, this content gets missed out.

This is what trafilatura generated
<row> <cell> <p>Marketing our services and those of selected third parties to:</p> </cell> <cell>For our legitimate interests or those of a third party, i.e., to promote our business to existing and former customers</cell> </row>

Hope this helps, thanks

@alroythalus alroythalus changed the title Doesnt detect bullet points within tables Doesn't detect bullet points within tables Apr 25, 2023
@adbar adbar added the enhancement New feature or request label Apr 25, 2023
@adbar
Copy link
Owner

adbar commented Apr 25, 2023

I agree, it's uncommon to find bullet points in tables but it would indeed be a useful addition.

@alroythalus
Copy link
Author

Do you feel this could be related to #318?

@adbar
Copy link
Owner

adbar commented Apr 25, 2023

If the parent element is a table it's a duplicate issue, otherwise it's a problem with the extraction of nested elements.

@alroythalus
Copy link
Author

See this site uses bullet points in tables a lot. https://www.spotify.com/in-en/legal/privacy-policy/
Just leaving this example, hope is useful

@adbar
Copy link
Owner

adbar commented May 22, 2024

I see that the first case now appears to be solved. The second one can be addressed by focusing on recall:
favor_recall=True with Python, --recall on the command-line.

@adbar adbar closed this as completed May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants