Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paras get broken up into fragments #334

Closed
alroythalus opened this issue Apr 24, 2023 · 3 comments
Closed

Paras get broken up into fragments #334

alroythalus opened this issue Apr 24, 2023 · 3 comments

Comments

@alroythalus
Copy link

In the case of https://www.kia.com/us/en/privacy

  • Headers do not get detected maybe cus of the class attached to every header
    Screenshot (548)

  • The table present isn't represented as trafilatura as a table but as p tags

@adbar
Copy link
Owner

adbar commented Apr 24, 2023

Hi @alroythalus, I see two different issues here:

  1. The first case is not a header but a bold paragraph, you mean to say it isn't in the extracted content?
  2. You mean the table which follows? In fact it's not used as a table with columns so the fact that it's treated isn't a problem, or is it?

@alroythalus
Copy link
Author

  • wrt to header: it gets extracted but as a para itself, even tho in the page it is intended to be a header.

  • wrt table, it doesn't get treated ideally, and the html of the page does define it as a table.
    That's why i was wondering why the table gets converted to p tags.

@adbar
Copy link
Owner

adbar commented Apr 24, 2023

I'm afraid it's related to the issue #333 you also filed, the package makes no assumptions as to the nature of the text segments.

@adbar adbar closed this as not planned Won't fix, can't repro, duplicate, stale Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants