Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple HTML processing issue? #337

Closed
alroythalus opened this issue Apr 26, 2023 · 4 comments
Closed

Simple HTML processing issue? #337

alroythalus opened this issue Apr 26, 2023 · 4 comments
Labels
question Further information is requested

Comments

@alroythalus
Copy link

alroythalus commented Apr 26, 2023

For this simple piece of HTML :

<!DOCTYPE html>
<html lang="en">
<body>
        <h1>4. How Long We Keep df Information</h1>
</body>
</html>

This is the Trafilatura output

<doc title="4. How Long We Keep df Information" categories="" tags="" fingerprint="g92davQ/jL7gis2YFBcyHhRLd2s=">
  <main>
    <p>4. How Long We Keep df Information</p>
  </main>
</doc>

It converts the h-tag to a p-tag for some reason. Is this intended to be like this? Just curious.

@adbar adbar added the question Further information is requested label Apr 26, 2023
@adbar
Copy link
Owner

adbar commented Apr 26, 2023

It can be explained by various issues with document conversion, notably as regards conformity with the XML TEI standard.

@alroythalus
Copy link
Author

Is this a problem that needs to be solved? Or is it expected not to work for bare-bone simple html

@adbar
Copy link
Owner

adbar commented Apr 27, 2023

No it's not an issue, I'll close the thread.

@adbar adbar closed this as not planned Won't fix, can't repro, duplicate, stale Apr 27, 2023
@alroythalus
Copy link
Author

Sure. Really appreciate the effort and time you put into this Library. I personally love it too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants