-
-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
anchor issue #147
Comments
@pieterhartel There was a small issue here which I fixed, the rest can be explained by the orphan text at the bottom. If you write The reason is that trailing titles at the bottom of articles are discarded during extraction, it enhances the quality of extraction. In this particular case it does not work but in general it is not a bug IMO. |
@adbar wrote "The reason is that trailing titles at the bottom of articles are discarded during extraction". |
I get your point, but the last title in your example is followed by orphan text without a tag, so the last tag seen by the parser is |
I don't know how common this is, but there are definitely pages where there is valuable text in h1 blocks lower in the page . An example I ran into is https://mywellself.ca/about-us . Is there any workaround to extract the info in these multiple h1 blocks and include it ? |
@chakravir Trafilatura tries to work in a generic way and there is only little potential for customization. |
It seems that sometimes a link without an href is ignored. Consider the sample html below:
I would expect four occurrences of "The quick brown fox jumps over the lazy dog', with numbers 1,2,3 and 4. But #3 is missing:
The text was updated successfully, but these errors were encountered: