Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

anchor issue #147

Closed
pieterhartel opened this issue Nov 18, 2021 · 5 comments
Closed

anchor issue #147

pieterhartel opened this issue Nov 18, 2021 · 5 comments
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@pieterhartel
Copy link

It seems that sometimes a link without an href is ignored. Consider the sample html below:

$ cat anchor.html 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
</head>
<body>

<h1>FOO.</h1>
<p><strong>FOO!</strong></p>
<p>BE AWARE OF SCAMMERS WHO COPY OUR SITE! This is our one and ONLY site - <a href="http://peyueomdqxfjxtpg.onion">http://peyueomdqxfjxtpg.onion</a> Please bookmark us.</p>

    <h1>The quick brown fox jumps over the lazy dog  1</h1>
    <a>The quick brown fox jumps over the lazy dog  2</a>
    <h1><a>The quick brown fox jumps over the lazy dog  3</a></h1>
    The quick brown fox jumps over the lazy dog  4
Lorem ipsum
</body></html>

I would expect four occurrences of "The quick brown fox jumps over the lazy dog', with numbers 1,2,3 and 4. But #3 is missing:

$ trafilatura --json --links <anchor.html 
{"title": "FOO.", "author": null, "hostname": null, "date": null, "categories": "", "tags": "",
"fingerprint": "O0AByIFzTc/NCqx2cgJPXyjnK3s=", "id": null, "license": null,
"raw-text": "FOO. FOO! BE AWARE OF SCAMMERS WHO COPY OUR SITE! This is our one and ONLY site - http://peyueomdqxfjxtpg.onion Please bookmark us. The quick brown fox jumps over the lazy dog 1 The quick brown fox jumps over the lazy dog 2 The quick brown fox jumps over the lazy dog 4 Lorem ipsum",
"source": null, "source-hostname": null, "excerpt": null,
"text": "FOO.\nFOO!\nBE AWARE OF SCAMMERS WHO COPY OUR SITE! This is our one and ONLY site - http://peyueomdqxfjxtpg.onion Please bookmark us.\nThe quick brown fox jumps over the lazy dog 1\nThe quick brown fox jumps over the lazy dog 2\nThe quick brown fox jumps over the lazy dog 4\nLorem ipsum",
"comments": ""}
@adbar adbar added the bug Something isn't working label Nov 18, 2021
adbar added a commit that referenced this issue Jan 28, 2022
@adbar
Copy link
Owner

adbar commented Jan 28, 2022

@pieterhartel There was a small issue here which I fixed, the rest can be explained by the orphan text at the bottom. If you write <p>The quick brown fox jumps over the lazy dog 4</p> then you will see it in the output.

The reason is that trailing titles at the bottom of articles are discarded during extraction, it enhances the quality of extraction. In this particular case it does not work but in general it is not a bug IMO.

@adbar adbar added the wontfix This will not be worked on label Jan 28, 2022
@pieterhartel
Copy link
Author

@adbar wrote "The reason is that trailing titles at the bottom of articles are discarded during extraction".
I don't think that this is the case. There is text following the <h1> element in the example. I installed the latest version of trafilatura and in the example added more text after the <h1> and the title still does not show up.

@adbar
Copy link
Owner

adbar commented Feb 21, 2022

I get your point, but the last title in your example is followed by orphan text without a tag, so the last tag seen by the parser is <h1>.

@chakravir
Copy link

chakravir commented Sep 11, 2023

I don't know how common this is, but there are definitely pages where there is valuable text in h1 blocks lower in the page . An example I ran into is https://mywellself.ca/about-us .

Is there any workaround to extract the info in these multiple h1 blocks and include it ?

@adbar
Copy link
Owner

adbar commented Oct 10, 2023

@chakravir Trafilatura tries to work in a generic way and there is only little potential for customization.

@adbar adbar closed this as not planned Won't fix, can't repro, duplicate, stale Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants