anchor issue #147

pieterhartel · 2021-11-18T12:27:59Z

It seems that sometimes a link without an href is ignored. Consider the sample html below:

$ cat anchor.html 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
</head>
<body>

<h1>FOO.</h1>
<p><strong>FOO!</strong></p>
<p>BE AWARE OF SCAMMERS WHO COPY OUR SITE! This is our one and ONLY site - <a href="http://peyueomdqxfjxtpg.onion">http://peyueomdqxfjxtpg.onion</a> Please bookmark us.</p>

    <h1>The quick brown fox jumps over the lazy dog  1</h1>
    <a>The quick brown fox jumps over the lazy dog  2</a>
    <h1><a>The quick brown fox jumps over the lazy dog  3</a></h1>
    The quick brown fox jumps over the lazy dog  4
Lorem ipsum
</body></html>

I would expect four occurrences of "The quick brown fox jumps over the lazy dog', with numbers 1,2,3 and 4. But #3 is missing:

$ trafilatura --json --links <anchor.html 
{"title": "FOO.", "author": null, "hostname": null, "date": null, "categories": "", "tags": "",
"fingerprint": "O0AByIFzTc/NCqx2cgJPXyjnK3s=", "id": null, "license": null,
"raw-text": "FOO. FOO! BE AWARE OF SCAMMERS WHO COPY OUR SITE! This is our one and ONLY site - http://peyueomdqxfjxtpg.onion Please bookmark us. The quick brown fox jumps over the lazy dog 1 The quick brown fox jumps over the lazy dog 2 The quick brown fox jumps over the lazy dog 4 Lorem ipsum",
"source": null, "source-hostname": null, "excerpt": null,
"text": "FOO.\nFOO!\nBE AWARE OF SCAMMERS WHO COPY OUR SITE! This is our one and ONLY site - http://peyueomdqxfjxtpg.onion Please bookmark us.\nThe quick brown fox jumps over the lazy dog 1\nThe quick brown fox jumps over the lazy dog 2\nThe quick brown fox jumps over the lazy dog 4\nLorem ipsum",
"comments": ""}

The text was updated successfully, but these errors were encountered:

adbar · 2022-01-28T17:10:41Z

@pieterhartel There was a small issue here which I fixed, the rest can be explained by the orphan text at the bottom. If you write <p>The quick brown fox jumps over the lazy dog 4</p> then you will see it in the output.

The reason is that trailing titles at the bottom of articles are discarded during extraction, it enhances the quality of extraction. In this particular case it does not work but in general it is not a bug IMO.

pieterhartel · 2022-01-30T15:43:36Z

@adbar wrote "The reason is that trailing titles at the bottom of articles are discarded during extraction".
I don't think that this is the case. There is text following the <h1> element in the example. I installed the latest version of trafilatura and in the example added more text after the <h1> and the title still does not show up.

adbar · 2022-02-21T12:44:29Z

I get your point, but the last title in your example is followed by orphan text without a tag, so the last tag seen by the parser is <h1>.

chakravir · 2023-09-11T02:44:44Z

I don't know how common this is, but there are definitely pages where there is valuable text in h1 blocks lower in the page . An example I ran into is https://mywellself.ca/about-us .

Is there any workaround to extract the info in these multiple h1 blocks and include it ?

adbar · 2023-10-10T12:03:40Z

@chakravir Trafilatura tries to work in a generic way and there is only little potential for customization.

adbar added the bug Something isn't working label Nov 18, 2021

adbar added a commit that referenced this issue Jan 28, 2022

fix: nested elements in titles (#147)

4c88205

adbar added the wontfix This will not be worked on label Jan 28, 2022

adbar closed this as not planned Won't fix, can't repro, duplicate, stale Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

anchor issue #147

anchor issue #147

pieterhartel commented Nov 18, 2021

adbar commented Jan 28, 2022

pieterhartel commented Jan 30, 2022

adbar commented Feb 21, 2022 •

edited

chakravir commented Sep 11, 2023 •

edited

adbar commented Oct 10, 2023

anchor issue #147

anchor issue #147

Comments

pieterhartel commented Nov 18, 2021

adbar commented Jan 28, 2022

pieterhartel commented Jan 30, 2022

adbar commented Feb 21, 2022 • edited

chakravir commented Sep 11, 2023 • edited

adbar commented Oct 10, 2023

adbar commented Feb 21, 2022 •

edited

chakravir commented Sep 11, 2023 •

edited