Skip to content

Commit

Permalink
Consider #hash links in link density
Browse files Browse the repository at this point in the history
mozilla/readability@3c83389 (not all these changes work as expected, so one section has been commented out for now)
  • Loading branch information
fivefilters committed Aug 25, 2021
1 parent 84a5622 commit 1c66246
Show file tree
Hide file tree
Showing 6 changed files with 57 additions and 10 deletions.
10 changes: 6 additions & 4 deletions src/Nodes/NodeTrait.php
Expand Up @@ -238,19 +238,21 @@ public function getAllLinks()
*/
public function getLinkDensity()
{
$linkLength = 0;
$textLength = mb_strlen($this->getTextContent(true));

if (!$textLength) {
if ($textLength === 0) {
return 0;
}

$linkLength = 0;

$links = $this->getAllLinks();

if ($links) {
/** @var DOMElement $link */
foreach ($links as $link) {
$linkLength += mb_strlen($link->getTextContent(true));
$href = $link->getAttribute('href');
$coefficient = ($href && preg_match(NodeUtility::$regexps['hashUrl'], $href)) ? 0.3 : 1;
$linkLength += mb_strlen($link->getTextContent(true)) * $coefficient;
}
}

Expand Down
21 changes: 19 additions & 2 deletions src/Readability.php
Expand Up @@ -1882,8 +1882,6 @@ public function _cleanConditionally(DOMDocument $article, $tag)
return;
}

$isList = in_array($tag, ['ul', 'ol']);

/*
* Gather counts for other typical elements embedded within.
* Traverse backwards so we can remove nodes at the same time
Expand All @@ -1896,6 +1894,25 @@ public function _cleanConditionally(DOMDocument $article, $tag)
/** @var $node DOMElement */
$node = $allNodesWithTag[$length - 1 - $i];

$isList = in_array($tag, ['ul', 'ol']);
/*
// Doesn't seem to work as expected
// compared to JS version: https://github.com/mozilla/readability/commit/3c833899866ffb1f9130767110197fd6f5c08d4c
if (!$isList) {
$listLength = 0;
$listNodes = $this->_getAllNodesWithTag($node, ['ul', 'ol']);
array_walk($listNodes, function($list) use(&$listLength) {
$listLength += mb_strlen($list->getTextContent());
});
$nodeTextLength = mb_strlen($node->getTextContent());
if (!$nodeTextLength) {
$isList = true;
} else {
$isList = $listLength / $nodeTextLength > 0.9;
}
}
*/

// First check if this node IS data table, in which case don't remove it.
if ($tag === 'table' && $node->isReadabilityDataTable()) {
continue;
Expand Down
2 changes: 1 addition & 1 deletion test/test-pages/aclu/expected-metadata.json
@@ -1,6 +1,6 @@
{
"Author": "Daniel Kahn Gillmor",
"Direction": "ltr",
"Direction": null,
"Excerpt": "Facebook collects data about people who have never even opted in. But there are ways these non-users can protect themselves.",
"Image": "https:\/\/www.aclu.org\/sites\/default\/files\/styles\/metatag_og_image_1200x630\/public\/field_share_image\/web18-facebook-socialshare-1200x628-v02.png?itok=p77cQjOm",
"Title": "Facebook Is Tracking Me Even Though I\u2019m Not on Facebook",
Expand Down
25 changes: 24 additions & 1 deletion test/test-pages/mozilla-1/expected.html
Expand Up @@ -23,7 +23,30 @@ <h2>Designed to <br>be redesigned</h2>
<p><img src="http://mozorg.cdn.mozilla.net/media/img/firefox/desktop/customize/animations/flexible-bottom-fallback.cafd48a3d0a4.png" alt></p>
</div>
</div>

<div id="customize" data-ga-label="More ways to customize">
<h2>More ways to customize</h2>

<ul id="customizer-list" role="tablist">
<li> <a id="customize-themes" href="#themes">

Themes
</a>

</li>
<li> <a id="customize-addons" href="#add-ons">

Add-ons
</a>

</li>
<li> <a id="customize-awesomebar" href="#awesome-bar">

Awesome Bar
</a>

</li>
</ul>
</div>
<div id="customizers-wrapper">
<div id="themes" role="tabpanel" aria-labelledby="customize-themes">
<div>
Expand Down
5 changes: 4 additions & 1 deletion test/test-pages/nytimes-2/expected.html
Expand Up @@ -33,7 +33,10 @@
<p data-para-count="479" data-total-count="1333">It is hard to overestimate how complex an asset sale like this is. Some of the assets are self-contained, but they must be gathered up and transferred. Employees need to be shuffled around and compensation arrangements redone. Many contracts, like the now-infamous one struck with the search engine Mozilla, which <a href="http://www.recode.net/2016/7/7/12116296/marissa-mayer-deal-mozilla-yahoo-payment">may result in a payment of up to a $1 billion</a>, will contain change-of-control provisions that will be set off and have to be addressed. Tax issues always loom large.</p> <p><a href="#story-continues-1">Continue reading the main story</a>
</p></div>


<div id="story-continues-1">

<p><a href="#story-continues-2">Continue reading the main story</a></p>
</div>
<div>
<p data-para-count="602" data-total-count="1935" id="story-continues-2">In the second step, at the closing, <a href="https://www.sec.gov/Archives/edgar/data/1011006/000119312516656036/d178500dex22.htm">Yahoo will sell the stock</a> in the single subsidiary to Verizon. At that point, Yahoo will change its name to something without “Yahoo” in it. My favorite is simply Remain Co., the name Yahoo executives are using. Remain Co. will become a holding company for the Alibaba and Yahoo Japan stock. Included will also be $10 billion in cash, plus the Excalibur patent portfolio and a number of minority investments including Snapchat. Ahh, if only Yahoo had bought Snapchat instead of Tumblr (indeed, if only Yahoo had bought Google or Facebook when it had the chance).</p>

Expand Down
4 changes: 3 additions & 1 deletion test/test-pages/wikipedia-2/expected.html
Expand Up @@ -939,7 +939,9 @@ <h2>
Historically, extractive industries have contributed strongly to New Zealand's economy, focussing at different times on sealing, whaling, <a href="http://fakehost/wiki/Phormium" title="Phormium">flax</a>, gold, <a href="http://fakehost/wiki/Kauri_gum" title="Kauri gum">kauri gum</a>, and native timber.<sup id="cite_ref-RWT_export_evolution_203-0"><a href="#cite_note-RWT_export_evolution-203">[196]</a></sup> The first shipment of refrigerated meat on the <i><a href="http://fakehost/wiki/Dunedin_(ship)" title="Dunedin (ship)">Dunedin</a></i> in 1882 led to the establishment of meat and dairy exports to Britain, a trade which provided the basis for strong economic growth in New Zealand.<sup id="cite_ref-204"><a href="#cite_note-204">[197]</a></sup> High demand for agricultural products from the United Kingdom and the United States helped New Zealanders achieve higher living standards than both Australia and Western Europe in the 1950s and 1960s.<sup id="cite_ref-205"><a href="#cite_note-205">[198]</a></sup> In 1973, New Zealand's export market was reduced when the United Kingdom joined the <a href="http://fakehost/wiki/European_Economic_Community" title="European Economic Community">European Economic Community</a><sup id="cite_ref-206"><a href="#cite_note-206">[199]</a></sup> and other compounding factors, such as the <a href="http://fakehost/wiki/1973_oil_crisis" title="1973 oil crisis">1973 oil</a> and <a href="http://fakehost/wiki/1979_oil_crisis" title="1979 oil crisis">1979 energy</a> crises, led to a severe <a href="http://fakehost/wiki/Depression_(economics)" title="Depression (economics)">economic depression</a>.<sup id="cite_ref-207"><a href="#cite_note-207">[200]</a></sup> Living standards in New Zealand fell behind those of Australia and Western Europe, and by 1982 New Zealand had the lowest per-capita income of all the developed nations surveyed by <a href="http://fakehost/wiki/World_Bank_Group" title="World Bank Group">the World Bank</a>.<sup id="cite_ref-208"><a href="#cite_note-208">[201]</a></sup> In the mid-1980s New Zealand deregulated its <a href="http://fakehost/wiki/Agriculture_in_New_Zealand" title="Agriculture in New Zealand">agricultural sector</a> by phasing out <a href="http://fakehost/wiki/Agricultural_subsidy" title="Agricultural subsidy">subsidies</a> over a three-year period.<sup id="cite_ref-209"><a href="#cite_note-209">[202]</a></sup><sup id="cite_ref-210"><a href="#cite_note-210">[203]</a></sup> Since 1984, successive governments engaged in major <a href="http://fakehost/wiki/Macroeconomic" title="Macroeconomic">macroeconomic</a> restructuring (known first as <a href="http://fakehost/wiki/Rogernomics" title="Rogernomics">Rogernomics</a> and then <a href="http://fakehost/wiki/Ruthanasia" title="Ruthanasia">Ruthanasia</a>), rapidly transforming New Zealand from a <a href="http://fakehost/wiki/Protectionism" title="Protectionism">protected</a> and highly regulated economy to a liberalised <a href="http://fakehost/wiki/Free-trade" title="Free-trade">free-trade</a> economy.<sup id="cite_ref-Liberalisation_211-0"><a href="#cite_note-Liberalisation-211">[204]</a></sup><sup id="cite_ref-212"><a href="#cite_note-212">[205]</a></sup>
</p>
<div>
<p><a href="http://fakehost/wiki/File:MilfordSound.jpg"><img alt="Blue water against a backdrop of snow-capped mountains" src="http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/MilfordSound.jpg/220px-MilfordSound.jpg" decoding="async" width="220" height="147" srcset="http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/MilfordSound.jpg/330px-MilfordSound.jpg 1.5x, http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/MilfordSound.jpg/440px-MilfordSound.jpg 2x" data-file-width="2048" data-file-height="1364"></a></p>
<p><a href="http://fakehost/wiki/File:MilfordSound.jpg"><img alt="Blue water against a backdrop of snow-capped mountains" src="http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/MilfordSound.jpg/220px-MilfordSound.jpg" decoding="async" width="220" height="147" srcset="http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/MilfordSound.jpg/330px-MilfordSound.jpg 1.5x, http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/MilfordSound.jpg/440px-MilfordSound.jpg 2x" data-file-width="2048" data-file-height="1364"></a></p><div>
<p><a href="http://fakehost/wiki/Milford_Sound" title="Milford Sound">Milford Sound</a> is one of New Zealand's most famous tourist destinations.<sup id="cite_ref-213"><a href="#cite_note-213">[206]</a></sup>
</p></div>
</div>
<p>
Unemployment peaked above 10% in 1991 and 1992,<sup id="cite_ref-unemployment_214-0"><a href="#cite_note-unemployment-214">[207]</a></sup> following the <a href="http://fakehost/wiki/Black_Monday_(1987)" title="Black Monday (1987)">1987 share market crash</a>, but eventually fell to a record low (since 1986) of 3.7% in 2007 (ranking third from twenty-seven comparable OECD nations).<sup id="cite_ref-unemployment_214-1"><a href="#cite_note-unemployment-214">[207]</a></sup> However, the <a href="http://fakehost/wiki/Financial_crisis_of_2007%E2%80%932008" title="Financial crisis of 2007–2008">global financial crisis</a> that followed had a major impact on New Zealand, with the GDP shrinking for five consecutive quarters, the longest recession in over thirty years,<sup id="cite_ref-215"><a href="#cite_note-215">[208]</a></sup><sup id="cite_ref-216"><a href="#cite_note-216">[209]</a></sup> and unemployment rising back to 7% in late 2009.<sup id="cite_ref-217"><a href="#cite_note-217">[210]</a></sup> Unemployment rates for different age groups follow similar trends, but are consistently higher among youth. In the December 2014 quarter, the general unemployment rate was around 5.8%, while the unemployment rate for youth aged 15 to 21 was 15.6%.<sup id="cite_ref-unemployment_214-2"><a href="#cite_note-unemployment-214">[207]</a></sup> New Zealand has experienced a series of "<a href="http://fakehost/wiki/Brain_drain" title="Brain drain">brain drains</a>" since the 1970s<sup id="cite_ref-218"><a href="#cite_note-218">[211]</a></sup> that still continue today.<sup id="cite_ref-219"><a href="#cite_note-219">[212]</a></sup> Nearly one quarter of highly skilled workers live overseas, mostly in Australia and Britain, which is the largest proportion from any developed nation.<sup id="cite_ref-220"><a href="#cite_note-220">[213]</a></sup> In recent decades, however, a "brain gain" has brought in educated professionals from Europe and less developed countries.<sup id="cite_ref-221"><a href="#cite_note-221">[214]</a></sup><sup id="cite_ref-FOOTNOTEBain200644_222-0"><a href="#cite_note-FOOTNOTEBain200644-222">[215]</a></sup> Today New Zealand's economy benefits from a high level of <a href="http://fakehost/wiki/Innovation" title="Innovation">innovation</a>.<sup id="cite_ref-223"><a href="#cite_note-223">[216]</a></sup>
Expand Down

0 comments on commit 1c66246

Please sign in to comment.