stage 3: fall back when chosen subtree is suspiciously small#2
Merged
Merged
Conversation
…iciously small
The scored tree walk can lock onto a high-density-but-tiny element (an
intro paragraph, a small component-grid wrapper) and miss the substantive
main content elsewhere on the page. Symptom: extraction returns ≤5% of
the body text on a page with thousands of chars of legitimate content,
because the link-density penalty drove every link-dense subtree's
aggregate score negative and the intro paragraph won by default.
Adds a third fallback trigger:
body_text_len >= 200
AND kept_text_len * 100 < body_text_len * 15
AND fallback_text > kept_text_len * 2
When all three hold, we treat Stage 3's choice as untrustworthy and use
the fallback chain's result instead. The 15% threshold is empirical,
tuned against a small real-world corpus.
Empirical impact on a 23-URL spot-check:
- One page where Stage 3 was picking a 3 KB component-nav menu instead
of the substantive 30 KB body content: now correctly returns the
body content; extraction_quality moves from 0.22 to 0.40 (above the
0.30 confidence-gate threshold the downstream caller uses).
- 22 other URLs: extraction output unchanged.
- Golden corpus: all 54 fixtures still pass.
Known limitation NOT fixed by this patch: pages where almost all text is
link text (e.g. table-layout listings of all-anchor rows). On these,
text_len_excluding_links is near-zero for both body and chosen subtree,
so the 15%-of-body trigger doesn't fire. A follow-up may need a separate
"full text including links" comparison for that class of page.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The scored tree walk in
select_maincan lock onto a tiny high-text-density subtree (an intro paragraph, a small component wrapper) and miss the substantive main content elsewhere on the page. The fallback chain is already capable of finding the real content in these cases via justext-style paragraph classification, but it currently only runs when Stage 3 returns nothing or returns under the absolutemin_extraction_lengththreshold. If Stage 3 picks a subtree of, say, 60 chars on a page with 5,000 chars of body text, the fallback is never invoked even though the pick is obviously bad.Fix
Adds a third fallback trigger in
lib.rs::extract:When all three conditions hold (chosen subtree is < 15% of the body's text-excluding-links AND the fallback chain found at least 2× more text than Stage 3 picked), we use the fallback's choice. The 15% threshold is empirical, tuned against a small real-world corpus.
Empirical impact
Verified on a 23-URL spot-check covering articles, docs, forums, listings, product pages, marketing, service pages:
extraction_qualitymoves from 0.22 to 0.40 (above the 0.30 threshold downstream callers gate on).What this does NOT fix
Pages where nearly all text is link text (table-layout listings of all-anchor rows). On those,
text_len_excluding_linksis near-zero for both the body and the chosen subtree, so the 15%-of-body trigger doesn't fire. The fix would need a "full text including links" variant of the comparison, which has its own trade-offs (normal articles with many inline links would risk false-positive fallback triggers). Separate follow-up.Why no new unit test
The change is a runtime decision over real DOM shapes; the failure case requires a specific HTML structure (Stage 3 must lock onto a small subtree, fallback chain must find a bigger one). Synthesizing such an HTML reliably enough to gate CI on is fragile. The golden corpus + the 23-URL real-world spot-check are the durable signal; if reviewers want a focused fixture I can add one.