Skip to content

stage 3: fall back when chosen subtree is suspiciously small#2

Merged
abimaelmartell merged 1 commit into
mainfrom
fix/stage3-fallback-on-undersized-pick
May 20, 2026
Merged

stage 3: fall back when chosen subtree is suspiciously small#2
abimaelmartell merged 1 commit into
mainfrom
fix/stage3-fallback-on-undersized-pick

Conversation

@abimaelmartell
Copy link
Copy Markdown
Member

The scored tree walk in select_main can lock onto a tiny high-text-density subtree (an intro paragraph, a small component wrapper) and miss the substantive main content elsewhere on the page. The fallback chain is already capable of finding the real content in these cases via justext-style paragraph classification, but it currently only runs when Stage 3 returns nothing or returns under the absolute min_extraction_length threshold. If Stage 3 picks a subtree of, say, 60 chars on a page with 5,000 chars of body text, the fallback is never invoked even though the pick is obviously bad.

Fix

Adds a third fallback trigger in lib.rs::extract:

body_text_len  >= 200
AND  kept_text_len * 100  <  body_text_len * 15
AND  fallback_text  >  kept_text_len * 2

When all three conditions hold (chosen subtree is < 15% of the body's text-excluding-links AND the fallback chain found at least 2× more text than Stage 3 picked), we use the fallback's choice. The 15% threshold is empirical, tuned against a small real-world corpus.

Empirical impact

Verified on a 23-URL spot-check covering articles, docs, forums, listings, product pages, marketing, service pages:

  • One service-style page where Stage 3 was picking a ~3 KB component nav instead of the substantive ~30 KB main content: now correctly returns the body content. Output 10×; extraction_quality moves from 0.22 to 0.40 (above the 0.30 threshold downstream callers gate on).
  • 22 other URLs: extraction output unchanged (within bytes).
  • Golden corpus: all 54 fixtures still pass.
  • Existing unit + integration tests: 35/35 pass.

What this does NOT fix

Pages where nearly all text is link text (table-layout listings of all-anchor rows). On those, text_len_excluding_links is near-zero for both the body and the chosen subtree, so the 15%-of-body trigger doesn't fire. The fix would need a "full text including links" variant of the comparison, which has its own trade-offs (normal articles with many inline links would risk false-positive fallback triggers). Separate follow-up.

Why no new unit test

The change is a runtime decision over real DOM shapes; the failure case requires a specific HTML structure (Stage 3 must lock onto a small subtree, fallback chain must find a bigger one). Synthesizing such an HTML reliably enough to gate CI on is fragile. The golden corpus + the 23-URL real-world spot-check are the durable signal; if reviewers want a focused fixture I can add one.

…iciously small

The scored tree walk can lock onto a high-density-but-tiny element (an
intro paragraph, a small component-grid wrapper) and miss the substantive
main content elsewhere on the page. Symptom: extraction returns ≤5% of
the body text on a page with thousands of chars of legitimate content,
because the link-density penalty drove every link-dense subtree's
aggregate score negative and the intro paragraph won by default.

Adds a third fallback trigger:

  body_text_len >= 200
  AND kept_text_len * 100 < body_text_len * 15
  AND fallback_text > kept_text_len * 2

When all three hold, we treat Stage 3's choice as untrustworthy and use
the fallback chain's result instead. The 15% threshold is empirical,
tuned against a small real-world corpus.

Empirical impact on a 23-URL spot-check:

  - One page where Stage 3 was picking a 3 KB component-nav menu instead
    of the substantive 30 KB body content: now correctly returns the
    body content; extraction_quality moves from 0.22 to 0.40 (above the
    0.30 confidence-gate threshold the downstream caller uses).
  - 22 other URLs: extraction output unchanged.
  - Golden corpus: all 54 fixtures still pass.

Known limitation NOT fixed by this patch: pages where almost all text is
link text (e.g. table-layout listings of all-anchor rows). On these,
text_len_excluding_links is near-zero for both body and chosen subtree,
so the 15%-of-body trigger doesn't fire. A follow-up may need a separate
"full text including links" comparison for that class of page.
@abimaelmartell abimaelmartell merged commit d2fde0e into main May 20, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant