Cascade gets the Trafilatura ratio condition and an empty-URL bug
fix; anchor election arrives as a structural inversion (StripOptions::main_landmark); a committed regression corpus pins per-fixture F1 floors so future filter changes that drop F1 fail CI.
Highlights
extract_html_cascadenow uses Trafilatura's2xratio
condition (the only published cascade form with measured F1
evidence) instead of the prior fixed-threshold form. Short-circuits
readability when the scanner already produced enough.- Anchor election via
StripOptions::main_landmark(): pre-scans
for<main>or longest<article>and restricts extraction to
that subtree. WCXB article F1 0.855 → 0.867. Opt-in (regresses
forum / product pages where content lives outside<main>). - Regression corpus (
tests/fixtures/regression/) with per-
extractor F1 floors. Failing CI on F1 drift, not just test counts. - Bug fix:
extract_with_readability("...", "")silently
returnedNonebecause dom_smoothie rejected the empty URL. The
cascade's fallback branch had been disabled in practice; now
fires correctly on real-world content-in-aside pages.
See CHANGELOG for the full list.
Test counts
677 tests + 17 doc tests passing under --all-features (was 661).