Skip to content

0.15.0

Latest

Choose a tag to compare

@arclabs561 arclabs561 released this 23 Apr 20:57
· 14 commits to main since this release

Cascade gets the Trafilatura ratio condition and an empty-URL bug
fix; anchor election arrives as a structural inversion (StripOptions::main_landmark); a committed regression corpus pins per-fixture F1 floors so future filter changes that drop F1 fail CI.

Highlights

  • extract_html_cascade now uses Trafilatura's 2x ratio
    condition (the only published cascade form with measured F1
    evidence) instead of the prior fixed-threshold form. Short-circuits
    readability when the scanner already produced enough.
  • Anchor election via StripOptions::main_landmark(): pre-scans
    for <main> or longest <article> and restricts extraction to
    that subtree. WCXB article F1 0.855 → 0.867. Opt-in (regresses
    forum / product pages where content lives outside <main>).
  • Regression corpus (tests/fixtures/regression/) with per-
    extractor F1 floors. Failing CI on F1 drift, not just test counts.
  • Bug fix: extract_with_readability("...", "") silently
    returned None because dom_smoothie rejected the empty URL. The
    cascade's fallback branch had been disabled in practice; now
    fires correctly on real-world content-in-aside pages.

See CHANGELOG for the full list.

Test counts

677 tests + 17 doc tests passing under --all-features (was 661).