Skip to content

deformat 0.14.0

Choose a tag to compare

@arclabs561 arclabs561 released this 23 Apr 20:00
· 21 commits to main since this release

Minor release adding two new public APIs. No breaking changes.

Added

  • html::filter_low_cetd_density(segments, min_fraction_of_mean) — a fourth composable filter implementing Composite Text Density with sibling smoothing (Sun et al. SIGIR 2011). Smooths per-segment char density as 0.25·prev + 0.5·self + 0.25·next and drops NarrativeText / UncategorizedText whose smoothed density falls below min_fraction_of_mean × mean. Language-agnostic (character-based, not word-based). Preserves structural roles. Composed after the existing three-filter pipeline on WCXB dev split: overall F1 0.774 → 0.778 (+0.4pp), article F1 0.881 → 0.887 (+0.6pp), precision +1.0pp, without% +1.6pp.

  • New deformat::page_type module with detect_page_type(html) -> PageType and PageType enum variants Article / Documentation / Product / Forum / Listing / Collection / Service / Unknown. Pure heuristic, inspects in priority order: (1) <meta property=""og:type"">, (2) JSON-LD @type, (3) schema.org itemtype, (4) <link rel=""canonical""> URL path, (5) structural counts (<article>, .comment, .price).

Fixed

  • UTF-8 char-boundary panic in page_type when slicing near multi-byte characters (curly quote U+2019 near og:type keyword, observed on real WCXB pages). Added floor_char_boundary / ceil_char_boundary helpers. Regression guard: proptest strategy now includes Latin-1, CJK, Arabic, emoji, and curly-quote codepoints specifically to exercise char-boundary handling.

Tests

  • 29 new tests (632 total, +2 doc-tests): CETD behaviour, page_type classification, multilang body preservation, structural preservation, arbitrary-bytes panic-guards.

Compatibility

  • MSRV unchanged (1.80.0). No breaking API changes. Published size: 196 KB compressed.

Full changelog: https://github.com/arclabs561/deformat/blob/main/CHANGELOG.md