deformat 0.14.0
Minor release adding two new public APIs. No breaking changes.
Added
-
html::filter_low_cetd_density(segments, min_fraction_of_mean)— a fourth composable filter implementing Composite Text Density with sibling smoothing (Sun et al. SIGIR 2011). Smooths per-segment char density as0.25·prev + 0.5·self + 0.25·nextand dropsNarrativeText/UncategorizedTextwhose smoothed density falls belowmin_fraction_of_mean × mean. Language-agnostic (character-based, not word-based). Preserves structural roles. Composed after the existing three-filter pipeline on WCXB dev split: overall F1 0.774 → 0.778 (+0.4pp), article F1 0.881 → 0.887 (+0.6pp), precision +1.0pp, without% +1.6pp. -
New
deformat::page_typemodule withdetect_page_type(html) -> PageTypeandPageTypeenum variants Article / Documentation / Product / Forum / Listing / Collection / Service / Unknown. Pure heuristic, inspects in priority order: (1)<meta property=""og:type"">, (2) JSON-LD@type, (3) schema.orgitemtype, (4)<link rel=""canonical"">URL path, (5) structural counts (<article>,.comment,.price).
Fixed
- UTF-8 char-boundary panic in
page_typewhen slicing near multi-byte characters (curly quote U+2019 nearog:typekeyword, observed on real WCXB pages). Addedfloor_char_boundary/ceil_char_boundaryhelpers. Regression guard: proptest strategy now includes Latin-1, CJK, Arabic, emoji, and curly-quote codepoints specifically to exercise char-boundary handling.
Tests
- 29 new tests (632 total, +2 doc-tests): CETD behaviour, page_type classification, multilang body preservation, structural preservation, arbitrary-bytes panic-guards.
Compatibility
- MSRV unchanged (1.80.0). No breaking API changes. Published size: 196 KB compressed.
Full changelog: https://github.com/arclabs561/deformat/blob/main/CHANGELOG.md