deformat 0.12.0
Fixed
strip_to_text_with_paths: spanoutput_endwas rebased only for leading-whitespace trim, not trailing. Spans could retainoutput_end > trimmed_text.len()and panic callers ontext[span.output_start..span.output_end]. Both sides are now clamped to the trimmed output.remap_spansdemotedSpanKind::DirecttoEntityDecodedonly on byte-count changes. Whitespace runs like" \n"collapse to"\n"with count preserved but byte value swapped, leaving a Direct span whose output\nclaimed byte-exact correspondence to source' '. Now also compares bytes and demotes on content mismatch. (Surfaced by proptest on"a<span> </span><h1 />'".)
Added
strip_to_text_with_spansandstrip_to_text_with_pathsnow emit a single whole-input span on the plain-text fast path (input with no<). Kind isDirectwhen output bytes equal source,EntityDecodedwhen decoding or whitespace cleanup changed them. Previously the fast path returned an emptySpanMap, which was API-inconsistent with the tagged path.html::filter_low_sentence_density(segments, min_sentences_per_100_words)— dropsNarrativeText/UncategorizedTextsegments whose(punctuation count) / (word count) * 100falls below the floor. Catches tag-cloud paragraphs that link-density misses. Preserves Title, Header, Footer, ListItem, Table, CodeSnippet, Formula, Image, FigureCaption, PageBreak, and short blocks (<15 words).- DOCX tables emit
Segment::Tablewithmetadata.text_as_htmlpopulated from a normalized<table><tr><td>…</td></tr></table>representation. HTML-sensitive characters (<,>,&,\") in cell text are escaped. examples/filter_pipeline.rs— runnable walkthrough of the three composable filters: each stage drops exactly one segment in the demo input.
WCXB benchmark (dev split, 1,497 pages)
| Pipeline | F1 | P | R | without% |
|---|---|---|---|---|
strip_to_text (baseline) |
0.740 | 0.675 | 0.957 | 56.5% |
| + link-density (cap 0.45) | 0.748 | 0.696 | 0.944 | 64.6% |
| + sentence-density (1.0) | 0.740 | 0.678 | 0.952 | 59.2% |
| link + sentence + boilerplate | 0.765 | 0.739 | 0.909 | 78.2% |
Article F1 0.851 → 0.876. Forum +4.8pp, product +5.1pp, service +4.3pp. Listing −2.6pp (link-heavy pages are legitimately link-dense).
Tests
tests/spanmap.rs: 36 → 67. Regression guards for the 0.11.0</a>path-leak, sibling indexing, UTF-8 char-boundary safety, per-SpanKindsource_positionsemantics, whitespace-collapse demotion, self-closing tags, unclosed tags, multibyte text, and trim-end OOB.tests/proptest.rs: 22 → 32. Invariants for span bounds, sort order, non-overlap,source_rangemonotonicity, Direct first-byte byte-exactness, and plain-strip output parity.tests/segments.rs: 22 → 29. DOCX table extraction with escapedtext_as_htmland sentence-density filter composition.tests/bench_real_html.rsmigrated off live URLs to WCXB fixtures (3#[ignore]smoke tests).
Total cargo test --all-features --all-targets: 467 → 559 passing. 14 doc-tests. Clippy and doc warnings: 0.
Compatibility
- MSRV unchanged (1.80.0).
- No breaking API changes.
Full changelog: https://github.com/arclabs561/deformat/blob/main/CHANGELOG.md