Skip to content

deformat 0.12.0

Choose a tag to compare

@arclabs561 arclabs561 released this 22 Apr 22:23
· 46 commits to main since this release

Fixed

  • strip_to_text_with_paths: span output_end was rebased only for leading-whitespace trim, not trailing. Spans could retain output_end > trimmed_text.len() and panic callers on text[span.output_start..span.output_end]. Both sides are now clamped to the trimmed output.
  • remap_spans demoted SpanKind::Direct to EntityDecoded only on byte-count changes. Whitespace runs like " \n" collapse to "\n" with count preserved but byte value swapped, leaving a Direct span whose output \n claimed byte-exact correspondence to source ' '. Now also compares bytes and demotes on content mismatch. (Surfaced by proptest on "a<span> </span><h1 />'".)

Added

  • strip_to_text_with_spans and strip_to_text_with_paths now emit a single whole-input span on the plain-text fast path (input with no <). Kind is Direct when output bytes equal source, EntityDecoded when decoding or whitespace cleanup changed them. Previously the fast path returned an empty SpanMap, which was API-inconsistent with the tagged path.
  • html::filter_low_sentence_density(segments, min_sentences_per_100_words) — drops NarrativeText / UncategorizedText segments whose (punctuation count) / (word count) * 100 falls below the floor. Catches tag-cloud paragraphs that link-density misses. Preserves Title, Header, Footer, ListItem, Table, CodeSnippet, Formula, Image, FigureCaption, PageBreak, and short blocks (<15 words).
  • DOCX tables emit Segment::Table with metadata.text_as_html populated from a normalized <table><tr><td>…</td></tr></table> representation. HTML-sensitive characters (<, >, &, \") in cell text are escaped.
  • examples/filter_pipeline.rs — runnable walkthrough of the three composable filters: each stage drops exactly one segment in the demo input.

WCXB benchmark (dev split, 1,497 pages)

Pipeline F1 P R without%
strip_to_text (baseline) 0.740 0.675 0.957 56.5%
+ link-density (cap 0.45) 0.748 0.696 0.944 64.6%
+ sentence-density (1.0) 0.740 0.678 0.952 59.2%
link + sentence + boilerplate 0.765 0.739 0.909 78.2%

Article F1 0.851 → 0.876. Forum +4.8pp, product +5.1pp, service +4.3pp. Listing −2.6pp (link-heavy pages are legitimately link-dense).

Tests

  • tests/spanmap.rs: 36 → 67. Regression guards for the 0.11.0 </a> path-leak, sibling indexing, UTF-8 char-boundary safety, per-SpanKind source_position semantics, whitespace-collapse demotion, self-closing tags, unclosed tags, multibyte text, and trim-end OOB.
  • tests/proptest.rs: 22 → 32. Invariants for span bounds, sort order, non-overlap, source_range monotonicity, Direct first-byte byte-exactness, and plain-strip output parity.
  • tests/segments.rs: 22 → 29. DOCX table extraction with escaped text_as_html and sentence-density filter composition.
  • tests/bench_real_html.rs migrated off live URLs to WCXB fixtures (3 #[ignore] smoke tests).

Total cargo test --all-features --all-targets: 467 → 559 passing. 14 doc-tests. Clippy and doc warnings: 0.

Compatibility

  • MSRV unchanged (1.80.0).
  • No breaking API changes.

Full changelog: https://github.com/arclabs561/deformat/blob/main/CHANGELOG.md