Skip to content

deformat 0.13.0

Choose a tag to compare

@arclabs561 arclabs561 released this 23 Apr 13:12
· 31 commits to main since this release

Fixed

  • Void HTML5 elements (<img>, <br>, <hr>, <input>, and the rest of the void-element list) were pushed onto the path-tracking stack but never popped (void elements have no closing tag). Text emitted after a void element within the same block inherited it in its PathSpan.path — e.g., "Before <img> between" produced article/p/img for "between". Void elements are no longer pushed.
  • strip_to_text_with_paths only clamped span output_end for leading-whitespace trim; trailing trim shortened the output string without adjusting spans, panicking callers that indexed text[span.output_start..span.output_end] on inputs with trailing whitespace. Both sides are now clamped.
  • remap_spans demoted SpanKind::DirectEntityDecoded only when byte counts changed after whitespace cleanup. A whitespace run like " \n" collapses to "\n" with byte count preserved but the byte value swapped, leaving a Direct span whose output \n claimed byte-exact correspondence to source ' '. Now also compares bytes and demotes on content mismatch.

Added

  • strip_to_segments emits Segment::Image for blocks whose text comes only from SpanKind::Synthetic spans — the typical case is a standalone <img> inside <figure>, <body>, or at the document root. Inline <img> within a paragraph keeps the enclosing NarrativeText.
  • Structural block roles (Title, Header, Footer, ListItem, Table, CodeSnippet, FigureCaption) always win over Image — an <img> inside <h1>, <td>, <li>, or <pre> belongs to that container's semantic role, not a bare Image.
  • Segment::CodeSnippet now populates metadata.languages from the <code class=""language-X""></code> (or lang-X) class attribute. Handles Pandoc / GFM / Prism / highlight.js conventions; language identifier is lowercased.
  • <summary> classifies as Segment::Title and updates last_title_id, so paragraphs inside a <details> carry parent_id pointing at the summary.
  • <address>, <fieldset>, <legend> classify as NarrativeText (previously fell through to UncategorizedText).
  • html::filter_low_sentence_density(segments, min_sentences_per_100_words) drops NarrativeText / UncategorizedText segments whose (punctuation count) / (word count) * 100 falls below the floor. Catches tag-cloud paragraphs that link-density misses because they aren't wrapped in anchors. (Shipped mid-0.12.0; formal release notes here.)

Changed

  • The link-density filter preserves Table alongside the existing Title and Header. Tables that reach the segmenter past the scanner-level nav/footer/aside skip are content (product specs, comparison grids, TOC tables on documentation pages). WCXB triple-pipeline listing F1 0.580 → 0.613 (+3.3pp); overall F1 0.765 → 0.767.

Examples + interop

  • examples/segments_json.rs — emit pure Vec<Segment> JSON to stdout.
  • scripts/langchain_interop.py — stdlib-only Python script that deserializes segments.json into (page_content, metadata) tuples matching langchain_core.documents.Document.
  • examples/filter_pipeline.rs — runnable walkthrough of the three-filter composition (link-density → sentence-density → boilerplate) on a single HTML page.

WCXB benchmark (dev split, 1,495 pages)

Pipeline F1 P R without%
strip_to_text (baseline) 0.740 0.675 0.957 56.5%
+ link-density (cap 0.45) 0.748 0.696 0.944 64.6%
+ sentence-density (1.0) 0.740 0.678 0.952 59.2%
link + sentence + boilerplate 0.767 0.739 0.913 78.0%

Per-type F1 deltas from baseline: article +2.5pp (0.851 → 0.876), service +4.2pp, forum +4.7pp, product +4.7pp, listing +1.1pp (recovering the 0.12.0 regression).

Tests

  • tests/spanmap.rs: 36 → 69. Void-element regression guards, path-leak regressions.
  • tests/segments.rs: 22 → 45. Segment::Image emission, structural roles overriding Image, <summary> classification, Table preservation under link-density, CodeSnippet language hints.
  • tests/proptest.rs: 22 → 32.
  • tests/bench_real_html.rs migrated from live URLs to WCXB-fixture smoke tests.

Total cargo test --all-features --all-targets: 467 (0.10.0) → 577 passing. 14 doc-tests. Clippy/doc/fmt clean. MSRV unchanged (1.80.0).

Full changelog: https://github.com/arclabs561/deformat/blob/main/CHANGELOG.md