Release deformat 0.13.0 · arclabs561/deformat

Fixed

Void HTML5 elements (<img>, <br>, <hr>, <input>, and the rest of the void-element list) were pushed onto the path-tracking stack but never popped (void elements have no closing tag). Text emitted after a void element within the same block inherited it in its PathSpan.path — e.g., "Before <img> between" produced article/p/img for "between". Void elements are no longer pushed.
strip_to_text_with_paths only clamped span output_end for leading-whitespace trim; trailing trim shortened the output string without adjusting spans, panicking callers that indexed text[span.output_start..span.output_end] on inputs with trailing whitespace. Both sides are now clamped.
remap_spans demoted SpanKind::Direct → EntityDecoded only when byte counts changed after whitespace cleanup. A whitespace run like " \n" collapses to "\n" with byte count preserved but the byte value swapped, leaving a Direct span whose output \n claimed byte-exact correspondence to source ' '. Now also compares bytes and demotes on content mismatch.

Added

strip_to_segments emits Segment::Image for blocks whose text comes only from SpanKind::Synthetic spans — the typical case is a standalone <img> inside <figure>, <body>, or at the document root. Inline <img> within a paragraph keeps the enclosing NarrativeText.
Structural block roles (Title, Header, Footer, ListItem, Table, CodeSnippet, FigureCaption) always win over Image — an <img> inside <h1>, <td>, <li>, or <pre> belongs to that container's semantic role, not a bare Image.
Segment::CodeSnippet now populates metadata.languages from the <code class=""language-X""></code> (or lang-X) class attribute. Handles Pandoc / GFM / Prism / highlight.js conventions; language identifier is lowercased.
<summary> classifies as Segment::Title and updates last_title_id, so paragraphs inside a <details> carry parent_id pointing at the summary.
<address>, <fieldset>, <legend> classify as NarrativeText (previously fell through to UncategorizedText).
html::filter_low_sentence_density(segments, min_sentences_per_100_words) drops NarrativeText / UncategorizedText segments whose (punctuation count) / (word count) * 100 falls below the floor. Catches tag-cloud paragraphs that link-density misses because they aren't wrapped in anchors. (Shipped mid-0.12.0; formal release notes here.)

Changed

The link-density filter preserves Table alongside the existing Title and Header. Tables that reach the segmenter past the scanner-level nav/footer/aside skip are content (product specs, comparison grids, TOC tables on documentation pages). WCXB triple-pipeline listing F1 0.580 → 0.613 (+3.3pp); overall F1 0.765 → 0.767.

Examples + interop

examples/segments_json.rs — emit pure Vec<Segment> JSON to stdout.
scripts/langchain_interop.py — stdlib-only Python script that deserializes segments.json into (page_content, metadata) tuples matching langchain_core.documents.Document.
examples/filter_pipeline.rs — runnable walkthrough of the three-filter composition (link-density → sentence-density → boilerplate) on a single HTML page.

WCXB benchmark (dev split, 1,495 pages)

Pipeline	F1	P	R	without%
`strip_to_text` (baseline)	0.740	0.675	0.957	56.5%
+ link-density (cap 0.45)	0.748	0.696	0.944	64.6%
+ sentence-density (1.0)	0.740	0.678	0.952	59.2%
link + sentence + boilerplate	0.767	0.739	0.913	78.0%

Per-type F1 deltas from baseline: article +2.5pp (0.851 → 0.876), service +4.2pp, forum +4.7pp, product +4.7pp, listing +1.1pp (recovering the 0.12.0 regression).

Tests

tests/spanmap.rs: 36 → 69. Void-element regression guards, path-leak regressions.
tests/segments.rs: 22 → 45. Segment::Image emission, structural roles overriding Image, <summary> classification, Table preservation under link-density, CodeSnippet language hints.
tests/proptest.rs: 22 → 32.
tests/bench_real_html.rs migrated from live URLs to WCXB-fixture smoke tests.

Total cargo test --all-features --all-targets: 467 (0.10.0) → 577 passing. 14 doc-tests. Clippy/doc/fmt clean. MSRV unchanged (1.80.0).

Full changelog: https://github.com/arclabs561/deformat/blob/main/CHANGELOG.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deformat 0.13.0

Choose a tag to compare

Sorry, something went wrong.