deformat 0.13.0
Fixed
- Void HTML5 elements (
<img>,<br>,<hr>,<input>, and the rest of the void-element list) were pushed onto the path-tracking stack but never popped (void elements have no closing tag). Text emitted after a void element within the same block inherited it in itsPathSpan.path— e.g.,"Before <img> between"producedarticle/p/imgfor"between". Void elements are no longer pushed. strip_to_text_with_pathsonly clamped spanoutput_endfor leading-whitespace trim; trailing trim shortened the output string without adjusting spans, panicking callers that indexedtext[span.output_start..span.output_end]on inputs with trailing whitespace. Both sides are now clamped.remap_spansdemotedSpanKind::Direct→EntityDecodedonly when byte counts changed after whitespace cleanup. A whitespace run like" \n"collapses to"\n"with byte count preserved but the byte value swapped, leaving a Direct span whose output\nclaimed byte-exact correspondence to source' '. Now also compares bytes and demotes on content mismatch.
Added
strip_to_segmentsemitsSegment::Imagefor blocks whose text comes only fromSpanKind::Syntheticspans — the typical case is a standalone<img>inside<figure>,<body>, or at the document root. Inline<img>within a paragraph keeps the enclosingNarrativeText.- Structural block roles (
Title,Header,Footer,ListItem,Table,CodeSnippet,FigureCaption) always win overImage— an<img>inside<h1>,<td>,<li>, or<pre>belongs to that container's semantic role, not a bare Image. Segment::CodeSnippetnow populatesmetadata.languagesfrom the<code class=""language-X""></code>(orlang-X) class attribute. Handles Pandoc / GFM / Prism / highlight.js conventions; language identifier is lowercased.<summary>classifies asSegment::Titleand updateslast_title_id, so paragraphs inside a<details>carryparent_idpointing at the summary.<address>,<fieldset>,<legend>classify asNarrativeText(previously fell through toUncategorizedText).html::filter_low_sentence_density(segments, min_sentences_per_100_words)dropsNarrativeText/UncategorizedTextsegments whose(punctuation count) / (word count) * 100falls below the floor. Catches tag-cloud paragraphs that link-density misses because they aren't wrapped in anchors. (Shipped mid-0.12.0; formal release notes here.)
Changed
- The link-density filter preserves
Tablealongside the existingTitleandHeader. Tables that reach the segmenter past the scanner-level nav/footer/aside skip are content (product specs, comparison grids, TOC tables on documentation pages). WCXB triple-pipeline listing F1 0.580 → 0.613 (+3.3pp); overall F1 0.765 → 0.767.
Examples + interop
examples/segments_json.rs— emit pureVec<Segment>JSON to stdout.scripts/langchain_interop.py— stdlib-only Python script that deserializessegments.jsoninto(page_content, metadata)tuples matchinglangchain_core.documents.Document.examples/filter_pipeline.rs— runnable walkthrough of the three-filter composition (link-density → sentence-density → boilerplate) on a single HTML page.
WCXB benchmark (dev split, 1,495 pages)
| Pipeline | F1 | P | R | without% |
|---|---|---|---|---|
strip_to_text (baseline) |
0.740 | 0.675 | 0.957 | 56.5% |
| + link-density (cap 0.45) | 0.748 | 0.696 | 0.944 | 64.6% |
| + sentence-density (1.0) | 0.740 | 0.678 | 0.952 | 59.2% |
| link + sentence + boilerplate | 0.767 | 0.739 | 0.913 | 78.0% |
Per-type F1 deltas from baseline: article +2.5pp (0.851 → 0.876), service +4.2pp, forum +4.7pp, product +4.7pp, listing +1.1pp (recovering the 0.12.0 regression).
Tests
tests/spanmap.rs: 36 → 69. Void-element regression guards, path-leak regressions.tests/segments.rs: 22 → 45.Segment::Imageemission, structural roles overriding Image,<summary>classification, Table preservation under link-density, CodeSnippet language hints.tests/proptest.rs: 22 → 32.tests/bench_real_html.rsmigrated from live URLs to WCXB-fixture smoke tests.
Total cargo test --all-features --all-targets: 467 (0.10.0) → 577 passing. 14 doc-tests. Clippy/doc/fmt clean. MSRV unchanged (1.80.0).
Full changelog: https://github.com/arclabs561/deformat/blob/main/CHANGELOG.md