Release deformat 0.13.1 · arclabs561/deformat

Bug-fix release focused on malformed-HTML recovery and multilingual correctness. No breaking API changes.

Fixed — scanner recovery on malformed HTML

<main> landmark resets skip_depth when unclosed <nav> / <aside> / <footer> would otherwise strand the scanner. Recovers 3 of 4 WCXB articles that previously extracted 0 characters.
Unclosed attribute quotes (e.g. <img … wp-image-3737" />) no longer swallow the rest of the document. A 256-byte + tag-start recovery heuristic ends the phantom quote.
is_wiki_skip_tag id matching requires exact equality or a hyphen-separated prefix. Drupal id=""toc_19421"" per-section anchors are no longer false-matched as Wikipedia TOCs.

Fixed — multilingual content preservation

filter_low_sentence_density and filter_boilerplate counted only ASCII .?! terminators, silently dropping pages of CJK / Arabic / Hindi / Armenian / Ethiopic prose that use 。？！ / ؟ / । / ։ / ።. The check now recognizes the Unicode sentence-terminator inventory used across WCXB and Common Crawl multilingual corpora.
Word-count fallback switched from whitespace-split to a character-based proxy (chars / 5) so space-less scripts (Chinese, Japanese, Thai) clear the density-noise floor instead of silently skipping the filter.

Added

Segment::CodeSnippet.metadata.languages now populates from the <code class=""language-X""> / lang-X class attribute, matching the long-standing README + enum-doc claim. Handles Pandoc / GFM / Prism / highlight.js conventions. Language identifier is lowercased.

Tests and fixtures

New tests/fixtures/synthetic/ (~10 KB) with committed minimal-valid DOCX / XLSX / PPTX / EPUB / RTF authored here and dual-licensed MIT-OR-Apache-2.0. Generated deterministically by scripts/generate_synthetic_fixtures.py (Python stdlib only).
New tests/fixtures/adversarial/ with minimized regression repros for the three malformed-HTML recovery paths plus a 6-language multilingual article fixture.
tests/real_formats.rs rewritten against committed fixtures. The 8 #[ignore]-gated tests that previously only ran via scripts/fetch_fixtures.sh are now 14 non-ignored tests that run in CI.
tests/fixtures/PROVENANCE.md documents per-file origin, license, and the commit-vs-fetch decision rubric.
New proptest randomizes over 7 non-ASCII sentence terminators to guard the filter against future ASCII-centric regressions.
Total tests: 579 → 603.

Measured impact (WCXB dev split, 1,495 pages, triple-filter pipeline)

Metric	0.13.0	0.13.1	Δ
Overall F1	0.767	0.774	+0.7pp
Article F1	0.876	0.880	+0.4pp
Documentation F1	0.885	0.906	+2.1pp
Product F1	0.485	0.500	+1.5pp
Service F1	0.772	0.790	+1.8pp

The multilingual fix is primarily a quality-of-output win; WCXB is English-heavy, so the F1 shift is within measurement noise but users feeding non-English content now get correct filter behavior.

Compatibility

MSRV unchanged (1.80.0).
Published size: 182 KB compressed (vs. 181 KB for 0.13.0).

Full changelog: https://github.com/arclabs561/deformat/blob/main/CHANGELOG.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deformat 0.13.1

Choose a tag to compare

Sorry, something went wrong.