Skip to content

deformat 0.13.1

Choose a tag to compare

@arclabs561 arclabs561 released this 23 Apr 19:20
· 24 commits to main since this release

Bug-fix release focused on malformed-HTML recovery and multilingual correctness. No breaking API changes.

Fixed — scanner recovery on malformed HTML

  • <main> landmark resets skip_depth when unclosed <nav> / <aside> / <footer> would otherwise strand the scanner. Recovers 3 of 4 WCXB articles that previously extracted 0 characters.
  • Unclosed attribute quotes (e.g. <img … wp-image-3737" />) no longer swallow the rest of the document. A 256-byte + tag-start recovery heuristic ends the phantom quote.
  • is_wiki_skip_tag id matching requires exact equality or a hyphen-separated prefix. Drupal id=""toc_19421"" per-section anchors are no longer false-matched as Wikipedia TOCs.

Fixed — multilingual content preservation

  • filter_low_sentence_density and filter_boilerplate counted only ASCII .?! terminators, silently dropping pages of CJK / Arabic / Hindi / Armenian / Ethiopic prose that use 。?! / ؟ / / ։ / . The check now recognizes the Unicode sentence-terminator inventory used across WCXB and Common Crawl multilingual corpora.
  • Word-count fallback switched from whitespace-split to a character-based proxy (chars / 5) so space-less scripts (Chinese, Japanese, Thai) clear the density-noise floor instead of silently skipping the filter.

Added

  • Segment::CodeSnippet.metadata.languages now populates from the <code class=""language-X""> / lang-X class attribute, matching the long-standing README + enum-doc claim. Handles Pandoc / GFM / Prism / highlight.js conventions. Language identifier is lowercased.

Tests and fixtures

  • New tests/fixtures/synthetic/ (~10 KB) with committed minimal-valid DOCX / XLSX / PPTX / EPUB / RTF authored here and dual-licensed MIT-OR-Apache-2.0. Generated deterministically by scripts/generate_synthetic_fixtures.py (Python stdlib only).
  • New tests/fixtures/adversarial/ with minimized regression repros for the three malformed-HTML recovery paths plus a 6-language multilingual article fixture.
  • tests/real_formats.rs rewritten against committed fixtures. The 8 #[ignore]-gated tests that previously only ran via scripts/fetch_fixtures.sh are now 14 non-ignored tests that run in CI.
  • tests/fixtures/PROVENANCE.md documents per-file origin, license, and the commit-vs-fetch decision rubric.
  • New proptest randomizes over 7 non-ASCII sentence terminators to guard the filter against future ASCII-centric regressions.
  • Total tests: 579 → 603.

Measured impact (WCXB dev split, 1,495 pages, triple-filter pipeline)

Metric 0.13.0 0.13.1 Δ
Overall F1 0.767 0.774 +0.7pp
Article F1 0.876 0.880 +0.4pp
Documentation F1 0.885 0.906 +2.1pp
Product F1 0.485 0.500 +1.5pp
Service F1 0.772 0.790 +1.8pp

The multilingual fix is primarily a quality-of-output win; WCXB is English-heavy, so the F1 shift is within measurement noise but users feeding non-English content now get correct filter behavior.

Compatibility

  • MSRV unchanged (1.80.0).
  • Published size: 182 KB compressed (vs. 181 KB for 0.13.0).

Full changelog: https://github.com/arclabs561/deformat/blob/main/CHANGELOG.md