deformat 0.13.1
Bug-fix release focused on malformed-HTML recovery and multilingual correctness. No breaking API changes.
Fixed — scanner recovery on malformed HTML
<main>landmark resetsskip_depthwhen unclosed<nav>/<aside>/<footer>would otherwise strand the scanner. Recovers 3 of 4 WCXB articles that previously extracted 0 characters.- Unclosed attribute quotes (e.g.
<img … wp-image-3737" />) no longer swallow the rest of the document. A 256-byte + tag-start recovery heuristic ends the phantom quote. is_wiki_skip_tagid matching requires exact equality or a hyphen-separated prefix. Drupalid=""toc_19421""per-section anchors are no longer false-matched as Wikipedia TOCs.
Fixed — multilingual content preservation
filter_low_sentence_densityandfilter_boilerplatecounted only ASCII.?!terminators, silently dropping pages of CJK / Arabic / Hindi / Armenian / Ethiopic prose that use。?!/؟/।/։/።. The check now recognizes the Unicode sentence-terminator inventory used across WCXB and Common Crawl multilingual corpora.- Word-count fallback switched from whitespace-split to a character-based proxy (chars / 5) so space-less scripts (Chinese, Japanese, Thai) clear the density-noise floor instead of silently skipping the filter.
Added
Segment::CodeSnippet.metadata.languagesnow populates from the<code class=""language-X"">/lang-Xclass attribute, matching the long-standing README + enum-doc claim. Handles Pandoc / GFM / Prism / highlight.js conventions. Language identifier is lowercased.
Tests and fixtures
- New
tests/fixtures/synthetic/(~10 KB) with committed minimal-valid DOCX / XLSX / PPTX / EPUB / RTF authored here and dual-licensed MIT-OR-Apache-2.0. Generated deterministically byscripts/generate_synthetic_fixtures.py(Python stdlib only). - New
tests/fixtures/adversarial/with minimized regression repros for the three malformed-HTML recovery paths plus a 6-language multilingual article fixture. tests/real_formats.rsrewritten against committed fixtures. The 8#[ignore]-gated tests that previously only ran viascripts/fetch_fixtures.share now 14 non-ignored tests that run in CI.tests/fixtures/PROVENANCE.mddocuments per-file origin, license, and the commit-vs-fetch decision rubric.- New proptest randomizes over 7 non-ASCII sentence terminators to guard the filter against future ASCII-centric regressions.
- Total tests: 579 → 603.
Measured impact (WCXB dev split, 1,495 pages, triple-filter pipeline)
| Metric | 0.13.0 | 0.13.1 | Δ |
|---|---|---|---|
| Overall F1 | 0.767 | 0.774 | +0.7pp |
| Article F1 | 0.876 | 0.880 | +0.4pp |
| Documentation F1 | 0.885 | 0.906 | +2.1pp |
| Product F1 | 0.485 | 0.500 | +1.5pp |
| Service F1 | 0.772 | 0.790 | +1.8pp |
The multilingual fix is primarily a quality-of-output win; WCXB is English-heavy, so the F1 shift is within measurement noise but users feeding non-English content now get correct filter behavior.
Compatibility
- MSRV unchanged (1.80.0).
- Published size: 182 KB compressed (vs. 181 KB for 0.13.0).
Full changelog: https://github.com/arclabs561/deformat/blob/main/CHANGELOG.md