dExtract

Pure-Rust text extraction for common non-PDF local file formats and archives.

Author: Andrew Yates andrewyates.name@gmail.com
Version: 0.3.0
License: Apache-2.0

dExtract mechanically extracts structured text from common non-PDF local files without shelling out to system tools or external services. The facade currently registers built-in extractors for plain text, CSV/TSV, legacy Word/Excel/PowerPoint (doc, xls, ppt), legacy iWork XML packages (pages, numbers, key), Excel binary workbooks (xlsb), Office Open XML (docx, xlsx, pptx, plus accepted macro/template/slideshow aliases), OpenDocument (odt/fodt, ods/fods, odp/fodp, odg/otg/fodg, plus accepted template aliases), XPS/OXPS (xps, oxps), Visio drawing/XML/package/binary variants (vsdx, vdx, vsx, vtx, vsdm, vssx, vssm, vstx, vstm, vsd, vss, vst), CHM topic text (chm), bounded OneNote visible-text recovery (one), PostScript/EPS lexical text recovery (ps, eps), bounded DjVu TXTa text-layer extraction (djvu, djv), WordPerfect WPC document-area text (wpd), HTML/MHTML, EPUB, bounded MOBI/AZW/AZW3 UTF-8 text records, RTF, RFC 5322 / Outlook mail (eml, msg), and ZIP/TAR/limited 7z inputs, including gzip-compressed tar (.tar.gz, .tgz). Several extractors are deliberately limited: modern iWork IWA and directory packages, OneNote package/rich-object parsing, CHM rendering/scripts/resources, legacy Visio record/page semantics, macros, embedded payload recursion, OCR, and layout rendering remain out of scope. TeX/LaTeX sources, mailbox stores such as mbox/Maildir, WordStar documents, AbiWord documents, Microsoft Works files, Microsoft Access databases, SQLite database files, Microsoft Project schedule/interchange files, Outlook PST/OST mailbox stores, IBM Notes/Domino NSF/NTF databases, Microsoft InfoPath XSN form packages, OpenDocument ancillary variants, SPSS/SAS/Stata statistical data files, WARC/WACZ/Safari webarchive captures, FileMaker Pro databases, Kindle KFX/Topaz/AZW4 files, QuarkXPress documents, Scribus SLA page-layout files, Adobe PageMaker layout files, StarOffice/OpenOffice legacy files, Quattro Pro spreadsheets, FrameMaker MIF/FM documents, InDesign IDML packages, DWF/DWFx fixed-layout drawing packages, HP PCL/PCLXL print streams, and OLE-gated Publisher .pub inputs remain unsupported boundaries. PDF parsing and OCR belong in dpdf.

Installation

Until the crates are published to crates.io, install the CLI from the public Git repository:

cargo install --git https://github.com/andrewdyates/dextract dextract-cli

Use the library from the public Git branch:

[dependencies]
dextract = { git = "https://github.com/andrewdyates/dextract" }

From a checkout of this repo, install the CLI without publishing:

cargo install --path dextract-cli

Or depend on the current public Git branch when tracking unreleased changes:

[dependencies]
dextract = { git = "https://github.com/andrewdyates/dextract" }

Minimum Supported Rust Version

dExtract's workspace MSRV is Rust 1.88, declared in the root Cargo.toml. This follows the current dependency graph and may increase when dependencies require a newer compiler.

Local Supply-Chain Checks

The repository carries a local cargo-deny policy in deny.toml for RustSec advisories, yanked crates, license allow-listing, duplicate-version visibility, wildcard dependency bans, and crate source restrictions. Run it from a checkout with:

bash scripts/check-supply-chain.sh

Install the pinned CI version locally with cargo +stable install cargo-deny --version 0.19.4 --locked if the script reports that it is missing or mismatched. Override the pin only when updating CI with CARGO_DENY_VERSION=<version>. Use bash scripts/check-supply-chain.sh --offline only after Cargo indexes and the RustSec advisory database are cached.

Local API Compatibility Checks

Public Rust API compatibility is checked with cargo-semver-checks for workspace library crates only:

bash scripts/check-api-compat.sh

Install the pinned CI version with cargo install cargo-semver-checks --version 0.42.0 --locked if the script reports that it is missing or mismatched. Override the pin only when updating CI with CARGO_SEMVER_CHECKS_VERSION=<version>. The check is dev-only and uses crates.io as the baseline source, so it skips crates that have not been published yet; that is expected before the first crates.io release.

Local GitHub Workflow Checks

GitHub Actions workflow validation is available as a lightweight local check:

bash scripts/check-github-workflows.sh

Install the pinned CI version of actionlint with go install github.com/rhysd/actionlint/cmd/actionlint@v1.7.7 for full workflow syntax and expression validation. Without actionlint, the script still runs the repository-specific structural checks that keep release-critical CI gates wired.

Quick Start

CLI usage after cargo install --path dextract-cli:

dextract legacy.doc
dextract mail.eml
dextract report.docx
dextract --json workbook.ods
dextract --json report.xlsx
dextract --output extracted.txt notes.rtf
dextract --process-isolation legacy.doc
dextract --max-input-bytes 10485760 report.docx
dextract batch ./documents --budget 100000
dextract batch ./documents --max-input-bytes 10485760
dextract batch ./documents --recursive --max-files 5000
dextract batch ./documents --json
dextract formats
dextract formats --all --json
dextract completions zsh > _dextract
dextract manpage > dextract.1

From a fresh checkout, run the same commands with cargo run -p dextract-cli -- .... The CLI can extract one or more files, write sibling *.extracted.txt files for a directory, print the extractor-backed format matrix, and print the broader detected-format matrix with formats --all.

CLI output semantics:

Plain extraction writes extracted page text to stdout, or to --output when provided for a single input. Multiple plain-text inputs written to stdout are separated by one newline; multiple pages inside one input are separated by a form-feed byte.
--json writes one JSON object for one input file and one JSON array for multiple input files. That object-vs-array shape is the documented CLI contract.
JSON uses canonical lowercase format names such as docx, odt, xps, csv, eml, mhtml, and text.
--process-isolation runs each root-command extraction in a child worker process with the same output and exit-code contract. It can also be enabled with DEXTRACT_PROCESS_ISOLATION=1. This is not a sandbox: the parent still reads each input file into memory, the worker has a 30-second timeout and 64 MiB JSON response cap, and callers still need OS/container limits for hostile inputs.
--format accepts extractor-backed formats plus aliases such as html / htm, mhtml / mht, text / txt / plain, legacy ppt, pages, numbers, key / keynote, onenote / one / onetoc2 / onepkg, chm, visio-binary / vsd / vss / vst, and accepted OOXML/OpenDocument family variants (docm, dotx, dotm, xlsb, xlsm, xltx, xltm, pptm, potx, potm, ppsx, ppsm, sldx, sldm, ott, fodt, ots, fods, otp, fodp, odg, otg, fodg, xps, oxps, vsdx, visio-package, vsdm, vssx, vssm, vstx, vstm, visio-xml, vdx, vsx, vtx, postscript, ps, eps, djvu, djv, wordperfect, and wpd). It does not accept detected-but-unsupported formats such as wordstar, ws, ws1, ws2, ws3, ws4, ws5, ws6, ws7, ws8, wsd, wsm, wst, wsb, wsx, publisher, pub, access, mdb, accdb, mde, accde, sqlite, sqlite3, sqlite-database, db, db3, sqlite2, sdb, sqlite-wal, sqlite-shm, sqlite3-wal, sqlite3-shm, db-wal, db-shm, opendocument-ancillary, odm, odb, odf, odc, dbf, dbase, dif, sylk, slk, project, mpp, mpt, mpx, outlook-pst-ost, pst, ost, ibm-notes-domino, notes, domino, nsf, ntf, infopath-xsn, infopath, xsn, xsf, warc-wacz-webarchive, web-archive, warc, warc-gz, warc.gz, wacz, webarchive, safari-webarchive, filemaker, filemaker-pro, fmp12, fp7, fp5, fp3, fmp, fmpur, usr, fmpsl, pagemaker, adobe-pagemaker, pmd, p65, pm6, pm5, pm4, pmt, t65, dwf-dwfx, dwf, dwfx, pcl-pclxl, hp-pcl, hp-pclxl, pcl, pclxl, pxl, prn, spl, staroffice-openoffice-legacy, staroffice-openoffice, staroffice, openoffice-legacy, openoffice, sxw, stw, sxc, stc, sxi, sti, sxd, std, sxm, sxg, sdw, sdc, sdd, sda, tex-latex, tex, latex, ltx, mobi, azw, or azw3.
Auto-detection does not fall back to plain text for arbitrary unknown bytes. Plain text is selected only for known text-like extensions such as .txt, .md, .markdown, and .json, or when explicitly forced with --format text. Standalone .xml is not auto-detected as structured XML; use --format text to extract it as raw text.
Unknown and detected-but-unsupported formats fail root-command extraction with UnsupportedFormat and exit code 1. batch skips them; batch --json reports status: "skipped", the detected format, and guidance.
--budget BYTES is a best-effort extracted-text byte guard in v0.1.0. Text-like, DOC, DOCX, XLS, XLSB, ODT, and modern package extractors generally return partial output with truncated = true; MSG now follows the same caller-budget truncation contract while preserving hard errors for internal parser safety caps. 0 disables only the caller-specified text limit, not internal parser, package, or archive safety limits.
--max-input-bytes BYTES checks each input file's metadata length before reading it into memory; 0 disables this pre-read input cap. Over-limit root extraction exits with code 1 and no stdout. In batch, over-limit files are recorded as failed, JSON reports use stage: "input_limit", and remaining files continue processing.
batch --max-files FILES bounds directory traversal before extraction starts so large trees cannot accumulate unbounded path lists in memory. The default cap is 10000 scanned filesystem files; 0 disables this traversal cap. Over-limit batch traversal exits with code 1, emits no JSON report, and does not write partial extraction outputs.
formats lists extractor-backed formats. formats --all also lists detected-but-unsupported formats with status and guidance.
completions <shell> writes a shell completion script to stdout for Bash, Elvish, Fish, PowerShell, or Zsh.
manpage writes a roff manpage to stdout for packaging or local installation.
batch scans only direct children of the target directory by default. Use batch --recursive to opt into nested filesystem directory traversal. It writes sibling *.extracted.txt files, skips generated *.extracted.txt outputs on rerun, reports progress on stderr, writes no document text to stdout, skips unsupported detections, and exits successfully when every detected input is unsupported.
batch --json preserves the same extraction side effects and exit-code policy, but writes a machine-readable report to stdout with summary counters and per-file extracted, skipped, or failed rows.
CLI memory policy: each input file is read into memory before extraction, including when root extraction uses --process-isolation. batch and root-command multi-file extraction process one input file at a time. Root multi-file output and batch --json per-file report rows are spooled to temporary files before final stdout or --output writes, so later failures do not emit partial stdout and the CLI no longer retains every extracted document/report row in memory. --budget limits extracted text only; --max-input-bytes is the CLI pre-read input file size cap; JSON modes change reporting shape and output buffering, not input buffering or parser memory. Use OS/process/container limits for large or untrusted inputs; process isolation adds a child-process boundary, timeout, and output cap, but does not replace those limits.

Exit-code policy:

0: success, including a batch run where all detected inputs are unsupported.
1: runtime, extraction, input, output, or filesystem failure.
2: command-line usage error reported by clap.

Library usage:

use dextract::DExtract;

let data = std::fs::read("report.docx")?;
let dx = DExtract::new();
let output = dx.extract(&data, "report.docx", 0)?;

for page in &output.pages {
    println!("Page {}: {}", page.page_number, page.text);
}
# Ok::<(), Box<dyn std::error::Error>>(())

If detection returns DocumentFormat::Unknown, DExtract::extract() returns ExtractionError::UnsupportedFormat; it does not treat unknown input as plain text unless the filename has a supported text-like extension. Use extract_as() when intentionally forcing plain text or another extractor-backed format.

WordStar inputs (ws, ws1, ws2, ws3, ws4, ws5, ws6, ws7, ws8, and wsd) are detected-but-unsupported and do not fall through to plain text, RTF, OLE, ZIP/TAR/7z archive, HTML, or XML extraction. Adjacent macro/template/backup/index candidates (wsm, wst, wsb, and wsx) remain inventory-only. Text extraction, metadata extraction, formatting interpretation, dot-command interpretation, codepage decoding, print-control interpretation, macro execution, embedded-object extraction, native/converter dependency support, and extractor-backed support remain outside this boundary.

AppleWorks/ClarisWorks inputs (cwk) are detected-but-unsupported and do not fall through to iWork/Pages, ZIP/TAR/7z archive, OLE, HTML, XML, RTF, or plain-text extraction. Adjacent templates/stationery (cws, cwt), MacBinary/AppleDouble resource-fork material, text/metadata/cell extraction, graphics/formula parsing, converter/native dependency support, and extractor-backed support remain outside this boundary.

HWPX packages (hwpx) are detected-but-unsupported and do not fall through to ZIP/TAR/7z archive, OLE, HTML/XML, RTF, plain-text, or legacy HWP handling. Legacy HWP (hwp) remains evaluation-only because current fixtures are generated FileHeader sentinel bytes, not valid CFB documents.

StarOffice/OpenOffice legacy inputs (sxw, stw, sxc, stc, sxi, sti, sxd, std, sxm, sxg, sdw, sdc, sdd, sda) are detected-but-unsupported and do not fall through to supported ODT/ODS/ODP/ODG, generic ZIP/TAR/7z archive, OLE, HTML/XML, RTF, or plain-text extraction. This boundary does not add OpenDocument aliasing, text/cell/metadata extraction, macro/script or formula execution, converter/native dependency support, or extractor-backed support.

Comic-book archives (cbz, cbt, cb7, cbr) are detected-but-unsupported and do not fall through to generic ZIP/TAR/7z archive, PDF, EPUB, OLE, HTML/XML, RTF, plain-text, image/OCR, or RAR handling. Archive listing metadata, ComicInfo.xml metadata, text sidecars, page images, recursive dispatch, and extractor-backed support remain outside this boundary.

Metadata Semantics

Metadata is best-effort and format-dependent in v0.1.0. ExtractionOutput.metadata carries common document-level fields when an extractor can recover them: title, author, subject, creator, producer, keywords, creation date, modification date, and page_count.

ExtractedPage.metadata is page/section-scoped. It may carry extractor-specific page, sheet, slide, archive-entry, mail-part, or HTML page fields such as sheet_name, row_count, slide_number, path, size, content_type, description, or language. These keys are local to the extracted page or section and are not normalized into ExtractionOutput.metadata.

Emitted ExtractedPage values use 1-indexed page_number values in output order. source_format identifies the extractor family that produced the page, and byte_count is the byte length of the emitted text for that page. The cells vector is populated only by extractors with structured table or spreadsheet output; fixed-layout page extraction such as XPS/OXPS emits page text and page metadata but no cells.

Some extractors intentionally promote source fields into document metadata. For example, HTML <title> becomes document-level metadata.title, HTML author/generator meta tags become metadata.author and metadata.producer, and OpenOffice-style HTML created/changed/modified meta dates in YYYYMMDD;HHMMSS form are normalized. page_count is the extractor-reported logical/source count when available, not a normalized cross-format guarantee. Text-like single-section formats such as CSV, TSV, HTML, MHTML, RTF, EML, and MSG currently report emitted sections. Office/OpenDocument/EPUB formats may report source-declared pages, sheets, slides, or spine items, which can differ from the number of emitted ExtractedPage sections; EPUB page_count is based on manifest-backed spine items after OPF logical caps are applied. Generic archives currently leave page_count = 0 even when archive entries are emitted as page-like sections; plain text also reports 0 because it has no source page-count metadata.

The DExtract facade catches unwind panics from registered extractors and returns ExtractionError::ExtractorPanicked. It does not install or suppress the process panic hook, so caught panics may still print diagnostics to stderr. Aborts, out-of-memory conditions, stack overflows, and direct extractor calls remain caller/process-level concerns. The CLI's optional --process-isolation mode adds a root-command child-process boundary around extraction; it is separate from the facade panic boundary and still is not a sandbox.

Supported Formats

Format	Detected	Extractor-backed	Fixture-backed	Compatibility-backed	Scope / notes
Plain text	Yes	Yes	Yes	Yes	UTF-8 lossy passthrough for local text files; `page_count` remains `0` because no source count is reported, with pinned CC0 compatibility sample.
CSV	Yes	Yes	Yes	Yes	Stable tab-separated text plus structured cells; reports one emitted section as `page_count = 1`, with pinned CC0 compatibility sample.
TSV	Yes	Yes	Yes	Yes	Tab-delimited input rendered as stable tab-separated text plus structured cells; reports one emitted section as `page_count = 1`, with pinned MIT compatibility sample.
TeX/LaTeX sources	Yes	No	Unsupported corpus + preflight fixtures	No	#158 registers `.tex`, `.ltx`, and `.latex` as detected-unsupported boundaries. Generated corpus fixtures cover plain, LaTeX, and archive/OLE/HTML/XML/RTF/plain-looking no-fallback inputs; preflight evidence records bounded command, brace, environment, blocked include/resource/execution directive, and resource-cap observations. No TeX execution, macro expansion, include traversal, graphics/bibliography loading, rendering, converter/native dependency, or extraction support is claimed.
DOC	Yes	Yes	Yes	Yes	Legacy OLE Word extraction with best-effort document metadata and source page-count metadata when available.
DOCX	Yes	Yes	Yes	Yes	Document body plus header/footer text, document metadata, and source page-count metadata when available; accepted `docm`, `dotx`, and `dotm` aliases route here.
ODT	Yes	Yes	Yes	Yes	OpenDocument text extraction with document metadata and source page-count metadata when available; accepted `ott` and flat XML `.fodt` inputs route here.
XLS	Yes	Yes	Yes	Yes	Legacy BIFF workbook extraction with sheet pages, structured cells, and sheet-count `page_count`.
XLSX	Yes	Yes	Yes	Yes	Shared strings and worksheet XML become structured pages and cells with sheet-count `page_count`; accepted `xlsm`, `xltx`, and `xltm` aliases route here.
ODS	Yes	Yes	Yes	Yes	OpenDocument spreadsheet extraction with sheet pages, structured cells, and source sheet-count `page_count`; accepted `ots` and flat XML `.fods` inputs route here.
PPT	Yes	Yes	Yes	Yes	Supported with limitations: bounded mechanical legacy PowerPoint text and metadata extraction from synthetic fixtures plus one pinned CC0 real-world sample; no rendering, macros, media extraction, embedded object recursion, or decryption.
PPTX	Yes	Yes	Yes	Yes	Slide text, speaker notes, presentation metadata, and slide-count `page_count`; accepted `pptm`, `potx`, `potm`, `ppsx`, `ppsm`, `sldx`, and `sldm` aliases route here.
ODP	Yes	Yes	Yes	Yes	OpenDocument presentation extraction with slide/page-count `page_count`; accepted `otp` and flat `.fodp` inputs route here.
ODG	Yes	Yes	Yes	Yes	Bounded OpenDocument drawing package/template/flat XML extraction for `.odg`, `.otg`, and `.fodg` visible drawing text, with a pinned Apache-2.0 packaged ODG compatibility sample.
HTML	Yes	Yes	Yes	Yes	Visible text plus document-level title, author, producer, normalized created/modified metadata, and `page_count = 1`; page metadata keeps description and language, with bounded nested-table cell capture and a pinned CC0 compatibility sample.
EPUB	Yes	Yes	Yes	Yes	XHTML chapters extracted in spine order with capped OPF metadata, manifest, and spine parsing plus spine-item `page_count`, with pinned W3C EPUB compatibility sample.
RTF	Yes	Yes	Yes	Yes	Body text plus `\\info` metadata; reports one emitted section as `page_count = 1`, with pinned CC0 compatibility sample.
RTFD packages	No	No	Preflight inventory fixtures	No	Evaluation-only #170 `testdata/rtfd-preflight` fixtures record directory-package shape, `TXT.rtf` root presence, attachment/member inventory, nested `.rtfd` deferral, resource-fork sidecar deferral, member-name traversal examples, and resource caps. No public `.rtfd` detection, detected-unsupported no-fallback behavior, RTFD text extraction, attachment extraction, recursive traversal, image/OCR extraction, converter/native dependency, or extractor-backed support is claimed.
EML	Yes	Yes	Yes	Yes	RFC 5322 mail extraction with decoded headers, body selection, MIME traversal caps, and truncation for clipped internals; reports selected body sections as `page_count`, while ordinary attachment payloads are not extracted.
MSG	Yes	Yes	Yes	Yes	Outlook `.msg` extraction with message metadata and text bodies; reports emitted message body sections as `page_count`, while ordinary attachment payloads are not extracted.
Outlook PST/OST	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #181; generated `.pst`/`.ost` corpus fixtures cover Outlook store classification, guidance, and no MSG, EML, mbox/Maildir, ZIP/TAR/7z archive, OLE, HTML, RTF, or plain-text fallback, while `testdata/outlook-pst-ost-preflight` records synthetic NDB header/version sentinels, malformed/root/node/block-count cases, encrypted/protected markers, oversized store metadata, and fallback probes. No message extraction, attachment extraction, folder traversal, decryption/password handling, store repair, search-index interpretation, account/client access, network access, native/libpff dependency, or extractor-backed support is claimed.
IBM Notes/Domino NSF/NTF	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #185; generated `.nsf`/`.ntf` corpus fixtures cover Notes/Domino classification, guidance, and no MSG, EML, mbox/Maildir, Outlook PST/OST, Access, DBF/DIF/SYLK, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while `testdata/notes-domino-preflight` records synthetic header observations, malformed short headers, encrypted/ACL-looking markers, resource caps, fallback probes, and inventory-only `.box`/`.ndl`/`.id` adjacent artifacts. No message extraction, attachment extraction, folder/view/form traversal, ACL/security interpretation, replication handling, decryption, Domino API/native dependency, database repair, embedded-payload recursion, adjacent artifact public classification, or extractor-backed support is claimed.
Microsoft InfoPath XSN	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #186; generated `.xsn` corpus fixtures cover InfoPath package classification, guidance, and no OOXML/ODF, ZIP/TAR/7z archive, generic XML, HTML, RTF, or plain-text fallback, while `testdata/infopath-preflight` records XSN package structure, `manifest.xsf`-style metadata, InfoPath XML namespace samples, malformed package/XML cases, external data-connection markers, traversal-shaped member names, resource caps, and inventory-only `.xsf`/standalone XML candidates. No form rendering, schema validation, script execution, data-connection traversal, attachment extraction, SharePoint integration, native/converter dependency, standalone generic XML public classification, or extractor-backed support is claimed.
WARC/WACZ/Safari webarchive	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #187; generated `.warc`, `.warc.gz`, `.wacz`, and `.webarchive` corpus fixtures cover web-archive classification, guidance, and no MHTML, HTML, JSON/plain text, generic XML, RTF, ZIP/TAR/7z archive, gzip TAR, or Safari plist/plain fallback, while `testdata/web-archive-preflight` records WARC/WARC.GZ records, WACZ package/index shape, Safari webarchive markers, malformed cases, traversal member names, external-resource markers, resource caps, and inventory-only `.arc`/`.arc.gz`/`.cdx`/`.cdxj`/`.har` candidates. No WARC record extraction, WACZ package extraction, Safari webarchive parsing, HTTP payload extraction, page rendering, resource reconstruction, CDX index lookup, external-resource traversal, native/converter dependency, inventory-only extension public classification, or extractor-backed support is claimed.
FileMaker Pro	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #189; generated `.fmp12` and `.fp7` corpus fixtures cover FileMaker classification, guidance, and no Access, DBF/DIF/SYLK, SQLite, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while `testdata/filemaker-preflight` records synthetic header/version observations, malformed short/bad headers, encryption/container/external-container markers, resource caps, fallback probes, and inventory-only `.fp5`/`.fp3`/`.fmp`/`.fmpur`/`.usr`/`.fmpsl` candidates. No table extraction, layout/form/report rendering, script execution, calculation evaluation, container-field extraction, external-container traversal, account/security interpretation, decryption, repair, FileMaker native dependency, inventory-only extension public classification, or extractor-backed support is claimed.
SPSS	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #190; generated `.sav`, `.zsav`, and `.por` corpus fixtures cover SPSS classification, guidance, and no spreadsheet/database/archive/OLE/HTML/XML/RTF/CSV/TSV/plain fallback, while `testdata/statistical-data-preflight` records header/version, compression/encryption, variable/value-label, malformed/resource-cap, fallback, and inventory-only `.sps`/`.spv`/`.spo` evidence. No table, row, cell, value-label, metadata, compression, decryption, statistical interpretation, native/converter dependency, inventory-only public classification, or extractor-backed support is claimed.
SAS	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #190; generated `.sas7bdat` and `.xpt` corpus fixtures cover SAS classification, guidance, and no spreadsheet/database/archive/OLE/HTML/XML/RTF/CSV/TSV/plain fallback, while `testdata/statistical-data-preflight` records header/version, encryption, variable/value-label, malformed/resource-cap, fallback, and inventory-only `.sas`/`.sas7bcat`/`.sas7bvew`/`.sas7bndx` evidence. No table, row, cell, value-label, metadata, compression, decryption, statistical interpretation, native/converter dependency, inventory-only public classification, or extractor-backed support is claimed.
Stata	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #190; generated `.dta` corpus fixtures cover Stata classification, guidance, and no spreadsheet/database/archive/OLE/HTML/XML/RTF/CSV/TSV/plain fallback, while `testdata/statistical-data-preflight` records header/version, encryption, variable/value-label, malformed/resource-cap, fallback, and inventory-only `.do`/`.ado`/`.smcl`/`.gph`/`.dct` evidence. No table, row, cell, value-label, metadata, compression, decryption, statistical interpretation, native/converter dependency, inventory-only public classification, or extractor-backed support is claimed.
Mailbox stores	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #156; generated `.mbox` corpus fixtures cover mailbox-store classification, guidance, and no EML, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, or plain-text fallback, while `testdata/mailbox-preflight` records bounded mbox boundary/order/resource-cap observations plus CLI-only direct Maildir-shaped directory detection through `new`/`cur`/`tmp` children. No mailbox text extraction, message enumeration, attachment extraction, recursive store handling, Maildir message traversal, account/client access, network access, native/converter dependency, or extractor-backed support is claimed.
Pages	Yes	Limited	Yes	Yes	Legacy XML Pages packages with root `index.xml.gz` or `index.xml` are extractor-backed for body paragraph text; modern IWA and directory packages remain unsupported subformats.
Numbers	Yes	Limited	Yes	Yes	Legacy XML Numbers packages with root `index.xml.gz` or `index.xml` are extractor-backed for table/cell text; modern IWA and directory packages remain unsupported subformats.
Keynote	Yes	Limited	Yes	Yes	Legacy XML Keynote packages with root `index.apxl.gz` or `index.apxl` are extractor-backed for slide paragraph/table text; modern IWA and directory packages remain unsupported subformats.
MHTML	Yes	Yes	Yes	Yes	MHTML web archive inputs (`.mht`, `.mhtml`) extract the selected root HTML or plain text part, with pinned Apache-2.0 compatibility evidence; linked resources are skipped rather than reconstructed or recursively extracted.
XLSB	Yes	Yes	Yes	Yes	Excel binary workbook packages (`.xlsb`) are supported through #96 with calamine-backed BIFF12 worksheet/cell extraction, ZIP/package preflight caps, panic containment, and pinned Apache-2.0 compatibility evidence; formulas are not executed, macros and embedded payloads are ignored, and external links are not followed.
XPS/OXPS	Yes	Yes	Yes	Yes	Bounded FixedDocument/FixedPage package extraction for `.xps` and `.oxps`; extracts `Glyphs` `UnicodeString` text and core package metadata, with pinned Apache-2.0 XPS/OpenXPS-content compatibility evidence and no rendering, OCR, font decoding, or glyph-index reconstruction.
VSDX	Yes	Yes	Yes	Yes	Bounded Visio drawing text extraction for `.vsdx`, with pinned MIT compatibility evidence and generated malformed relationship/XML/resource-cap fixtures; follows declared page relationships, extracts page `Text` elements, and fails closed on corrupt VSDX packages without rendering shapes, OCR, macro execution, or external links.
Legacy Visio binary	Yes	Limited	Yes	Yes	Bounded visible-text recovery from valid Visio CFB `.vsd`, `.vss`, and `.vst` inputs is extractor-backed with pinned Apache POI evidence. Real MS-VSD record validation, page/shape ordering, metadata extraction, rendering, macro/VBA parsing or execution, embedded-object extraction, external-link traversal, and native/converter dependency remain unsupported.
Visio XML/package variants	Yes	Limited	Yes	No	Bounded extractor-backed support for #153 covers legacy Visio XML `.vdx`, `.vsx`, and `.vtx` plus modern package `.vsdm`, `.vssx`, `.vssm`, `.vstx`, and `.vstm` fixtures generated in `testdata/corpus/basic`; extraction reads page `Text` elements, ignores macro/VBA parts, and fails closed on malformed XML, corrupt packages, missing relationships, oversized parts, external targets, and non-Visio packages without VSDX, ZIP/archive, XML/plain text, OLE, HTML, or legacy Visio binary fallback. `testdata/visio-package-preflight` records package spine and malformed-boundary evidence; metadata extraction, rendering, conversion, macro execution, embedded-payload recursion, external-link traversal, and broad compatibility are out of scope.
Publisher	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #126/#149; committed generated `testdata/corpus/basic/minimal_publisher.pub` fixture covers OLE-magic `.pub`/Publisher classification and unsupported no-fallback behavior, while `testdata/publisher-preflight` records synthetic OLE-header stream/version/text-candidate/metadata-candidate inventory, malformed-header, missing-stream, encrypted-marker, external data-source, embedded-object, and resource-cap parser-readiness evidence only. Publisher inputs do not route to DOC, XLS, PPT, MSG, generic OLE, archive, HTML, or plain-text extraction, and no Publisher text extraction, metadata extraction, rendering, conversion, mail-merge/data-source access, macro execution, embedded-object recursion, external-link traversal, decryption, native/converter dependency, real Publisher compatibility, or Publisher stream/version validation is supported.
Adobe InDesign IDML	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #169; committed generated `testdata/corpus/basic/minimal_indesign.idml`, `rtf_looking_indesign.idml`, `html_looking_indesign.idml`, `ole_looking_indesign.idml`, and `plain_looking_indesign.idml` fixtures cover extension-gated `.idml` classification and unsupported no-fallback behavior, while `testdata/indesign-preflight` records IDML ZIP package shape, `designmap.xml` presence, story XML inventory, malformed XML, traversal-shaped member paths, external-resource markers, embedded asset deferral, and package resource caps. IDML inputs do not route to ZIP/archive, RTF, OLE, HTML, or plain-text extraction, and no IDML text extraction, metadata extraction, rendering/layout fidelity, image extraction/OCR, script/plugin execution, external-resource loading, recursive embedded-payload extraction, converter/native dependency, or extractor-backed support is claimed.
Adobe InDesign INDD	No	No	Preflight inventory fixtures	No	Evaluation-only #169 binary `.indd` feasibility lane in `testdata/indesign-preflight` records binary feasibility and ZIP-looking `.indd` fallback-risk inventory only. No public `.indd` detection, detected-unsupported no-fallback behavior, binary INDD parsing, rendering/layout fidelity, image extraction/OCR, script/plugin execution, external-resource loading, recursive embedded-payload extraction, converter/native dependency, or extractor-backed support is claimed.
Adobe FrameMaker MIF/FM	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #171; committed generated `testdata/corpus/basic/minimal_framemaker.mif`, `rtf_looking_framemaker.mif`, `zip_looking_framemaker.mif`, `html_looking_framemaker.mif`, `ole_looking_framemaker.fm`, and `plain_looking_framemaker.fm` fixtures cover extension-gated `.mif`/`.fm` classification and unsupported no-fallback behavior, while `testdata/framemaker-preflight` records MIF header/version markers, text-flow/paragraph inventory, escaped strings, blocked external-resource/include/cross-reference markers, binary payload rejection, fallback-looking signature probes, structure/file resource caps, and binary `.fm` feasibility rows. FrameMaker inputs do not route to RTF, ZIP/archive, OLE, HTML, or plain-text extraction, and no MIF text extraction, binary FM parsing, rendering/layout fidelity, imported graphic loading, external-resource traversal, converter/native dependency, or extractor-backed support is claimed.
CHM	Yes	Limited	Yes	Yes	Bounded CHM topic text extraction uses internal ITSF/ITSP/PMGL/PMGI/DataSpace readiness and LZX topic decoding, with pinned Apache Tika evidence. Rendering, scripts, external links, embedded payload recursion, broad compatibility, and generic archive/OLE/HTML/plain fallback remain unsupported.
OneNote	Yes	Limited	Yes	Yes	Bounded visible-text recovery for real OneStore revision-store files is extractor-backed with pinned Apache Tika evidence. `.onepkg` member extraction, object graph/rich-text semantics, handwriting/OCR, rendering, embedded payload recursion, and generic OLE/ZIP/plain fallback remain unsupported.
DjVu	Yes	Limited	Generated corpus fixtures + preflight fixtures	No	Bounded #151 extractor-backed slice for uncompressed `TXTa` text-layer bytes in `.djvu`/`.djv` IFF/FORM inputs; committed generated `testdata/corpus/basic/minimal_djvu.djvu` covers baseline extraction, while `testdata/djvu-preflight` and supporting `dextract-djvu-preflight` keep synthetic IFF/FORM container, page-form, directory, TXTa/TXTz, malformed-chunk, external-reference, and resource-cap coverage. DjVu inputs do not route to generic archive, OLE, HTML, or plain-text extraction; TXTz decompression, rendering, OCR, image extraction, metadata value extraction, external references, native dependencies, and converter shellouts remain unsupported.
PostScript/EPS	Yes	Limited	Generated corpus fixtures	No	Bounded lexical extraction for #150 after the #127 boundary; committed generated `testdata/corpus/basic/minimal_postscript.ps` and `testdata/corpus/basic/minimal_eps.eps` fixtures cover DSC metadata plus literal/hex string text recovery only. PostScript/EPS extraction does not execute PostScript, render, OCR, process EPS previews, load external resources, recover outlined text, interpret fonts, or use Ghostscript.
WordPerfect	Yes	Limited	Yes	Yes	Bounded WPC5/WPC6 `.wpd` document-area text extraction for #148, backed by generated `testdata/corpus/basic/minimal_wordperfect.wpd`, `testdata/wordperfect-preflight` prefix/malformed/adjacent fixtures, and pinned Apache-2.0 `testdata/realworld/apache-tika/testWordPerfect.wpd`. Extracts plain document-area text and WPC prefix metadata only; macros/templates (`.wpt`/`.wcm`), graphics/layout rendering, embedded objects, external resources, encrypted inputs, converter shellouts, and native dependencies remain unsupported.
WordPerfect adjacent/malformed boundaries	Yes	No	Parser-readiness preflight fixtures	No	Adjacent `.wpt`/`.wcm` fixtures remain non-public inventory only, and malformed, encrypted, unsupported-version/product/type, payload-cap, ZIP-looking, and OLE-looking `.wpd` inputs fail closed without generic archive, OLE, HTML, or plain-text fallback.
WordStar	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #194; generated `.ws`, `.ws7`, and `.wsd` corpus fixtures plus fallback-looking probes cover WordStar classification, guidance, and no plain text, RTF, OLE, ZIP/TAR/7z archive, HTML, or XML fallback, while `testdata/wordstar-preflight` records control-byte/header observations, dot-command/print-control markers, malformed/resource-cap cases, fallback probes, and inventory-only `.wsm`/`.wst`/`.wsb`/`.wsx` candidates. No text extraction, metadata extraction, formatting interpretation, dot-command interpretation, codepage decoding, print-control interpretation, macro execution, embedded-object extraction, native/converter dependency, inventory-only public classification, or extractor-backed support is claimed.
AbiWord	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported close-out boundary for #157; committed generated fixtures cover extension-gated `.abw` classification and no-fallback behavior, while `testdata/abiword-preflight` and internal `dextract-abiword-preflight` record XML/gzip shape inventory, synthetic 1.x/2.x/3.x version/provenance candidates, structure-only section/paragraph/heading/list/table/cell counts, non-public `.zabw`/`.abw.gz` compressed candidates, malformed XML/entity/missing-version/unsupported-version/embedded-object/external-link/depth/element/attribute/text/metadata-key-limit cases, missing/corrupt gzip, gzip resource-cap boundaries, and metadata-key-name inventory only. AbiWord inputs do not route to generic XML, ZIP, OLE, HTML, or plain-text extraction, and no text extraction, metadata extraction, extractor-backed XML parser readiness, real-world producer compatibility, compressed-variant public support, embedded-object recursion, external-link traversal, or converter/native dependency support is claimed.
Microsoft Works	Yes	No	Unsupported corpus + preflight feasibility fixtures	No	Detected unsupported boundary for #155; committed generated fixtures cover extension-gated `.wps`/`.wks`/`.wdb`/`.xlr` classification and no-fallback behavior, while `testdata/works-preflight` records synthetic Works-family sentinels, malformed/version/record cases, resource caps, encryption/embedded/external markers, fallback probes, missing legal compatibility samples, version-family unknowns, and parser-readiness blockers. Microsoft Works inputs do not route to RTF, XLS, XLSX, generic archive, OLE, HTML, or plain-text extraction, and no text extraction, metadata extraction, real stream/version validation, database/spreadsheet parsing, embedded-object recursion, external-link traversal, decryption, native/converter dependency, or parser-readiness support is claimed.
Microsoft Access	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #182; generated `.mdb`/`.accdb` corpus fixtures cover Access classification, guidance, and no XLS/XLSX/ODS, DBF/DIF/SYLK, Works, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while `testdata/access-preflight` records synthetic Jet/ACE sentinels, malformed/version/object-count cases, password/encryption and linked-table markers, and inventory-only `.mde`/`.accde` candidates. No table extraction, query execution, form/report rendering, macro/VBA execution, linked-table traversal, external data-source access, embedded-object recursion, decryption, repair, native ACE/Jet dependency, inventory-only extension public classification, or extractor-backed support is claimed.
SQLite database	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #188; generated `.sqlite`/`.sqlite3` corpus fixtures cover SQLite classification, guidance, and no XLS/XLSX/ODS, Access, DBF/DIF/SYLK, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while `testdata/sqlite-preflight` records SQLite header/page-size/version observations, malformed short/bad headers, resource-cap cases, fallback probes, and inventory-only `.db`, `.db3`, `.sqlite2`, `.sdb`, WAL, and SHM sidecar candidates. No SQL execution, schema/table extraction, row/cell extraction, FTS/index parsing, WAL replay, extension loading, encryption/decryption, repair, native SQLite dependency, sidecar public classification, or extractor-backed support is claimed.
DBF/dBASE	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #183; generated `.dbf` corpus fixtures cover DBF/dBASE classification, guidance, and no XLS/XLSX/ODS, Access, DIF/SYLK, Works, Quattro Pro, Lotus 1-2-3, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while `testdata/dbf-preflight` records deterministic DBF header/version observations, malformed/header/resource-cap cases, fallback probes, and inventory-only `.dbt`/`.fpt`/`.ndx`/`.mdx`/`.cdx` sidecar candidates. No table/row/cell extraction, memo parsing, codepage decoding claims, deleted-record recovery, index parsing, shapefile/geospatial support, formula/macro execution, native/converter dependency, sidecar public classification, or extractor-backed DBF parser support is claimed.
DIF	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #184; generated `.dif` corpus fixtures cover DIF classification, guidance, and no CSV/TSV, plain-text, HTML/XML, RTF, archive, OLE, XLS/XLSX/ODS, Access, DBF, Works, Quattro, Lotus, or SYLK fallback, while shared `testdata/dif-sylk-preflight` records deterministic text-header observations, malformed/resource-cap cases, fallback probes, and inventory-only `.sylk` split evidence. No cell/table extraction, formula execution or evaluation, codepage/encoding normalization, external-link traversal, native/converter dependency, magic-only detection, or extractor-backed support is claimed.
SYLK	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #184; generated public `.slk` corpus fixtures cover SYLK classification, guidance, and no CSV/TSV, plain-text, HTML/XML, RTF, archive, OLE, XLS/XLSX/ODS, Access, DBF, Works, Quattro, Lotus, or DIF fallback, while shared `testdata/dif-sylk-preflight` keeps `.sylk` inventory-only. No cell/table extraction, formula execution or evaluation, codepage/encoding normalization, external-link traversal, native/converter dependency, magic-only detection, `.sylk` public classification, or extractor-backed support is claimed.
Microsoft Project	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #180; generated `.mpp`/`.mpt`/`.mpx` corpus fixtures cover Project classification, guidance, and no DOC/XLS/PPT/OOXML/ODF, Visio, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while `testdata/project-preflight` records synthetic CFB/MPX sentinels, malformed/version/task/resource/record-count cases, password/VBA/external-link/embedded-object markers, and fallback probes. No task extraction, resource extraction, schedule calculation, formula evaluation, Gantt/timeline rendering, image/OCR extraction, macro/VBA execution, external-link traversal, embedded-object recursion, decryption, repair, native/converter dependency, or extractor-backed support is claimed.
MOBI/AZW	Yes	Limited	Supported corpus + preflight fixtures	No	Bounded #145 support extracts unencrypted UTF-8 `.mobi`, `.azw`, and `.azw3` text records when the input validates as PDB-style `BOOK`/`MOBI` and uses uncompressed or classic PalmDOC compression. Generated supported corpus fixtures cover `.mobi`/`.azw`/`.azw3` extraction, while legacy minimal corpus fixtures and `testdata/mobi-preflight` keep unsupported-subset/no-fallback evidence for empty/minimal headers, numeric-only EXTH inventory, encoding-marker classification, unsupported Windows-1252/unknown encodings, HUFF/CDIC, encryption, malformed PalmDOC bytes, and output limits. No DRM/decryption, Windows-1252 decoding, metadata extraction, EXTH value decoding, HTML/XHTML conversion, rendering, resource extraction, embedded-payload recursion, external-link traversal, AZW4/Topaz/KFX support, or generic `.prc`/`.pdb` classification is supported.
Kindle KFX/Topaz/AZW4	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #176; generated `.kfx`/`.tpz`/`.azw1`/`.azw4` corpus fixtures cover Kindle KFX/Topaz/AZW4 classification, guidance, and no PDF, EPUB/ZIP, archive, MOBI/AZW, or plain-text fallback, while `testdata/kindle-ebook-boundary` keeps generic `.prc`/`.pdb`, `.azw6`, `.azw8`, and `.azw9` inventory-only. No Kindle KFX/Topaz/AZW4 text extraction, DRM/decryption, metadata extraction, EXTH value decoding, rendering, resource extraction, embedded-payload recursion, external-link traversal, converter/native dependency, or extractor-backed support is claimed.
QuarkXPress	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #178; generated `.qxd`/`.qxp` corpus fixtures cover QuarkXPress classification, guidance, and no ZIP/TAR/7z archive, OLE, HTML/XML, RTF, or plain-text fallback, while `testdata/quarkxpress-preflight` records synthetic page-layout sentinels, malformed/header/object-count cases, external-resource markers, and inventory-only `.qxt`/`.qpt`/`.qxb`/`.qxl`/`.xtg` candidates. No layout rendering, text extraction, metadata extraction, font interpretation, image/OCR extraction, external-resource loading, color management, converter/native dependency, inventory-only extension public classification, or extractor-backed support is claimed.
Scribus SLA	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #192; generated `.sla` corpus fixtures cover extension-gated Scribus classification, guidance, and no generic XML, gzip/archive, HTML, RTF, InDesign, QuarkXPress, Publisher, FrameMaker, or plain-text fallback, while `testdata/scribus-preflight` records Scribus XML root/version observations, `.sla.gz` inventory-only candidates, malformed XML/gzip cases, external image/font markers, traversal/script markers, resource caps, and fallback probes. No text extraction, metadata extraction, layout rendering, font/image loading, color management, script execution, external-resource traversal, gzip public support, native/converter dependency, inventory-only public classification, or extractor-backed support is claimed.
Adobe PageMaker	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #193; generated `.pmd`/`.p65`/`.pm6` corpus fixtures cover PageMaker classification, guidance, and no InDesign, QuarkXPress, Scribus SLA, Publisher, FrameMaker, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, or plain-text fallback, while `testdata/pagemaker-preflight` records synthetic layout sentinels, malformed header/version cases, external image/font markers, traversal markers, resource caps, fallback probes, and inventory-only `.pm5`/`.pm4`/`.pmt`/`.t65` candidates. No text extraction, metadata extraction, layout rendering, font interpretation, image/OCR extraction, external-resource loading, color management, script execution, converter/native dependency, inventory-only public classification, or extractor-backed support is claimed.
Quattro Pro	Yes	No	Unsupported corpus + feasibility inventory	No	Detected unsupported boundary for #159; committed generated fixtures cover extension-gated `.qpw`/`.wb1`/`.wb2`/`.wb3` classification and no-fallback behavior, while `testdata/quattro-preflight` records fixture inventory, generated `.wb1`/`.wb2` PRONOM BOF signature/version markers, malformed signature boundaries, missing legal compatibility samples, and remaining parser-readiness blockers only. Quattro Pro inputs do not route to XLS, XLSX, ODS, generic archive, OLE, HTML, or plain-text extraction, and no text extraction, metadata extraction, formula execution, macro execution, embedded-object recursion, external-link traversal, decryption, native/converter dependency, workbook/sheet/cell validation, public support registration, or extractor-backed parser support is claimed.
Lotus 1-2-3	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #164; generated `.wk1`/`.wk3`/`.wk4`/`.123` corpus fixtures cover Lotus 1-2-3 classification, guidance, and no XLS/XLSX/ODS, DBF/DIF/SYLK, Works, Quattro Pro, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while `testdata/lotus-preflight` records generated PRONOM BOF signature/version fixtures, malformed signature boundaries, missing legal compatibility samples, and parser-readiness blockers only. No text extraction, metadata extraction, workbook/sheet/cell parsing, formula execution, macro execution, embedded-object recursion, external-link traversal, decryption, native/converter dependency, public support registration, or extractor-backed Lotus parser support is claimed; `.wks` remains owned by the Microsoft Works boundary.
HWPX	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #162; generated `testdata/corpus/basic/*hwpx.hwpx` fixtures cover extension-gated `.hwpx` classification, guidance, and no ZIP/TAR/7z, OLE, HTML/XML, RTF, plain-text, or legacy HWP fallback, while `testdata/hwp-hwpx-preflight` records HWPX package-spine, malformed package/XML/resource-cap, and external-resource inventory. No text/metadata extraction, package-member extraction, OWPML/body parsing, rendering/layout, decryption, script execution, converter/native dependency, or extractor-backed support is claimed.
AutoCAD DWG/DXF	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #175; generated `.dwg`/`.dxf` corpus fixtures cover extension-gated AutoCAD classification, guidance, and no PDF/PostScript, ZIP/archive, OLE, HTML/XML, RTF, or plain-text fallback, while `testdata/autocad-preflight` records ASCII DXF section/table/entity and TEXT/MTEXT marker inventory, DWG `AC10xx` sentinels, malformed/header/resource-cap cases, and fallback-signature risk only. No CAD text extraction, metadata extraction, geometry interpretation, rendering/OCR, external-reference traversal, converter/native dependency, or extractor-backed support is claimed.
DWF/DWFx	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #177; generated `.dwf`/`.dwfx` corpus fixtures cover DWF/DWFx classification, guidance, and no PDF/PostScript, XPS/OXPS, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, plain-text, image/OCR, or Visio fallback, while `testdata/dwf-dwfx-preflight` records synthetic package/signature fixtures, malformed package-shape cases, traversal member names, external-resource markers, embedded asset deferral, and package entry/byte caps. No text extraction, metadata extraction, CAD geometry extraction, fixed-layout rendering, image/OCR extraction, embedded-resource traversal, external-resource loading, native/converter dependency, or extractor-backed support is claimed.
HP PCL/PCLXL	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #191; generated `.pcl` corpus fixtures cover extension-gated HP PCL/PCLXL classification, guidance, and no PDF/PostScript, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, plain-text, or image/OCR fallback, while `testdata/pcl-preflight` records PJL, PCL escape-sequence, PCLXL sentinel, malformed/resource-cap, embedded-resource, fallback, and inventory-only `.prn`/`.pxl`/`.pclxl`/`.spl` evidence. No print-stream interpretation, text extraction, font interpretation, rendering/OCR, PJL execution, embedded-resource extraction, printer emulation, Ghostscript/native dependency, inventory-only extension public classification, or extractor-backed support is claimed.
Legacy HWP	No	No	Evaluation preflight inventory	No	Evaluation-only #162 lane; `testdata/hwp-hwpx-preflight` records generated `.hwp` FileHeader sentinels that are not valid CFB documents. No public `.hwp` detection, detected-unsupported no-fallback behavior, valid legacy CFB parser support, text/metadata extraction, decryption, converter/native dependency, or extractor-backed support is claimed.
StarOffice/OpenOffice legacy	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #166; generated `.sxw`/`.stw`/`.sxc`/`.stc`/`.sxi`/`.sti`/`.sxd`/`.std`/`.sxm`/`.sxg`/`.sdw`/`.sdc`/`.sdd`/`.sda` corpus fixtures cover classification, guidance, and no ODT/ODS/ODP/ODG, ZIP/TAR/7z, OLE, HTML/XML, RTF, or plain-text fallback, while `testdata/staroffice-openoffice-preflight` records package-shape fixtures, legacy binary sentinels, malformed ZIP/XML/resource boundaries, and parser-readiness blockers. No OpenDocument aliasing, text/cell/metadata extraction, macro/script or formula execution, converter/native dependency support, or extractor-backed support is claimed.
FictionBook FB2	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #163; generated `.fb2` and `.fb2.zip` corpus fixtures cover FictionBook FB2 classification, guidance, and no PDF, EPUB, MOBI/AZW, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, or plain-text fallback, while `testdata/fb2-preflight` records XML/package observations, malformed XML/resource caps, and ZIP member/traversal boundaries. No text extraction, metadata extraction, resource extraction, embedded-payload recursion, external-link traversal, converter/native dependency, public support registration, or extractor-backed support is claimed.
OpenDocument ancillary variants	Yes	No	Unsupported corpus + boundary inventory	No	Detected unsupported boundary for #165; generated `.odm`/`.odb`/`.odf`/`.odc` corpus fixtures cover OpenDocument ancillary classification, guidance, and no ZIP/TAR/7z archive, OLE, HTML/XML, RTF, plain-text, or neighboring ODT/ODS/ODP/ODG fallback, while `testdata/opendocument-ancillary-boundary` records variant separation and remaining parser questions. No master-document traversal, database/formula execution, chart rendering, macro/script execution, embedded-payload recursion, external-link traversal, decryption, native/converter dependency, or extractor-backed support is claimed.
AppleWorks/ClarisWorks	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #167; generated `testdata/corpus/basic/*appleworks.cwk` fixtures cover extension-gated `.cwk` classification, guidance, and no iWork/Pages, ZIP/TAR/7z, OLE, HTML, XML, RTF, or plain-text fallback, while `testdata/appleworks-preflight` records BOBO shared-signature inventory, byte caps, MacBinary/AppleDouble resource-fork deferral, and PRONOM observations. No text/metadata/cell extraction, graphics/formula parsing, converter/native dependency, or extractor-backed support is claimed.
Comic-book archives	Yes	No	Unsupported corpus + preflight fixtures	No	Detected unsupported boundary for #168; generated `testdata/corpus/basic/comic` and `minimal_cb.cb` fixtures cover extension-gated `.cbz`, `.cbt`, `.cb7`, and `.cbr` classification, guidance, and no generic ZIP/TAR/7z archive, PDF, EPUB, OLE, HTML/XML, RTF, plain-text, image/OCR, or RAR fallback, while `testdata/comic-archive-preflight` records container/signature observations, image-only entries, ComicInfo.xml inventory, text-sidecar boundaries, and traversal/resource-cap boundaries. No archive listing metadata, ComicInfo.xml metadata extraction, RAR support, image extraction/OCR, sidecar text extraction, recursive dispatch, converter/native dependency, or extractor-backed support is claimed.
ZIP	Yes	Yes	Yes	Yes	One page-like section per entry, with pinned MIT archive compatibility sample; traversal paths are rejected. Archive entry bytes are decoded as UTF-8-lossy text and are not recursively dispatched to other extractors; generic ZIP output is not modern iWork/Numbers/Keynote support or support for rejected OOXML/ODF and package boundaries.
TAR	Yes	Yes	Yes	Yes	Tar and gzip-compressed tar (`.tar.gz`, `.tgz`) via the same archive extractor, with pinned BSD-3-Clause GNU tar compatibility sample. Archive entry bytes are decoded as UTF-8-lossy text and are not recursively dispatched to other extractors.
7Z	Yes	Limited	Yes	Yes	Unencrypted non-solid 7z archives with unencoded headers, one coder per folder, no filters, and COPY, DEFLATE, BZIP2, or bounded-dictionary LZMA2 streams, with pinned Apache-2.0 COPY compatibility sample; LZMA, high-dictionary LZMA2, encrypted, encoded-header, solid, filtered, and coder-chain archives are rejected.
PDF	Yes	No	Unsupported fixture	No	Out of scope; use `dpdf` for PDF parsing and OCR.
OCR / images	No	No	No	No	Out of scope for this repo.

See docs/archive-policy.md for the archive recursion and resource-limit policy.

Fixture-backed means covered by committed baseline fixtures or explicitly named unsupported-boundary fixtures under testdata/corpus/ or a format-specific testdata/ directory. Compatibility-backed means covered by pinned public samples under testdata/realworld/. Absence of a compatibility fixture does not mean a format is unsupported; it means the release evidence is currently limited to unit tests, integration tests, and baseline fixtures.

Repository Layout

dExtract is a workspace with 32 publishable crates plus five unpublished internal preflight crates. The public build uses the members declared in the root Cargo.toml, including bounded extractors for legacy .ppt, legacy XML iWork, CHM, OneNote visible-text recovery, legacy Visio binary visible-text recovery, PostScript/EPS, DjVu TXTa, MOBI/AZW/AZW3, and WordPerfect WPC text extraction.

Path	Purpose
`dextract-types`	Shared traits and data types for extractors and outputs.
`dextract-ole`	Shared bounded OLE preflight validation for legacy Office extractors.
`dextract-zip-package`	Shared hardened ZIP package reader for OOXML, OpenDocument, EPUB, and XPS/OXPS extractors.
`dextract`	Facade crate that registers the built-in extractors.
`dextract-cli`	`dextract` command-line entrypoint.
`dextract-doc`	Legacy DOC extractor.
`dextract-docx`	DOCX extractor.
`dextract-odt`	ODT extractor.
`dextract-xls`	Legacy XLS extractor.
`dextract-xlsb`	XLSB extractor.
`dextract-xlsx`	XLSX extractor.
`dextract-ods`	ODS extractor.
`dextract-ppt`	Legacy PPT extractor with bounded mechanical text and metadata support; no rendering, macros, media extraction, embedded object recursion, or decryption.
`dextract-pptx`	PPTX extractor.
`dextract-odp`	ODP extractor.
`dextract-odg`	ODG/OTG package and flat FODG drawing extractor.
`dextract-postscript`	PostScript/EPS bounded lexical extractor; no execution, rendering, OCR, previews, external resources, or Ghostscript.
`dextract-iwork`	Legacy XML Pages/Numbers/Keynote package extractor.
`dextract-wordperfect`	Bounded WordPerfect WPC5/WPC6 `.wpd` document-area extractor; no macros/templates, rendering, embedded-object recursion, external-resource traversal, converter shellout, or native dependency.
`dextract-iwork-preflight`	Unpublished iWork input preflight primitives; not extractor-backed support.
`dextract-onenote-preflight`	Unpublished OneNote input, `.onepkg` package-inventory preflight primitives, and fail-closed #101 text/object-page/object-graph/revision-table-sequence/page-rich-text-object-reference/visible-text-object readiness blocker checks; not extractor-backed support.
`dextract-mobi-preflight`	MOBI/AZW PDB/MOBI envelope preflight primitives used by `dextract-mobi`, including text-encoding marker classification and bounded uncompressed/classic PalmDOC text-record materialization.
`dextract-chm-preflight`	CHM ITSF envelope, ITSP/PMGL header validation, PMGI/DataSpace readiness, LZX topic decoding support, bounded `.hhc` TOC ordering, and fail-closed malformed reset-table checks used by the limited CHM facade extractor.
`dextract-djvu-preflight`	DjVu IFF/FORM container parser-readiness primitives and bounded TXTa byte materialization for `.djvu`/`.djv`; used by the limited facade extractor and preflight fixtures.
`dextract-abiword-preflight`	Unpublished AbiWord XML/gzip shape inventory, synthetic version-family checks, extraction-risk marker rejection, and structure-only element-count primitives for `.abw` plus non-public `.zabw`/`.abw.gz` candidates; not extractor-backed support.
`dextract-wordperfect-preflight`	Unpublished WordPerfect WPC5/WPC6 header and synthetic payload inventory primitives for `.wpd` plus non-public `.wpt`/`.wcm` candidates; not extractor-backed support.
`dextract-visio-binary-preflight`	Unpublished legacy Visio binary CFB stream-inventory, non-synthetic VisioDocument real-parser/version-policy compatibility/record-map/table/text-run/page-shape/metadata gating, and repo-owned synthetic record/text/page-shape ordering plus record/version/stream consistency parser-readiness primitives for `.vsd`/`.vss`/`.vst`; not real Visio record decoding or extractor-backed support.
`dextract-xps`	XPS/OXPS fixed-layout package extractor.
`dextract-vsdx`	VSDX plus bounded Visio XML/package variant extractor.
`dextract-html`	HTML extractor.
`dextract-csv`	CSV extractor.
`dextract-epub`	EPUB extractor.
`dextract-mobi`	Bounded MOBI/AZW/AZW3 UTF-8 text-record extractor for unencrypted uncompressed/classic PalmDOC inputs.
`dextract-rtf`	RTF extractor.
`dextract-eml`	RFC 5322 email and MHTML extractor.
`dextract-msg`	Outlook MSG extractor.
`dextract-archive`	ZIP, TAR, and limited 7z extractor.

Reference Docs

Cargo.toml in the repository root for the workspace manifest and public repo URL.
dextract/src/lib.rs in the repository for the facade API and built-in extractor registration.
dextract-cli/src/main.rs in the repository for CLI behavior and subcommands.
RELEASING.md, ROADMAP.md, testdata/README.md, and scripts/fetch_realworld_corpus.py in the repository for release and corpus guidance.

Development

Run the workspace checks from the repo root:

cargo fmt --all --check
cargo clippy --workspace --all-targets --all-features --locked -- -D warnings
cargo test --workspace --all-targets --locked

Useful release-time checks:

bash scripts/check-release-tooling.sh
bash scripts/check-github-workflows.sh
cargo doc --workspace --no-deps --locked
bash scripts/check-package-list.sh
bash scripts/check-supply-chain.sh
bash scripts/check-api-compat.sh --required
python3 scripts/check_test_corpus_drift.py
python3 scripts/validate_format_gap_issue_drafts.py --quiet
cargo run -p dextract-cli -- formats
cargo run -p dextract-cli -- formats --json
cargo run -p dextract-cli -- formats --all --json
cargo package --list -p dextract
cargo publish --dry-run -p dextract-types

For reproducible local extraction performance checks against committed fixtures, use the CLI benchmark harness:

short_sha="$(git rev-parse --short=7 HEAD)"
python3 scripts/bench_extractors.py \
  --warmups 2 \
  --iterations 5 \
  --fixture-set all \
  --mode both \
  --output "target/perf/refreshed-performance-baseline-${short_sha}.json"

See docs/performance.md and docs/performance-baseline.md for the benchmark contract and current same-machine baseline. For ad hoc library-only timing against representative files, cargo run -p dextract --example simple_bench remains available.

The committed fixture corpus used for facade-level support checks lives under testdata/corpus/. Pinned public compatibility samples live under testdata/realworld/.

License

dExtract is released under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dExtract

Installation

Minimum Supported Rust Version

Local Supply-Chain Checks

Local API Compatibility Checks

Local GitHub Workflow Checks

Quick Start

Metadata Semantics

Supported Formats

Repository Layout

Reference Docs

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
_dpub		_dpub
dextract-abiword-preflight		dextract-abiword-preflight
dextract-archive		dextract-archive
dextract-chm-preflight		dextract-chm-preflight
dextract-cli		dextract-cli
dextract-csv		dextract-csv
dextract-djvu-preflight		dextract-djvu-preflight
dextract-doc		dextract-doc
dextract-docx		dextract-docx
dextract-eml		dextract-eml
dextract-epub		dextract-epub
dextract-html		dextract-html
dextract-iwork-preflight		dextract-iwork-preflight
dextract-iwork		dextract-iwork
dextract-mobi-preflight		dextract-mobi-preflight
dextract-mobi		dextract-mobi
dextract-msg		dextract-msg
dextract-odg		dextract-odg
dextract-odp		dextract-odp
dextract-ods		dextract-ods
dextract-odt		dextract-odt
dextract-ole		dextract-ole
dextract-onenote-preflight		dextract-onenote-preflight
dextract-postscript		dextract-postscript
dextract-ppt		dextract-ppt
dextract-pptx		dextract-pptx
dextract-rtf		dextract-rtf
dextract-types		dextract-types
dextract-visio-binary-preflight		dextract-visio-binary-preflight
dextract-vsdx		dextract-vsdx
dextract-wordperfect-preflight		dextract-wordperfect-preflight
dextract-wordperfect		dextract-wordperfect
dextract-xls		dextract-xls
dextract-xlsb		dextract-xlsb
dextract-xlsx		dextract-xlsx
dextract-xps		dextract-xps
dextract-zip-package		dextract-zip-package
dextract		dextract
docs		docs
fuzz		fuzz
scripts		scripts
testdata		testdata
.editorconfig		.editorconfig
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
INTEGRITY.json		INTEGRITY.json
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
RELEASING.md		RELEASING.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
deny.toml		deny.toml

Folders and files

Latest commit

History

Repository files navigation

dExtract

Installation

Minimum Supported Rust Version

Local Supply-Chain Checks

Local API Compatibility Checks

Local GitHub Workflow Checks

Quick Start

Metadata Semantics

Supported Formats

Repository Layout

Reference Docs

Development

License

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages