Skip to content

ghuntley/dextract

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dExtract

Pure-Rust text extraction for common non-PDF local file formats and archives.

dExtract mechanically extracts structured text from common non-PDF local files without shelling out to system tools or external services. The facade currently registers built-in extractors for plain text, CSV/TSV, legacy Word/Excel/PowerPoint (doc, xls, ppt), legacy iWork XML packages (pages, numbers, key), Excel binary workbooks (xlsb), Office Open XML (docx, xlsx, pptx, plus accepted macro/template/slideshow aliases), OpenDocument (odt/fodt, ods/fods, odp/fodp, odg/otg/fodg, plus accepted template aliases), XPS/OXPS (xps, oxps), Visio drawing/XML/package/binary variants (vsdx, vdx, vsx, vtx, vsdm, vssx, vssm, vstx, vstm, vsd, vss, vst), CHM topic text (chm), bounded OneNote visible-text recovery (one), PostScript/EPS lexical text recovery (ps, eps), bounded DjVu TXTa text-layer extraction (djvu, djv), WordPerfect WPC document-area text (wpd), HTML/MHTML, EPUB, bounded MOBI/AZW/AZW3 UTF-8 text records, RTF, RFC 5322 / Outlook mail (eml, msg), and ZIP/TAR/limited 7z inputs, including gzip-compressed tar (.tar.gz, .tgz). Several extractors are deliberately limited: modern iWork IWA and directory packages, OneNote package/rich-object parsing, CHM rendering/scripts/resources, legacy Visio record/page semantics, macros, embedded payload recursion, OCR, and layout rendering remain out of scope. TeX/LaTeX sources, mailbox stores such as mbox/Maildir, WordStar documents, AbiWord documents, Microsoft Works files, Microsoft Access databases, SQLite database files, Microsoft Project schedule/interchange files, Outlook PST/OST mailbox stores, IBM Notes/Domino NSF/NTF databases, Microsoft InfoPath XSN form packages, OpenDocument ancillary variants, SPSS/SAS/Stata statistical data files, WARC/WACZ/Safari webarchive captures, FileMaker Pro databases, Kindle KFX/Topaz/AZW4 files, QuarkXPress documents, Scribus SLA page-layout files, Adobe PageMaker layout files, StarOffice/OpenOffice legacy files, Quattro Pro spreadsheets, FrameMaker MIF/FM documents, InDesign IDML packages, DWF/DWFx fixed-layout drawing packages, HP PCL/PCLXL print streams, and OLE-gated Publisher .pub inputs remain unsupported boundaries. PDF parsing and OCR belong in dpdf.

Installation

Until the crates are published to crates.io, install the CLI from the public Git repository:

cargo install --git https://github.com/andrewdyates/dextract dextract-cli

Use the library from the public Git branch:

[dependencies]
dextract = { git = "https://github.com/andrewdyates/dextract" }

From a checkout of this repo, install the CLI without publishing:

cargo install --path dextract-cli

Or depend on the current public Git branch when tracking unreleased changes:

[dependencies]
dextract = { git = "https://github.com/andrewdyates/dextract" }

Minimum Supported Rust Version

dExtract's workspace MSRV is Rust 1.88, declared in the root Cargo.toml. This follows the current dependency graph and may increase when dependencies require a newer compiler.

Local Supply-Chain Checks

The repository carries a local cargo-deny policy in deny.toml for RustSec advisories, yanked crates, license allow-listing, duplicate-version visibility, wildcard dependency bans, and crate source restrictions. Run it from a checkout with:

bash scripts/check-supply-chain.sh

Install the pinned CI version locally with cargo +stable install cargo-deny --version 0.19.4 --locked if the script reports that it is missing or mismatched. Override the pin only when updating CI with CARGO_DENY_VERSION=<version>. Use bash scripts/check-supply-chain.sh --offline only after Cargo indexes and the RustSec advisory database are cached.

Local API Compatibility Checks

Public Rust API compatibility is checked with cargo-semver-checks for workspace library crates only:

bash scripts/check-api-compat.sh

Install the pinned CI version with cargo install cargo-semver-checks --version 0.42.0 --locked if the script reports that it is missing or mismatched. Override the pin only when updating CI with CARGO_SEMVER_CHECKS_VERSION=<version>. The check is dev-only and uses crates.io as the baseline source, so it skips crates that have not been published yet; that is expected before the first crates.io release.

Local GitHub Workflow Checks

GitHub Actions workflow validation is available as a lightweight local check:

bash scripts/check-github-workflows.sh

Install the pinned CI version of actionlint with go install github.com/rhysd/actionlint/cmd/actionlint@v1.7.7 for full workflow syntax and expression validation. Without actionlint, the script still runs the repository-specific structural checks that keep release-critical CI gates wired.

Quick Start

CLI usage after cargo install --path dextract-cli:

dextract legacy.doc
dextract mail.eml
dextract report.docx
dextract --json workbook.ods
dextract --json report.xlsx
dextract --output extracted.txt notes.rtf
dextract --process-isolation legacy.doc
dextract --max-input-bytes 10485760 report.docx
dextract batch ./documents --budget 100000
dextract batch ./documents --max-input-bytes 10485760
dextract batch ./documents --recursive --max-files 5000
dextract batch ./documents --json
dextract formats
dextract formats --all --json
dextract completions zsh > _dextract
dextract manpage > dextract.1

From a fresh checkout, run the same commands with cargo run -p dextract-cli -- .... The CLI can extract one or more files, write sibling *.extracted.txt files for a directory, print the extractor-backed format matrix, and print the broader detected-format matrix with formats --all.

CLI output semantics:

  • Plain extraction writes extracted page text to stdout, or to --output when provided for a single input. Multiple plain-text inputs written to stdout are separated by one newline; multiple pages inside one input are separated by a form-feed byte.
  • --json writes one JSON object for one input file and one JSON array for multiple input files. That object-vs-array shape is the documented CLI contract.
  • JSON uses canonical lowercase format names such as docx, odt, xps, csv, eml, mhtml, and text.
  • --process-isolation runs each root-command extraction in a child worker process with the same output and exit-code contract. It can also be enabled with DEXTRACT_PROCESS_ISOLATION=1. This is not a sandbox: the parent still reads each input file into memory, the worker has a 30-second timeout and 64 MiB JSON response cap, and callers still need OS/container limits for hostile inputs.
  • --format accepts extractor-backed formats plus aliases such as html / htm, mhtml / mht, text / txt / plain, legacy ppt, pages, numbers, key / keynote, onenote / one / onetoc2 / onepkg, chm, visio-binary / vsd / vss / vst, and accepted OOXML/OpenDocument family variants (docm, dotx, dotm, xlsb, xlsm, xltx, xltm, pptm, potx, potm, ppsx, ppsm, sldx, sldm, ott, fodt, ots, fods, otp, fodp, odg, otg, fodg, xps, oxps, vsdx, visio-package, vsdm, vssx, vssm, vstx, vstm, visio-xml, vdx, vsx, vtx, postscript, ps, eps, djvu, djv, wordperfect, and wpd). It does not accept detected-but-unsupported formats such as wordstar, ws, ws1, ws2, ws3, ws4, ws5, ws6, ws7, ws8, wsd, wsm, wst, wsb, wsx, publisher, pub, access, mdb, accdb, mde, accde, sqlite, sqlite3, sqlite-database, db, db3, sqlite2, sdb, sqlite-wal, sqlite-shm, sqlite3-wal, sqlite3-shm, db-wal, db-shm, opendocument-ancillary, odm, odb, odf, odc, dbf, dbase, dif, sylk, slk, project, mpp, mpt, mpx, outlook-pst-ost, pst, ost, ibm-notes-domino, notes, domino, nsf, ntf, infopath-xsn, infopath, xsn, xsf, warc-wacz-webarchive, web-archive, warc, warc-gz, warc.gz, wacz, webarchive, safari-webarchive, filemaker, filemaker-pro, fmp12, fp7, fp5, fp3, fmp, fmpur, usr, fmpsl, pagemaker, adobe-pagemaker, pmd, p65, pm6, pm5, pm4, pmt, t65, dwf-dwfx, dwf, dwfx, pcl-pclxl, hp-pcl, hp-pclxl, pcl, pclxl, pxl, prn, spl, staroffice-openoffice-legacy, staroffice-openoffice, staroffice, openoffice-legacy, openoffice, sxw, stw, sxc, stc, sxi, sti, sxd, std, sxm, sxg, sdw, sdc, sdd, sda, tex-latex, tex, latex, ltx, mobi, azw, or azw3.
  • Auto-detection does not fall back to plain text for arbitrary unknown bytes. Plain text is selected only for known text-like extensions such as .txt, .md, .markdown, and .json, or when explicitly forced with --format text. Standalone .xml is not auto-detected as structured XML; use --format text to extract it as raw text.
  • Unknown and detected-but-unsupported formats fail root-command extraction with UnsupportedFormat and exit code 1. batch skips them; batch --json reports status: "skipped", the detected format, and guidance.
  • --budget BYTES is a best-effort extracted-text byte guard in v0.1.0. Text-like, DOC, DOCX, XLS, XLSB, ODT, and modern package extractors generally return partial output with truncated = true; MSG now follows the same caller-budget truncation contract while preserving hard errors for internal parser safety caps. 0 disables only the caller-specified text limit, not internal parser, package, or archive safety limits.
  • --max-input-bytes BYTES checks each input file's metadata length before reading it into memory; 0 disables this pre-read input cap. Over-limit root extraction exits with code 1 and no stdout. In batch, over-limit files are recorded as failed, JSON reports use stage: "input_limit", and remaining files continue processing.
  • batch --max-files FILES bounds directory traversal before extraction starts so large trees cannot accumulate unbounded path lists in memory. The default cap is 10000 scanned filesystem files; 0 disables this traversal cap. Over-limit batch traversal exits with code 1, emits no JSON report, and does not write partial extraction outputs.
  • formats lists extractor-backed formats. formats --all also lists detected-but-unsupported formats with status and guidance.
  • completions <shell> writes a shell completion script to stdout for Bash, Elvish, Fish, PowerShell, or Zsh.
  • manpage writes a roff manpage to stdout for packaging or local installation.
  • batch scans only direct children of the target directory by default. Use batch --recursive to opt into nested filesystem directory traversal. It writes sibling *.extracted.txt files, skips generated *.extracted.txt outputs on rerun, reports progress on stderr, writes no document text to stdout, skips unsupported detections, and exits successfully when every detected input is unsupported.
  • batch --json preserves the same extraction side effects and exit-code policy, but writes a machine-readable report to stdout with summary counters and per-file extracted, skipped, or failed rows.
  • CLI memory policy: each input file is read into memory before extraction, including when root extraction uses --process-isolation. batch and root-command multi-file extraction process one input file at a time. Root multi-file output and batch --json per-file report rows are spooled to temporary files before final stdout or --output writes, so later failures do not emit partial stdout and the CLI no longer retains every extracted document/report row in memory. --budget limits extracted text only; --max-input-bytes is the CLI pre-read input file size cap; JSON modes change reporting shape and output buffering, not input buffering or parser memory. Use OS/process/container limits for large or untrusted inputs; process isolation adds a child-process boundary, timeout, and output cap, but does not replace those limits.

Exit-code policy:

  • 0: success, including a batch run where all detected inputs are unsupported.
  • 1: runtime, extraction, input, output, or filesystem failure.
  • 2: command-line usage error reported by clap.

Library usage:

use dextract::DExtract;

let data = std::fs::read("report.docx")?;
let dx = DExtract::new();
let output = dx.extract(&data, "report.docx", 0)?;

for page in &output.pages {
    println!("Page {}: {}", page.page_number, page.text);
}
# Ok::<(), Box<dyn std::error::Error>>(())

If detection returns DocumentFormat::Unknown, DExtract::extract() returns ExtractionError::UnsupportedFormat; it does not treat unknown input as plain text unless the filename has a supported text-like extension. Use extract_as() when intentionally forcing plain text or another extractor-backed format.

WordStar inputs (ws, ws1, ws2, ws3, ws4, ws5, ws6, ws7, ws8, and wsd) are detected-but-unsupported and do not fall through to plain text, RTF, OLE, ZIP/TAR/7z archive, HTML, or XML extraction. Adjacent macro/template/backup/index candidates (wsm, wst, wsb, and wsx) remain inventory-only. Text extraction, metadata extraction, formatting interpretation, dot-command interpretation, codepage decoding, print-control interpretation, macro execution, embedded-object extraction, native/converter dependency support, and extractor-backed support remain outside this boundary.

AppleWorks/ClarisWorks inputs (cwk) are detected-but-unsupported and do not fall through to iWork/Pages, ZIP/TAR/7z archive, OLE, HTML, XML, RTF, or plain-text extraction. Adjacent templates/stationery (cws, cwt), MacBinary/AppleDouble resource-fork material, text/metadata/cell extraction, graphics/formula parsing, converter/native dependency support, and extractor-backed support remain outside this boundary.

HWPX packages (hwpx) are detected-but-unsupported and do not fall through to ZIP/TAR/7z archive, OLE, HTML/XML, RTF, plain-text, or legacy HWP handling. Legacy HWP (hwp) remains evaluation-only because current fixtures are generated FileHeader sentinel bytes, not valid CFB documents.

StarOffice/OpenOffice legacy inputs (sxw, stw, sxc, stc, sxi, sti, sxd, std, sxm, sxg, sdw, sdc, sdd, sda) are detected-but-unsupported and do not fall through to supported ODT/ODS/ODP/ODG, generic ZIP/TAR/7z archive, OLE, HTML/XML, RTF, or plain-text extraction. This boundary does not add OpenDocument aliasing, text/cell/metadata extraction, macro/script or formula execution, converter/native dependency support, or extractor-backed support.

Comic-book archives (cbz, cbt, cb7, cbr) are detected-but-unsupported and do not fall through to generic ZIP/TAR/7z archive, PDF, EPUB, OLE, HTML/XML, RTF, plain-text, image/OCR, or RAR handling. Archive listing metadata, ComicInfo.xml metadata, text sidecars, page images, recursive dispatch, and extractor-backed support remain outside this boundary.

Metadata Semantics

Metadata is best-effort and format-dependent in v0.1.0. ExtractionOutput.metadata carries common document-level fields when an extractor can recover them: title, author, subject, creator, producer, keywords, creation date, modification date, and page_count.

ExtractedPage.metadata is page/section-scoped. It may carry extractor-specific page, sheet, slide, archive-entry, mail-part, or HTML page fields such as sheet_name, row_count, slide_number, path, size, content_type, description, or language. These keys are local to the extracted page or section and are not normalized into ExtractionOutput.metadata.

Emitted ExtractedPage values use 1-indexed page_number values in output order. source_format identifies the extractor family that produced the page, and byte_count is the byte length of the emitted text for that page. The cells vector is populated only by extractors with structured table or spreadsheet output; fixed-layout page extraction such as XPS/OXPS emits page text and page metadata but no cells.

Some extractors intentionally promote source fields into document metadata. For example, HTML <title> becomes document-level metadata.title, HTML author/generator meta tags become metadata.author and metadata.producer, and OpenOffice-style HTML created/changed/modified meta dates in YYYYMMDD;HHMMSS form are normalized. page_count is the extractor-reported logical/source count when available, not a normalized cross-format guarantee. Text-like single-section formats such as CSV, TSV, HTML, MHTML, RTF, EML, and MSG currently report emitted sections. Office/OpenDocument/EPUB formats may report source-declared pages, sheets, slides, or spine items, which can differ from the number of emitted ExtractedPage sections; EPUB page_count is based on manifest-backed spine items after OPF logical caps are applied. Generic archives currently leave page_count = 0 even when archive entries are emitted as page-like sections; plain text also reports 0 because it has no source page-count metadata.

The DExtract facade catches unwind panics from registered extractors and returns ExtractionError::ExtractorPanicked. It does not install or suppress the process panic hook, so caught panics may still print diagnostics to stderr. Aborts, out-of-memory conditions, stack overflows, and direct extractor calls remain caller/process-level concerns. The CLI's optional --process-isolation mode adds a root-command child-process boundary around extraction; it is separate from the facade panic boundary and still is not a sandbox.

Supported Formats

Format Detected Extractor-backed Fixture-backed Compatibility-backed Scope / notes
Plain text Yes Yes Yes Yes UTF-8 lossy passthrough for local text files; page_count remains 0 because no source count is reported, with pinned CC0 compatibility sample.
CSV Yes Yes Yes Yes Stable tab-separated text plus structured cells; reports one emitted section as page_count = 1, with pinned CC0 compatibility sample.
TSV Yes Yes Yes Yes Tab-delimited input rendered as stable tab-separated text plus structured cells; reports one emitted section as page_count = 1, with pinned MIT compatibility sample.
TeX/LaTeX sources Yes No Unsupported corpus + preflight fixtures No #158 registers .tex, .ltx, and .latex as detected-unsupported boundaries. Generated corpus fixtures cover plain, LaTeX, and archive/OLE/HTML/XML/RTF/plain-looking no-fallback inputs; preflight evidence records bounded command, brace, environment, blocked include/resource/execution directive, and resource-cap observations. No TeX execution, macro expansion, include traversal, graphics/bibliography loading, rendering, converter/native dependency, or extraction support is claimed.
DOC Yes Yes Yes Yes Legacy OLE Word extraction with best-effort document metadata and source page-count metadata when available.
DOCX Yes Yes Yes Yes Document body plus header/footer text, document metadata, and source page-count metadata when available; accepted docm, dotx, and dotm aliases route here.
ODT Yes Yes Yes Yes OpenDocument text extraction with document metadata and source page-count metadata when available; accepted ott and flat XML .fodt inputs route here.
XLS Yes Yes Yes Yes Legacy BIFF workbook extraction with sheet pages, structured cells, and sheet-count page_count.
XLSX Yes Yes Yes Yes Shared strings and worksheet XML become structured pages and cells with sheet-count page_count; accepted xlsm, xltx, and xltm aliases route here.
ODS Yes Yes Yes Yes OpenDocument spreadsheet extraction with sheet pages, structured cells, and source sheet-count page_count; accepted ots and flat XML .fods inputs route here.
PPT Yes Yes Yes Yes Supported with limitations: bounded mechanical legacy PowerPoint text and metadata extraction from synthetic fixtures plus one pinned CC0 real-world sample; no rendering, macros, media extraction, embedded object recursion, or decryption.
PPTX Yes Yes Yes Yes Slide text, speaker notes, presentation metadata, and slide-count page_count; accepted pptm, potx, potm, ppsx, ppsm, sldx, and sldm aliases route here.
ODP Yes Yes Yes Yes OpenDocument presentation extraction with slide/page-count page_count; accepted otp and flat .fodp inputs route here.
ODG Yes Yes Yes Yes Bounded OpenDocument drawing package/template/flat XML extraction for .odg, .otg, and .fodg visible drawing text, with a pinned Apache-2.0 packaged ODG compatibility sample.
HTML Yes Yes Yes Yes Visible text plus document-level title, author, producer, normalized created/modified metadata, and page_count = 1; page metadata keeps description and language, with bounded nested-table cell capture and a pinned CC0 compatibility sample.
EPUB Yes Yes Yes Yes XHTML chapters extracted in spine order with capped OPF metadata, manifest, and spine parsing plus spine-item page_count, with pinned W3C EPUB compatibility sample.
RTF Yes Yes Yes Yes Body text plus \\info metadata; reports one emitted section as page_count = 1, with pinned CC0 compatibility sample.
RTFD packages No No Preflight inventory fixtures No Evaluation-only #170 testdata/rtfd-preflight fixtures record directory-package shape, TXT.rtf root presence, attachment/member inventory, nested .rtfd deferral, resource-fork sidecar deferral, member-name traversal examples, and resource caps. No public .rtfd detection, detected-unsupported no-fallback behavior, RTFD text extraction, attachment extraction, recursive traversal, image/OCR extraction, converter/native dependency, or extractor-backed support is claimed.
EML Yes Yes Yes Yes RFC 5322 mail extraction with decoded headers, body selection, MIME traversal caps, and truncation for clipped internals; reports selected body sections as page_count, while ordinary attachment payloads are not extracted.
MSG Yes Yes Yes Yes Outlook .msg extraction with message metadata and text bodies; reports emitted message body sections as page_count, while ordinary attachment payloads are not extracted.
Outlook PST/OST Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #181; generated .pst/.ost corpus fixtures cover Outlook store classification, guidance, and no MSG, EML, mbox/Maildir, ZIP/TAR/7z archive, OLE, HTML, RTF, or plain-text fallback, while testdata/outlook-pst-ost-preflight records synthetic NDB header/version sentinels, malformed/root/node/block-count cases, encrypted/protected markers, oversized store metadata, and fallback probes. No message extraction, attachment extraction, folder traversal, decryption/password handling, store repair, search-index interpretation, account/client access, network access, native/libpff dependency, or extractor-backed support is claimed.
IBM Notes/Domino NSF/NTF Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #185; generated .nsf/.ntf corpus fixtures cover Notes/Domino classification, guidance, and no MSG, EML, mbox/Maildir, Outlook PST/OST, Access, DBF/DIF/SYLK, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while testdata/notes-domino-preflight records synthetic header observations, malformed short headers, encrypted/ACL-looking markers, resource caps, fallback probes, and inventory-only .box/.ndl/.id adjacent artifacts. No message extraction, attachment extraction, folder/view/form traversal, ACL/security interpretation, replication handling, decryption, Domino API/native dependency, database repair, embedded-payload recursion, adjacent artifact public classification, or extractor-backed support is claimed.
Microsoft InfoPath XSN Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #186; generated .xsn corpus fixtures cover InfoPath package classification, guidance, and no OOXML/ODF, ZIP/TAR/7z archive, generic XML, HTML, RTF, or plain-text fallback, while testdata/infopath-preflight records XSN package structure, manifest.xsf-style metadata, InfoPath XML namespace samples, malformed package/XML cases, external data-connection markers, traversal-shaped member names, resource caps, and inventory-only .xsf/standalone XML candidates. No form rendering, schema validation, script execution, data-connection traversal, attachment extraction, SharePoint integration, native/converter dependency, standalone generic XML public classification, or extractor-backed support is claimed.
WARC/WACZ/Safari webarchive Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #187; generated .warc, .warc.gz, .wacz, and .webarchive corpus fixtures cover web-archive classification, guidance, and no MHTML, HTML, JSON/plain text, generic XML, RTF, ZIP/TAR/7z archive, gzip TAR, or Safari plist/plain fallback, while testdata/web-archive-preflight records WARC/WARC.GZ records, WACZ package/index shape, Safari webarchive markers, malformed cases, traversal member names, external-resource markers, resource caps, and inventory-only .arc/.arc.gz/.cdx/.cdxj/.har candidates. No WARC record extraction, WACZ package extraction, Safari webarchive parsing, HTTP payload extraction, page rendering, resource reconstruction, CDX index lookup, external-resource traversal, native/converter dependency, inventory-only extension public classification, or extractor-backed support is claimed.
FileMaker Pro Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #189; generated .fmp12 and .fp7 corpus fixtures cover FileMaker classification, guidance, and no Access, DBF/DIF/SYLK, SQLite, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while testdata/filemaker-preflight records synthetic header/version observations, malformed short/bad headers, encryption/container/external-container markers, resource caps, fallback probes, and inventory-only .fp5/.fp3/.fmp/.fmpur/.usr/.fmpsl candidates. No table extraction, layout/form/report rendering, script execution, calculation evaluation, container-field extraction, external-container traversal, account/security interpretation, decryption, repair, FileMaker native dependency, inventory-only extension public classification, or extractor-backed support is claimed.
SPSS Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #190; generated .sav, .zsav, and .por corpus fixtures cover SPSS classification, guidance, and no spreadsheet/database/archive/OLE/HTML/XML/RTF/CSV/TSV/plain fallback, while testdata/statistical-data-preflight records header/version, compression/encryption, variable/value-label, malformed/resource-cap, fallback, and inventory-only .sps/.spv/.spo evidence. No table, row, cell, value-label, metadata, compression, decryption, statistical interpretation, native/converter dependency, inventory-only public classification, or extractor-backed support is claimed.
SAS Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #190; generated .sas7bdat and .xpt corpus fixtures cover SAS classification, guidance, and no spreadsheet/database/archive/OLE/HTML/XML/RTF/CSV/TSV/plain fallback, while testdata/statistical-data-preflight records header/version, encryption, variable/value-label, malformed/resource-cap, fallback, and inventory-only .sas/.sas7bcat/.sas7bvew/.sas7bndx evidence. No table, row, cell, value-label, metadata, compression, decryption, statistical interpretation, native/converter dependency, inventory-only public classification, or extractor-backed support is claimed.
Stata Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #190; generated .dta corpus fixtures cover Stata classification, guidance, and no spreadsheet/database/archive/OLE/HTML/XML/RTF/CSV/TSV/plain fallback, while testdata/statistical-data-preflight records header/version, encryption, variable/value-label, malformed/resource-cap, fallback, and inventory-only .do/.ado/.smcl/.gph/.dct evidence. No table, row, cell, value-label, metadata, compression, decryption, statistical interpretation, native/converter dependency, inventory-only public classification, or extractor-backed support is claimed.
Mailbox stores Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #156; generated .mbox corpus fixtures cover mailbox-store classification, guidance, and no EML, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, or plain-text fallback, while testdata/mailbox-preflight records bounded mbox boundary/order/resource-cap observations plus CLI-only direct Maildir-shaped directory detection through new/cur/tmp children. No mailbox text extraction, message enumeration, attachment extraction, recursive store handling, Maildir message traversal, account/client access, network access, native/converter dependency, or extractor-backed support is claimed.
Pages Yes Limited Yes Yes Legacy XML Pages packages with root index.xml.gz or index.xml are extractor-backed for body paragraph text; modern IWA and directory packages remain unsupported subformats.
Numbers Yes Limited Yes Yes Legacy XML Numbers packages with root index.xml.gz or index.xml are extractor-backed for table/cell text; modern IWA and directory packages remain unsupported subformats.
Keynote Yes Limited Yes Yes Legacy XML Keynote packages with root index.apxl.gz or index.apxl are extractor-backed for slide paragraph/table text; modern IWA and directory packages remain unsupported subformats.
MHTML Yes Yes Yes Yes MHTML web archive inputs (.mht, .mhtml) extract the selected root HTML or plain text part, with pinned Apache-2.0 compatibility evidence; linked resources are skipped rather than reconstructed or recursively extracted.
XLSB Yes Yes Yes Yes Excel binary workbook packages (.xlsb) are supported through #96 with calamine-backed BIFF12 worksheet/cell extraction, ZIP/package preflight caps, panic containment, and pinned Apache-2.0 compatibility evidence; formulas are not executed, macros and embedded payloads are ignored, and external links are not followed.
XPS/OXPS Yes Yes Yes Yes Bounded FixedDocument/FixedPage package extraction for .xps and .oxps; extracts Glyphs UnicodeString text and core package metadata, with pinned Apache-2.0 XPS/OpenXPS-content compatibility evidence and no rendering, OCR, font decoding, or glyph-index reconstruction.
VSDX Yes Yes Yes Yes Bounded Visio drawing text extraction for .vsdx, with pinned MIT compatibility evidence and generated malformed relationship/XML/resource-cap fixtures; follows declared page relationships, extracts page Text elements, and fails closed on corrupt VSDX packages without rendering shapes, OCR, macro execution, or external links.
Legacy Visio binary Yes Limited Yes Yes Bounded visible-text recovery from valid Visio CFB .vsd, .vss, and .vst inputs is extractor-backed with pinned Apache POI evidence. Real MS-VSD record validation, page/shape ordering, metadata extraction, rendering, macro/VBA parsing or execution, embedded-object extraction, external-link traversal, and native/converter dependency remain unsupported.
Visio XML/package variants Yes Limited Yes No Bounded extractor-backed support for #153 covers legacy Visio XML .vdx, .vsx, and .vtx plus modern package .vsdm, .vssx, .vssm, .vstx, and .vstm fixtures generated in testdata/corpus/basic; extraction reads page Text elements, ignores macro/VBA parts, and fails closed on malformed XML, corrupt packages, missing relationships, oversized parts, external targets, and non-Visio packages without VSDX, ZIP/archive, XML/plain text, OLE, HTML, or legacy Visio binary fallback. testdata/visio-package-preflight records package spine and malformed-boundary evidence; metadata extraction, rendering, conversion, macro execution, embedded-payload recursion, external-link traversal, and broad compatibility are out of scope.
Publisher Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #126/#149; committed generated testdata/corpus/basic/minimal_publisher.pub fixture covers OLE-magic .pub/Publisher classification and unsupported no-fallback behavior, while testdata/publisher-preflight records synthetic OLE-header stream/version/text-candidate/metadata-candidate inventory, malformed-header, missing-stream, encrypted-marker, external data-source, embedded-object, and resource-cap parser-readiness evidence only. Publisher inputs do not route to DOC, XLS, PPT, MSG, generic OLE, archive, HTML, or plain-text extraction, and no Publisher text extraction, metadata extraction, rendering, conversion, mail-merge/data-source access, macro execution, embedded-object recursion, external-link traversal, decryption, native/converter dependency, real Publisher compatibility, or Publisher stream/version validation is supported.
Adobe InDesign IDML Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #169; committed generated testdata/corpus/basic/minimal_indesign.idml, rtf_looking_indesign.idml, html_looking_indesign.idml, ole_looking_indesign.idml, and plain_looking_indesign.idml fixtures cover extension-gated .idml classification and unsupported no-fallback behavior, while testdata/indesign-preflight records IDML ZIP package shape, designmap.xml presence, story XML inventory, malformed XML, traversal-shaped member paths, external-resource markers, embedded asset deferral, and package resource caps. IDML inputs do not route to ZIP/archive, RTF, OLE, HTML, or plain-text extraction, and no IDML text extraction, metadata extraction, rendering/layout fidelity, image extraction/OCR, script/plugin execution, external-resource loading, recursive embedded-payload extraction, converter/native dependency, or extractor-backed support is claimed.
Adobe InDesign INDD No No Preflight inventory fixtures No Evaluation-only #169 binary .indd feasibility lane in testdata/indesign-preflight records binary feasibility and ZIP-looking .indd fallback-risk inventory only. No public .indd detection, detected-unsupported no-fallback behavior, binary INDD parsing, rendering/layout fidelity, image extraction/OCR, script/plugin execution, external-resource loading, recursive embedded-payload extraction, converter/native dependency, or extractor-backed support is claimed.
Adobe FrameMaker MIF/FM Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #171; committed generated testdata/corpus/basic/minimal_framemaker.mif, rtf_looking_framemaker.mif, zip_looking_framemaker.mif, html_looking_framemaker.mif, ole_looking_framemaker.fm, and plain_looking_framemaker.fm fixtures cover extension-gated .mif/.fm classification and unsupported no-fallback behavior, while testdata/framemaker-preflight records MIF header/version markers, text-flow/paragraph inventory, escaped strings, blocked external-resource/include/cross-reference markers, binary payload rejection, fallback-looking signature probes, structure/file resource caps, and binary .fm feasibility rows. FrameMaker inputs do not route to RTF, ZIP/archive, OLE, HTML, or plain-text extraction, and no MIF text extraction, binary FM parsing, rendering/layout fidelity, imported graphic loading, external-resource traversal, converter/native dependency, or extractor-backed support is claimed.
CHM Yes Limited Yes Yes Bounded CHM topic text extraction uses internal ITSF/ITSP/PMGL/PMGI/DataSpace readiness and LZX topic decoding, with pinned Apache Tika evidence. Rendering, scripts, external links, embedded payload recursion, broad compatibility, and generic archive/OLE/HTML/plain fallback remain unsupported.
OneNote Yes Limited Yes Yes Bounded visible-text recovery for real OneStore revision-store files is extractor-backed with pinned Apache Tika evidence. .onepkg member extraction, object graph/rich-text semantics, handwriting/OCR, rendering, embedded payload recursion, and generic OLE/ZIP/plain fallback remain unsupported.
DjVu Yes Limited Generated corpus fixtures + preflight fixtures No Bounded #151 extractor-backed slice for uncompressed TXTa text-layer bytes in .djvu/.djv IFF/FORM inputs; committed generated testdata/corpus/basic/minimal_djvu.djvu covers baseline extraction, while testdata/djvu-preflight and supporting dextract-djvu-preflight keep synthetic IFF/FORM container, page-form, directory, TXTa/TXTz, malformed-chunk, external-reference, and resource-cap coverage. DjVu inputs do not route to generic archive, OLE, HTML, or plain-text extraction; TXTz decompression, rendering, OCR, image extraction, metadata value extraction, external references, native dependencies, and converter shellouts remain unsupported.
PostScript/EPS Yes Limited Generated corpus fixtures No Bounded lexical extraction for #150 after the #127 boundary; committed generated testdata/corpus/basic/minimal_postscript.ps and testdata/corpus/basic/minimal_eps.eps fixtures cover DSC metadata plus literal/hex string text recovery only. PostScript/EPS extraction does not execute PostScript, render, OCR, process EPS previews, load external resources, recover outlined text, interpret fonts, or use Ghostscript.
WordPerfect Yes Limited Yes Yes Bounded WPC5/WPC6 .wpd document-area text extraction for #148, backed by generated testdata/corpus/basic/minimal_wordperfect.wpd, testdata/wordperfect-preflight prefix/malformed/adjacent fixtures, and pinned Apache-2.0 testdata/realworld/apache-tika/testWordPerfect.wpd. Extracts plain document-area text and WPC prefix metadata only; macros/templates (.wpt/.wcm), graphics/layout rendering, embedded objects, external resources, encrypted inputs, converter shellouts, and native dependencies remain unsupported.
WordPerfect adjacent/malformed boundaries Yes No Parser-readiness preflight fixtures No Adjacent .wpt/.wcm fixtures remain non-public inventory only, and malformed, encrypted, unsupported-version/product/type, payload-cap, ZIP-looking, and OLE-looking .wpd inputs fail closed without generic archive, OLE, HTML, or plain-text fallback.
WordStar Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #194; generated .ws, .ws7, and .wsd corpus fixtures plus fallback-looking probes cover WordStar classification, guidance, and no plain text, RTF, OLE, ZIP/TAR/7z archive, HTML, or XML fallback, while testdata/wordstar-preflight records control-byte/header observations, dot-command/print-control markers, malformed/resource-cap cases, fallback probes, and inventory-only .wsm/.wst/.wsb/.wsx candidates. No text extraction, metadata extraction, formatting interpretation, dot-command interpretation, codepage decoding, print-control interpretation, macro execution, embedded-object extraction, native/converter dependency, inventory-only public classification, or extractor-backed support is claimed.
AbiWord Yes No Unsupported corpus + preflight fixtures No Detected unsupported close-out boundary for #157; committed generated fixtures cover extension-gated .abw classification and no-fallback behavior, while testdata/abiword-preflight and internal dextract-abiword-preflight record XML/gzip shape inventory, synthetic 1.x/2.x/3.x version/provenance candidates, structure-only section/paragraph/heading/list/table/cell counts, non-public .zabw/.abw.gz compressed candidates, malformed XML/entity/missing-version/unsupported-version/embedded-object/external-link/depth/element/attribute/text/metadata-key-limit cases, missing/corrupt gzip, gzip resource-cap boundaries, and metadata-key-name inventory only. AbiWord inputs do not route to generic XML, ZIP, OLE, HTML, or plain-text extraction, and no text extraction, metadata extraction, extractor-backed XML parser readiness, real-world producer compatibility, compressed-variant public support, embedded-object recursion, external-link traversal, or converter/native dependency support is claimed.
Microsoft Works Yes No Unsupported corpus + preflight feasibility fixtures No Detected unsupported boundary for #155; committed generated fixtures cover extension-gated .wps/.wks/.wdb/.xlr classification and no-fallback behavior, while testdata/works-preflight records synthetic Works-family sentinels, malformed/version/record cases, resource caps, encryption/embedded/external markers, fallback probes, missing legal compatibility samples, version-family unknowns, and parser-readiness blockers. Microsoft Works inputs do not route to RTF, XLS, XLSX, generic archive, OLE, HTML, or plain-text extraction, and no text extraction, metadata extraction, real stream/version validation, database/spreadsheet parsing, embedded-object recursion, external-link traversal, decryption, native/converter dependency, or parser-readiness support is claimed.
Microsoft Access Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #182; generated .mdb/.accdb corpus fixtures cover Access classification, guidance, and no XLS/XLSX/ODS, DBF/DIF/SYLK, Works, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while testdata/access-preflight records synthetic Jet/ACE sentinels, malformed/version/object-count cases, password/encryption and linked-table markers, and inventory-only .mde/.accde candidates. No table extraction, query execution, form/report rendering, macro/VBA execution, linked-table traversal, external data-source access, embedded-object recursion, decryption, repair, native ACE/Jet dependency, inventory-only extension public classification, or extractor-backed support is claimed.
SQLite database Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #188; generated .sqlite/.sqlite3 corpus fixtures cover SQLite classification, guidance, and no XLS/XLSX/ODS, Access, DBF/DIF/SYLK, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while testdata/sqlite-preflight records SQLite header/page-size/version observations, malformed short/bad headers, resource-cap cases, fallback probes, and inventory-only .db, .db3, .sqlite2, .sdb, WAL, and SHM sidecar candidates. No SQL execution, schema/table extraction, row/cell extraction, FTS/index parsing, WAL replay, extension loading, encryption/decryption, repair, native SQLite dependency, sidecar public classification, or extractor-backed support is claimed.
DBF/dBASE Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #183; generated .dbf corpus fixtures cover DBF/dBASE classification, guidance, and no XLS/XLSX/ODS, Access, DIF/SYLK, Works, Quattro Pro, Lotus 1-2-3, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while testdata/dbf-preflight records deterministic DBF header/version observations, malformed/header/resource-cap cases, fallback probes, and inventory-only .dbt/.fpt/.ndx/.mdx/.cdx sidecar candidates. No table/row/cell extraction, memo parsing, codepage decoding claims, deleted-record recovery, index parsing, shapefile/geospatial support, formula/macro execution, native/converter dependency, sidecar public classification, or extractor-backed DBF parser support is claimed.
DIF Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #184; generated .dif corpus fixtures cover DIF classification, guidance, and no CSV/TSV, plain-text, HTML/XML, RTF, archive, OLE, XLS/XLSX/ODS, Access, DBF, Works, Quattro, Lotus, or SYLK fallback, while shared testdata/dif-sylk-preflight records deterministic text-header observations, malformed/resource-cap cases, fallback probes, and inventory-only .sylk split evidence. No cell/table extraction, formula execution or evaluation, codepage/encoding normalization, external-link traversal, native/converter dependency, magic-only detection, or extractor-backed support is claimed.
SYLK Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #184; generated public .slk corpus fixtures cover SYLK classification, guidance, and no CSV/TSV, plain-text, HTML/XML, RTF, archive, OLE, XLS/XLSX/ODS, Access, DBF, Works, Quattro, Lotus, or DIF fallback, while shared testdata/dif-sylk-preflight keeps .sylk inventory-only. No cell/table extraction, formula execution or evaluation, codepage/encoding normalization, external-link traversal, native/converter dependency, magic-only detection, .sylk public classification, or extractor-backed support is claimed.
Microsoft Project Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #180; generated .mpp/.mpt/.mpx corpus fixtures cover Project classification, guidance, and no DOC/XLS/PPT/OOXML/ODF, Visio, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while testdata/project-preflight records synthetic CFB/MPX sentinels, malformed/version/task/resource/record-count cases, password/VBA/external-link/embedded-object markers, and fallback probes. No task extraction, resource extraction, schedule calculation, formula evaluation, Gantt/timeline rendering, image/OCR extraction, macro/VBA execution, external-link traversal, embedded-object recursion, decryption, repair, native/converter dependency, or extractor-backed support is claimed.
MOBI/AZW Yes Limited Supported corpus + preflight fixtures No Bounded #145 support extracts unencrypted UTF-8 .mobi, .azw, and .azw3 text records when the input validates as PDB-style BOOK/MOBI and uses uncompressed or classic PalmDOC compression. Generated supported corpus fixtures cover .mobi/.azw/.azw3 extraction, while legacy minimal corpus fixtures and testdata/mobi-preflight keep unsupported-subset/no-fallback evidence for empty/minimal headers, numeric-only EXTH inventory, encoding-marker classification, unsupported Windows-1252/unknown encodings, HUFF/CDIC, encryption, malformed PalmDOC bytes, and output limits. No DRM/decryption, Windows-1252 decoding, metadata extraction, EXTH value decoding, HTML/XHTML conversion, rendering, resource extraction, embedded-payload recursion, external-link traversal, AZW4/Topaz/KFX support, or generic .prc/.pdb classification is supported.
Kindle KFX/Topaz/AZW4 Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #176; generated .kfx/.tpz/.azw1/.azw4 corpus fixtures cover Kindle KFX/Topaz/AZW4 classification, guidance, and no PDF, EPUB/ZIP, archive, MOBI/AZW, or plain-text fallback, while testdata/kindle-ebook-boundary keeps generic .prc/.pdb, .azw6, .azw8, and .azw9 inventory-only. No Kindle KFX/Topaz/AZW4 text extraction, DRM/decryption, metadata extraction, EXTH value decoding, rendering, resource extraction, embedded-payload recursion, external-link traversal, converter/native dependency, or extractor-backed support is claimed.
QuarkXPress Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #178; generated .qxd/.qxp corpus fixtures cover QuarkXPress classification, guidance, and no ZIP/TAR/7z archive, OLE, HTML/XML, RTF, or plain-text fallback, while testdata/quarkxpress-preflight records synthetic page-layout sentinels, malformed/header/object-count cases, external-resource markers, and inventory-only .qxt/.qpt/.qxb/.qxl/.xtg candidates. No layout rendering, text extraction, metadata extraction, font interpretation, image/OCR extraction, external-resource loading, color management, converter/native dependency, inventory-only extension public classification, or extractor-backed support is claimed.
Scribus SLA Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #192; generated .sla corpus fixtures cover extension-gated Scribus classification, guidance, and no generic XML, gzip/archive, HTML, RTF, InDesign, QuarkXPress, Publisher, FrameMaker, or plain-text fallback, while testdata/scribus-preflight records Scribus XML root/version observations, .sla.gz inventory-only candidates, malformed XML/gzip cases, external image/font markers, traversal/script markers, resource caps, and fallback probes. No text extraction, metadata extraction, layout rendering, font/image loading, color management, script execution, external-resource traversal, gzip public support, native/converter dependency, inventory-only public classification, or extractor-backed support is claimed.
Adobe PageMaker Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #193; generated .pmd/.p65/.pm6 corpus fixtures cover PageMaker classification, guidance, and no InDesign, QuarkXPress, Scribus SLA, Publisher, FrameMaker, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, or plain-text fallback, while testdata/pagemaker-preflight records synthetic layout sentinels, malformed header/version cases, external image/font markers, traversal markers, resource caps, fallback probes, and inventory-only .pm5/.pm4/.pmt/.t65 candidates. No text extraction, metadata extraction, layout rendering, font interpretation, image/OCR extraction, external-resource loading, color management, script execution, converter/native dependency, inventory-only public classification, or extractor-backed support is claimed.
Quattro Pro Yes No Unsupported corpus + feasibility inventory No Detected unsupported boundary for #159; committed generated fixtures cover extension-gated .qpw/.wb1/.wb2/.wb3 classification and no-fallback behavior, while testdata/quattro-preflight records fixture inventory, generated .wb1/.wb2 PRONOM BOF signature/version markers, malformed signature boundaries, missing legal compatibility samples, and remaining parser-readiness blockers only. Quattro Pro inputs do not route to XLS, XLSX, ODS, generic archive, OLE, HTML, or plain-text extraction, and no text extraction, metadata extraction, formula execution, macro execution, embedded-object recursion, external-link traversal, decryption, native/converter dependency, workbook/sheet/cell validation, public support registration, or extractor-backed parser support is claimed.
Lotus 1-2-3 Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #164; generated .wk1/.wk3/.wk4/.123 corpus fixtures cover Lotus 1-2-3 classification, guidance, and no XLS/XLSX/ODS, DBF/DIF/SYLK, Works, Quattro Pro, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while testdata/lotus-preflight records generated PRONOM BOF signature/version fixtures, malformed signature boundaries, missing legal compatibility samples, and parser-readiness blockers only. No text extraction, metadata extraction, workbook/sheet/cell parsing, formula execution, macro execution, embedded-object recursion, external-link traversal, decryption, native/converter dependency, public support registration, or extractor-backed Lotus parser support is claimed; .wks remains owned by the Microsoft Works boundary.
HWPX Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #162; generated testdata/corpus/basic/*hwpx.hwpx fixtures cover extension-gated .hwpx classification, guidance, and no ZIP/TAR/7z, OLE, HTML/XML, RTF, plain-text, or legacy HWP fallback, while testdata/hwp-hwpx-preflight records HWPX package-spine, malformed package/XML/resource-cap, and external-resource inventory. No text/metadata extraction, package-member extraction, OWPML/body parsing, rendering/layout, decryption, script execution, converter/native dependency, or extractor-backed support is claimed.
AutoCAD DWG/DXF Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #175; generated .dwg/.dxf corpus fixtures cover extension-gated AutoCAD classification, guidance, and no PDF/PostScript, ZIP/archive, OLE, HTML/XML, RTF, or plain-text fallback, while testdata/autocad-preflight records ASCII DXF section/table/entity and TEXT/MTEXT marker inventory, DWG AC10xx sentinels, malformed/header/resource-cap cases, and fallback-signature risk only. No CAD text extraction, metadata extraction, geometry interpretation, rendering/OCR, external-reference traversal, converter/native dependency, or extractor-backed support is claimed.
DWF/DWFx Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #177; generated .dwf/.dwfx corpus fixtures cover DWF/DWFx classification, guidance, and no PDF/PostScript, XPS/OXPS, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, plain-text, image/OCR, or Visio fallback, while testdata/dwf-dwfx-preflight records synthetic package/signature fixtures, malformed package-shape cases, traversal member names, external-resource markers, embedded asset deferral, and package entry/byte caps. No text extraction, metadata extraction, CAD geometry extraction, fixed-layout rendering, image/OCR extraction, embedded-resource traversal, external-resource loading, native/converter dependency, or extractor-backed support is claimed.
HP PCL/PCLXL Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #191; generated .pcl corpus fixtures cover extension-gated HP PCL/PCLXL classification, guidance, and no PDF/PostScript, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, plain-text, or image/OCR fallback, while testdata/pcl-preflight records PJL, PCL escape-sequence, PCLXL sentinel, malformed/resource-cap, embedded-resource, fallback, and inventory-only .prn/.pxl/.pclxl/.spl evidence. No print-stream interpretation, text extraction, font interpretation, rendering/OCR, PJL execution, embedded-resource extraction, printer emulation, Ghostscript/native dependency, inventory-only extension public classification, or extractor-backed support is claimed.
Legacy HWP No No Evaluation preflight inventory No Evaluation-only #162 lane; testdata/hwp-hwpx-preflight records generated .hwp FileHeader sentinels that are not valid CFB documents. No public .hwp detection, detected-unsupported no-fallback behavior, valid legacy CFB parser support, text/metadata extraction, decryption, converter/native dependency, or extractor-backed support is claimed.
StarOffice/OpenOffice legacy Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #166; generated .sxw/.stw/.sxc/.stc/.sxi/.sti/.sxd/.std/.sxm/.sxg/.sdw/.sdc/.sdd/.sda corpus fixtures cover classification, guidance, and no ODT/ODS/ODP/ODG, ZIP/TAR/7z, OLE, HTML/XML, RTF, or plain-text fallback, while testdata/staroffice-openoffice-preflight records package-shape fixtures, legacy binary sentinels, malformed ZIP/XML/resource boundaries, and parser-readiness blockers. No OpenDocument aliasing, text/cell/metadata extraction, macro/script or formula execution, converter/native dependency support, or extractor-backed support is claimed.
FictionBook FB2 Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #163; generated .fb2 and .fb2.zip corpus fixtures cover FictionBook FB2 classification, guidance, and no PDF, EPUB, MOBI/AZW, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, or plain-text fallback, while testdata/fb2-preflight records XML/package observations, malformed XML/resource caps, and ZIP member/traversal boundaries. No text extraction, metadata extraction, resource extraction, embedded-payload recursion, external-link traversal, converter/native dependency, public support registration, or extractor-backed support is claimed.
OpenDocument ancillary variants Yes No Unsupported corpus + boundary inventory No Detected unsupported boundary for #165; generated .odm/.odb/.odf/.odc corpus fixtures cover OpenDocument ancillary classification, guidance, and no ZIP/TAR/7z archive, OLE, HTML/XML, RTF, plain-text, or neighboring ODT/ODS/ODP/ODG fallback, while testdata/opendocument-ancillary-boundary records variant separation and remaining parser questions. No master-document traversal, database/formula execution, chart rendering, macro/script execution, embedded-payload recursion, external-link traversal, decryption, native/converter dependency, or extractor-backed support is claimed.
AppleWorks/ClarisWorks Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #167; generated testdata/corpus/basic/*appleworks.cwk fixtures cover extension-gated .cwk classification, guidance, and no iWork/Pages, ZIP/TAR/7z, OLE, HTML, XML, RTF, or plain-text fallback, while testdata/appleworks-preflight records BOBO shared-signature inventory, byte caps, MacBinary/AppleDouble resource-fork deferral, and PRONOM observations. No text/metadata/cell extraction, graphics/formula parsing, converter/native dependency, or extractor-backed support is claimed.
Comic-book archives Yes No Unsupported corpus + preflight fixtures No Detected unsupported boundary for #168; generated testdata/corpus/basic/*comic* and minimal_cb*.cb* fixtures cover extension-gated .cbz, .cbt, .cb7, and .cbr classification, guidance, and no generic ZIP/TAR/7z archive, PDF, EPUB, OLE, HTML/XML, RTF, plain-text, image/OCR, or RAR fallback, while testdata/comic-archive-preflight records container/signature observations, image-only entries, ComicInfo.xml inventory, text-sidecar boundaries, and traversal/resource-cap boundaries. No archive listing metadata, ComicInfo.xml metadata extraction, RAR support, image extraction/OCR, sidecar text extraction, recursive dispatch, converter/native dependency, or extractor-backed support is claimed.
ZIP Yes Yes Yes Yes One page-like section per entry, with pinned MIT archive compatibility sample; traversal paths are rejected. Archive entry bytes are decoded as UTF-8-lossy text and are not recursively dispatched to other extractors; generic ZIP output is not modern iWork/Numbers/Keynote support or support for rejected OOXML/ODF and package boundaries.
TAR Yes Yes Yes Yes Tar and gzip-compressed tar (.tar.gz, .tgz) via the same archive extractor, with pinned BSD-3-Clause GNU tar compatibility sample. Archive entry bytes are decoded as UTF-8-lossy text and are not recursively dispatched to other extractors.
7Z Yes Limited Yes Yes Unencrypted non-solid 7z archives with unencoded headers, one coder per folder, no filters, and COPY, DEFLATE, BZIP2, or bounded-dictionary LZMA2 streams, with pinned Apache-2.0 COPY compatibility sample; LZMA, high-dictionary LZMA2, encrypted, encoded-header, solid, filtered, and coder-chain archives are rejected.
PDF Yes No Unsupported fixture No Out of scope; use dpdf for PDF parsing and OCR.
OCR / images No No No No Out of scope for this repo.

See docs/archive-policy.md for the archive recursion and resource-limit policy.

Fixture-backed means covered by committed baseline fixtures or explicitly named unsupported-boundary fixtures under testdata/corpus/ or a format-specific testdata/ directory. Compatibility-backed means covered by pinned public samples under testdata/realworld/. Absence of a compatibility fixture does not mean a format is unsupported; it means the release evidence is currently limited to unit tests, integration tests, and baseline fixtures.

Repository Layout

dExtract is a workspace with 32 publishable crates plus five unpublished internal preflight crates. The public build uses the members declared in the root Cargo.toml, including bounded extractors for legacy .ppt, legacy XML iWork, CHM, OneNote visible-text recovery, legacy Visio binary visible-text recovery, PostScript/EPS, DjVu TXTa, MOBI/AZW/AZW3, and WordPerfect WPC text extraction.

Path Purpose
dextract-types Shared traits and data types for extractors and outputs.
dextract-ole Shared bounded OLE preflight validation for legacy Office extractors.
dextract-zip-package Shared hardened ZIP package reader for OOXML, OpenDocument, EPUB, and XPS/OXPS extractors.
dextract Facade crate that registers the built-in extractors.
dextract-cli dextract command-line entrypoint.
dextract-doc Legacy DOC extractor.
dextract-docx DOCX extractor.
dextract-odt ODT extractor.
dextract-xls Legacy XLS extractor.
dextract-xlsb XLSB extractor.
dextract-xlsx XLSX extractor.
dextract-ods ODS extractor.
dextract-ppt Legacy PPT extractor with bounded mechanical text and metadata support; no rendering, macros, media extraction, embedded object recursion, or decryption.
dextract-pptx PPTX extractor.
dextract-odp ODP extractor.
dextract-odg ODG/OTG package and flat FODG drawing extractor.
dextract-postscript PostScript/EPS bounded lexical extractor; no execution, rendering, OCR, previews, external resources, or Ghostscript.
dextract-iwork Legacy XML Pages/Numbers/Keynote package extractor.
dextract-wordperfect Bounded WordPerfect WPC5/WPC6 .wpd document-area extractor; no macros/templates, rendering, embedded-object recursion, external-resource traversal, converter shellout, or native dependency.
dextract-iwork-preflight Unpublished iWork input preflight primitives; not extractor-backed support.
dextract-onenote-preflight Unpublished OneNote input, .onepkg package-inventory preflight primitives, and fail-closed #101 text/object-page/object-graph/revision-table-sequence/page-rich-text-object-reference/visible-text-object readiness blocker checks; not extractor-backed support.
dextract-mobi-preflight MOBI/AZW PDB/MOBI envelope preflight primitives used by dextract-mobi, including text-encoding marker classification and bounded uncompressed/classic PalmDOC text-record materialization.
dextract-chm-preflight CHM ITSF envelope, ITSP/PMGL header validation, PMGI/DataSpace readiness, LZX topic decoding support, bounded .hhc TOC ordering, and fail-closed malformed reset-table checks used by the limited CHM facade extractor.
dextract-djvu-preflight DjVu IFF/FORM container parser-readiness primitives and bounded TXTa byte materialization for .djvu/.djv; used by the limited facade extractor and preflight fixtures.
dextract-abiword-preflight Unpublished AbiWord XML/gzip shape inventory, synthetic version-family checks, extraction-risk marker rejection, and structure-only element-count primitives for .abw plus non-public .zabw/.abw.gz candidates; not extractor-backed support.
dextract-wordperfect-preflight Unpublished WordPerfect WPC5/WPC6 header and synthetic payload inventory primitives for .wpd plus non-public .wpt/.wcm candidates; not extractor-backed support.
dextract-visio-binary-preflight Unpublished legacy Visio binary CFB stream-inventory, non-synthetic VisioDocument real-parser/version-policy compatibility/record-map/table/text-run/page-shape/metadata gating, and repo-owned synthetic record/text/page-shape ordering plus record/version/stream consistency parser-readiness primitives for .vsd/.vss/.vst; not real Visio record decoding or extractor-backed support.
dextract-xps XPS/OXPS fixed-layout package extractor.
dextract-vsdx VSDX plus bounded Visio XML/package variant extractor.
dextract-html HTML extractor.
dextract-csv CSV extractor.
dextract-epub EPUB extractor.
dextract-mobi Bounded MOBI/AZW/AZW3 UTF-8 text-record extractor for unencrypted uncompressed/classic PalmDOC inputs.
dextract-rtf RTF extractor.
dextract-eml RFC 5322 email and MHTML extractor.
dextract-msg Outlook MSG extractor.
dextract-archive ZIP, TAR, and limited 7z extractor.

Reference Docs

  • Cargo.toml in the repository root for the workspace manifest and public repo URL.
  • dextract/src/lib.rs in the repository for the facade API and built-in extractor registration.
  • dextract-cli/src/main.rs in the repository for CLI behavior and subcommands.
  • RELEASING.md, ROADMAP.md, testdata/README.md, and scripts/fetch_realworld_corpus.py in the repository for release and corpus guidance.

Development

Run the workspace checks from the repo root:

cargo fmt --all --check
cargo clippy --workspace --all-targets --all-features --locked -- -D warnings
cargo test --workspace --all-targets --locked

Useful release-time checks:

bash scripts/check-release-tooling.sh
bash scripts/check-github-workflows.sh
cargo doc --workspace --no-deps --locked
bash scripts/check-package-list.sh
bash scripts/check-supply-chain.sh
bash scripts/check-api-compat.sh --required
python3 scripts/check_test_corpus_drift.py
python3 scripts/validate_format_gap_issue_drafts.py --quiet
cargo run -p dextract-cli -- formats
cargo run -p dextract-cli -- formats --json
cargo run -p dextract-cli -- formats --all --json
cargo package --list -p dextract
cargo publish --dry-run -p dextract-types

For reproducible local extraction performance checks against committed fixtures, use the CLI benchmark harness:

short_sha="$(git rev-parse --short=7 HEAD)"
python3 scripts/bench_extractors.py \
  --warmups 2 \
  --iterations 5 \
  --fixture-set all \
  --mode both \
  --output "target/perf/refreshed-performance-baseline-${short_sha}.json"

See docs/performance.md and docs/performance-baseline.md for the benchmark contract and current same-machine baseline. For ad hoc library-only timing against representative files, cargo run -p dextract --example simple_bench remains available.

The committed fixture corpus used for facade-level support checks lives under testdata/corpus/. Pinned public compatibility samples live under testdata/realworld/.

License

dExtract is released under the Apache License 2.0.

Copyright 2026 Dropbox

About

Pure-Rust document extraction library for binary formats

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Rust 56.6%
  • Python 38.7%
  • Shell 4.7%