Pure-Rust text extraction for common non-PDF local file formats and archives.
- Author: Andrew Yates andrewyates.name@gmail.com
- Version: 0.3.0
- License: Apache-2.0
dExtract mechanically extracts structured text from common non-PDF local files without shelling out to system tools or external services. The facade currently registers built-in extractors for plain text, CSV/TSV, legacy Word/Excel/PowerPoint (doc, xls, ppt), legacy iWork XML packages (pages, numbers, key), Excel binary workbooks (xlsb), Office Open XML (docx, xlsx, pptx, plus accepted macro/template/slideshow aliases), OpenDocument (odt/fodt, ods/fods, odp/fodp, odg/otg/fodg, plus accepted template aliases), XPS/OXPS (xps, oxps), Visio drawing/XML/package/binary variants (vsdx, vdx, vsx, vtx, vsdm, vssx, vssm, vstx, vstm, vsd, vss, vst), CHM topic text (chm), bounded OneNote visible-text recovery (one), PostScript/EPS lexical text recovery (ps, eps), bounded DjVu TXTa text-layer extraction (djvu, djv), WordPerfect WPC document-area text (wpd), HTML/MHTML, EPUB, bounded MOBI/AZW/AZW3 UTF-8 text records, RTF, RFC 5322 / Outlook mail (eml, msg), and ZIP/TAR/limited 7z inputs, including gzip-compressed tar (.tar.gz, .tgz). Several extractors are deliberately limited: modern iWork IWA and directory packages, OneNote package/rich-object parsing, CHM rendering/scripts/resources, legacy Visio record/page semantics, macros, embedded payload recursion, OCR, and layout rendering remain out of scope. TeX/LaTeX sources, mailbox stores such as mbox/Maildir, WordStar documents, AbiWord documents, Microsoft Works files, Microsoft Access databases, SQLite database files, Microsoft Project schedule/interchange files, Outlook PST/OST mailbox stores, IBM Notes/Domino NSF/NTF databases, Microsoft InfoPath XSN form packages, OpenDocument ancillary variants, SPSS/SAS/Stata statistical data files, WARC/WACZ/Safari webarchive captures, FileMaker Pro databases, Kindle KFX/Topaz/AZW4 files, QuarkXPress documents, Scribus SLA page-layout files, Adobe PageMaker layout files, StarOffice/OpenOffice legacy files, Quattro Pro spreadsheets, FrameMaker MIF/FM documents, InDesign IDML packages, DWF/DWFx fixed-layout drawing packages, HP PCL/PCLXL print streams, and OLE-gated Publisher .pub inputs remain unsupported boundaries. PDF parsing and OCR belong in dpdf.
Until the crates are published to crates.io, install the CLI from the public Git repository:
cargo install --git https://github.com/andrewdyates/dextract dextract-cliUse the library from the public Git branch:
[dependencies]
dextract = { git = "https://github.com/andrewdyates/dextract" }From a checkout of this repo, install the CLI without publishing:
cargo install --path dextract-cliOr depend on the current public Git branch when tracking unreleased changes:
[dependencies]
dextract = { git = "https://github.com/andrewdyates/dextract" }dExtract's workspace MSRV is Rust 1.88, declared in the root Cargo.toml.
This follows the current dependency graph and may increase when dependencies
require a newer compiler.
The repository carries a local cargo-deny policy in deny.toml for RustSec
advisories, yanked crates, license allow-listing, duplicate-version visibility,
wildcard dependency bans, and crate source restrictions. Run it from a checkout
with:
bash scripts/check-supply-chain.shInstall the pinned CI version locally with
cargo +stable install cargo-deny --version 0.19.4 --locked if the script
reports that it is missing or mismatched. Override the pin only when updating
CI with CARGO_DENY_VERSION=<version>. Use
bash scripts/check-supply-chain.sh --offline only after Cargo indexes and the
RustSec advisory database are cached.
Public Rust API compatibility is checked with cargo-semver-checks for
workspace library crates only:
bash scripts/check-api-compat.shInstall the pinned CI version with
cargo install cargo-semver-checks --version 0.42.0 --locked if the script
reports that it is missing or mismatched. Override the pin only when updating
CI with CARGO_SEMVER_CHECKS_VERSION=<version>. The check is dev-only and
uses crates.io as the baseline source, so it skips crates that have not been
published yet; that is expected before the first crates.io release.
GitHub Actions workflow validation is available as a lightweight local check:
bash scripts/check-github-workflows.shInstall the pinned CI version of actionlint with
go install github.com/rhysd/actionlint/cmd/actionlint@v1.7.7 for full workflow
syntax and expression validation. Without actionlint, the script still runs the
repository-specific structural checks that keep release-critical CI gates wired.
CLI usage after cargo install --path dextract-cli:
dextract legacy.doc
dextract mail.eml
dextract report.docx
dextract --json workbook.ods
dextract --json report.xlsx
dextract --output extracted.txt notes.rtf
dextract --process-isolation legacy.doc
dextract --max-input-bytes 10485760 report.docx
dextract batch ./documents --budget 100000
dextract batch ./documents --max-input-bytes 10485760
dextract batch ./documents --recursive --max-files 5000
dextract batch ./documents --json
dextract formats
dextract formats --all --json
dextract completions zsh > _dextract
dextract manpage > dextract.1From a fresh checkout, run the same commands with cargo run -p dextract-cli -- .... The CLI can extract one or more files, write sibling *.extracted.txt files for a directory, print the extractor-backed format matrix, and print the broader detected-format matrix with formats --all.
CLI output semantics:
- Plain extraction writes extracted page text to stdout, or to
--outputwhen provided for a single input. Multiple plain-text inputs written to stdout are separated by one newline; multiple pages inside one input are separated by a form-feed byte. --jsonwrites one JSON object for one input file and one JSON array for multiple input files. That object-vs-array shape is the documented CLI contract.- JSON uses canonical lowercase format names such as
docx,odt,xps,csv,eml,mhtml, andtext. --process-isolationruns each root-command extraction in a child worker process with the same output and exit-code contract. It can also be enabled withDEXTRACT_PROCESS_ISOLATION=1. This is not a sandbox: the parent still reads each input file into memory, the worker has a 30-second timeout and 64 MiB JSON response cap, and callers still need OS/container limits for hostile inputs.--formataccepts extractor-backed formats plus aliases such ashtml/htm,mhtml/mht,text/txt/plain, legacyppt,pages,numbers,key/keynote,onenote/one/onetoc2/onepkg,chm,visio-binary/vsd/vss/vst, and accepted OOXML/OpenDocument family variants (docm,dotx,dotm,xlsb,xlsm,xltx,xltm,pptm,potx,potm,ppsx,ppsm,sldx,sldm,ott,fodt,ots,fods,otp,fodp,odg,otg,fodg,xps,oxps,vsdx,visio-package,vsdm,vssx,vssm,vstx,vstm,visio-xml,vdx,vsx,vtx,postscript,ps,eps,djvu,djv,wordperfect, andwpd). It does not accept detected-but-unsupported formats such aswordstar,ws,ws1,ws2,ws3,ws4,ws5,ws6,ws7,ws8,wsd,wsm,wst,wsb,wsx,publisher,pub,access,mdb,accdb,mde,accde,sqlite,sqlite3,sqlite-database,db,db3,sqlite2,sdb,sqlite-wal,sqlite-shm,sqlite3-wal,sqlite3-shm,db-wal,db-shm,opendocument-ancillary,odm,odb,odf,odc,dbf,dbase,dif,sylk,slk,project,mpp,mpt,mpx,outlook-pst-ost,pst,ost,ibm-notes-domino,notes,domino,nsf,ntf,infopath-xsn,infopath,xsn,xsf,warc-wacz-webarchive,web-archive,warc,warc-gz,warc.gz,wacz,webarchive,safari-webarchive,filemaker,filemaker-pro,fmp12,fp7,fp5,fp3,fmp,fmpur,usr,fmpsl,pagemaker,adobe-pagemaker,pmd,p65,pm6,pm5,pm4,pmt,t65,dwf-dwfx,dwf,dwfx,pcl-pclxl,hp-pcl,hp-pclxl,pcl,pclxl,pxl,prn,spl,staroffice-openoffice-legacy,staroffice-openoffice,staroffice,openoffice-legacy,openoffice,sxw,stw,sxc,stc,sxi,sti,sxd,std,sxm,sxg,sdw,sdc,sdd,sda,tex-latex,tex,latex,ltx,mobi,azw, orazw3.- Auto-detection does not fall back to plain text for arbitrary unknown bytes.
Plain text is selected only for known text-like extensions such as
.txt,.md,.markdown, and.json, or when explicitly forced with--format text. Standalone.xmlis not auto-detected as structured XML; use--format textto extract it as raw text. - Unknown and detected-but-unsupported formats fail root-command extraction
with
UnsupportedFormatand exit code1.batchskips them;batch --jsonreportsstatus: "skipped", the detected format, and guidance. --budget BYTESis a best-effort extracted-text byte guard in v0.1.0. Text-like, DOC, DOCX, XLS, XLSB, ODT, and modern package extractors generally return partial output withtruncated = true; MSG now follows the same caller-budget truncation contract while preserving hard errors for internal parser safety caps.0disables only the caller-specified text limit, not internal parser, package, or archive safety limits.--max-input-bytes BYTESchecks each input file's metadata length before reading it into memory;0disables this pre-read input cap. Over-limit root extraction exits with code1and no stdout. Inbatch, over-limit files are recorded as failed, JSON reports usestage: "input_limit", and remaining files continue processing.batch --max-files FILESbounds directory traversal before extraction starts so large trees cannot accumulate unbounded path lists in memory. The default cap is10000scanned filesystem files;0disables this traversal cap. Over-limit batch traversal exits with code1, emits no JSON report, and does not write partial extraction outputs.formatslists extractor-backed formats.formats --allalso lists detected-but-unsupported formats with status and guidance.completions <shell>writes a shell completion script to stdout for Bash, Elvish, Fish, PowerShell, or Zsh.manpagewrites a roff manpage to stdout for packaging or local installation.batchscans only direct children of the target directory by default. Usebatch --recursiveto opt into nested filesystem directory traversal. It writes sibling*.extracted.txtfiles, skips generated*.extracted.txtoutputs on rerun, reports progress on stderr, writes no document text to stdout, skips unsupported detections, and exits successfully when every detected input is unsupported.batch --jsonpreserves the same extraction side effects and exit-code policy, but writes a machine-readable report to stdout with summary counters and per-fileextracted,skipped, orfailedrows.- CLI memory policy: each input file is read into memory before extraction,
including when root extraction uses
--process-isolation.batchand root-command multi-file extraction process one input file at a time. Root multi-file output andbatch --jsonper-file report rows are spooled to temporary files before final stdout or--outputwrites, so later failures do not emit partial stdout and the CLI no longer retains every extracted document/report row in memory.--budgetlimits extracted text only;--max-input-bytesis the CLI pre-read input file size cap; JSON modes change reporting shape and output buffering, not input buffering or parser memory. Use OS/process/container limits for large or untrusted inputs; process isolation adds a child-process boundary, timeout, and output cap, but does not replace those limits.
Exit-code policy:
0: success, including abatchrun where all detected inputs are unsupported.1: runtime, extraction, input, output, or filesystem failure.2: command-line usage error reported by clap.
Library usage:
use dextract::DExtract;
let data = std::fs::read("report.docx")?;
let dx = DExtract::new();
let output = dx.extract(&data, "report.docx", 0)?;
for page in &output.pages {
println!("Page {}: {}", page.page_number, page.text);
}
# Ok::<(), Box<dyn std::error::Error>>(())If detection returns DocumentFormat::Unknown, DExtract::extract() returns
ExtractionError::UnsupportedFormat; it does not treat unknown input as plain
text unless the filename has a supported text-like extension. Use extract_as()
when intentionally forcing plain text or another extractor-backed format.
WordStar inputs (ws, ws1, ws2, ws3, ws4, ws5, ws6, ws7,
ws8, and wsd) are detected-but-unsupported and do not fall through to
plain text, RTF, OLE, ZIP/TAR/7z archive, HTML, or XML extraction. Adjacent
macro/template/backup/index candidates (wsm, wst, wsb, and wsx) remain
inventory-only. Text extraction, metadata extraction, formatting
interpretation, dot-command interpretation, codepage decoding, print-control
interpretation, macro execution, embedded-object extraction, native/converter
dependency support, and extractor-backed support remain outside this boundary.
AppleWorks/ClarisWorks inputs (cwk) are detected-but-unsupported and do not
fall through to iWork/Pages, ZIP/TAR/7z archive, OLE, HTML, XML, RTF, or
plain-text extraction. Adjacent templates/stationery (cws, cwt),
MacBinary/AppleDouble resource-fork material, text/metadata/cell extraction,
graphics/formula parsing, converter/native dependency support, and
extractor-backed support remain outside this boundary.
HWPX packages (hwpx) are detected-but-unsupported and do not fall through to
ZIP/TAR/7z archive, OLE, HTML/XML, RTF, plain-text, or legacy HWP handling.
Legacy HWP (hwp) remains evaluation-only because current fixtures are
generated FileHeader sentinel bytes, not valid CFB documents.
StarOffice/OpenOffice legacy inputs (sxw, stw, sxc, stc, sxi, sti,
sxd, std, sxm, sxg, sdw, sdc, sdd, sda) are
detected-but-unsupported and do not fall through to supported ODT/ODS/ODP/ODG,
generic ZIP/TAR/7z archive, OLE, HTML/XML, RTF, or plain-text extraction. This
boundary does not add OpenDocument aliasing, text/cell/metadata extraction,
macro/script or formula execution, converter/native dependency support, or
extractor-backed support.
Comic-book archives (cbz, cbt, cb7, cbr) are detected-but-unsupported
and do not fall through to generic ZIP/TAR/7z archive, PDF, EPUB, OLE, HTML/XML,
RTF, plain-text, image/OCR, or RAR handling. Archive listing metadata,
ComicInfo.xml metadata, text sidecars, page images, recursive dispatch, and
extractor-backed support remain outside this boundary.
Metadata is best-effort and format-dependent in v0.1.0.
ExtractionOutput.metadata carries common document-level fields when an
extractor can recover them: title, author, subject, creator, producer,
keywords, creation date, modification date, and page_count.
ExtractedPage.metadata is page/section-scoped. It may carry
extractor-specific page, sheet, slide, archive-entry, mail-part, or HTML page
fields such as sheet_name, row_count, slide_number, path, size,
content_type, description, or language. These keys are local to the
extracted page or section and are not normalized into
ExtractionOutput.metadata.
Emitted ExtractedPage values use 1-indexed page_number values in output
order. source_format identifies the extractor family that produced the page,
and byte_count is the byte length of the emitted text for that page. The
cells vector is populated only by extractors with structured table or
spreadsheet output; fixed-layout page extraction such as XPS/OXPS emits page
text and page metadata but no cells.
Some extractors intentionally promote source fields into document metadata. For
example, HTML <title> becomes document-level metadata.title, HTML
author/generator meta tags become metadata.author and metadata.producer,
and OpenOffice-style HTML created/changed/modified meta dates in
YYYYMMDD;HHMMSS form are normalized. page_count is the extractor-reported
logical/source count when available, not a normalized cross-format guarantee.
Text-like single-section formats such as CSV, TSV, HTML, MHTML, RTF, EML, and MSG
currently report emitted sections. Office/OpenDocument/EPUB formats may report
source-declared pages, sheets, slides, or spine items, which can differ from the
number of emitted ExtractedPage sections; EPUB page_count is based on
manifest-backed spine items after OPF logical caps are applied. Generic
archives currently leave page_count = 0 even when archive entries are emitted
as page-like sections; plain text also reports 0 because it has no source
page-count metadata.
The DExtract facade catches unwind panics from registered extractors and returns ExtractionError::ExtractorPanicked. It does not install or suppress the process panic hook, so caught panics may still print diagnostics to stderr. Aborts, out-of-memory conditions, stack overflows, and direct extractor calls remain caller/process-level concerns. The CLI's optional --process-isolation mode adds a root-command child-process boundary around extraction; it is separate from the facade panic boundary and still is not a sandbox.
| Format | Detected | Extractor-backed | Fixture-backed | Compatibility-backed | Scope / notes |
|---|---|---|---|---|---|
| Plain text | Yes | Yes | Yes | Yes | UTF-8 lossy passthrough for local text files; page_count remains 0 because no source count is reported, with pinned CC0 compatibility sample. |
| CSV | Yes | Yes | Yes | Yes | Stable tab-separated text plus structured cells; reports one emitted section as page_count = 1, with pinned CC0 compatibility sample. |
| TSV | Yes | Yes | Yes | Yes | Tab-delimited input rendered as stable tab-separated text plus structured cells; reports one emitted section as page_count = 1, with pinned MIT compatibility sample. |
| TeX/LaTeX sources | Yes | No | Unsupported corpus + preflight fixtures | No | #158 registers .tex, .ltx, and .latex as detected-unsupported boundaries. Generated corpus fixtures cover plain, LaTeX, and archive/OLE/HTML/XML/RTF/plain-looking no-fallback inputs; preflight evidence records bounded command, brace, environment, blocked include/resource/execution directive, and resource-cap observations. No TeX execution, macro expansion, include traversal, graphics/bibliography loading, rendering, converter/native dependency, or extraction support is claimed. |
| DOC | Yes | Yes | Yes | Yes | Legacy OLE Word extraction with best-effort document metadata and source page-count metadata when available. |
| DOCX | Yes | Yes | Yes | Yes | Document body plus header/footer text, document metadata, and source page-count metadata when available; accepted docm, dotx, and dotm aliases route here. |
| ODT | Yes | Yes | Yes | Yes | OpenDocument text extraction with document metadata and source page-count metadata when available; accepted ott and flat XML .fodt inputs route here. |
| XLS | Yes | Yes | Yes | Yes | Legacy BIFF workbook extraction with sheet pages, structured cells, and sheet-count page_count. |
| XLSX | Yes | Yes | Yes | Yes | Shared strings and worksheet XML become structured pages and cells with sheet-count page_count; accepted xlsm, xltx, and xltm aliases route here. |
| ODS | Yes | Yes | Yes | Yes | OpenDocument spreadsheet extraction with sheet pages, structured cells, and source sheet-count page_count; accepted ots and flat XML .fods inputs route here. |
| PPT | Yes | Yes | Yes | Yes | Supported with limitations: bounded mechanical legacy PowerPoint text and metadata extraction from synthetic fixtures plus one pinned CC0 real-world sample; no rendering, macros, media extraction, embedded object recursion, or decryption. |
| PPTX | Yes | Yes | Yes | Yes | Slide text, speaker notes, presentation metadata, and slide-count page_count; accepted pptm, potx, potm, ppsx, ppsm, sldx, and sldm aliases route here. |
| ODP | Yes | Yes | Yes | Yes | OpenDocument presentation extraction with slide/page-count page_count; accepted otp and flat .fodp inputs route here. |
| ODG | Yes | Yes | Yes | Yes | Bounded OpenDocument drawing package/template/flat XML extraction for .odg, .otg, and .fodg visible drawing text, with a pinned Apache-2.0 packaged ODG compatibility sample. |
| HTML | Yes | Yes | Yes | Yes | Visible text plus document-level title, author, producer, normalized created/modified metadata, and page_count = 1; page metadata keeps description and language, with bounded nested-table cell capture and a pinned CC0 compatibility sample. |
| EPUB | Yes | Yes | Yes | Yes | XHTML chapters extracted in spine order with capped OPF metadata, manifest, and spine parsing plus spine-item page_count, with pinned W3C EPUB compatibility sample. |
| RTF | Yes | Yes | Yes | Yes | Body text plus \\info metadata; reports one emitted section as page_count = 1, with pinned CC0 compatibility sample. |
| RTFD packages | No | No | Preflight inventory fixtures | No | Evaluation-only #170 testdata/rtfd-preflight fixtures record directory-package shape, TXT.rtf root presence, attachment/member inventory, nested .rtfd deferral, resource-fork sidecar deferral, member-name traversal examples, and resource caps. No public .rtfd detection, detected-unsupported no-fallback behavior, RTFD text extraction, attachment extraction, recursive traversal, image/OCR extraction, converter/native dependency, or extractor-backed support is claimed. |
| EML | Yes | Yes | Yes | Yes | RFC 5322 mail extraction with decoded headers, body selection, MIME traversal caps, and truncation for clipped internals; reports selected body sections as page_count, while ordinary attachment payloads are not extracted. |
| MSG | Yes | Yes | Yes | Yes | Outlook .msg extraction with message metadata and text bodies; reports emitted message body sections as page_count, while ordinary attachment payloads are not extracted. |
| Outlook PST/OST | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #181; generated .pst/.ost corpus fixtures cover Outlook store classification, guidance, and no MSG, EML, mbox/Maildir, ZIP/TAR/7z archive, OLE, HTML, RTF, or plain-text fallback, while testdata/outlook-pst-ost-preflight records synthetic NDB header/version sentinels, malformed/root/node/block-count cases, encrypted/protected markers, oversized store metadata, and fallback probes. No message extraction, attachment extraction, folder traversal, decryption/password handling, store repair, search-index interpretation, account/client access, network access, native/libpff dependency, or extractor-backed support is claimed. |
| IBM Notes/Domino NSF/NTF | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #185; generated .nsf/.ntf corpus fixtures cover Notes/Domino classification, guidance, and no MSG, EML, mbox/Maildir, Outlook PST/OST, Access, DBF/DIF/SYLK, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while testdata/notes-domino-preflight records synthetic header observations, malformed short headers, encrypted/ACL-looking markers, resource caps, fallback probes, and inventory-only .box/.ndl/.id adjacent artifacts. No message extraction, attachment extraction, folder/view/form traversal, ACL/security interpretation, replication handling, decryption, Domino API/native dependency, database repair, embedded-payload recursion, adjacent artifact public classification, or extractor-backed support is claimed. |
| Microsoft InfoPath XSN | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #186; generated .xsn corpus fixtures cover InfoPath package classification, guidance, and no OOXML/ODF, ZIP/TAR/7z archive, generic XML, HTML, RTF, or plain-text fallback, while testdata/infopath-preflight records XSN package structure, manifest.xsf-style metadata, InfoPath XML namespace samples, malformed package/XML cases, external data-connection markers, traversal-shaped member names, resource caps, and inventory-only .xsf/standalone XML candidates. No form rendering, schema validation, script execution, data-connection traversal, attachment extraction, SharePoint integration, native/converter dependency, standalone generic XML public classification, or extractor-backed support is claimed. |
| WARC/WACZ/Safari webarchive | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #187; generated .warc, .warc.gz, .wacz, and .webarchive corpus fixtures cover web-archive classification, guidance, and no MHTML, HTML, JSON/plain text, generic XML, RTF, ZIP/TAR/7z archive, gzip TAR, or Safari plist/plain fallback, while testdata/web-archive-preflight records WARC/WARC.GZ records, WACZ package/index shape, Safari webarchive markers, malformed cases, traversal member names, external-resource markers, resource caps, and inventory-only .arc/.arc.gz/.cdx/.cdxj/.har candidates. No WARC record extraction, WACZ package extraction, Safari webarchive parsing, HTTP payload extraction, page rendering, resource reconstruction, CDX index lookup, external-resource traversal, native/converter dependency, inventory-only extension public classification, or extractor-backed support is claimed. |
| FileMaker Pro | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #189; generated .fmp12 and .fp7 corpus fixtures cover FileMaker classification, guidance, and no Access, DBF/DIF/SYLK, SQLite, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while testdata/filemaker-preflight records synthetic header/version observations, malformed short/bad headers, encryption/container/external-container markers, resource caps, fallback probes, and inventory-only .fp5/.fp3/.fmp/.fmpur/.usr/.fmpsl candidates. No table extraction, layout/form/report rendering, script execution, calculation evaluation, container-field extraction, external-container traversal, account/security interpretation, decryption, repair, FileMaker native dependency, inventory-only extension public classification, or extractor-backed support is claimed. |
| SPSS | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #190; generated .sav, .zsav, and .por corpus fixtures cover SPSS classification, guidance, and no spreadsheet/database/archive/OLE/HTML/XML/RTF/CSV/TSV/plain fallback, while testdata/statistical-data-preflight records header/version, compression/encryption, variable/value-label, malformed/resource-cap, fallback, and inventory-only .sps/.spv/.spo evidence. No table, row, cell, value-label, metadata, compression, decryption, statistical interpretation, native/converter dependency, inventory-only public classification, or extractor-backed support is claimed. |
| SAS | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #190; generated .sas7bdat and .xpt corpus fixtures cover SAS classification, guidance, and no spreadsheet/database/archive/OLE/HTML/XML/RTF/CSV/TSV/plain fallback, while testdata/statistical-data-preflight records header/version, encryption, variable/value-label, malformed/resource-cap, fallback, and inventory-only .sas/.sas7bcat/.sas7bvew/.sas7bndx evidence. No table, row, cell, value-label, metadata, compression, decryption, statistical interpretation, native/converter dependency, inventory-only public classification, or extractor-backed support is claimed. |
| Stata | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #190; generated .dta corpus fixtures cover Stata classification, guidance, and no spreadsheet/database/archive/OLE/HTML/XML/RTF/CSV/TSV/plain fallback, while testdata/statistical-data-preflight records header/version, encryption, variable/value-label, malformed/resource-cap, fallback, and inventory-only .do/.ado/.smcl/.gph/.dct evidence. No table, row, cell, value-label, metadata, compression, decryption, statistical interpretation, native/converter dependency, inventory-only public classification, or extractor-backed support is claimed. |
| Mailbox stores | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #156; generated .mbox corpus fixtures cover mailbox-store classification, guidance, and no EML, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, or plain-text fallback, while testdata/mailbox-preflight records bounded mbox boundary/order/resource-cap observations plus CLI-only direct Maildir-shaped directory detection through new/cur/tmp children. No mailbox text extraction, message enumeration, attachment extraction, recursive store handling, Maildir message traversal, account/client access, network access, native/converter dependency, or extractor-backed support is claimed. |
| Pages | Yes | Limited | Yes | Yes | Legacy XML Pages packages with root index.xml.gz or index.xml are extractor-backed for body paragraph text; modern IWA and directory packages remain unsupported subformats. |
| Numbers | Yes | Limited | Yes | Yes | Legacy XML Numbers packages with root index.xml.gz or index.xml are extractor-backed for table/cell text; modern IWA and directory packages remain unsupported subformats. |
| Keynote | Yes | Limited | Yes | Yes | Legacy XML Keynote packages with root index.apxl.gz or index.apxl are extractor-backed for slide paragraph/table text; modern IWA and directory packages remain unsupported subformats. |
| MHTML | Yes | Yes | Yes | Yes | MHTML web archive inputs (.mht, .mhtml) extract the selected root HTML or plain text part, with pinned Apache-2.0 compatibility evidence; linked resources are skipped rather than reconstructed or recursively extracted. |
| XLSB | Yes | Yes | Yes | Yes | Excel binary workbook packages (.xlsb) are supported through #96 with calamine-backed BIFF12 worksheet/cell extraction, ZIP/package preflight caps, panic containment, and pinned Apache-2.0 compatibility evidence; formulas are not executed, macros and embedded payloads are ignored, and external links are not followed. |
| XPS/OXPS | Yes | Yes | Yes | Yes | Bounded FixedDocument/FixedPage package extraction for .xps and .oxps; extracts Glyphs UnicodeString text and core package metadata, with pinned Apache-2.0 XPS/OpenXPS-content compatibility evidence and no rendering, OCR, font decoding, or glyph-index reconstruction. |
| VSDX | Yes | Yes | Yes | Yes | Bounded Visio drawing text extraction for .vsdx, with pinned MIT compatibility evidence and generated malformed relationship/XML/resource-cap fixtures; follows declared page relationships, extracts page Text elements, and fails closed on corrupt VSDX packages without rendering shapes, OCR, macro execution, or external links. |
| Legacy Visio binary | Yes | Limited | Yes | Yes | Bounded visible-text recovery from valid Visio CFB .vsd, .vss, and .vst inputs is extractor-backed with pinned Apache POI evidence. Real MS-VSD record validation, page/shape ordering, metadata extraction, rendering, macro/VBA parsing or execution, embedded-object extraction, external-link traversal, and native/converter dependency remain unsupported. |
| Visio XML/package variants | Yes | Limited | Yes | No | Bounded extractor-backed support for #153 covers legacy Visio XML .vdx, .vsx, and .vtx plus modern package .vsdm, .vssx, .vssm, .vstx, and .vstm fixtures generated in testdata/corpus/basic; extraction reads page Text elements, ignores macro/VBA parts, and fails closed on malformed XML, corrupt packages, missing relationships, oversized parts, external targets, and non-Visio packages without VSDX, ZIP/archive, XML/plain text, OLE, HTML, or legacy Visio binary fallback. testdata/visio-package-preflight records package spine and malformed-boundary evidence; metadata extraction, rendering, conversion, macro execution, embedded-payload recursion, external-link traversal, and broad compatibility are out of scope. |
| Publisher | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #126/#149; committed generated testdata/corpus/basic/minimal_publisher.pub fixture covers OLE-magic .pub/Publisher classification and unsupported no-fallback behavior, while testdata/publisher-preflight records synthetic OLE-header stream/version/text-candidate/metadata-candidate inventory, malformed-header, missing-stream, encrypted-marker, external data-source, embedded-object, and resource-cap parser-readiness evidence only. Publisher inputs do not route to DOC, XLS, PPT, MSG, generic OLE, archive, HTML, or plain-text extraction, and no Publisher text extraction, metadata extraction, rendering, conversion, mail-merge/data-source access, macro execution, embedded-object recursion, external-link traversal, decryption, native/converter dependency, real Publisher compatibility, or Publisher stream/version validation is supported. |
| Adobe InDesign IDML | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #169; committed generated testdata/corpus/basic/minimal_indesign.idml, rtf_looking_indesign.idml, html_looking_indesign.idml, ole_looking_indesign.idml, and plain_looking_indesign.idml fixtures cover extension-gated .idml classification and unsupported no-fallback behavior, while testdata/indesign-preflight records IDML ZIP package shape, designmap.xml presence, story XML inventory, malformed XML, traversal-shaped member paths, external-resource markers, embedded asset deferral, and package resource caps. IDML inputs do not route to ZIP/archive, RTF, OLE, HTML, or plain-text extraction, and no IDML text extraction, metadata extraction, rendering/layout fidelity, image extraction/OCR, script/plugin execution, external-resource loading, recursive embedded-payload extraction, converter/native dependency, or extractor-backed support is claimed. |
| Adobe InDesign INDD | No | No | Preflight inventory fixtures | No | Evaluation-only #169 binary .indd feasibility lane in testdata/indesign-preflight records binary feasibility and ZIP-looking .indd fallback-risk inventory only. No public .indd detection, detected-unsupported no-fallback behavior, binary INDD parsing, rendering/layout fidelity, image extraction/OCR, script/plugin execution, external-resource loading, recursive embedded-payload extraction, converter/native dependency, or extractor-backed support is claimed. |
| Adobe FrameMaker MIF/FM | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #171; committed generated testdata/corpus/basic/minimal_framemaker.mif, rtf_looking_framemaker.mif, zip_looking_framemaker.mif, html_looking_framemaker.mif, ole_looking_framemaker.fm, and plain_looking_framemaker.fm fixtures cover extension-gated .mif/.fm classification and unsupported no-fallback behavior, while testdata/framemaker-preflight records MIF header/version markers, text-flow/paragraph inventory, escaped strings, blocked external-resource/include/cross-reference markers, binary payload rejection, fallback-looking signature probes, structure/file resource caps, and binary .fm feasibility rows. FrameMaker inputs do not route to RTF, ZIP/archive, OLE, HTML, or plain-text extraction, and no MIF text extraction, binary FM parsing, rendering/layout fidelity, imported graphic loading, external-resource traversal, converter/native dependency, or extractor-backed support is claimed. |
| CHM | Yes | Limited | Yes | Yes | Bounded CHM topic text extraction uses internal ITSF/ITSP/PMGL/PMGI/DataSpace readiness and LZX topic decoding, with pinned Apache Tika evidence. Rendering, scripts, external links, embedded payload recursion, broad compatibility, and generic archive/OLE/HTML/plain fallback remain unsupported. |
| OneNote | Yes | Limited | Yes | Yes | Bounded visible-text recovery for real OneStore revision-store files is extractor-backed with pinned Apache Tika evidence. .onepkg member extraction, object graph/rich-text semantics, handwriting/OCR, rendering, embedded payload recursion, and generic OLE/ZIP/plain fallback remain unsupported. |
| DjVu | Yes | Limited | Generated corpus fixtures + preflight fixtures | No | Bounded #151 extractor-backed slice for uncompressed TXTa text-layer bytes in .djvu/.djv IFF/FORM inputs; committed generated testdata/corpus/basic/minimal_djvu.djvu covers baseline extraction, while testdata/djvu-preflight and supporting dextract-djvu-preflight keep synthetic IFF/FORM container, page-form, directory, TXTa/TXTz, malformed-chunk, external-reference, and resource-cap coverage. DjVu inputs do not route to generic archive, OLE, HTML, or plain-text extraction; TXTz decompression, rendering, OCR, image extraction, metadata value extraction, external references, native dependencies, and converter shellouts remain unsupported. |
| PostScript/EPS | Yes | Limited | Generated corpus fixtures | No | Bounded lexical extraction for #150 after the #127 boundary; committed generated testdata/corpus/basic/minimal_postscript.ps and testdata/corpus/basic/minimal_eps.eps fixtures cover DSC metadata plus literal/hex string text recovery only. PostScript/EPS extraction does not execute PostScript, render, OCR, process EPS previews, load external resources, recover outlined text, interpret fonts, or use Ghostscript. |
| WordPerfect | Yes | Limited | Yes | Yes | Bounded WPC5/WPC6 .wpd document-area text extraction for #148, backed by generated testdata/corpus/basic/minimal_wordperfect.wpd, testdata/wordperfect-preflight prefix/malformed/adjacent fixtures, and pinned Apache-2.0 testdata/realworld/apache-tika/testWordPerfect.wpd. Extracts plain document-area text and WPC prefix metadata only; macros/templates (.wpt/.wcm), graphics/layout rendering, embedded objects, external resources, encrypted inputs, converter shellouts, and native dependencies remain unsupported. |
| WordPerfect adjacent/malformed boundaries | Yes | No | Parser-readiness preflight fixtures | No | Adjacent .wpt/.wcm fixtures remain non-public inventory only, and malformed, encrypted, unsupported-version/product/type, payload-cap, ZIP-looking, and OLE-looking .wpd inputs fail closed without generic archive, OLE, HTML, or plain-text fallback. |
| WordStar | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #194; generated .ws, .ws7, and .wsd corpus fixtures plus fallback-looking probes cover WordStar classification, guidance, and no plain text, RTF, OLE, ZIP/TAR/7z archive, HTML, or XML fallback, while testdata/wordstar-preflight records control-byte/header observations, dot-command/print-control markers, malformed/resource-cap cases, fallback probes, and inventory-only .wsm/.wst/.wsb/.wsx candidates. No text extraction, metadata extraction, formatting interpretation, dot-command interpretation, codepage decoding, print-control interpretation, macro execution, embedded-object extraction, native/converter dependency, inventory-only public classification, or extractor-backed support is claimed. |
| AbiWord | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported close-out boundary for #157; committed generated fixtures cover extension-gated .abw classification and no-fallback behavior, while testdata/abiword-preflight and internal dextract-abiword-preflight record XML/gzip shape inventory, synthetic 1.x/2.x/3.x version/provenance candidates, structure-only section/paragraph/heading/list/table/cell counts, non-public .zabw/.abw.gz compressed candidates, malformed XML/entity/missing-version/unsupported-version/embedded-object/external-link/depth/element/attribute/text/metadata-key-limit cases, missing/corrupt gzip, gzip resource-cap boundaries, and metadata-key-name inventory only. AbiWord inputs do not route to generic XML, ZIP, OLE, HTML, or plain-text extraction, and no text extraction, metadata extraction, extractor-backed XML parser readiness, real-world producer compatibility, compressed-variant public support, embedded-object recursion, external-link traversal, or converter/native dependency support is claimed. |
| Microsoft Works | Yes | No | Unsupported corpus + preflight feasibility fixtures | No | Detected unsupported boundary for #155; committed generated fixtures cover extension-gated .wps/.wks/.wdb/.xlr classification and no-fallback behavior, while testdata/works-preflight records synthetic Works-family sentinels, malformed/version/record cases, resource caps, encryption/embedded/external markers, fallback probes, missing legal compatibility samples, version-family unknowns, and parser-readiness blockers. Microsoft Works inputs do not route to RTF, XLS, XLSX, generic archive, OLE, HTML, or plain-text extraction, and no text extraction, metadata extraction, real stream/version validation, database/spreadsheet parsing, embedded-object recursion, external-link traversal, decryption, native/converter dependency, or parser-readiness support is claimed. |
| Microsoft Access | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #182; generated .mdb/.accdb corpus fixtures cover Access classification, guidance, and no XLS/XLSX/ODS, DBF/DIF/SYLK, Works, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while testdata/access-preflight records synthetic Jet/ACE sentinels, malformed/version/object-count cases, password/encryption and linked-table markers, and inventory-only .mde/.accde candidates. No table extraction, query execution, form/report rendering, macro/VBA execution, linked-table traversal, external data-source access, embedded-object recursion, decryption, repair, native ACE/Jet dependency, inventory-only extension public classification, or extractor-backed support is claimed. |
| SQLite database | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #188; generated .sqlite/.sqlite3 corpus fixtures cover SQLite classification, guidance, and no XLS/XLSX/ODS, Access, DBF/DIF/SYLK, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while testdata/sqlite-preflight records SQLite header/page-size/version observations, malformed short/bad headers, resource-cap cases, fallback probes, and inventory-only .db, .db3, .sqlite2, .sdb, WAL, and SHM sidecar candidates. No SQL execution, schema/table extraction, row/cell extraction, FTS/index parsing, WAL replay, extension loading, encryption/decryption, repair, native SQLite dependency, sidecar public classification, or extractor-backed support is claimed. |
| DBF/dBASE | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #183; generated .dbf corpus fixtures cover DBF/dBASE classification, guidance, and no XLS/XLSX/ODS, Access, DIF/SYLK, Works, Quattro Pro, Lotus 1-2-3, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while testdata/dbf-preflight records deterministic DBF header/version observations, malformed/header/resource-cap cases, fallback probes, and inventory-only .dbt/.fpt/.ndx/.mdx/.cdx sidecar candidates. No table/row/cell extraction, memo parsing, codepage decoding claims, deleted-record recovery, index parsing, shapefile/geospatial support, formula/macro execution, native/converter dependency, sidecar public classification, or extractor-backed DBF parser support is claimed. |
| DIF | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #184; generated .dif corpus fixtures cover DIF classification, guidance, and no CSV/TSV, plain-text, HTML/XML, RTF, archive, OLE, XLS/XLSX/ODS, Access, DBF, Works, Quattro, Lotus, or SYLK fallback, while shared testdata/dif-sylk-preflight records deterministic text-header observations, malformed/resource-cap cases, fallback probes, and inventory-only .sylk split evidence. No cell/table extraction, formula execution or evaluation, codepage/encoding normalization, external-link traversal, native/converter dependency, magic-only detection, or extractor-backed support is claimed. |
| SYLK | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #184; generated public .slk corpus fixtures cover SYLK classification, guidance, and no CSV/TSV, plain-text, HTML/XML, RTF, archive, OLE, XLS/XLSX/ODS, Access, DBF, Works, Quattro, Lotus, or DIF fallback, while shared testdata/dif-sylk-preflight keeps .sylk inventory-only. No cell/table extraction, formula execution or evaluation, codepage/encoding normalization, external-link traversal, native/converter dependency, magic-only detection, .sylk public classification, or extractor-backed support is claimed. |
| Microsoft Project | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #180; generated .mpp/.mpt/.mpx corpus fixtures cover Project classification, guidance, and no DOC/XLS/PPT/OOXML/ODF, Visio, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while testdata/project-preflight records synthetic CFB/MPX sentinels, malformed/version/task/resource/record-count cases, password/VBA/external-link/embedded-object markers, and fallback probes. No task extraction, resource extraction, schedule calculation, formula evaluation, Gantt/timeline rendering, image/OCR extraction, macro/VBA execution, external-link traversal, embedded-object recursion, decryption, repair, native/converter dependency, or extractor-backed support is claimed. |
| MOBI/AZW | Yes | Limited | Supported corpus + preflight fixtures | No | Bounded #145 support extracts unencrypted UTF-8 .mobi, .azw, and .azw3 text records when the input validates as PDB-style BOOK/MOBI and uses uncompressed or classic PalmDOC compression. Generated supported corpus fixtures cover .mobi/.azw/.azw3 extraction, while legacy minimal corpus fixtures and testdata/mobi-preflight keep unsupported-subset/no-fallback evidence for empty/minimal headers, numeric-only EXTH inventory, encoding-marker classification, unsupported Windows-1252/unknown encodings, HUFF/CDIC, encryption, malformed PalmDOC bytes, and output limits. No DRM/decryption, Windows-1252 decoding, metadata extraction, EXTH value decoding, HTML/XHTML conversion, rendering, resource extraction, embedded-payload recursion, external-link traversal, AZW4/Topaz/KFX support, or generic .prc/.pdb classification is supported. |
| Kindle KFX/Topaz/AZW4 | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #176; generated .kfx/.tpz/.azw1/.azw4 corpus fixtures cover Kindle KFX/Topaz/AZW4 classification, guidance, and no PDF, EPUB/ZIP, archive, MOBI/AZW, or plain-text fallback, while testdata/kindle-ebook-boundary keeps generic .prc/.pdb, .azw6, .azw8, and .azw9 inventory-only. No Kindle KFX/Topaz/AZW4 text extraction, DRM/decryption, metadata extraction, EXTH value decoding, rendering, resource extraction, embedded-payload recursion, external-link traversal, converter/native dependency, or extractor-backed support is claimed. |
| QuarkXPress | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #178; generated .qxd/.qxp corpus fixtures cover QuarkXPress classification, guidance, and no ZIP/TAR/7z archive, OLE, HTML/XML, RTF, or plain-text fallback, while testdata/quarkxpress-preflight records synthetic page-layout sentinels, malformed/header/object-count cases, external-resource markers, and inventory-only .qxt/.qpt/.qxb/.qxl/.xtg candidates. No layout rendering, text extraction, metadata extraction, font interpretation, image/OCR extraction, external-resource loading, color management, converter/native dependency, inventory-only extension public classification, or extractor-backed support is claimed. |
| Scribus SLA | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #192; generated .sla corpus fixtures cover extension-gated Scribus classification, guidance, and no generic XML, gzip/archive, HTML, RTF, InDesign, QuarkXPress, Publisher, FrameMaker, or plain-text fallback, while testdata/scribus-preflight records Scribus XML root/version observations, .sla.gz inventory-only candidates, malformed XML/gzip cases, external image/font markers, traversal/script markers, resource caps, and fallback probes. No text extraction, metadata extraction, layout rendering, font/image loading, color management, script execution, external-resource traversal, gzip public support, native/converter dependency, inventory-only public classification, or extractor-backed support is claimed. |
| Adobe PageMaker | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #193; generated .pmd/.p65/.pm6 corpus fixtures cover PageMaker classification, guidance, and no InDesign, QuarkXPress, Scribus SLA, Publisher, FrameMaker, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, or plain-text fallback, while testdata/pagemaker-preflight records synthetic layout sentinels, malformed header/version cases, external image/font markers, traversal markers, resource caps, fallback probes, and inventory-only .pm5/.pm4/.pmt/.t65 candidates. No text extraction, metadata extraction, layout rendering, font interpretation, image/OCR extraction, external-resource loading, color management, script execution, converter/native dependency, inventory-only public classification, or extractor-backed support is claimed. |
| Quattro Pro | Yes | No | Unsupported corpus + feasibility inventory | No | Detected unsupported boundary for #159; committed generated fixtures cover extension-gated .qpw/.wb1/.wb2/.wb3 classification and no-fallback behavior, while testdata/quattro-preflight records fixture inventory, generated .wb1/.wb2 PRONOM BOF signature/version markers, malformed signature boundaries, missing legal compatibility samples, and remaining parser-readiness blockers only. Quattro Pro inputs do not route to XLS, XLSX, ODS, generic archive, OLE, HTML, or plain-text extraction, and no text extraction, metadata extraction, formula execution, macro execution, embedded-object recursion, external-link traversal, decryption, native/converter dependency, workbook/sheet/cell validation, public support registration, or extractor-backed parser support is claimed. |
| Lotus 1-2-3 | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #164; generated .wk1/.wk3/.wk4/.123 corpus fixtures cover Lotus 1-2-3 classification, guidance, and no XLS/XLSX/ODS, DBF/DIF/SYLK, Works, Quattro Pro, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, CSV/TSV, or plain-text fallback, while testdata/lotus-preflight records generated PRONOM BOF signature/version fixtures, malformed signature boundaries, missing legal compatibility samples, and parser-readiness blockers only. No text extraction, metadata extraction, workbook/sheet/cell parsing, formula execution, macro execution, embedded-object recursion, external-link traversal, decryption, native/converter dependency, public support registration, or extractor-backed Lotus parser support is claimed; .wks remains owned by the Microsoft Works boundary. |
| HWPX | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #162; generated testdata/corpus/basic/*hwpx.hwpx fixtures cover extension-gated .hwpx classification, guidance, and no ZIP/TAR/7z, OLE, HTML/XML, RTF, plain-text, or legacy HWP fallback, while testdata/hwp-hwpx-preflight records HWPX package-spine, malformed package/XML/resource-cap, and external-resource inventory. No text/metadata extraction, package-member extraction, OWPML/body parsing, rendering/layout, decryption, script execution, converter/native dependency, or extractor-backed support is claimed. |
| AutoCAD DWG/DXF | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #175; generated .dwg/.dxf corpus fixtures cover extension-gated AutoCAD classification, guidance, and no PDF/PostScript, ZIP/archive, OLE, HTML/XML, RTF, or plain-text fallback, while testdata/autocad-preflight records ASCII DXF section/table/entity and TEXT/MTEXT marker inventory, DWG AC10xx sentinels, malformed/header/resource-cap cases, and fallback-signature risk only. No CAD text extraction, metadata extraction, geometry interpretation, rendering/OCR, external-reference traversal, converter/native dependency, or extractor-backed support is claimed. |
| DWF/DWFx | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #177; generated .dwf/.dwfx corpus fixtures cover DWF/DWFx classification, guidance, and no PDF/PostScript, XPS/OXPS, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, plain-text, image/OCR, or Visio fallback, while testdata/dwf-dwfx-preflight records synthetic package/signature fixtures, malformed package-shape cases, traversal member names, external-resource markers, embedded asset deferral, and package entry/byte caps. No text extraction, metadata extraction, CAD geometry extraction, fixed-layout rendering, image/OCR extraction, embedded-resource traversal, external-resource loading, native/converter dependency, or extractor-backed support is claimed. |
| HP PCL/PCLXL | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #191; generated .pcl corpus fixtures cover extension-gated HP PCL/PCLXL classification, guidance, and no PDF/PostScript, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, plain-text, or image/OCR fallback, while testdata/pcl-preflight records PJL, PCL escape-sequence, PCLXL sentinel, malformed/resource-cap, embedded-resource, fallback, and inventory-only .prn/.pxl/.pclxl/.spl evidence. No print-stream interpretation, text extraction, font interpretation, rendering/OCR, PJL execution, embedded-resource extraction, printer emulation, Ghostscript/native dependency, inventory-only extension public classification, or extractor-backed support is claimed. |
| Legacy HWP | No | No | Evaluation preflight inventory | No | Evaluation-only #162 lane; testdata/hwp-hwpx-preflight records generated .hwp FileHeader sentinels that are not valid CFB documents. No public .hwp detection, detected-unsupported no-fallback behavior, valid legacy CFB parser support, text/metadata extraction, decryption, converter/native dependency, or extractor-backed support is claimed. |
| StarOffice/OpenOffice legacy | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #166; generated .sxw/.stw/.sxc/.stc/.sxi/.sti/.sxd/.std/.sxm/.sxg/.sdw/.sdc/.sdd/.sda corpus fixtures cover classification, guidance, and no ODT/ODS/ODP/ODG, ZIP/TAR/7z, OLE, HTML/XML, RTF, or plain-text fallback, while testdata/staroffice-openoffice-preflight records package-shape fixtures, legacy binary sentinels, malformed ZIP/XML/resource boundaries, and parser-readiness blockers. No OpenDocument aliasing, text/cell/metadata extraction, macro/script or formula execution, converter/native dependency support, or extractor-backed support is claimed. |
| FictionBook FB2 | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #163; generated .fb2 and .fb2.zip corpus fixtures cover FictionBook FB2 classification, guidance, and no PDF, EPUB, MOBI/AZW, ZIP/TAR/7z archive, OLE, HTML/XML, RTF, or plain-text fallback, while testdata/fb2-preflight records XML/package observations, malformed XML/resource caps, and ZIP member/traversal boundaries. No text extraction, metadata extraction, resource extraction, embedded-payload recursion, external-link traversal, converter/native dependency, public support registration, or extractor-backed support is claimed. |
| OpenDocument ancillary variants | Yes | No | Unsupported corpus + boundary inventory | No | Detected unsupported boundary for #165; generated .odm/.odb/.odf/.odc corpus fixtures cover OpenDocument ancillary classification, guidance, and no ZIP/TAR/7z archive, OLE, HTML/XML, RTF, plain-text, or neighboring ODT/ODS/ODP/ODG fallback, while testdata/opendocument-ancillary-boundary records variant separation and remaining parser questions. No master-document traversal, database/formula execution, chart rendering, macro/script execution, embedded-payload recursion, external-link traversal, decryption, native/converter dependency, or extractor-backed support is claimed. |
| AppleWorks/ClarisWorks | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #167; generated testdata/corpus/basic/*appleworks.cwk fixtures cover extension-gated .cwk classification, guidance, and no iWork/Pages, ZIP/TAR/7z, OLE, HTML, XML, RTF, or plain-text fallback, while testdata/appleworks-preflight records BOBO shared-signature inventory, byte caps, MacBinary/AppleDouble resource-fork deferral, and PRONOM observations. No text/metadata/cell extraction, graphics/formula parsing, converter/native dependency, or extractor-backed support is claimed. |
| Comic-book archives | Yes | No | Unsupported corpus + preflight fixtures | No | Detected unsupported boundary for #168; generated testdata/corpus/basic/*comic* and minimal_cb*.cb* fixtures cover extension-gated .cbz, .cbt, .cb7, and .cbr classification, guidance, and no generic ZIP/TAR/7z archive, PDF, EPUB, OLE, HTML/XML, RTF, plain-text, image/OCR, or RAR fallback, while testdata/comic-archive-preflight records container/signature observations, image-only entries, ComicInfo.xml inventory, text-sidecar boundaries, and traversal/resource-cap boundaries. No archive listing metadata, ComicInfo.xml metadata extraction, RAR support, image extraction/OCR, sidecar text extraction, recursive dispatch, converter/native dependency, or extractor-backed support is claimed. |
| ZIP | Yes | Yes | Yes | Yes | One page-like section per entry, with pinned MIT archive compatibility sample; traversal paths are rejected. Archive entry bytes are decoded as UTF-8-lossy text and are not recursively dispatched to other extractors; generic ZIP output is not modern iWork/Numbers/Keynote support or support for rejected OOXML/ODF and package boundaries. |
| TAR | Yes | Yes | Yes | Yes | Tar and gzip-compressed tar (.tar.gz, .tgz) via the same archive extractor, with pinned BSD-3-Clause GNU tar compatibility sample. Archive entry bytes are decoded as UTF-8-lossy text and are not recursively dispatched to other extractors. |
| 7Z | Yes | Limited | Yes | Yes | Unencrypted non-solid 7z archives with unencoded headers, one coder per folder, no filters, and COPY, DEFLATE, BZIP2, or bounded-dictionary LZMA2 streams, with pinned Apache-2.0 COPY compatibility sample; LZMA, high-dictionary LZMA2, encrypted, encoded-header, solid, filtered, and coder-chain archives are rejected. |
| Yes | No | Unsupported fixture | No | Out of scope; use dpdf for PDF parsing and OCR. |
|
| OCR / images | No | No | No | No | Out of scope for this repo. |
See docs/archive-policy.md for the archive recursion and resource-limit policy.
Fixture-backed means covered by committed baseline fixtures or explicitly named
unsupported-boundary fixtures under testdata/corpus/ or a format-specific
testdata/ directory.
Compatibility-backed means covered by pinned public samples under
testdata/realworld/. Absence of a compatibility fixture does not mean a format
is unsupported; it means the release evidence is currently limited to unit
tests, integration tests, and baseline fixtures.
dExtract is a workspace with 32 publishable crates plus five unpublished
internal preflight crates. The public build uses the members declared in the
root Cargo.toml, including bounded extractors for legacy .ppt, legacy XML
iWork, CHM, OneNote visible-text recovery, legacy Visio binary visible-text recovery,
PostScript/EPS, DjVu TXTa, MOBI/AZW/AZW3, and WordPerfect WPC text
extraction.
| Path | Purpose |
|---|---|
dextract-types |
Shared traits and data types for extractors and outputs. |
dextract-ole |
Shared bounded OLE preflight validation for legacy Office extractors. |
dextract-zip-package |
Shared hardened ZIP package reader for OOXML, OpenDocument, EPUB, and XPS/OXPS extractors. |
dextract |
Facade crate that registers the built-in extractors. |
dextract-cli |
dextract command-line entrypoint. |
dextract-doc |
Legacy DOC extractor. |
dextract-docx |
DOCX extractor. |
dextract-odt |
ODT extractor. |
dextract-xls |
Legacy XLS extractor. |
dextract-xlsb |
XLSB extractor. |
dextract-xlsx |
XLSX extractor. |
dextract-ods |
ODS extractor. |
dextract-ppt |
Legacy PPT extractor with bounded mechanical text and metadata support; no rendering, macros, media extraction, embedded object recursion, or decryption. |
dextract-pptx |
PPTX extractor. |
dextract-odp |
ODP extractor. |
dextract-odg |
ODG/OTG package and flat FODG drawing extractor. |
dextract-postscript |
PostScript/EPS bounded lexical extractor; no execution, rendering, OCR, previews, external resources, or Ghostscript. |
dextract-iwork |
Legacy XML Pages/Numbers/Keynote package extractor. |
dextract-wordperfect |
Bounded WordPerfect WPC5/WPC6 .wpd document-area extractor; no macros/templates, rendering, embedded-object recursion, external-resource traversal, converter shellout, or native dependency. |
dextract-iwork-preflight |
Unpublished iWork input preflight primitives; not extractor-backed support. |
dextract-onenote-preflight |
Unpublished OneNote input, .onepkg package-inventory preflight primitives, and fail-closed #101 text/object-page/object-graph/revision-table-sequence/page-rich-text-object-reference/visible-text-object readiness blocker checks; not extractor-backed support. |
dextract-mobi-preflight |
MOBI/AZW PDB/MOBI envelope preflight primitives used by dextract-mobi, including text-encoding marker classification and bounded uncompressed/classic PalmDOC text-record materialization. |
dextract-chm-preflight |
CHM ITSF envelope, ITSP/PMGL header validation, PMGI/DataSpace readiness, LZX topic decoding support, bounded .hhc TOC ordering, and fail-closed malformed reset-table checks used by the limited CHM facade extractor. |
dextract-djvu-preflight |
DjVu IFF/FORM container parser-readiness primitives and bounded TXTa byte materialization for .djvu/.djv; used by the limited facade extractor and preflight fixtures. |
dextract-abiword-preflight |
Unpublished AbiWord XML/gzip shape inventory, synthetic version-family checks, extraction-risk marker rejection, and structure-only element-count primitives for .abw plus non-public .zabw/.abw.gz candidates; not extractor-backed support. |
dextract-wordperfect-preflight |
Unpublished WordPerfect WPC5/WPC6 header and synthetic payload inventory primitives for .wpd plus non-public .wpt/.wcm candidates; not extractor-backed support. |
dextract-visio-binary-preflight |
Unpublished legacy Visio binary CFB stream-inventory, non-synthetic VisioDocument real-parser/version-policy compatibility/record-map/table/text-run/page-shape/metadata gating, and repo-owned synthetic record/text/page-shape ordering plus record/version/stream consistency parser-readiness primitives for .vsd/.vss/.vst; not real Visio record decoding or extractor-backed support. |
dextract-xps |
XPS/OXPS fixed-layout package extractor. |
dextract-vsdx |
VSDX plus bounded Visio XML/package variant extractor. |
dextract-html |
HTML extractor. |
dextract-csv |
CSV extractor. |
dextract-epub |
EPUB extractor. |
dextract-mobi |
Bounded MOBI/AZW/AZW3 UTF-8 text-record extractor for unencrypted uncompressed/classic PalmDOC inputs. |
dextract-rtf |
RTF extractor. |
dextract-eml |
RFC 5322 email and MHTML extractor. |
dextract-msg |
Outlook MSG extractor. |
dextract-archive |
ZIP, TAR, and limited 7z extractor. |
Cargo.tomlin the repository root for the workspace manifest and public repo URL.dextract/src/lib.rsin the repository for the facade API and built-in extractor registration.dextract-cli/src/main.rsin the repository for CLI behavior and subcommands.RELEASING.md,ROADMAP.md,testdata/README.md, andscripts/fetch_realworld_corpus.pyin the repository for release and corpus guidance.
Run the workspace checks from the repo root:
cargo fmt --all --check
cargo clippy --workspace --all-targets --all-features --locked -- -D warnings
cargo test --workspace --all-targets --lockedUseful release-time checks:
bash scripts/check-release-tooling.sh
bash scripts/check-github-workflows.sh
cargo doc --workspace --no-deps --locked
bash scripts/check-package-list.sh
bash scripts/check-supply-chain.sh
bash scripts/check-api-compat.sh --required
python3 scripts/check_test_corpus_drift.py
python3 scripts/validate_format_gap_issue_drafts.py --quiet
cargo run -p dextract-cli -- formats
cargo run -p dextract-cli -- formats --json
cargo run -p dextract-cli -- formats --all --json
cargo package --list -p dextract
cargo publish --dry-run -p dextract-typesFor reproducible local extraction performance checks against committed fixtures, use the CLI benchmark harness:
short_sha="$(git rev-parse --short=7 HEAD)"
python3 scripts/bench_extractors.py \
--warmups 2 \
--iterations 5 \
--fixture-set all \
--mode both \
--output "target/perf/refreshed-performance-baseline-${short_sha}.json"See docs/performance.md and docs/performance-baseline.md for the benchmark
contract and current same-machine baseline. For ad hoc library-only timing
against representative files, cargo run -p dextract --example simple_bench
remains available.
The committed fixture corpus used for facade-level support checks lives under
testdata/corpus/. Pinned public compatibility samples live under
testdata/realworld/.
dExtract is released under the Apache License 2.0.
Copyright 2026 Dropbox