Multi-format document parser with OCR fallback. Extracts text from PDF, Word, Excel, PowerPoint, and plaintext files. Falls back to Tesseract OCR for scanned PDFs and an AI vision API for low-quality scans.
Originally built for CRAG (regulatory compliance intelligence) and extracted as a standalone library.
npm install @crag-pro/doc-parserimport { parse } from "@crag-pro/doc-parser";
const result = await parse("./contract.pdf");
console.log(result.text);
console.log(result.metadata);| Extension | Parser | Notes |
|---|---|---|
.pdf |
pdf-parse | OCR fallback (Tesseract) for scanned pages, AI vision API for low-confidence OCR |
.docx |
mammoth | |
.xlsx |
exceljs | |
.pptx |
XML extraction | |
.txt, .csv, .md, .msg |
plaintext |
npx @crag-pro/doc-parser ./input-dir ./output-dirSee doc-parser --help for full options.
Tesseract must be installed locally for OCR fallback:
# macOS
brew install tesseract
# Ubuntu
apt-get install tesseract-ocrFor AI vision fallback on low-quality scans, set ANTHROPIC_API_KEY in your environment.
MIT