Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
217 changes: 217 additions & 0 deletions pdf-processing-cli/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
# Created by https://www.toptal.com/developers/gitignore/api/node,linux,macos,windows
# Edit at https://www.toptal.com/developers/gitignore?templates=node,linux,macos,windows

### Linux ###
*~

# temporary files which can be created if a process still has a handle open of a deleted file
.fuse_hidden*

# KDE directory preferences
.directory

# Linux trash folder which might appear on any partition or disk
.Trash-*

# .nfs files are created when an open file is removed but is still being accessed
.nfs*

### macOS ###
# General
.DS_Store
.AppleDouble
.LSOverride

# Icon must end with two \r
Icon

# Thumbnails
._*

# Files that might appear in the root of a volume
.DocumentRevisions-V100
.fseventsd
.Spotlight-V100
.TemporaryItems
.Trashes
.VolumeIcon.icns
.com.apple.timemachine.donotpresent

# Directories potentially created on remote AFP share
.AppleDB
.AppleDesktop
Network Trash Folder
Temporary Items
.apdisk

### macOS Patch ###
# iCloud generated files
*.icloud

### Node ###
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
lerna-debug.log*
.pnpm-debug.log*

# Diagnostic reports (https://nodejs.org/api/report.html)
report.[0-9]*.[0-9]*.[0-9]*.[0-9]*.json

# Runtime data
pids
*.pid
*.seed
*.pid.lock

# Directory for instrumented libs generated by jscoverage/JSCover
lib-cov

# Coverage directory used by tools like istanbul
coverage
*.lcov

# nyc test coverage
.nyc_output

# Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files)
.grunt

# Bower dependency directory (https://bower.io/)
bower_components

# node-waf configuration
.lock-wscript

# Compiled binary addons (https://nodejs.org/api/addons.html)
build/Release

# Dependency directories
node_modules/
jspm_packages/

# Snowpack dependency directory (https://snowpack.dev/)
web_modules/

# TypeScript cache
*.tsbuildinfo

# Optional npm cache directory
.npm

# Optional eslint cache
.eslintcache

# Optional stylelint cache
.stylelintcache

# Microbundle cache
.rpt2_cache/
.rts2_cache_cjs/
.rts2_cache_es/
.rts2_cache_umd/

# Optional REPL history
.node_repl_history

# Output of 'npm pack'
*.tgz

# Yarn Integrity file
.yarn-integrity

# dotenv environment variable files
.env
.env.development.local
.env.test.local
.env.production.local
.env.local

# parcel-bundler cache (https://parceljs.org/)
.cache
.parcel-cache

# Next.js build output
.next
out

# Nuxt.js build / generate output
.nuxt
dist

# Gatsby files
.cache/
# Comment in the public line in if your project uses Gatsby and not Next.js
# https://nextjs.org/blog/next-9-1#public-directory-support
# public

# vuepress build output
.vuepress/dist

# vuepress v2.x temp and cache directory
.temp

# Docusaurus cache and generated files
.docusaurus

# Serverless directories
.serverless/

# FuseBox cache
.fusebox/

# DynamoDB Local files
.dynamodb/

# TernJS port file
.tern-port

# Stores VSCode versions used for testing VSCode extensions
.vscode-test

# yarn v2
.yarn/cache
.yarn/unplugged
.yarn/build-state.yml
.yarn/install-state.gz
.pnp.*

### Node Patch ###
# Serverless Webpack directories
.webpack/

# Optional stylelint cache

# SvelteKit build / generate output
.svelte-kit

### Windows ###
# Windows thumbnail cache files
Thumbs.db
Thumbs.db:encryptable
ehthumbs.db
ehthumbs_vista.db

# Dump file
*.stackdump

# Folder config file
[Dd]esktop.ini

# Recycle Bin used on file shares
$RECYCLE.BIN/

# Windows Installer files
*.cab
*.msi
*.msix
*.msm
*.msp

# Windows shortcuts
*.lnk

# End of https://www.toptal.com/developers/gitignore/api/node,linux,macos,windows
Expand Down
1 change: 1 addition & 0 deletions pdf-processing-cli/.npmrc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
save-exact=true
1 change: 1 addition & 0 deletions pdf-processing-cli/.tool-versions
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
nodejs 24.8.0
74 changes: 74 additions & 0 deletions pdf-processing-cli/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# PDF Processing CLI

A Node.js command-line utility for extracting text and rendering PNG images from PDF files using the WebAssembly build of PDFium and a WASM-based PNG encoder.

## Prerequisites

- Node.js 24.8.0 (matching the project engine constraint)
- npm (bundled with Node.js)

## Setup

```bash
npm install
```

This downloads the required WebAssembly assets for `@hyzyla/pdfium` and `@jsquash/png`.

## Usage

```bash
node ./bin/pdf-tool.mjs <pdf-path> [options]
```

After running `npm install`, you can also add the command to your PATH via `npx pdf-tool` or by installing the package globally.

### Options

- `--text-out <file>`: Write extracted text to the specified file.
- `--text-format <plain|json>`: Choose plain text (default) or JSON output.
- `--text-join`: Join all page text into a single block (plain text only).
- `--png-dir <directory>`: Directory to write rendered PNG files (one per page).
- `--scale <number>`: Render scale multiplier (default: 1). Mutually exclusive with `--width`/`--height`.
- `--width <pixels>` and `--height <pixels>`: Explicit output size (both required when used).
- `--pages <list>`: Comma-separated page numbers or ranges (e.g. `1,3-5`). Defaults to all pages.
- `--password <string>`: Password for encrypted PDFs.
- `--render-form-fields`: Draw interactive form fields when rendering PNGs.
- `--help`: Show usage information.

### Examples

Extract text as JSON and render PNGs for pages 1–3 at 2× scale:

```bash
node ./bin/pdf-tool.mjs ./docs/sample.pdf \
--pages 1-3 \
--text-out output/sample.json \
--text-format json \
--png-dir output/images \
--scale 2
```

Extract all text into a single plain-text file:

```bash
node ./bin/pdf-tool.mjs ./docs/sample.pdf --text-out output/sample.txt --text-join
```

Render form fields at 300 dpi (scale 4.17) for every page:

```bash
node ./bin/pdf-tool.mjs ./docs/sample.pdf --png-dir output/images --scale 4.17 --render-form-fields
```

## Project Structure

- `src/pdfium/`: Pdfium initialisation helpers and document loading utilities.
- `src/text/`: Text extraction module.
- `src/image/`: PDF page rendering and PNG encoding module.
- `bin/pdf-tool.mjs`: CLI entrypoint that wires text extraction and rendering features.

## Notes

- WASM binaries are loaded directly from the installed packages at runtime; no additional build step is required.
- Ensure the runtime has permission to read the target PDF and write to the requested output paths.
Loading