Skip to content

feat(extract): add document extraction pipeline with OCR and LLM analysis#475

Merged
cpcloud merged 17 commits intomainfrom
feat/200-document-extraction-pipeline
Feb 23, 2026
Merged

feat(extract): add document extraction pipeline with OCR and LLM analysis#475
cpcloud merged 17 commits intomainfrom
feat/200-document-extraction-pipeline

Conversation

@cpcloud
Copy link
Owner

@cpcloud cpcloud commented Feb 22, 2026

Summary

  • Add document extraction pipeline: PDF text extraction (pdftotext), OCR (tesseract via pdftoppm + per-page OCR), and LLM-powered structured analysis (title, summary, document type via streaming JSON)
  • New extraction overlay with live progress: per-step status icons, elapsed timers, expandable log content with left border pipe, and accept/discard workflow
  • OCR and LLM steps stream progress through channels using the existing waitFor pattern; text extraction stays synchronous
  • Extraction is opt-in (requires LLM configured) and OCR gracefully degrades when tesseract/poppler aren't available
  • Schema additions: extracted_text, ocr_text, ocr_tsv columns on documents
  • Demo tape showing the full extraction flow on a scanned invoice PDF

Test plan

  • Upload a PDF with LLM configured: overlay opens, text/OCR/LLM steps run, accept persists hints
  • Upload a scanned PDF (no embedded text): OCR runs, then LLM analyzes OCR output
  • Upload an image: OCR runs directly on the image, then LLM
  • No tesseract installed: OCR step skipped, LLM runs on pdftotext output only
  • No LLM configured: no overlay (text extraction is silent)
  • LLM fails mid-stream: overlay stays open with error, Esc discards
  • Esc during extraction: overlay closes, extraction cancelled
  • go test -shuffle=on ./... passes

🤖 Generated with Claude Code

@cpcloud cpcloud force-pushed the feat/200-document-extraction-pipeline branch 5 times, most recently from 0e2df39 to 86d4eba Compare February 22, 2026 22:22
cpcloud and others added 17 commits February 22, 2026 17:52
Add the text extraction foundation for the document extraction pipeline
(#200). Introduces ExtractedText and OCRData columns on Document, a pure-Go
PDF text extractor using ledongthuc/pdf, and a design plan document.

- Add ExtractedText (string) and OCRData ([]byte) to Document model
- Implement ExtractText for PDF, text/*, and markdown MIME types
- Add IsScanned heuristic (empty/whitespace text = scanned)
- Include test fixture generator and sample.pdf

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the OCR layer for the document extraction pipeline (#200). Scanned
PDFs and images are recognized via tesseract + pdftoppm when available,
with graceful degradation and a one-time hint when tools are missing.

- Add OCR function with PDF rasterization (pdftoppm) and image OCR paths
- Parse tesseract TSV output preserving word/line/paragraph structure
- Add tool detection with sync.Once caching (HasTesseract, HasPDFToPPM)
- Add tesseract + poppler-utils to devShell in flake.nix
- Add one-time tesseract hint setting in data/settings.go

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the LLM extraction layer for the document extraction pipeline (#200).
When an extraction model is configured, documents are analyzed to extract
vendor, amounts, dates, entity links, and maintenance schedules.

- Add ExtractionHints and EntityContext types with validation maps
- Build extraction prompt with entity context for LLM matching
- Parse flexible LLM JSON responses (money as string/float, multiple
  date formats, code-fenced responses)
- Add [extraction] config section with model, max_ocr_pages, enabled
- Wire extraction config through app Model and main.go

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the Pipeline orchestrator that sequences text extraction, OCR, and
LLM extraction into a single Run call (#200). Wire it into the document
upload flow with entity context from the database.

- Add Pipeline.Run orchestrating all three extraction layers
- Add Store.EntityNames for LLM entity matching context
- Rewrite parseDocumentFormData to use Pipeline with documentParseResult
- Add buildExtractionPipeline and showTesseractHint to app Model
- Pre-fill document title from LLM suggestion, notes from summary
- Surface non-fatal extraction errors via status bar

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rogress

Add an interactive overlay that shows real-time progress when documents
are processed through the extraction pipeline (text -> OCR -> LLM).
Each step displays a spinner, elapsed time, and detail (page count,
model name, character count). Users accept results with `a` or cancel
with `esc`.

Key changes:
- Extraction overlay with step navigation (j/k), expand/collapse (enter)
- Channel-based OCR streaming with per-page and rasterization progress
- LLM token streaming in overlay with accumulated JSON display
- Accept/cancel flow: results held until user presses `a`
- Proper context cancellation: esc cancels all in-flight work
- OCR failure gracefully continues to LLM step
- currency_unit field in extraction schema for cents/dollars disambiguation
- Configurable pdftotext timeout (extraction.text_timeout / MICASA_TEXT_TIMEOUT)
- ExtractionPromptInput struct replaces 7 positional params
- Cached extraction LLM client on model
- Docs: configuration, keybindings, and documents guide updated

closes #200

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
VHS tape that demonstrates the document extraction overlay: importing a
scanned PDF, OCR progress, LLM extraction, and accepting results.
Requires Ollama running with qwen3:0.6b.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Combine sample.pdf (digital text) and scanned-invoice.pdf (image pages)
into a 109KB mixed-inspection.pdf fixture. The pipeline test verifies
that pdftotext extracts digital pages while OCR handles scanned ones --
the common case for real-world inspection reports and permits.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without these tools, all OCR and PDF text extraction tests are skipped
in CI. Install poppler-utils (pdftotext, pdftoppm) and tesseract-ocr
on all three platforms so the pipeline tests exercise real code paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace checked-in binary test fixtures (sample.pdf, invoice.png,
scanned-invoice.pdf, mixed-inspection.pdf) with 4 bash scripts that
generate them on demand. Fixtures are now gitignored and generated
via shell hook (local dev) or CI step (with shell: bash for Windows).

- gen-sample-pdf.bash: base64-embedded minimal PDF (no deps)
- gen-invoice-png.bash: magick-generated realistic invoice image
- gen-scanned-pdf.bash: magick image-to-PDF conversion
- gen-mixed-pdf.bash: pdfunite digital + scanned pages
- 5 nix apps: 4 individual + gen-testdata combined
- CI: imagemagick added to all platforms, shell: bash fixture step

closes #200

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use Shift+F to jump directly to Docs tab
- Add new document instead of editing existing one
- Increase terminal height for overlay centering

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… tape

Clamp base view to terminal height before overlay compositing so the
extraction overlay centers correctly when opened from an in-place form
save (where the form content overflows the terminal).

Fix expanded log content: add left border pipe (│) to visually separate
log output from step headers, add blank line spacing between expanded
steps, and use directional triangles (▾ down when expanded, ▸ right
when collapsed).

Fix the demo tape: enter Edit mode before Shift+A, use a temp directory
with just the test PDF so the file picker navigation is deterministic.

closes #200

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ubuntu 22.04/24.04 ship ImageMagick v6 which only provides `convert`,
not the v7 `magick` command. Symlink convert to magick on Linux CI so
gen scripts work uniformly across all platforms.

Add gen-sample-text-png.bash for the OCR image integration test fixture.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Choco's poppler package extracts binaries into a nested directory
that isn't on PATH. Split the install into two steps: choco install
in PowerShell, then find pdfunite.exe and add its directory to
GITHUB_PATH in bash.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The choco poppler package only ships source code, not compiled
binaries. Drop it from Windows CI and gracefully skip gen-mixed-pdf
when pdfunite is unavailable. The mixed-PDF test already skips when
the fixture is missing.
macOS ImageMagick needs ghostscript for text rendering (magick -annotate).
Without it, fixture images are blank and tesseract returns empty text.

Also add skipOrFatalCI helper: tests skip locally when tools are missing,
but fail hard in CI on Linux/macOS where all tools should be installed.
Tests for the two main gaps in coverage:

- Pipeline with LLM: mock httptest server returns canned extraction
  JSON, verifying the full text -> LLM -> parsed hints path. Also
  tests LLM server down, garbage response, and no-text skip.

- OCRWithProgress: empty data, context cancellation, and integration
  tests for image and PDF paths (need tesseract/pdftoppm in CI).
@cpcloud cpcloud force-pushed the feat/200-document-extraction-pipeline branch from 0a8df47 to 3e75321 Compare February 22, 2026 22:53
@cpcloud cpcloud merged commit bbda404 into main Feb 23, 2026
12 checks passed
@cpcloud cpcloud deleted the feat/200-document-extraction-pipeline branch February 23, 2026 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant