feat(extract): add document extraction pipeline with OCR and LLM analysis by cpcloud · Pull Request #475 · cpcloud/micasa

cpcloud · 2026-02-22T21:20:59Z

Summary

Add document extraction pipeline: PDF text extraction (pdftotext), OCR (tesseract via pdftoppm + per-page OCR), and LLM-powered structured analysis (title, summary, document type via streaming JSON)
New extraction overlay with live progress: per-step status icons, elapsed timers, expandable log content with left border pipe, and accept/discard workflow
OCR and LLM steps stream progress through channels using the existing waitFor pattern; text extraction stays synchronous
Extraction is opt-in (requires LLM configured) and OCR gracefully degrades when tesseract/poppler aren't available
Schema additions: extracted_text, ocr_text, ocr_tsv columns on documents
Demo tape showing the full extraction flow on a scanned invoice PDF

Test plan

Upload a PDF with LLM configured: overlay opens, text/OCR/LLM steps run, accept persists hints
Upload a scanned PDF (no embedded text): OCR runs, then LLM analyzes OCR output
Upload an image: OCR runs directly on the image, then LLM
No tesseract installed: OCR step skipped, LLM runs on pdftotext output only
No LLM configured: no overlay (text extraction is silent)
LLM fails mid-stream: overlay stays open with error, Esc discards
Esc during extraction: overlay closes, extraction cancelled
go test -shuffle=on ./... passes

🤖 Generated with Claude Code

Add the text extraction foundation for the document extraction pipeline (#200). Introduces ExtractedText and OCRData columns on Document, a pure-Go PDF text extractor using ledongthuc/pdf, and a design plan document. - Add ExtractedText (string) and OCRData ([]byte) to Document model - Implement ExtractText for PDF, text/*, and markdown MIME types - Add IsScanned heuristic (empty/whitespace text = scanned) - Include test fixture generator and sample.pdf Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add the OCR layer for the document extraction pipeline (#200). Scanned PDFs and images are recognized via tesseract + pdftoppm when available, with graceful degradation and a one-time hint when tools are missing. - Add OCR function with PDF rasterization (pdftoppm) and image OCR paths - Parse tesseract TSV output preserving word/line/paragraph structure - Add tool detection with sync.Once caching (HasTesseract, HasPDFToPPM) - Add tesseract + poppler-utils to devShell in flake.nix - Add one-time tesseract hint setting in data/settings.go Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add the LLM extraction layer for the document extraction pipeline (#200). When an extraction model is configured, documents are analyzed to extract vendor, amounts, dates, entity links, and maintenance schedules. - Add ExtractionHints and EntityContext types with validation maps - Build extraction prompt with entity context for LLM matching - Parse flexible LLM JSON responses (money as string/float, multiple date formats, code-fenced responses) - Add [extraction] config section with model, max_ocr_pages, enabled - Wire extraction config through app Model and main.go Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add the Pipeline orchestrator that sequences text extraction, OCR, and LLM extraction into a single Run call (#200). Wire it into the document upload flow with entity context from the database. - Add Pipeline.Run orchestrating all three extraction layers - Add Store.EntityNames for LLM entity matching context - Rewrite parseDocumentFormData to use Pipeline with documentParseResult - Add buildExtractionPipeline and showTesseractHint to app Model - Pre-fill document title from LLM suggestion, notes from summary - Surface non-fatal extraction errors via status bar Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rogress Add an interactive overlay that shows real-time progress when documents are processed through the extraction pipeline (text -> OCR -> LLM). Each step displays a spinner, elapsed time, and detail (page count, model name, character count). Users accept results with `a` or cancel with `esc`. Key changes: - Extraction overlay with step navigation (j/k), expand/collapse (enter) - Channel-based OCR streaming with per-page and rasterization progress - LLM token streaming in overlay with accumulated JSON display - Accept/cancel flow: results held until user presses `a` - Proper context cancellation: esc cancels all in-flight work - OCR failure gracefully continues to LLM step - currency_unit field in extraction schema for cents/dollars disambiguation - Configurable pdftotext timeout (extraction.text_timeout / MICASA_TEXT_TIMEOUT) - ExtractionPromptInput struct replaces 7 positional params - Cached extraction LLM client on model - Docs: configuration, keybindings, and documents guide updated closes #200 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

VHS tape that demonstrates the document extraction overlay: importing a scanned PDF, OCR progress, LLM extraction, and accepting results. Requires Ollama running with qwen3:0.6b. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Combine sample.pdf (digital text) and scanned-invoice.pdf (image pages) into a 109KB mixed-inspection.pdf fixture. The pipeline test verifies that pdftotext extracts digital pages while OCR handles scanned ones -- the common case for real-world inspection reports and permits. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Without these tools, all OCR and PDF text extraction tests are skipped in CI. Install poppler-utils (pdftotext, pdftoppm) and tesseract-ocr on all three platforms so the pipeline tests exercise real code paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace checked-in binary test fixtures (sample.pdf, invoice.png, scanned-invoice.pdf, mixed-inspection.pdf) with 4 bash scripts that generate them on demand. Fixtures are now gitignored and generated via shell hook (local dev) or CI step (with shell: bash for Windows). - gen-sample-pdf.bash: base64-embedded minimal PDF (no deps) - gen-invoice-png.bash: magick-generated realistic invoice image - gen-scanned-pdf.bash: magick image-to-PDF conversion - gen-mixed-pdf.bash: pdfunite digital + scanned pages - 5 nix apps: 4 individual + gen-testdata combined - CI: imagemagick added to all platforms, shell: bash fixture step closes #200 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Use Shift+F to jump directly to Docs tab - Add new document instead of editing existing one - Increase terminal height for overlay centering Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… tape Clamp base view to terminal height before overlay compositing so the extraction overlay centers correctly when opened from an in-place form save (where the form content overflows the terminal). Fix expanded log content: add left border pipe (│) to visually separate log output from step headers, add blank line spacing between expanded steps, and use directional triangles (▾ down when expanded, ▸ right when collapsed). Fix the demo tape: enter Edit mode before Shift+A, use a temp directory with just the test PDF so the file picker navigation is deterministic. closes #200 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Ubuntu 22.04/24.04 ship ImageMagick v6 which only provides `convert`, not the v7 `magick` command. Symlink convert to magick on Linux CI so gen scripts work uniformly across all platforms. Add gen-sample-text-png.bash for the OCR image integration test fixture. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Choco's poppler package extracts binaries into a nested directory that isn't on PATH. Split the install into two steps: choco install in PowerShell, then find pdfunite.exe and add its directory to GITHUB_PATH in bash. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The choco poppler package only ships source code, not compiled binaries. Drop it from Windows CI and gracefully skip gen-mixed-pdf when pdfunite is unavailable. The mixed-PDF test already skips when the fixture is missing.

macOS ImageMagick needs ghostscript for text rendering (magick -annotate). Without it, fixture images are blank and tesseract returns empty text. Also add skipOrFatalCI helper: tests skip locally when tools are missing, but fail hard in CI on Linux/macOS where all tools should be installed.

Tests for the two main gaps in coverage: - Pipeline with LLM: mock httptest server returns canned extraction JSON, verifying the full text -> LLM -> parsed hints path. Also tests LLM server down, garbage response, and no-text skip. - OCRWithProgress: empty data, context cancellation, and integration tests for image and PDF paths (need tesseract/pdftoppm in CI).

cpcloud force-pushed the feat/200-document-extraction-pipeline branch 5 times, most recently from 0e2df39 to 86d4eba Compare February 22, 2026 22:22

cpcloud and others added 17 commits February 22, 2026 17:52

docs: add extraction pipeline demo tape

f8c932d

VHS tape that demonstrates the document extraction overlay: importing a scanned PDF, OCR progress, LLM extraction, and accepting results. Requires Ollama running with qwen3:0.6b. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: record extraction pipeline demo

380cf99

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: fix extraction demo tape

cbf5553

- Use Shift+F to jump directly to Docs tab - Add new document instead of editing existing one - Increase terminal height for overlay centering Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cpcloud force-pushed the feat/200-document-extraction-pipeline branch from 0a8df47 to 3e75321 Compare February 22, 2026 22:53

cpcloud merged commit bbda404 into main Feb 23, 2026
12 checks passed

cpcloud deleted the feat/200-document-extraction-pipeline branch February 23, 2026 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(extract): add document extraction pipeline with OCR and LLM analysis#475

feat(extract): add document extraction pipeline with OCR and LLM analysis#475
cpcloud merged 17 commits intomainfrom
feat/200-document-extraction-pipeline

cpcloud commented Feb 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cpcloud commented Feb 22, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant